Skip to main content
LIBRARY GUIDES
Subjects, Services and Resources

Managing & Sharing Research Data: Guide

Reading Lists for Staff Banner

Researcher Guides

Managing and Sharing Research Data

Your research activity will generally create a lot of material, and understanding how to handle this is not always straightforward.

 


Creating and Using Research Data

 

Understanding the difference between research “data” and research “records” is often the first hurdle.

“Will I need this material to support a publication, or validate my research findings?”

“Will this item form part of a finalised data set once my work is complete?”

If the answer is “yes” to either question this will be part of your research data. Research records will usually need to be kept too, for audit purposes.

 

Research data can include:

  • Recorded outputs of observations, experiments, or simulations
  • Lab Books and Logs
  • Models created and used to perform simulations and experiments
  • Software tools created to capture, analyse, or otherwise use data
  • Documentation that describe the project context, methods used, and data outputs produced including email correspondence between collaborators.

Funding bodies usually like to see that the data you gather or create fills a gap in knowledge and require you to demonstrate this. It is often cost effective to re-use data created elsewhere in different ways, perhaps creating a “mash-up” of data from different sources to demonstrate something new. This is attractive to funding bodies, because it means they are not funding the same data gathering exercises twice.

Many public funding research bodies and publishers are now requiring that data is publicly available. You need to understand the terms of your funding agreement before you start, to make sure you take this into account.

You might also like...

  Association of Medical Research Charities

  Horizon 2020

  UKRI

  Concordat on Open Research Data

  MANTRA - Online Research Data Training Resources (University of Edinburgh)


Data Management Planning

 

A Data Management Plan (DMP) help researchers and research students with their research methodology. Data Management Planning is an RGU requirement and in many cases it is now becoming a Funder requirement at the point of submission.

A DMP covers the following basics:

  • description, format and volume of data
  • data storage and back-up measures
  • data management roles and responsibilities
  • infrastructure, costing or resources needed
  • plans for sharing data including ethical and legal issues or restrictions on data sharing
  • copyright and intellectual property rights of data

 

To help researchers, templates are available via the DMPonline tool. The tool includes video tuition and RGU users can login using their institutional credentials. Researchers who plan to submit to a funder where there is no prepared template can still use this tool, which will provide a standard simple template. Research students can also use this template for planning data handling during their studies.

Workshops on data management planning and data handling are held regularly throughout the academic year.

Data Management Plans: Examples

  ESRC / DFID Data Management Plan

  A Guide to Writing a Wellcome Trust Data Management Plan

  AHRC Example Technical Plan – University of Leeds

  AHRC Example Technical Plan – University of Bristol

 

Good Practice Tips

Know your legal, ethical and other obligations regarding research data, towards research participants, colleagues, research funders and institutions

  • Implement good practices in a consistent manner
  • Assign roles and responsibilities to relevant parties in the research
  • Design data management according to the needs and purpose of research
  • Incorporate data management measures as an integral part of your research cycle
  • Implement and review data management throughout research as part of research progression and review

Case Study Examples

Writing a Data Management Plan

In April 2010, the Digital Curation Centre (DCC) launched DMP Online, a web-based tool designed to help researchers and other data stakeholders develop data management plans according to the requirements of major research funders.

Using the tool researchers can create, store and update multiple versions of a data management plan at the grant application stage and during the research cycle. Plans can be customised and exported in various formats. Funder- and institution-specific best practice guidance is available.

The tool combines the DCC’s comprehensive ‘Checklist for a Data Management Plan’ with an analysis of research funder requirements. The DCC is working with partner organisations to include domain- and subject- specific guidance in the tool.

Submitting a Data Management Plan

The Rural Economy and Land Use (RELU) Programme has been at the forefront of implementing data management planning for research projects since 2004. Drawing  on best practice in data management and sharing across three research councils (ESRC, NERC and BBSRC), RELU requires that all funded projects develop and implement a Data Management Plan to ensure that data are well managed throughout the duration of a research project. In a data management plan researchers describe:

  • the need for access to existing data sources
  • data to be produced by the research project
  • quality assurance and back-up procedures
  • plans for management and archiving of collected data
  • expected difficulties in making data available for secondary research and measures to overcome such difficulties
  • who holds copyright and Intellectual Property Rights of the data
  • who has data management responsibility roles within the research team

Formatting

 

The format and software used to create research data depends on the hardware or software used or how researchers plan to analyse data and in some cases by discipline-specific standards and customs.

Despite the backward compatibility of many software packages to import data created in previous software versions, the safest option to guarantee long-term data access is to convert data to standard formats.

Keep it Organised!

Well-organised file names and folder structures make it easier to find and keep track of data files. Develop a system that works for your project and use it consistently.  Whilst computers add basic information and properties to a file, this is not reliable data management. It is better to record essential information in file names or through the folder structure. Think carefully how best to structure files in folders, in order to make it easy to locate and organise files and versions. When working in collaboration the need for an orderly structure is even higher.

Keep Track of Changes and Locations

It is important to ensure that different versions of files, related files held in different locations, and information that is cross-referenced between files are all subject to version control. It can be difficult to locate a correct version or to know how versions differ after some time has elapsed.

It is important to keep track of master versions of files, for example the latest iteration, especially where data files are shared between people or locations, e.g. on both a PC and a laptop. Checks and procedures may also need to be put in place to make sure that if the information in one file is altered, the related information in other files is also updated.

Because digital information can be copied or altered so easily, it is important to be able to demonstrate the authenticity of data and to be able to prevent unauthorised access to data that may potentially lead to unauthorised changes.

Ensure Good Quality Control!

Quality control of data is an integral part of all research and takes place at various stages. It is important to assign clear roles and responsibilities for data quality assurance at all stages of research and to develop suitable procedures before data gathering starts.  Quality control measures during data collection may include:

  • calibration of instruments
  • checking the truth of the record with an expert
  • using standardised methods and protocols for capturing observations
  • computer-assisted interview software standardise interviews and verify response consistency
  • checking data completeness
  • verifying random samples of the digital data against the original data
  • statistical analyses such as frequencies, means, to detect errors and anomalous values
  • peer review

Good quality and consistent transcription conventions include transcription instructions or guidelines and a template to ensure uniformity across a collection.  Full transcription is recommended for data sharing. If transcription is outsourced take care with:

  • data security when transmitting data between researcher and transcriber
  • data security procedures for the transcriber to follow
  • a non-disclosure agreement for the transcriber
  • transcriber instructions or guidelines, indicating required transcription style, layout and editing

Transcripts should:

  • have a unique identifier that labels an interview either through a name or number
  • have a uniform layout throughout a research project or data collection
  • use speaker tags to indicate turn-taking or question/answer sequence in conversations
  • carry line breaks between turn-takes
  • be page numbered
  • have a document cover sheet or header with brief interview or event details such as date, place, interviewer name, interviewee details

Include Data Documentation and Metadata

Data documentation explains how data was created, what it means, content and structure. It is part of good practice when creating, organising and managing data and is important to create sufficient contextual information to make sense of the data. Documentation may include:

  • names, labels and descriptions for variables, records and their values
  • explanation or definition of codes and classification schemes used
  • definitions of specialist terminology or acronyms used
  • codes of, and reasons for, missing values
  • derived data created after collection, with code, algorithm or command file
  • weighting and grossing variables created
  • data listing of annotations for cases, individuals or items

Metadata is the label attached to data to describe it.  It is extremely important, because most people will forget the details of what a data file or data set contains.  Typically metadata will include information on

  • WHAT was collected
  • HOW it was collected
  • WHEN the data was collected
  • WHO collected it
  • WHAT format was used

 

You might also like...

  File Formats & Version Control

  Transcript Model

  JISC Guide to Managing Digital Media

  JISC Guidance on File Names

  JISC Guidance on Digital File Formats

 

Good Practice Tips

Good data documentation includes:

  • the context of the data, project history, aim, objectives and hypotheses
  • data collection methods, sampling, instruments used, hardware and software used, scale and resolution, temporal and geographic coverage and secondary data sources used
  • structure of data files, study cases, relationships between files
  • data validation, checking, proofing, cleaning and quality assurance procedures carried out
  • changes made to data over time since their original creation and identification of different versions of data files
  • information on access and use conditions or data confidentiality

Good file naming conventions:

  • create meaningful but brief names
  • use file names to classify broad types of files
  • avoid using spaces and special characters
  • avoid very long file names
  • Create “readme” files to act as memory aids explaining your file name convention

Best practice to ensure authenticity is to:

  • keep a single master file of data
  • assign responsibility for master files to a single project team member
  • regulate write access to master versions of data files
  • record all changes to master files
  • maintain old master files in case later ones contain errors
  • archive copies of master files at regular intervals
  • develop a formal procedure for the destruction of master files

Version Control tips include:

  • decide how many versions of a file to keep, which versions to keep, for how long and how to organise versions
  • identify milestone versions to keep
  • uniquely identify files using a systematic naming convention
  • record version and status of a file, e.g. draft, interim, final, internal
  • record what changes are made to a file when a new version is created
  • record relationships between items where needed, e.g. relationship between code and the data file it is run against; between data file and related documentation or metadata; or between multiple files
  • track the location of files if they are stored in a variety of locations
  • regularly synchronise files in different locations, e.g. using MS SyncToy software
  • maintain single master files in a suitable file format to avoid version control problems associated with multiple working versions of files being developed in parallel
  • identify a single location for the storage of milestone and master versions

Case Study Examples

Documenting Data in NVivo

 

Researchers using qualitative data analysis packages, such as NVivo 9, to analyse data can use a range of the software’s features to describe and document data. Such descriptions both help during analysis and result in essential documentation when data is shared, as they can be exported from the project file alongside data at the end of research. Researchers can create classifications for persons (e.g. interviewees), data sources (e.g. interviews) and coding. Classifications can contain attributes such as the demographic characteristics of interviewees, pseudonyms used, and the date, time and place of interview. If researchers create generic classifications beforehand, attributes can be standardised across all sources or persons throughout the project. Existing template and pre-populated classification sheets can be imported into NVivo.

Documentation files like the methodology description, project plan, interview guidelines and con-sent form templates can be imported into the NVivo project file and stored in a ‘documentation’ folder in the Memos folder or linked from NVivo 9 externally. Additional documentation about analyses or data manipulations can be created in NVivo as memos.  A date- and time-stamped project event log can record all project events carried out during the NVivo project cycle. Additional descriptions can be added to all objects created in, or imported to, the project file such as the project file itself, data, documents, memos, nodes and classifications. All textual documentation compiled during the NVivo project cycle can later be exported as textual files; classifications and event logs can be exported as spreadsheets to document preserved data collections. The structure of the project objects can be exported in groups or individually.  Summary information about the project as a whole or groups of objects can be exported via project summary extract reports as a text, MS Excel or XML file.

Data Documentation

Online documentation for a data collection in the UK Data Archive Catalogue can include project instructions, questionnaires, technical reports, and user guides.  Researchers typically create metadata records for their data by completing a data centre’s data deposit form or metadata editor, or by using a metadata creation tool, like Go-Geo! GeoDoc16 or the UK Location Metadata Editor17. Providing detailed and meaningful dataset titles, descriptions, keywords and other information enables data centres to create rich resource-discovery metadata for archived data collections. Data centres accompany each dataset with a bibliographic citation that users are required to cite in research outputs to reference and acknowledge accurately the data source used. A citation gives credit to the data source and distributor and identifies data sources for validation.

File Formatting

The Wessex Archaeology Metric Archive Project has brought together metric animal bone data from a range of archaeological sites in England into a single database format. The dataset contains a selection of measurements commonly taken during Wessex Archaeology zoo- archaeological analysis of animal bone fragments found during field investigations. It was created by the researchers in MS Excel and MS Access formats and deposited with the Archaeology Data Service (ADS) in the same formats. ADS has preserved the dataset in Oracle and in comma- separated values format (CSV) and disseminates the data via both as an Oracle/Cold Fusion live interface and as downloadable CSV files.

File Conversions

The JISC-funded Data Management for Bio-Imaging project at the John Innes Centre developed Bioformats Converter software to batch convert bio–images from a variety of proprietary microscopy image formats to the Open Microscopy Environment format, OME-TIFF.21 OME-TIFF, an open file format that enables data sharing across platforms, maintains the original image metadata in the file in XML format.


Storing

 

You’ve invested a lot of time and effort in creating your data, so keep it safe. Throughout the life of your project you need to continuously think about solutions for storing data carefully. Many forms of storage media are inherently unreliable, and all file formats and physical storage media will ultimately become obsolete.

 

Back-up!

Making back-ups of files is an essential element of data management which protect against accidental or malicious data loss through:

  • hardware or software failure
  • virus infection or malicious hacking
  • human error

It is worthwhile checking that you can recover the files you have backed up. External cloud based storage is a good solution, but double check the security features offered, including recovery of files. If you plan to store any business critical or personal information make sure your chosen method complies with Data Protection legislation and best practice.

Share!

Sharing data between collaborators is a challenge. Anything sent by email persists in a number of unknown exchange servers – the sender’s, the receiver’s and others in-between – so relying on this as a method of data transfer is not good practice. Cloud-based or online file sharing services may be suitable for sharing certain types of data, but they are not recommended for data that may be confidential, because users do not control where data is ultimately stored. Researchers should be aware of the risks and benefits of each type of solution so they can make informed decisions about which to use.

Think of the Long Term!

In terms of long term storage of complete data sets once you are ready to publish, RGU library can help you protect, preserve, archive, and share your research data.

All research activity associated with RGU is an asset of the University and so RGU has a responsibility to secure, store and access all research data, within the bounds of any IP or confidentiality agreement.

To ensure this, RGU is providing R:\drives for researchers, including research students. These provide additional basic data storage space, which can be shared with named individuals who have an RGU login e.g. PIs, research team members, research students and supervisors. They do not provide additional processing or compute power.

Research students will have an R:\drive created for them shortly after they commence their studies, typically when they have completed Module 1 of PGCert.

Data is held securely and privately and so the R:\drive is ideal for confidential or sensitive data. The R:\drive can be accessed via Citrix remotely in the same way as H:\drives.

  Request for an R:\Drive

 

 

Good Practice Tips

  • Store data in non-proprietary or open standard formats
  • Create digital versions of paper documentation in Pdf/a format for long-term preservation and storage
  • Often research data and outputs that have been created collaboratively are available via a web site.Although this is an excellent means of disseminating research, data can be particularly vulnerable if the host institution closes the web site.Do not therefore rely on this method as a robust means of securing data.
  • Copy or migrate data files to new media between two and five years after they were first created, since both optical and magnetic media are subject to physical degradation
  • Consider whether to back-up particular files or the entire computer system (complete system image Check the data integrity of stored data files at regular intervals the frequency of back-up needed, after each change to a data file or at regular interval
  • Use a storage strategy, even for a short-term project, with two different forms of storage, e.g. On the cloud and a hard drive strategies for all systems where data are held, including portable computers and devices, non-network computers and home-based computers
  • Organise and clearly label stored data so they are easy to locate and physically accessible
  • Ensure that areas and rooms for storage of digital or non-digital data are fit for the purpose, structurally sound, and free from the risk of flood and fire

Case Studies and Examples

Data Backup and Storage

A research team carrying out coral reef research collects field data using handheld Personal Digital Assistants (PDAs). Digital data are transmitted daily to the institution’s network drive, where they are held in password-protected files. All data files are identified by an individual version number and creation date. Version information (version numbers and notes detailing differences between versions) is stored in a spreadsheet, also on the network drive. The institution’s network drive is fully backed-up onto Ultrium LTO2 data tapes. Incremental back-ups are made daily Monday to Thursday; full server back-ups are made from Friday to Sunday. Tapes are securely stored in a separate building. Upon completion of the research the data are deposited in the institution’s digital repository.

Survey of Anglo Welsh Dialects

In February 2008 the British Library (BL) received the recorded output of the Survey of Anglo-Welsh Dialects (SAWD), carried out by University College, Swansea, between 1969 and 1995. This survey recorded the English spoken in Wales by interviewing and tape- recording elderly speakers on topics including the farm and farming, the house and housekeeping, nature, animals, social activities and the weather. The collection was deposited in the form of 503 digital audio files, which were accessioned as .wav files in the BL’s Digital Library. Digital clones of all files are held at the Archive of Welsh English, alongside the original master recordings on 151 audio cassettes, from which the digital copies were created.

The BL’s Digital Library is mirrored on four sites – at Boston Spa, St Pancras, Aberystwyth and a ‘dark’ archive which is provided by a third party. Each of these servers has inbuilt integrity checks. The BL makes available access copies for users, in the form of .mp3 audio files, in the British Library Reading Rooms via the Soundserver system. A small set of audio extracts from the SAWD recordings are also available online on the BL’s Accents and Dialects web site, Sounds Familiar


Sharing

 

Research data is a valuable resource, requiring a lot of time, money and effort to produce. Data often has a significant value beyond the original research. Sharing data:

  • encourages scientific enquiry and debate and promotes innovation and potential new data uses
  • leads to new collaborations between data users and data creators
  • maximises transparency and accountability and enables scrutiny of research findings
  • encourages the improvement and validation of research methods and reduces the cost of duplicating data collection
  • increases the impact and visibility of research, promotes the research that created the data and its outcomes and can provide a direct credit to the researcher as a research output in its own right
  • provides important resources for education and training

Many funders have adopted research data sharing policies and require researchers to share data and outputs and journals increasingly require that the data that forms the basis for publications should be shared or deposited within an accessible database or repository.

Consent and Confidentiality

Many researchers at the start of their career believe that the best way of handling confidential data is to destroy it. This is usually completely unnecessary and can invalidate research outputs, including theses.  Even personal and confidential data can be openly shared provided researchers have taken care to observe the law and obtain the right level of consent that takes into account plans for dissemination.

  • Make sure to obtain fully informed consent including your plans for dissemination and data sharing
  • Where needed, protect people’s identities by anonymising data, and make sure this is part of your research methodology and costing

Key legislation that may impact on the sharing of confidential data

  • Data Protection Act 1998
  • Freedom of Information Act 2000
  • Human Rights Act 1998

 

In many cases, data obtained from people can be shared while upholding both the letter and the spirit of data protection and research ethics principles:

  • most research data obtained from participants can be successfully shared without breaching confidentiality
  • it is important to distinguish between personal data collected and research data in general
  • data protection laws do not apply to anonymised data
  • personal data should not be disclosed, unless consent has been given for disclosure
  • identifiable information may be excluded from data sharing
  • even personal sensitive data can be shared if suitable procedures, precautions and safeguards are followed, as is done at major data centres

Research Data and Copyright

Researchers and the institutions for which they work, or where they study typically hold copyright in their data. In the case of collaborative research or derived data, copyright may be held jointly by various researchers or institutions. Secondary users of data must obtain copyright clearance from the rights holder before data can be reproduced. Data can be copied for non-commercial teaching or research purposes without infringing copyright, under the fair dealing concept, providing that the owner of the data is acknowledged. When research data is submitted to a journal, researchers need to verify whether the publisher expects copyright transfer of the data.

 

Examples

  Consent Forms for Workshops

  Parental Consent Form

  Consent Form for Interviews

  Consent Form

 

Good Practice Tips

There are various ways to share research data, including:

  • depositing it in RGU’s institutional repository - OpenAIR
  • depositing it with a specialist data centre, data archive or data bank
  • submitting it to a journal to support a publication
  • making it available online via a project or institutional website
  • making it available informally between researchers on a peer-to-peer basis

Each of these ways of sharing data has advantages and disadvantages: data centres may not be able to accept all data submitted to them; institutional repositories may not be able to afford long-term maintenance of data or support for more complex research data; and websites are often ephemeral with little sustainability. Approaches to data sharing may vary according to research environments and disciplines, due to the varying nature of data types and their characteristics.

Case Study Examples

Recording Images of Participants in a Workshop

RGU is the lead partner in RiCORE - an Horizon 2020 project examining a number of aspects of the consenting process associated with the development of offshore energy installations. As part of the project, the team produced a video in which the partners discuss the aims and achievements of the project.  For some sections of the video the film company took footage at one of the expert workshops and has included that footage in the video, while the sound track discusses the workshops.

The footage in question includes the name badges and employing organisation of some of the participants. All participants signed a detailed consent form (see the section on consent) which allows the use of their image. However the project manager queried whether it is acceptable for participant’s names and organisation names to be shown in the video, or whether they should be blurred. The University’s Data Protection Officer advised that it really all comes down to what the individual delegate’s expectation is, having signed the consent form in which participants were given the option to either agree, or not to their ‘identification as a contributor in reports, publications, written web material, video material, photographs and images’. Where participants have agreed, his advice was that it is not necessary to blur out details on individual badges.

It is a usual expectation that delegates at conferences who have given agreement to the use of their image are also agreeing to the organisation’s name or initials being visible too in the likes of photographs, and seminar shots. From a Data Protection perspective the University’s Data Protection Officer suggests that one needs to consider if the use of the delegate’s personal data is, ‘fair and lawful’ and in this respect having the individual’s consent ensures that this requirement is met. In his view displaying an individual’s organisations name or initials (not the delegate’s personal data) would not necessarily constitute a breach of privacy, or confidentiality again if the individual has an expectation that this is likely to be made public.

Copyright and Using Third Party Data

The Stockholm Environmental Institute (SEI) has created an integrated spatial database, Social and Environmental Conditions in Rural Areas (SECRA). This contains a wide range of socio-economic and environmental characteristics for all rural Census 2001 Super Output Areas (SOAs) for England. Multiple 3rd party data sources were used, such as Census 2001 data, Land Cover Map data and data from the Land Registry, Environmental Agency, Automobile Association, Royal Mail and British Trust for Ornithology. Derived data have been calculated and mapped onto SOAs.; The researchers would like to distribute the database for wider use. Whilst the database contains no original third party data, only derived data, there is still joint copyright shared between the SEI and the various copyright holders of the third party data. The researchers have sought permission from all data owners to distribute the data and the copyright of all third party data is declared in the documentation. The database can therefore be distributed.

Copyright of Interviews with ‘elites’

A researcher has interviewed five retired cabinet ministers about their careers, producing audio recordings and full transcripts. The researcher then  analyses the data and offers the recordings and transcripts to a data centre for preserving. However the researcher did not get signed copyright transfers for further use of the interviewees’ words. In this case it would be problematic for a data centre to accept the data. Large extracts of the data cannot be quoted by secondary users. To do so would breach the interviewees’ copyright over their recorded words. This is equally a problem for the primary researcher. The researcher should have asked for transfer of copyright or a licence to use the data obtained through interviews, as the possibility exists that the interviewee may at some point wish to assert the right over their words, e.g. when publishing memoirs

Copyright of Licensed Data

A researcher subscribes to access spatial AgCensus data from the data centre EDINA. (Edinburgh) These data are then integrated with data collected by the researcher. As part of the ESRC research award contract the data has to be offered for archiving at the UK Data Archive. Can such integrated data be offered? The subscription agreement on accessing AgCensus data states that data may not be transferred to any other person or body without prior written permission from EDINA. Therefore, the UK Data Archive cannot accept the integrated data, unless the researcher obtains permission from EDINA. The researcher’s partial data, with the AgCensus data removed, can be archived. Secondary users could then re-combine these data with the AgCensus data, if they were to obtain their own AgCensus subscription.

Copyright of Media Sources

A researcher has collated articles about the Prime Minister from The Guardian over the past ten years, using the LexisNexis newspaper database to source articles. They are a range of socio-economic and environmental characteristics for all rural Census 2001 Super Output Areas (SOAs) for England. Multiple third party data sources were used, such as Census 2001 data, Land Cover Map data and data from the Land Registry, Environment Agency, Automobile then transcribed/copied by the researcher into a database so that content analysis can be applied. The researcher offers a copy of the database together with the original transcribed text to a data centre. Researchers cannot share either of these data sources as they do not have copyright in the original material. A data centre cannot accept these data as to do so would be breach of copyright. The rights holders, in this case The Guardian and LexisNexis, would need to provide consent for archiving.

Data Sharing and Journals

The Publishing Network for Geoscientific and Environmental Data (PANGAEA) is an open access repository for various journals. By giving each deposited dataset a DOI, a deposited dataset acquires a unique and persistent identifier, and the underlying data can be directly connected to the corresponding  article. For example, PANGAEA and the publisher Elsevier have reciprocal linking between research data deposited with PANGAEA and corresponding articles in Elsevier journals ‘Nature journals’ have a policy that requires authors to make data and materials available to readers, as a condition of publication, preferably via public repositories. Appropriate discipline-specific repositories are suggested. Specifications regarding data standards, compliance or formats may also be provided.

For example, for research on small molecule crystal structures, authors should submit the data and materials to the Cambridge Structural Database (CSD) as a Crystallographic Information File, a standard file structure for the archiving and distribution of crystallographic information. After publication of a manuscript, deposited structures are included in the CSD, from where bona fide researchers can retrieve them for free. CSD has similar deposition agreements with many other journals.

Data Security and Anonymisation

UK Biobank aims to collect medical and genetic data from 500,000 middle-aged people across the UK in order to create a research resource to study the prevention and treatment of serious diseases. Stringent security, confidentiality and anonymisation measures are in place. UK Biobank holds personal data on recruited patients, their medical records and blood, urine and genetic samples, with data made available to approved researchers. Data or samples provided to researchers never include personal identifying details.

All data and samples are stored anonymously by removing any identifying information. This identifying information is encrypted and stored separately in a restricted access database that is controlled by senior UK Biobank staff. Identifying data and samples are only linked using a code that has no external meaning. Only a few people within UK Biobank have access to the key to the code for re-linking participants’ identifying information with data and samples. All staff sign confidentiality agreements as part of their employment contracts.

Sharing Confidential Data

The Biological Records Centre (BRC) is the national custodian of data on the distribution of wildlife in the British Isles. Data are provided by volunteers, researchers and organisations. BRC disseminates data for environmental decision-making, education and research. Data whose publication could present a significant threat to a species or habitat (e.g. nesting location of birds of prey) will be treated as confidential.

The BRC provides access to the data it holds via the National Biodiversity Network Gateway. Standard access controls are as follows:

  • public access to view and download all records at a minimum 10 km2 level of resolution, and at higher resolution if the data provider agrees
  • registered users have access to view and download all except confidential records at the 1 km2 level of resolution
  • conservation organisations have access to view and download all except confidential records at full resolution with attributes
  • conservation officers in statutory conservation agencies have access to view and download all records, including confidential records at full resolution with attributes
  • records that have been signified as confidential by a data provider will not be made available to the conservation agencies without the consent of the data provider

Access Restrictions vs. Open Access

Working with data owners, the Secure Data Service provides researchers with secure access to data that are too detailed, sensitive or confidential to be made available under the standard licences operated by its sister service, the Economic and Social Data Service (ESDS). The service’s security philosophy is based upon training and trust, leading-edge technology, licensing and legal frameworks (including the 2007 Statistics Act), and strict security policies and penalties endorsed by both the ONS and the ESRC. The technical model shares many similarities with the ONS Virtual Microdata Laboratory and the NORC Secure Data Enclave. It is based around a Citrix infrastructure which turns the end user’s computer into a remote terminal. All data processing is carried out on a central secure server; no data travels over the network. Outputs for publication are only released subject to Statistical Disclosure Control checks by trained Service staff.

Secure Data Service data cannot be downloaded. Researchers analyse the data remotely from their home institution at their desktop or in a safe room. The Service provides a ‘home away from home’ research facility with familiar statistical software and MS Office tools to make remote collaboration and analysis secure and convenient. The clearing-house mechanism established following the Convention on Biological Diversity to promote information sharing, has resulted in an exponential increase in openly accessible biodiversity and ecosystem data since 1992. The Forest Spatial Information Catalogue is a web-based portal, developed by the Center for International Forestry Research (CIFOR), for public access to spatial data and maps. The catalogue holds satellite images, aerial photographs, land usage and forest cover maps, maps of protected areas, agricultural and demographic atlases and forest boundaries. For example, forest cover maps for the entire world, produced by the World Conservation Monitoring Centre in 1997 can be downloaded freely as digital vector data.

The Global Biodiversity Information Framework (GBIF) strives to make the world’s biodiversity data accessible everywhere in the world. The framework holds millions of species occurrence records based on specimens and observations, scientific and common names and classifications of living organisms and map references for species records. Data are contributed by numerous international data providers. Geo-referenced records can be mapped to Google Earth.