Research Guides: Research Data Management: Describing Data

Documenting Data

Data documentation will ensure that your data will be understood and interpreted by any user. It will explain how your data was created, what the context is for the data, structure of the data and its contents, and any manipulations that have been done to the data. Also see: Guide to Writing a "readme" File.

What's important to document?

Context of data collection
Data collection methodology
Structure and organization of data files
Data validation and quality assurance
Data manipulations through data analysis from raw data
Data confidentiality, access and use conditions

Data-level documentation

Variable names and descriptions
Definition of codes and classification schemes
Codes of, and reasons for, missing values
Definitions of specialty terminology and acronyms
Algorithms used to transform data
File format and software used

Creating Metadata

Properly describing and documenting data allows users (yourself included) to understand and track important details of the work. In addition to describing data, having metadata about the data also facilitates search and retrieval of the data when deposited in a data repository. In a lab setting, much of the content used to describe data is initially collected in a notebook; metadata is a more formal, sharable expression of this information. Where no appropriate formal metadata standard exists, writing “readme” style metadata is an appropriate strategy.

Metadata is information about data, and describes basic characteristics, such as:

Recommended Metadata Elements

Title	Name of the project or collection of datasets
Creator	Names and institutions of the people who created the data
Date	Key dates associated with the data, such as dates covered by the data or date of creation
Description	Description of the resource
Keywords or subjects	Keywords or subjects describing the content of the data>/td>
Identifier	Unique number or alphanumeric string used to identify the data
Coverage (if applicable)	Geographic coverage
Language	Language of the resource
Publisher	Entity responsible for making the dataset available
Funding Agencies	Organization or agency who funded the research
Access restrictions	Where and how your data can be accessed by other researchers
Copyright	Copyright date and type
Format	What format your data is in

Metadata Redefined

Well-structured metadata supports the long-term discovery and preservation of research data, but allows for the aggregation and simultaneous searching of research data from tens or hundreds or thousands of researchers. This is why domain-specific repositories typically require highly structured metadata with your data submissions: it enables highly granular searches on their aggregated content. This in turn makes your data easier to find.

Metadata can take many different forms, from free text to standardized, structured, machine-readable, extensible content. Specific disciplines, repositories or data centers may guide or even dictate the content and format of metadata, possibly using a formal standard. Because creation of standardized metadata can be difficult and time consuming, another consideration when selecting a standard is the availability of tools that can help generate the metadata (e.g. Morpho allows for easy creation of EML, Nesstar for DDI data, etc.).

The Digital Curation Center provides a catalog of common metadata standards, organized by discipline: http://www.dcc.ac.uk/resources/metadata-standards.

Some specific examples of metadata standards, both general and domain specific are:

Dublin Core - domain agnostic, basic and widely used metadata standard
DDI (Data Documentation Initiative) - common standard for social, behavioral and economic sciences, including survey data
EML (Ecological Metadata Language) - specific for ecology disciplines
ISO 19115 and FGDC-CSDGM (Federal Geographic Data Committee's Content Standard for Digital Geospatial Metadata) - for describing geospatial information
MINSEQE (MINimal information about high throughput SEQeuencing Experiments) - Genomics standard
FITS (Flexible Image Transport System) - Astronomy digital file standard that includes structured, embedded metadata
MIBBI - Minimum Information for Biological and Biomedical Investigations

Metadata Tools

Annotare
A forms-based software for annotating biomedical investigations and resulting data. It supports biomedical ontologies, contains standard templates for common experimental types, and includes a design wizard for creating your own forms.
ISA Creator
An open source, stand-alone application that assists with planning and describing experiments and facilitates export and import of data directly to and from some public repositories. Additional tools are available in the ISA-Tools software suite for parsing ISA-Tab into R data structures and for parsing PERL and Python for ISA-Tab. ISA-Tab is the required format for publishing data in Nature Publishing's Scientific Data journal. This software creates separate descriptive files for your experimental files.
Morpho
Describe ecological experiments and to create a catalog of data and descriptions that you can query. It includes an interface to the Knowledge Network for Biocomplexity (KNB) for sharing, querying, viewing, and retrieving data.
OMERO
Repository software for importing, viewing, organizing, describing, analyzing, and sharing microscopy images from anywhere you have Internet access. It includes the ability to create user groups with different permissions for sharing data.
OntoMaton
Ontology searching and automated tagging via NCBO's Bioportal of biomedical ontologies within Google spreadsheets. OntoMaton is part of the ISA-Tools suite. Annotations are generated within your tabular data file.
RightField
Open source tool that allows searching and selecting of ontology terms from within Microsoft Excel. RightField allows you to assign a pre-determined list of options to a particular cell within the spreadsheet. All annotations are embedded within the spreadsheet. The user can select from the NCBO's BioPortal ontologies or import an ontology from a URL or your local machine.