A crucial step in the data workflow is preparing the data for analysis. This process includes data cleaning, where errors in the data are identified, corrected, or standardized for consistent formatting. It is essential to approach this step with the same level of care and attention to reproducibility as the subsequent analysis.
OpenRefine (formerly Google Refine) is a free, open-source tool designed for handling messy data. It facilitates data cleaning and transformation from one format to another.
In this section, you will learn:
OpenRefine can be used to efficiently clean and format data while automatically tracking any changes made. Users have expressed that this tool saves them months of manual work required for such edits.
The data for this lesson is a part of the Data Carpentry Social Sciences workshop. It is a teaching version of the Studying African Farmer-Led Irrigation (SAFI) database. The SAFI dataset represents interviews of farmers in two countries in eastern sub-Saharan Africa (Mozambique and Tanzania). These interviews were conducted between November 2016 and June 2017 and probed household features (e.g. construction materials used, number of household members), agricultural practices (e.g. water usage), and assets (e.g. number and types of livestock).
The data used in this lesson is a subset of the teaching version that has been intentionally ‘messed up’ for this lesson.
Download the data file to your computer.
OpenRefine is an open-source software primarily used for working with messy data. This Java-based tool is capable of data cleaning and data transformation. You can download OpenRefine on Windows.
openrefine.exe
. This will launch a command prompt window, but you can ignore that and wait for the browser to launch.Exiting OpenRefine
OpenRefine is an open-source software primarily used for working with messy data. This Java-based tool is capable of data cleaning and data transformation. You can download OpenRefine on macOS.
Exiting OpenRefine
OpenRefine is an open-source software primarily used for working with messy data. This Java-based tool is capable of data cleaning and data transformation. You can download OpenRefine on Linux.
./refine
into the terminal within the OpenRefine directory.Exiting OpenRefine
csv - file extension indicating that a text file that has values separated by commas (comma-separated-values)
clustering - method for finding different groups of values that may actually be representing the same thing
faceting - method for exploring the values in a variable. In this episode it is used to explore the values in order to identify errors in data entry
filter - t o select a subset of data from a dataframe
JSON - file extension indicating that the values in a text file are structured using JavaScript Object Notation (JSON)
RDF - file that extension indicating that the values in a file are structured using Resource Description Framework (RDF)
regular expressions (regex) - text string for describing a search pattern. They usually incorporate the use of wildcards to match letters, numbers, punctuation, spacing, or some combination
tsv - file extension indicating that a text file that has values separated by tabs (tab-separated-values)
xls - file extension indicating that a file is a spreadsheet created by Microsoft Excel
xlsx - file extension indicating that a file is a spreadsheet created by Microsoft Excel using XML
XML - file extension indicating that the values in a file are structured using Extensible Markup Language (XML)
Off-campus? Please use the Software VPN and choose the group UCIFull to access licensed content. For more information, please Click here
Software VPN is not available for guests, so they may not have access to some content when connecting from off-campus.