Skip to Main Content

DPDS Social Sciences Workshop Resources


Email this link:

Summary

A crucial step in the data workflow is preparing the data for analysis. This process includes data cleaning, where errors in the data are identified, corrected, or standardized for consistent formatting. It is essential to approach this step with the same level of care and attention to reproducibility as the subsequent analysis.

OpenRefine (formerly Google Refine) is a free, open-source tool designed for handling messy data. It facilitates data cleaning and transformation from one format to another.

In this section, you will learn:

  • create, export and import a project in OpenRefine
  • view and work on subsets of rows using facets and text filters
  • reduce variations in data through clustering, bulk editing and transformations
  • undo and redo actions and export the history of actions
  • save cleaned data in a widely supported file format

OpenRefine can be used to efficiently clean and format data while automatically tracking any changes made. Users have expressed that this tool saves them months of manual work required for such edits.

Data

The data for this lesson is a part of the Data Carpentry Social Sciences workshop. It is a teaching version of the Studying African Farmer-Led Irrigation (SAFI) database. The SAFI dataset represents interviews of farmers in two countries in eastern sub-Saharan Africa (Mozambique and Tanzania). These interviews were conducted between November 2016 and June 2017 and probed household features (e.g. construction materials used, number of household members), agricultural practices (e.g. water usage), and assets (e.g. number and types of livestock).

The data used in this lesson is a subset of the teaching version that has been intentionally ‘messed up’ for this lesson.

Download the data file to your computer.

Software

OpenRefine is an open-source software primarily used for working with messy data. This Java-based tool is capable of data cleaning and data transformation. You can download OpenRefine on Windows.

  • Check that you have Firefox, Edge, Opera or Chrome browsers installed and set as your default browser. OpenRefine runs in your default browser. It will not run correctly in Internet Explorer.
  • Download the software from openrefine.org.
  • Unzip the downloaded file into a directory by right-clicking and selecting “Extract…”. Name that directory something like OpenRefine.
  • Go to your newly created OpenRefine directory.
  • Launch OpenRefine by opening openrefine.exe. This will launch a command prompt window, but you can ignore that and wait for the browser to launch.
  • If you see Internet Explorer start, or OpenRefine does not automatically open for you, point one of the supported browsers at http://127.0.0.1:3333/ or http://localhost:3333 to launch the program.

Exiting OpenRefine

  • To exit OpenRefine, close all the browser tabs or windows, then navigate to the command line window.
  • To close this window and ensure OpenRefine exits properly, hold down [control] and press [c] on your keyboard. This will save all changes to your projects.
  • Remember, it’s important to close the browser window or tab first to ensure you’re not actively using OpenRefine before stopping the server. This prevents any unsaved changes from being lost.
  • After stopping the server, you can safely exit the terminal or command prompt window.

OpenRefine is an open-source software primarily used for working with messy data. This Java-based tool is capable of data cleaning and data transformation. You can download OpenRefine on macOS.

  • Check that you have Firefox, Edge, Opera or Chrome browsers installed and set as your default browser. OpenRefine runs in your default browser. It will not run correctly in Internet Explorer.
  • Download the software from openrefine.org.
  • Unzip the downloaded file into a directory by double-clicking it. Name that directory something like OpenRefine.
  • Go to your newly created OpenRefine directory.
  • Drag the OpenRefine app into the Applications folder.
  • Launch OpenRefine: Control-click the app icon, then choose “Open” from the shortcut menu. For Troubleshooting help, see the Apple support page.
  • If you are using a different browser than listed above, or if OpenRefine does not automatically open for you, point your browser at http://127.0.0.1:3333/ or http://localhost:3333 to launch the program.

Exiting OpenRefine

  • To exit OpenRefine, close all the browser tabs or windows, then navigate to the command line window.
  • To close this window and ensure OpenRefine exits properly, hold down [control] and press [c] on your keyboard. This will save all changes to your projects.
  • Remember, it’s important to close the browser window or tab first to ensure you’re not actively using OpenRefine before stopping the server. This prevents any unsaved changes from being lost.
  • After stopping the server, you can safely exit the terminal or command prompt window.

OpenRefine is an open-source software primarily used for working with messy data. This Java-based tool is capable of data cleaning and data transformation. You can download OpenRefine on Linux.

  • Check that you have Firefox or Chrome browsers installed and set as your default browser. OpenRefine runs in your default browser.
  • Download the software from openrefine.org.
  • Unzip the downloaded file into a directory. Name that directory something like OpenRefine.
  • Go to your newly created OpenRefine directory.
  • Launch OpenRefine by typing ./refine into the terminal within the OpenRefine directory.
  • If you are using a different browser than listed above, or if OpenRefine does not automatically open for you, point your browser at http://127.0.0.1:3333/ or http://localhost:3333 to launch the program.

Exiting OpenRefine

  • To exit OpenRefine, close all the browser tabs or windows, then navigate to the command line window.
  • To close this window and ensure OpenRefine exits properly, hold down [control] and press [c] on your keyboard. This will save all changes to your projects.
  • Remember, it’s important to close the browser window or tab first to ensure you’re not actively using OpenRefine before stopping the server. This prevents any unsaved changes from being lost.
  • After stopping the server, you can safely exit the terminal or command prompt window.

Glossary

csv - file extension indicating that a text file that has values separated by commas (comma-separated-values)

clustering - method for finding different groups of values that may actually be representing the same thing

faceting - method for exploring the values in a variable. In this episode it is used to explore the values in order to identify errors in data entry

filter - t o select a subset of data from a dataframe

JSON - file extension indicating that the values in a text file are structured using JavaScript Object Notation (JSON)

RDF - file that extension indicating that the values in a file are structured using Resource Description Framework (RDF)

regular expressions (regex) - text string for describing a search pattern. They usually incorporate the use of wildcards to match letters, numbers, punctuation, spacing, or some combination

tsv - file extension indicating that a text file that has values separated by tabs (tab-separated-values)

xls - file extension indicating that a file is a spreadsheet created by Microsoft Excel

xlsx - file extension indicating that a file is a spreadsheet created by Microsoft Excel using XML

XML - file extension indicating that the values in a file are structured using Extensible Markup Language (XML)