Skip to Main Content

DPDS Social Sciences Workshop Resources


Email this link:

Summary

Good data organization is the foundation of any research project. Most researchers begin their projects with data stored in spreadsheets. Computers, however, have specific requirements for data organization. To utilize tools that enhance computational efficiency, such as programming languages like R or Python, researchers need to structure their data the way that computers read data.

In this section, you will learn:

  • good data entry practices - formatting data tables in spreadsheets
  • how to avoid common formatting mistakes
  • approaches for handling dates in spreadsheets
  • basic quality control and data manipulation in spreadsheets
  • exporting data from spreadsheets

Much of your time as a researcher will be dedicated to the initial 'data wrangling' stage, during which you must organize the data to facilitate proper analysis later on. Learning strategies for effective data organization can improve the formatting of existing data and help plan new data collection methods for more efficient data wrangling.

Data

You need to download some files to follow this lesson:

1. Download the following three files:

2. Place these 3 files in a folder you can easily find and access on your computer.

Software

Microsoft provides Microsoft Office 365 ProPlus to UC Irvine students at no cost thanks to a staff campus agreement program (MCCA). This agreement allows Microsoft to provide current students with the latest version of full Microsoft Office for their personally-owned computers, smartphones, and tablets and 1TB of OneDrive cloud storage.

Full-time faculty and staff whose departments are enrolled in MCCA licensing are eligible to install Microsoft Office on their personal devices and computers. Note that all installations create an ongoing financial responsibility for your department. Your license may be revoked for non-payment.

To install Microsoft Excel on your personal computing device through UCI, check out the Office of Information Technology's Microsoft 365 page.

If you do not have or do not want to use Microsoft Excel, you can use LibreOffice. It is a free, open source spreadsheet program.

Windows

  • Install LibreOffice by going to the installation page. The version for Windows should automatically be selected. Click Download Version X.X.X (whichever is the most recent version).
  • Once the installer is downloaded, double click on it and LibreOffice should install.

macOS

  • Install LibreOffice by going to the installation page. The version for Mac should automatically be selected. Click Download Version X.X.X (whichever is the most recent version).
  • Once the installer is downloaded, double click on it and LibreOffice should install.

Linux

  • Install LibreOffice by going to the installation page. The version for Linux should automatically be selected. Click Download Version X.X.X (whichever is the most recent version).
  • Once the installer is downloaded, double click on it and LibreOffice should install.
  • Package manager option:
    • pacman (Arch): pacman -S libreoffice
    • yum (Fedora, CentOS): yum install libreoffice
    • apt (Debian, Ubuntu): apt install libreoffice

Glossary

cleaned data - data that has been manipulated post-collection to remove errors or inaccuracies, introduce desired formatting changes, or otherwise prepare the data for analysis

conditional formatting - formatting that is applied to a specific cell or range of cells depending on a set of criteria

CSV (comma separated values) format - a plain text file format in which values are separated by commas

factor - a variable that takes on a limited number of possible values (i.e. categorical data)

metadata - data which describes other data

null value - a value used to record observations missing from a dataset

observation - a single measurement or record of the object being recorded (e.g. the weight of a particular mouse)

plain text - unformatted text

quality assurance - any process which checks data for validity during entry

quality control - any process which removes problematic data from a dataset

raw data - data that has not been manipulated and represents actual recorded values

rich text - formatted text (e.g. text that appears bolded, colored or italicized)

string - a collection of characters (e.g. “thisisastring”)

TSV (tab separated values) format - a plain text file format in which values are separated by tabs

variable - a category of data being collected on the object being recorded (e.g. a mouse’s weight)