Research Guides: Social Science Data: Getting Started

Connect from off campus

Researching from home? Remote access to the UC Irvine Libraries' licensed online resources is available to current UC Irvine students, faculty & staff. Visit our Connect from Off-Campus page for more information!

Related Research Guides

Social Science Librarians

Do you want to connect with the librarian who covers your discipline or topic? Or view the research guide that supports your discipline?

Melissa Beuoy
Research Librarian for Interdisciplinary Studies
Melissa supports Education, Gender Studies, Global & International Studies, DASA, and Language Science.
Annette Buckley
Research Librarian for Business and Economics
Annette supports Economics and Business.
Nicole Carpenter
Research Librarian for Social Sciences
Nicole supports Anthropology, Psychological & Cognitive Sciences, and Sociology.
Elizabeth V. Hernandez
Research Librarian for Criminology and Political Science
Elizabeth supports Criminology, Law, & Society, Political Science, Urban Planning & Public Policy, and Chicano/Latino Studies.

Data, Publishing, and Digital Scholarship

Data, Publishing, and Digital Scholarship (DPDS)
DPSDS fosters the use of digital content and transformative technology in scholarship and academic activities. They provide consultative and technical support for a wide range of tools and platforms. They work with the campus community to publish, promote, and preserve the digital products of research through consultation, teaching, and systems administration. Their areas of expertise include data curation, research data management, computational research, digital humanities, and scholarly communication.

Data vs. statistics

Data and Statistics are related, but not exactly the same. Understanding the difference is helpful as you conduct your search!

Here is an over-simplified explanation of the difference:

Data

Data is the raw material produced by research, administrative record-keeping, scientific instruments, or other collection methods.
Most often: a row represents a record/observation/case and a column represents a variable.
When dealing with data, the unit of analysis is important: What was being studied? What is represented in each row? A person? A household? A country?
There are also different types of variables. Sometimes a number is just as it seems, but often a number is a code that stands for something.
- For example, in the Age column, "40" means that person is age 40. But in the "Gender" column, maybe "1" means "female" and "2" means "male."
Data is designed to be read by a machine. For a human to make sense of a dataset, we need documentation or a codebook, which tells us what the rows represent, what the variable names mean, what the codes mean, how the data was collected, and everything else we need to understand the data.

Statistics

Statistics are an aggregated description of a dataset; they interpret or summarize the dataset.
Statistics are produced by some kind of analysis of a dataset.
When you find statistics, they're usually "ready to go" -- you can use them immediately in their current form. They're made to be read by humans!
Statistics usually look something like this:

An example:

According to the United States Census Bureau, 50.5% of the population of the United States is female: this is a statistic.

The statistic mentioned above was calculated from the Census Bureau's Decennial Census SF1 dataset, which has 311,591,917 cases/rows—one representing each person in the U.S.—each of which has an entry in the Sex variable/column.

Conveniently, someone at the Census Bureau analyzed the SF1 dataset to produce the statistic above; if that's what you were looking for, then great! However, if you wanted to do your own analysis, for example, examining Sex, Occupation, and Age (and there wasn't another statistic available to tell you what you wanted to know), then you'd need to use the dataset.

Types of data

Aggregate/Macro Data vs. Microdata

Aggregate or Macro Data are higher-level data that have been compiled from smaller units of data. For example, the census data that you find on American Community Survey have been aggregated to preserve the confidentiality of individual respondents. Microdata contain individual cases, usually individual people, or in the case of census data, individual households. The Public Use Microdata Sample (PUMS) for the Census provides access to the actual survey data from the Census, but eliminates information that would identify individuals.

Data Sets, Studies, and Series

In data archives like ICPSR, a data set or study is made up of the raw data file and any related files, usually the codebook and setup files. The codebook is your guide to making sense of the raw data. For survey data, the codebook usually contains the actual questionnaire and the values for the responses to each question. The setup files help the data display properly.

ICPSR uses the term series to describe collections of studies that have been repeated over time. For example, the National Health Interview Survey is conducted annually. In the ICPSR archive, you will find a description of the series that provides an overview. You will also find individual descriptions of each study (i.e. National Health Interview Survey, 2004). The study number in ICPSR refers to the individual survey.

Types of Data

Cross-Sectional describes data that are only collected once.

Time Series study the same variable over time. The National Health Interview Survey is an example of time series data because the questions generally remain the same over time, but the individual respondents vary.

Longitudinal Studies describe surveys that are conducted repeatedly, in which the same group of respondents are surveyed each time. This allows for examining changes over the life course. The Project on Human Development in Chicago Neighborhoods (PHDCN) Series contains a longitudinal component that tracks changes in the lives of individuals over time through interviews.

Strategies for finding data

Finding datasets and statistics can be time consuming. Here are some strategies to help you focus in on how to find the right data.

Think about who might collect the data: Could it have been collected by a government agency? A nonprofit/nongovernmental organization? A private business or industry group? Academic researchers?
Browse the tabs on this guide to learn about many of the data sources available at UCI and online: Keep in mind that what is described on this guide is usually at the resource level, rather than at the dataset level or the variable level. Think about how your topic might be classified. For example, you won't find any mention of firefighters on this guide, but you will find the Bureau of Labor Statistics, which provides data about people in different professions, including fire safety.
Search the literature: This will give you clues about which data sets others are using to investigate your topic, and even if no specific data sets are mentioned, you might learn about the organizations, government entities, or others that are likely to collect related data. A great resource is the ICPSR Bibliography of Data-Related Literature, which allows you to search for articles and other publications first, and then link directly to the datasets used by those articles. If you find useful, but dated, statistics in an article, look at the source and then go to the source's website to find updated numbers.

Things to consider

Who cares about this information? Data cost a lot to collect. Who cares enough about the information to collect it? Some of the most likely data collectors include governments, marketers, trade groups, and advocacy associations. Depending on your subject area, finding useful data can be very challenging. However, many organizations that collect and share data, also provide some sort of analysis of that data. For example, the World Bank collects and shares data, but also provides robust analysis in fact sheets, reports, publications, etc.

Here are a few things to think about when trying to find data/statistics: The most recent data may not be from this year. Because data takes time and money to collect, analyze, and disseminate the most recent datasets or statistics may sometimes be a few years old. Sometimes, data may only be collected once, sometimes, every 10 years. Follow the trail. Finding statistics can sometimes be an exercise in detective work. If you find a publication with useful statistics, look at the source of the statistics. If the article sites a source, e.g., the CDC or Pew Research, consult that source. It may provide additional statistics, updated statistics, or context that wasn't referenced in the article.

Evaluate the source. As with all information, you should evaluate the source providing the analysis. Are they biased? Is the group or website reliable? Do they provide access to data that the statistic came from? Read the statistic carefully. Be sure to pay close attention to any information provided about how the statistic was collected, etc. You don't want to misrepresent the statistic or its significance in your own writing.

Attribution and expression of gratitude

This information on this page has been adapted from library colleagues at NYU and UCSD. We are grateful for their efforts in thinking about how to convey this information and composing it.