Researching from home? Remote access to the UCI Libraries' licensed online resources is available to current UC Irvine students, faculty & staff. Visit our Connect from Off-Campus page for more information!
Do you want to connect with the librarian who covers your discipline or topic? Or view the research guide that supports your discipline?
Data and Statistics are related, but not exactly the same. Understanding the difference is helpful as you conduct your search!
Here is an over-simplified explanation of the difference:
According to the United States Census Bureau, 50.5% of the population of the United States is female: this is a statistic.
The statistic mentioned above was calculated from the Census Bureau's Decennial Census SF1 dataset, which has 311,591,917 cases/rows—one representing each person in the U.S.—each of which has an entry in the Sex variable/column.
Conveniently, someone at the Census Bureau analyzed the SF1 dataset to produce the statistic above; if that's what you were looking for, then great! However, if you wanted to do your own analysis, for example, examining Sex, Occupation, and Age (and there wasn't another statistic available to tell you what you wanted to know), then you'd need to use the dataset.
Aggregate/Macro Data vs. Microdata
Aggregate or Macro Data are higher-level data that have been compiled from smaller units of data. For example, the census data that you find on American Community Survey have been aggregated to preserve the confidentiality of individual respondents. Microdata contain individual cases, usually individual people, or in the case of census data, individual households. The Public Use Microdata Sample (PUMS) for the Census provides access to the actual survey data from the Census, but eliminates information that would identify individuals.
Data Sets, Studies, and Series
In data archives like ICPSR, a data set or study is made up of the raw data file and any related files, usually the codebook and setup files. The codebook is your guide to making sense of the raw data. For survey data, the codebook usually contains the actual questionnaire and the values for the responses to each question. The setup files help the data display properly.
ICPSR uses the term series to describe collections of studies that have been repeated over time. For example, the National Health Interview Survey is conducted annually. In the ICPSR archive, you will find a description of the series that provides an overview. You will also find individual descriptions of each study (i.e. National Health Interview Survey, 2004). The study number in ICPSR refers to the individual survey.
Types of Data
Cross-Sectional describes data that are only collected once.
Time Series study the same variable over time. The National Health Interview Survey is an example of time series data because the questions generally remain the same over time, but the individual respondents vary.
Longitudinal Studies describe surveys that are conducted repeatedly, in which the same group of respondents are surveyed each time. This allows for examining changes over the life course. The Project on Human Development in Chicago Neighborhoods (PHDCN) Series contains a longitudinal component that tracks changes in the lives of individuals over time through interviews.
Finding datasets and statistics can be time consuming. Here are some strategies to help you focus in on how to find the right data.
Search the literature: This will give you clues about which data sets others are using to investigate your topic, and even if no specific data sets are mentioned, you might learn about the organizations, government entities, or others that are likely to collect related data. A great resource is the ICPSR Bibliography of Data-Related Literature, which allows you to search for articles and other publications first, and then link directly to the datasets used by those articles. If you find useful, but dated, statistics in an article, look at the source and then go to the source's website to find updated numbers.
Who cares about this information? Data cost a lot to collect. Who cares enough about the information to collect it? Some of the most likely data collectors include governments, marketers, trade groups, and advocacy associations. Depending on your subject area, finding useful data can be very challenging. However, many organizations that collect and share data, also provide some sort of analysis of that data. For example, the World Bank collects and shares data, but also provides robust analysis in fact sheets, reports, publications, etc.
Here are a few things to think about when trying to find data/statistics: The most recent data may not be from this year. Because data takes time and money to collect, analyze, and disseminate the most recent datasets or statistics may sometimes be a few years old. Sometimes, data may only be collected once, sometimes, every 10 years. Follow the trail. Finding statistics can sometimes be an exercise in detective work. If you find a publication with useful statistics, look at the source of the statistics. If the article sites a source, e.g., the CDC or Pew Research, consult that source. It may provide additional statistics, updated statistics, or context that wasn't referenced in the article.
Evaluate the source. As with all information, you should evaluate the source providing the analysis. Are they biased? Is the group or website reliable? Do they provide access to data that the statistic came from? Read the statistic carefully. Be sure to pay close attention to any information provided about how the statistic was collected, etc. You don't want to misrepresent the statistic or its significance in your own writing.
This information on this page has been adapted from library colleagues at NYU and UCSD. We are grateful for their efforts in thinking about how to convey this information and composing it.