Research Guides: DPDS Social Sciences Workshop Resources: Part 3: Data Analysis and Visualization with R

Summary

This is an introduction to R designed for participants with no programming experience.

In this section, you will learn:

basic information about R syntax, the RStudio interface
how to import CSV files
the structure of data frames
how to deal with factors
how to add/remove rows and columns
how to calculate summary statistics from a dataframe brief introduction to plotting

R and RStudio are separate downloads and installations. R is the underlying statistical computing environment, but using R alone is no fun. RStudio is a graphical integrated development environment (IDE) that makes using R much easier and more interactive.

Software

R is a coding language and system for statistical computing and graphics. RStudio is a powerful, open-source software for data science and scientific research. It can be used for data analysis and visualization purposes.

If you already have R and RStudio installed:

Open RStudio, and click on “Help” > “Check for updates”. If a new version is available, quit RStudio, and download the latest version for RStudio.
To check which version of R you are using, start RStudio and the first thing that appears in the console indicates the version of R you are running. Alternatively, you can type sessionInfo(), which will also display which version of R you are running. Go on the CRAN website and check whether a more recent version is available. If so, please download and install it. You can check here for more information on how to remove old versions from your system.

If you don’t have R and RStudio installed:

Download R from the CRAN website.
Run the .exe file that was just downloaded.
Go to the RStudio download page.
Under Installers select RStudio x.yy.zzz - Windows Vista/7/8/10 (where x, y, and z represent version numbers).
Double click the file to install it.
Once it’s installed, open RStudio to make sure it works and you don’t get any error messages.

Download R for Windows
Use this link to download the free R software for Windows.
Download RStudio for Windows
Use this link to download the free RStudio software for Windows.

R is a coding language and system for statistical computing and graphics. RStudio is a powerful, open-source software for data science and

If you already have R and RStudio installed:

Open RStudio, and click on “Help” > “Check for updates”. If a new version is available, quit RStudio, and download the latest version for RStudio.
To check the version of R you are using, start RStudio and the first thing that appears on the terminal indicates the version of R you are running. Alternatively, you can type sessionInfo(), which will also display which version of R you are running. Go on the CRAN website and check whether a more recent version is available. If so, please download and install it. In any case, make sure you have at least R 3.2.

If you don’t have R and RStudio installed:

Download R from the CRAN website.
Select the .pkg file for the latest R version.
Double click on the downloaded file to install R.
It is also a good idea to install XQuartz (needed by some packages).
Go to the RStudio download page.
Under Installers select RStudio x.yy.zzz - Mac OS X 10.6+ (64-bit) (where x, y, and z represent version numbers).
Double click the file to install RStudio.
Once it’s installed, open RStudio to make sure it works and you don’t get any error messages.

Download R for macOS
Use this link to download the free R software for macOS.
Download XQuartz for macOS
Use this link to download the free XQuartz for macOS.
Download RStudio for macOS
Use this link to download the free RStudio software for macOS.

Follow the instructions for your distribution from CRAN, they provide information to get the most recent version of R for common distributions. For most distributions, you could use your package manager (e.g., for Debian/Ubuntu run sudo apt-get install r-base, and for Fedora sudo yum install R), but we don’t recommend this approach as the versions provided by this are usually out of date. In any case, make sure you have at least R 3.2.
Go to the RStudio download page.
Under Installers select the version that matches your distribution, and install it with your preferred method (e.g., with Debian/Ubuntu sudo dpkg -i rstudio-x.yy.zzz-amd64.deb at the terminal).
Once it’s installed, open RStudio to make sure it works and you don’t get any error messages.
After installing R and RStudio, you need to install the tidyverse and RSQLite packages. Start RStudio by double-clicking the icon and then type: install.packages(c("tidyverse", "RSQLite")). You can also do this by going to Tools -> Install Packages and typing the names of the packages you want to install, separated by a comma.

Download R for Linux
Use this link to download the free R software for Linux.
Download RStudio for Linux
Use this link to download the free RStudio software for Linux.

Glossary

Introduction to R

sqrt() # calculate the square root
round() # round a number
args() # find what arguments a function takes
length() # how many elements are in a particular vector
class() # the class (the type of element) of an object
str() # an overview of the object and the elements it contains
typeof # determines the (R internal) type or storage mode of any object
c() # create vector; add elements to vector
` [ ] ` # extract and subset vector
%in% # to test if a value is found in a vector
is.na() # test if there are missing values
na.omit() # Returns the object with incomplete cases removed
complete.cases()# elements which are complete cases

Starting with Data

download.file() # download files from the internet to your computer
read_csv() # load CSV file into R memory
head() # shows the first 6 rows
view() # invoke a spreadsheet-style data viewer
read_delim() # load a file in table format into R memory
str() # check structure of the object and information about the class, length and content of each column
dim() # check dimension of data frame
nrow() # returns the number of rows
ncol() # returns the number of columns
tail() # shows the last 6 rows
names() # returns the column names (synonym of colnames() for data frame objects)
rownames() # returns the row names
summary() # summary statistics for each column
glimpse # like str() applied to a data frame but tries to show as much data as possible
factor() # create factors
levels() # check levels of a factor
nlevels() # check number of levels of a factor
as.character() # convert an object to a character vector
as.numeric() # convert an object to a numeric vector
as.numeric(as.character(x)) # convert factors where the levels appear as characters to a numeric vector
as.numeric(levels(x))[x] # convert factors where the levels appear as numbers to a numeric vector
plot() # plot an object
addNA() # convert NA into a factor level
data.frame() # create a data.frame object
ymd() # convert a vector representing year, month, and day to a Date vector
paste() # concatenate vectors after converting to character

Data Wrangling with dplyr and tidyr

str() # check structure of the object and information about the class, length and content of each column
view() # invoke a spreadsheet-style data viewer
select() # select columns of a data frame
filter() # allows you to select a subset of rows in a data frame
%>% # pipes to select and filter at the same time
mutate() # create new columns based on the values in existing columns
head() # shows the first 6 rows
group_by() # split the data into groups, apply some analysis to each group, and then combine the results.
summarize() # collapses each group into a single-row summary of that group
mean() # calculate the mean value of a vector
!is.na() # test if there are no missing values
print() # print values to the console
min() # return the minimum value of a vector
arrange() # arrange rows by variables
desc() # transform a vector into a format that will be sorted in descending order
count() # counts the total number of records for each category
pivot_wider() # reshape a data frame by a key-value pair across multiple columns
pivot_longer() # reshape a data frame by collapsing into a key-value pair
replace_na() # Replace NAs with specified values
n_distinct() # get a count of unique values
write_csv() # save to a csv formatted file

Data Visualization with ggplot2

read_csv() # load a csv formatted file into R memory
ggplot2(data= , aes(x= , y= )) + geom_point( ) + facet_wrap () + theme_bw() + theme() # skeleton for creating plot layers
aes() # by selecting the variables to be plotted and the variables to define the presentation such as plotting size, shape color, etc.
geom_ # graphical representation of the data in the plot (points, lines, bars). To add a geom to the plot use + operator
facet_wrap() # allows to split one plot into multiple plots based on a factor included in the dataset
labs() # set labels to plot
theme_bw() # set the background to white
theme() # used to locally modify one or more theme elements in a specific ggplot object
+ # arrange ggplots horizontally
/ # arrange ggplots vertically
plot_layout() # set width and height of individual plots in a patchwork of plots
ggsave() # save a ggplot

Processing JSON data

read_json() # load json object to an R object