Skip to Main Content

DSS Social Sciences Workshop Resources


Email this link:

Summary

This is an introduction to R designed for participants with no programming experience.

In this section, you will learn:

  • basic information about R syntax, the RStudio interface
  • how to import CSV files
  • the structure of data frames
  • how to deal with factors
  • how to add/remove rows and columns
  • how to calculate summary statistics from a dataframe brief introduction to plotting

R and RStudio are separate downloads and installations. R is the underlying statistical computing environment, but using R alone is no fun. RStudio is a graphical integrated development environment (IDE) that makes using R much easier and more interactive. 

Software

R is a coding language and system for statistical computing and graphics. RStudio is a powerful, open-source software for data science and scientific research. It can be used for data analysis and visualization purposes. 

If you already have R and RStudio installed:

  • Open RStudio, and click on “Help” > “Check for updates”. If a new version is available, quit RStudio, and download the latest version for RStudio.
  • To check which version of R you are using, start RStudio and the first thing that appears in the console indicates the version of R you are running. Alternatively, you can type sessionInfo(), which will also display which version of R you are running. Go on the CRAN website and check whether a more recent version is available. If so, please download and install it. You can check here for more information on how to remove old versions from your system.

If you don’t have R and RStudio installed:

  • Download R from the CRAN website.
  • Run the .exe file that was just downloaded.
  • Go to the RStudio download page.
  • Under Installers select RStudio x.yy.zzz - Windows Vista/7/8/10 (where x, y, and z represent version numbers).
  • Double click the file to install it.
  • Once it’s installed, open RStudio to make sure it works and you don’t get any error messages.

R is a coding language and system for statistical computing and graphics. RStudio is a powerful, open-source software for data science and

If you already have R and RStudio installed:

  • Open RStudio, and click on “Help” > “Check for updates”. If a new version is available, quit RStudio, and download the latest version for RStudio.
  • To check the version of R you are using, start RStudio and the first thing that appears on the terminal indicates the version of R you are running. Alternatively, you can type sessionInfo(), which will also display which version of R you are running. Go on the CRAN website and check whether a more recent version is available. If so, please download and install it. In any case, make sure you have at least R 3.2.

If you don’t have R and RStudio installed:

  • Download R from the CRAN website.
  • Select the .pkg file for the latest R version.
  • Double click on the downloaded file to install R.
  • It is also a good idea to install XQuartz (needed by some packages).
  • Go to the RStudio download page.
  • Under Installers select RStudio x.yy.zzz - Mac OS X 10.6+ (64-bit) (where x, y, and z represent version numbers).
  • Double click the file to install RStudio.
  • Once it’s installed, open RStudio to make sure it works and you don’t get any error messages.

R is a coding language and system for statistical computing and graphics. RStudio is a powerful, open-source software for data science and scientific research. It can be used for data analysis and visualization purposes. 

  • Follow the instructions for your distribution from CRAN, they provide information to get the most recent version of R for common distributions. For most distributions, you could use your package manager (e.g., for Debian/Ubuntu run sudo apt-get install r-base, and for Fedora sudo yum install R), but we don’t recommend this approach as the versions provided by this are usually out of date. In any case, make sure you have at least R 3.2.
  • Go to the RStudio download page.
  • Under Installers select the version that matches your distribution, and install it with your preferred method (e.g., with Debian/Ubuntu sudo dpkg -i rstudio-x.yy.zzz-amd64.deb at the terminal).
  • Once it’s installed, open RStudio to make sure it works and you don’t get any error messages.
  • After installing R and RStudio, you need to install the tidyverse and RSQLite packages. Start RStudio by double-clicking the icon and then type: install.packages(c("tidyverse", "RSQLite")). You can also do this by going to Tools -> Install Packages and typing the names of the packages you want to install, separated by a comma.

Glossary

Introduction to R

  • sqrt() # calculate the square root
  • round() # round a number
  • args() # find what arguments a function takes
  • length() # how many elements are in a particular vector
  • class() # the class (the type of element) of an object
  • str() # an overview of the object and the elements it contains
  • typeof # determines the (R internal) type or storage mode of any object
  • c() # create vector; add elements to vector
  • ` [ ] ` # extract and subset vector
  • %in% # to test if a value is found in a vector
  • is.na() # test if there are missing values
  • na.omit() # Returns the object with incomplete cases removed
  • complete.cases()# elements which are complete cases

Starting with Data

  • download.file() # download files from the internet to your computer
  • read_csv() # load CSV file into R memory
  • head() # shows the first 6 rows
  • view() # invoke a spreadsheet-style data viewer
  • read_delim() # load a file in table format into R memory
  • str() # check structure of the object and information about the class, length and content of each column
  • dim() # check dimension of data frame
  • nrow() # returns the number of rows
  • ncol() # returns the number of columns
  • tail() # shows the last 6 rows
  • names() # returns the column names (synonym of colnames() for data frame objects)
  • rownames() # returns the row names
  • summary() # summary statistics for each column
  • glimpse # like str() applied to a data frame but tries to show as much data as possible
  • factor() # create factors
  • levels() # check levels of a factor
  • nlevels() # check number of levels of a factor
  • as.character() # convert an object to a character vector
  • as.numeric() # convert an object to a numeric vector
  • as.numeric(as.character(x)) # convert factors where the levels appear as characters to a numeric vector
  • as.numeric(levels(x))[x] # convert factors where the levels appear as numbers to a numeric vector
  • plot() # plot an object
  • addNA() # convert NA into a factor level
  • data.frame() # create a data.frame object
  • ymd() # convert a vector representing year, month, and day to a Date vector
  • paste() # concatenate vectors after converting to character

Data Wrangling with dplyr and tidyr

  • str() # check structure of the object and information about the class, length and content of each column
  • view() # invoke a spreadsheet-style data viewer
  • select() # select columns of a data frame
  • filter() # allows you to select a subset of rows in a data frame
  • %>% # pipes to select and filter at the same time
  • mutate() # create new columns based on the values in existing columns
  • head() # shows the first 6 rows
  • group_by() # split the data into groups, apply some analysis to each group, and then combine the results.
  • summarize() # collapses each group into a single-row summary of that group
  • mean() # calculate the mean value of a vector
  • !is.na() # test if there are no missing values
  • print() # print values to the console
  • min() # return the minimum value of a vector
  • arrange() # arrange rows by variables
  • desc() # transform a vector into a format that will be sorted in descending order
  • count() # counts the total number of records for each category
  • pivot_wider() # reshape a data frame by a key-value pair across multiple columns
  • pivot_longer() # reshape a data frame by collapsing into a key-value pair
  • replace_na() # Replace NAs with specified values
  • n_distinct() # get a count of unique values
  • write_csv() # save to a csv formatted file

Data Visualization with ggplot2

  • read_csv() # load a csv formatted file into R memory
  • ggplot2(data= , aes(x= , y= )) + geom_point( ) + facet_wrap () + theme_bw() + theme() # skeleton for creating plot layers
  • aes() # by selecting the variables to be plotted and the variables to define the presentation such as plotting size, shape color, etc.
  • geom_ # graphical representation of the data in the plot (points, lines, bars). To add a geom to the plot use + operator
  • facet_wrap() # allows to split one plot into multiple plots based on a factor included in the dataset
  • labs() # set labels to plot
  • theme_bw() # set the background to white
  • theme() # used to locally modify one or more theme elements in a specific ggplot object
  • + # arrange ggplots horizontally
  • / # arrange ggplots vertically
  • plot_layout() # set width and height of individual plots in a patchwork of plots
  • ggsave() # save a ggplot

Processing JSON data

  • read_json() # load json object to an R object