Data analysis with Tidyverse(dplyr)

from reading data to graphing results

Jongoh Kim

LISER

February 2, 2023

Introduction


This training aims to teach you data analysis in R from A-Z.

Data analysis includes

  • reading in a data set

  • cleaning data / manipulating values in columns

  • creating new columns

  • merging data

  • statistical analysis

  • graphing results

  • writing reports

Prerequisite


This training is for people who have basic knowledge of programming!

The basic knowledge includes:

  • data type

  • if,else statement

  • for, while loop

  • basic concept of a user-defined function

Prerequisite


This training assumes you are using R Studio for coding your R script so

if you have not installed the program, please do by clicking this link.

Let’s clean up the R environment!


Let’s set the environment for our R-coding.

I strongly suggest you put these lines of code at the start of your code file

You will use them a lot.

# Clear plots
if(!is.null(dev.list())) dev.off()
# Clear console
cat("\014") 
# Clean workspace
rm(list=ls())

"OR the best way for R studio users
to really initialize the R environment"
#install.packages("rstudioapi")
##restart R
rstudioapi::restartSession()

Packages


Packages are a toolbox that you can get(install)

and use to make your life(coding) easier!

Tidyverse


The most famous package bundle in the R world!

My personal choice for R packages


String Manipulation

  • stringr, stringi

Data Analysis

  • dplyr and/or data.table

My personal choice for R packages


Graph

  • ggplot2

Report

  • rmarkdown, markdown, knitr

My personal choice for R packages


Pipeline tools

  • targets(snakemake for Python users)

What is a Pipeline tool?

Check out this website for more details.

The website states that:

With targets, you can maintain a reproducible workflow without repeating yourself.

  • targets learns how your pipeline fits together,
  • skips costly runtime for tasks that are already up to date,
  • runs only the necessary computation,
  • supports implicit parallel computing,
  • abstracts files as R objects,
  • and shows tangible evidence that the results match the underlying code and data.

Summary

  • targets
    • optional but much better to know if your work involves a lot of coding
  • ggplot2
    • essential!
    • personally I believe this package is the best for graphing results(better than matplotlib from Python)
  • rmarkdown packages
    • optional
    • you can stick to Latex or word

Summary

  • dplyr
    • essential!
    • %>%(piping), merging, sampling methods from dplyr are much more intuitive and easier to understand
  • data.table
    • essential if you are dealing with a data set that is larger than 30GB in CSV file format
    • it has a steep learning curve

Installing packages


Let’s install a package called ‘gapminder’


install.packages("gapminder")

Installing packages


Let’s install the ‘dplyr’ package, ‘stringr’ package, and ‘ggplot2’ package


#installing multiple packages at one-go
install.packages(c("dplyr", "stringr", "ggplot2"))

#you can also install the whole packages from the tidyverse bundle by
install.packages('tidyverse')

Calling packages


Intuitive way


library(dplyr)
library(ggplot2)
library(stringr)
library(stringi)

Calling packages


Elegant way


packages <- c("dplyr", "ggplot2", "stringr", "stringi")
lapply(packages, require, character.only=T)

Checking the description of the function


  1. You can type this in console
?lapply


  1. highlight the function and press f1


  1. or manually type the function name in the search bar of the Help panel, that is located on the bottom right.

Checking the description of the function


Setting working directory


#checking your current working directory
getwd()

#setting your working directory
setwd("Here_you_type_the_directory_path")

Project

Project


The better way to do data analysis with R is by creating a project!!


Project allows you:

  • to keep the R studio interface as it was before you closed it
  • not to type in the working directory every time
  • to manage your files more easily

Let’s try it out


Let’s create a “training” project at our desktop.


Follow the section 6 of this webpage.

Writing/exporting Data

Writing/exporting Data


Let’s initially focus on CSV files only!

First let’s read in the gapminder table from the gapminder package

library(gapminder)
#assigning the gapminder table to data
data <- gapminder

"OR"

#assigning the gapminder table to data
data <- gapminder::gapminder

Writing/exporting Data


Let’s export this table to a CSV file

under the data folder in the project directory

Don’t forget to create the data folder before running the code!

or you will get an error

#writing the CSV file
## decimal points = "."
write.csv(data, "data/gapminder.csv")

## decimal points = comma
write.csv2(data, "data/gapminder.csv")

Writing/exporting Data


It is faster to export a data if you use the data.table package

#you should have installed the data.table package first
## install.packages("data.table")

##write as a CSV file!
data.table::fwrite(data, "data/gapminder.csv",
                   dec = ".", #decimal points("." is the default)
                   sep = ",") #separator(comma is the default)

Reading in Data

CSV - basic


#basic way
?read.csv
#let's read the gapminder.csv file stored in the data folder
## decimal points = "."
data <- read.csv("data/gapminder.csv")
## decimal points = ","
data <- read.csv2("data/gapminder.csv")

CSV - data.table


It is fast!(link)


CSV - data.table


#data table way
library(data.table)

##install.packages("data.table")
data <- fread("data/gapminder.csv")

#advantages of fread
data <- data.table::fread("data/gapminder.csv",
                          encoding = "UTF-8", #choose encoding 
                          nrows = 10, #choose how many rows
                          select = c("country", "continent")) #choose which columns

STATA, SAS, and SPSS


#we need haven package!
install.packages("haven")
library(haven)

#STATA
data <- read_dta("stata.dta")
#SAS
data <- read_sas("sas.sas")
#SPSS
data <- read_spss("spss.spss")

other tables(tab delimited and etc.)


In general, Google/Stack Overflow is your friend.


?read.table

#tab delimited
data <- read.table(file = "data_file_path", 
                   header = T, #if header exists
                   sep = "\t", #tab
                   dec = ",") #using a comma for decimal points

#equivalent to read.csv
read.table(file = "data/gapminder.csv",
           header = T,
           sep = ",",
           dec = ".")

Calculating time to read a data file


Many times, researchers would like to know how long it takes to read a file.

Especially, if the file is large.


ct <- Sys.time()
data <- read.csv2("data/gapminder.csv")
print(Sys.time()-ct)
"Time difference of 0.01645303 secs"

Checking/Cleaning the data

Checking the data


There are two ways to check/see your data


  • pressing the variable(in our case “data”) button on your right-top panel
  • or typing the variable name in the console
  • type str(data)

Checking the data


For instance,


data <- gapminder::gapminder
str(data)
'tibble [1,704 × 6] (S3: tbl_df/tbl/data.frame)
 $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ year     : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
 $ lifeExp  : num [1:1704] 28.8 30.3 32 34 36.1 ...
 $ pop      : int [1:1704] 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
 $ gdpPercap: num [1:1704] 779 821 853 836 740 ...'

first and the last of the data


Maybe you simply want to check the top N rows or the last N rows


#checking the first 6 rows
head(data)
#checking the last 6 rows
tail(data)

#checking the first 10 rows
head(data, 10)
#checking the last 10 rows
tail(data, 10)

Checking unique values


Maybe you simply want to check unique values of a column


#checking unique values of the continent column
unique(data$continent)
"[1] Asia     Europe   Africa   Americas Oceania 
 Levels: Africa Americas Asia Europe Oceania"

Checking variable type


typeof(data$continent)
'[1] "integer"'
class(data$continent)
'[1] "factor"' #ordered categorical variable

#comparing factor with character
print(c("a", "b", "c"))
'"a" "b" "c"'
print(as.factor(c("a", "b", "c")))
"[1] a b c
 Levels: a b c"

Changing variable types


#changing variable types
data$continent[1]
"[1] Asia
 Levels: Africa Americas Asia Europe Oceania"
#character
as.character(data$continent[1])
"Asia"
#integer
as.integer(data$continent[1])
"3"
# as.numeric()
# as.factor()
# as.data.frame()

list and vectors


#vector can only have one data type!
c(1,2,3)
"[1] 1 2 3"
c(1,"a")
'[1] "1" "a"' #notice the quotation marks(")

#list can have different data types and name them!
temp <- list("a", 1)
temp
'[[1]]
[1] "a"

[[2]]
[1] 1'

list and vectors

# changing the list into a vector again
temp <- unlist(temp)
temp
'[1] "a" "1"'

# list can have sub-structures
temp <- list(country = c("France", "Germany"),
             capital = c("Paris", "Berlin"),
             population = c(67500000, 83130000))
'$country
[1] "France"  "Germany"

$capital
[1] "Paris"  "Berlin"

$population
[1] 67500000 83130000'

list and vectors


You should be careful when you want to extract a value from a list

#list
temp[1]
'$country
[1] "France"  "Germany"'
class(temp[1])
'[1] "list"'

#vector
temp[[1]]
#or
temp$country #think $ a sign to use to access a sub-structure(sub-list)
'[1] "France"  "Germany"'
class(temp[[1]])
'[1] "character"'

list and vectors


You should be careful when you want to extract a value from a list

#list
temp[1]
'$country
[1] "France"  "Germany"'
class(temp[1])
'[1] "list"'

#vector
temp[[1]]
#or
temp$country #think $ a sign to use to access a sub-structure(sub-list)
'[1] "France"  "Germany"'
class(temp[[1]])
'[1] "character"'

#trying to access a value of a sub-list(country)
temp[1][1] #wrong
'$country
[1] "France"  "Germany"' #retrieved the whole sub-list(country) instead
temp[[1]][1] #right
#or
temp$country[1]
'[1] "France"'

Cleaning missing values


First, let’s create an example data.frame(table in R)


df <- data.frame(country = c("France", "Germany", "Luxembourg"),
             capital = c("Paris", "Berlin", NA),
             population = c(67500000, 83130000, 639000))
df
"country capital population
1     France   Paris   67500000
2    Germany  Berlin   83130000
3 Luxembourg    <NA>     639000"

Cleaning missing values


Piping!

%>%


A special operator from the dplyr package!

Think it as “then” in English.

df %>% # call df
  filter(!is.na(capital))
#then filter observations that do not have a missing value(NA) in the capital column
# ! means NOT e.g. 1==1 1!=2

"NA is a special data type in R that indicates a missing value
(comparable to NaN value in Python)"

#the equivalent is
filter(df, !is.na(capital))

Cleaning missing values

The advantage of piping is that it allows you to write concise codes. For instance,


#basic way
lux <- filter(df, country=="Luxembourg") #getting Luxembourg observation
lux <- mutate(lux, capital = "Luxembourg City") #changing the capital value for Luxembourg
lux
"    country         capital population
1 Luxembourg Luxembourg City     639000"
#piping!
lux <- df %>%
  filter(country=="Luxembourg") %>%
  mutate(capital = "Luxembourg City")
lux
"    country         capital population
1 Luxembourg Luxembourg City     639000"

Cleaning missing values


Let’s say you have odd values for missing values e.g. -999


df <- data.frame(country = c("France", "Germany", "Luxembourg"),
             capital = c("Paris", "Berlin", -999),
             population = c(67500000, 83130000, "Non_existent"))
#retrieving luxembourg in two different ways
##1
df %>% 
  filter(capital==-999)
##2
df %>% 
  filter(population=="Non_existent")

Cleaning missing values


#retrieve an observation that has the value "-999" in the capital columns
df %>% 
  filter(capital=="-999")
#change the capital value 
df %>% 
  filter(capital=="-999") %>%
  mutate(capital="Luxembourg City")
#change the population value 
df %>% 
  filter(capital=="-999") %>%
  mutate(capital="Luxembourg City") %>%
  mutate(population = 639000)
#different version
df %>% 
  filter(capital=="-999") %>%
  mutate(capital="Luxembourg City",
         population = 639000)

Cleaning missing values


Don’t forget to assign the changed values to the original data frame!


df <- df %>% 
  filter(capital=="-999") %>%
  mutate(capital="Luxembourg City",
         population = 639000)
df
"     country         capital population
1  Luxembourg Luxembourg City     639000"
#I WANTED TO CHANGE THE VALUES AND STORE THEM!
#I DON'T WANT TO ONLY GET CLEAN LUX ROW!

Cleaning missing values


Actually you need to use ifelse() function to get all observations


#re-creating the data frame
df <- data.frame(country = c("France", "Germany", "Luxembourg"),
             capital = c("Paris", "Berlin", -999),
             population = c(67500000, 83130000, "Non_existent"))
#changing the capital values!
df %>% 
  mutate(capital=ifelse(capital=="-999", #condition!
                        "Lux City", #if TRUE
                        "FALSE")) #if FALSE
"    country  capital   population
1     France    FALSE     67500000
2    Germany    FALSE     83130000
3 Luxembourg Lux City Non_existent"

Cleaning missing values


Actually you need to use ifelse() function to get all observations


#keeping the other values
df <- df %>% 
  mutate(capital=ifelse(capital=="-999",
                        "Luxembourg City",
                        capital)) #keeping the rest values
#changing the population as well
df <- df %>% 
  mutate(population=ifelse(population=="Non_existent",
                        639000,
                        population)) %>%
  mutate(population = as.integer(population))

df
"    country         capital population
1     France           Paris   67500000
2    Germany          Berlin   83130000
3 Luxembourg Luxembourg City     639000"

Custom function


Sometimes, you need to create your custom function to detect missing values.

For instance,


check_miss <- function(column, bool=F){
  if(bool==F){#if one don't want T/F values for the result
    fin_value <- is.na(column) | stringi::stri_isempty(column)
    fin_value <- grep(T, fin_value) #returning the location where T is located
    return(fin_value)
  }else{
    fin_value <- is.na(column) | stringi::stri_isempty(column) #returning a boolean vector
    return(fin_value)
  }
}

Custom function


Let’s test the function


check_miss(c(NA, "Luxembourg" ,""))
"[1] 1 3"

check_miss(c(NA, "Luxembourg" ,""), bool = T)
"[1]  TRUE FALSE  TRUE"

Creating/deleting columns

Creating columns


It is simple! Just use mutate() function to create a new column


df <- df %>%
  mutate(total_pop = sum(population))
df
"     country         capital population total_pop
1     France           Paris   67500000 151269000
2    Germany          Berlin   83130000 151269000
3 Luxembourg Luxembourg City     639000 151269000"

Creating columns based on condition


There are two ways!

1.The intuitive way

#saving the value
temp_pop <- df %>%
  filter(country %in% c("France", "Luxembourg")) %>%
  summarize(french_pop = sum(population)) %>%
  unlist()
temp_pop
"french_pop 
  68139000"
#and then creating the column
df <- df %>%
  mutate(french_pop = temp_pop)
df
"     country         capital population total_pop french_pop
1     France           Paris   67500000 151269000   68139000
2    Germany          Berlin   83130000 151269000   68139000
3 Luxembourg Luxembourg City     639000 151269000   68139000"

Creating columns based on condition


2.The complex way

df <- df %>%
  #creating a column that indicates whether a country speaks french
  mutate(french = ifelse(country %in% c("France", "Luxembourg"),
                             T,
                             F)) %>%
  #grouping by french column
  group_by(french) %>%
  #creating a new column
  mutate(french_pop = ifelse(french==T,
                             sum(population),
                             NA)) %>%
  ungroup() #important!
df
" country    capital         population total_pop french_pop french
  <chr>      <chr>                <int>     <int>      <int> <lgl> 
1 France     Paris             67500000 151269000   68139000 TRUE  
2 Germany    Berlin            83130000 151269000         NA FALSE 
3 Luxembourg Luxembourg City     639000 151269000   68139000 TRUE "

Creating columns based on condition


If you forget to ungroup…

df <- df %>%
  mutate(french = ifelse(country %in% c("France", "Luxembourg"),
                             T,
                             F)) %>%
  group_by(french) %>%
  mutate(french_pop = ifelse(french==T,
                             sum(population),
                             NA))
#results in grouped sum by french still!
df %>%
  mutate(temp = sum(population))
"country    capital         population total_pop french_pop french     temp
  <chr>      <chr>                <int>     <int>      <int> <lgl>     <int>
1 France     Paris             67500000 151269000   68139000 TRUE   68139000
2 Germany    Berlin            83130000 151269000         NA FALSE  83130000
3 Luxembourg Luxembourg City     639000 151269000   68139000 TRUE   68139000"

Creating columns based on condition


If you forget to ungroup…

df <- df %>%
  mutate(french = ifelse(country %in% c("France", "Luxembourg"),
                             T,
                             F)) %>%
  group_by(french) %>%
  mutate(french_pop = ifelse(french==T,
                             sum(population),
                             NA))
#results in grouped sum by french still!
df %>%
  mutate(temp = sum(population))
"country    capital         population total_pop french_pop french     temp
  <chr>      <chr>                <int>     <int>      <int> <lgl>     <int>
1 France     Paris             67500000 151269000   68139000 TRUE   68139000
2 Germany    Berlin            83130000 151269000         NA FALSE  83130000
3 Luxembourg Luxembourg City     639000 151269000   68139000 TRUE   68139000"
#check it
is.grouped_df(df)
"[1] TRUE"

Deleting columns


#choose the columns you want to exclude
df %>%
  select(-french_pop, -french)
"# A tibble: 3 × 3
  country    capital         population
  <chr>      <chr>                <int>
1 France     Paris             67500000
2 Germany    Berlin            83130000
3 Luxembourg Luxembourg City     639000"
#choose the columns you only need
df %>%
  select(country, capital, population, total_pop)
"# A tibble: 3 × 4
  country    capital         population total_pop
  <chr>      <chr>                <int>     <int>
1 France     Paris             67500000 151269000
2 Germany    Berlin            83130000 151269000
3 Luxembourg Luxembourg City     639000 151269000"

Deleting columns

#choose the columns you want to exclude
df %>%
  select(-french_pop, -french)
"# A tibble: 3 × 3
  country    capital         population
  <chr>      <chr>                <int>
1 France     Paris             67500000
2 Germany    Berlin            83130000
3 Luxembourg Luxembourg City     639000"
#choose the columns you only need
df %>%
  select(country, capital, population, total_pop)
"# A tibble: 3 × 4
  country    capital         population total_pop
  <chr>      <chr>                <int>     <int>
1 France     Paris             67500000 151269000
2 Germany    Berlin            83130000 151269000
3 Luxembourg Luxembourg City     639000 151269000"
#declare the columns of interest first!
cols <- c("country", "capital", "population", "total_pop")
df %>%
  select(all_of(cols)) #there is also any_of()
"# A tibble: 3 × 4
  country    capital         population total_pop
  <chr>      <chr>                <int>     <int>
1 France     Paris             67500000 151269000
2 Germany    Berlin            83130000 151269000
3 Luxembourg Luxembourg City     639000 151269000"

Deleting columns


Let’s try an advanced example!

#advanced
df <- data.frame(country = "France",
                 capital = "Paris",
                 population = 67500000, 
                 Y1950_population = NA,
                 Y1960_population = "",
                 Y1970_population = -999)
df
" country capital population Y1950_population Y1960_population Y1970_population
1  France   Paris   67500000               NA                              -999"

Deleting columns


Find columns that you want to erase!

#advanced
df <- data.frame(country = "France",
                 capital = "Paris",
                 population = 67500000, 
                 Y1950_population = NA,
                 Y1960_population = "",
                 Y1970_population = -999)
df

#check the column names of df
colnames(df)
'[1] "country"          "capital"          "population"       "Y1950_population" "Y1960_population" "Y1970_population"'

#find the columns you want to delete!
grep("^Y", colnames(df))
"[1] 4 5 6"
grepl("^Y", colnames(df))
"[1] FALSE FALSE FALSE  TRUE  TRUE  TRUE"

Deleting columns


Declare the columns you want to erase!

#advanced
df <- data.frame(country = "France",
                 capital = "Paris",
                 population = 67500000, 
                 Y1950_population = NA,
                 Y1960_population = "",
                 Y1970_population = -999)
df

#check the column names of df
colnames(df)
'[1] "country"          "capital"          "population"       "Y1950_population" "Y1960_population" "Y1970_population"'

#find the columns you want to delete!
grep("^Y", colnames(df))
"[1] 4 5 6"
grepl("^Y", colnames(df))
"[1] FALSE FALSE FALSE  TRUE  TRUE  TRUE"

#declare the columns you want to erase!
del_cols <- colnames(df)[grep("^Y", colnames(df))]
del_cols
'Y1950_population" "Y1960_population" "Y1970_population"'

Deleting columns


Remove the columns from the data frame!

#advanced
df <- data.frame(country = "France",
                 capital = "Paris",
                 population = 67500000, 
                 Y1950_population = NA,
                 Y1960_population = "",
                 Y1970_population = -999)
df

#check the column names of df
colnames(df)
'[1] "country"          "capital"          "population"       "Y1950_population" "Y1960_population" "Y1970_population"'

#find the columns you want to delete!
grep("^Y", colnames(df))
"[1] 4 5 6"
grepl("^Y", colnames(df))
"[1] FALSE FALSE FALSE  TRUE  TRUE  TRUE"

#declare the columns you want to erase!
del_cols <- colnames(df)[grep("^Y", colnames(df))]
del_cols
'Y1950_population" "Y1960_population" "Y1970_population"'

#exclude the columns!
df <- df %>%
  select(-all_of(del_cols))
df
"  country capital population
1  France   Paris   67500000"

Deleting columns


  • For more information on string matching based on regular expression, please check this website.

Merging data

Different joins

In R programming, people often use the term ‘join’ for merging the data, and many different types of join exist.

  1. left join
  2. right join
  3. inner join
  4. full join

Data


Let’s say we have two data sets: X and Y.

join


We would like the merge the two.


inner join

Animation


left join

Animation


right join

Animation


full join

Animation


etc.

Help page is accessible by typing

?left_join

it shows simple explanation of each join.


For different joins and animation visit this github page

Example


#setting the data sets
x.df <- data.frame(id = c(1,2),
                   country = c("France", "Germany"),
                   capital = c("Paris", "Berlin"),
                   population = c(67500000, 83130000))

y.df <- data.frame(id = c(2,3),
                   country = c("Germany", "Luxembourg"),
                   capital = c( "Berlin", "Luxembourg City"),
                   language = c( "DE",  "FR"))

left_join example


merged.df <- left_join(x = x.df, 
                       y = y.df,
                       by = "id") #by which column will you merge
merged.df
" id country.x capital.x population country.y capital.y language
1  1    France     Paris   67500000      <NA>      <NA>     <NA>
2  2   Germany    Berlin   83130000   Germany    Berlin       DE"

right_join example


merged.df <- right_join(x = x.df, 
                        y = y.df,
                        by = "id") #by which column will you merge
merged.df
" id country.x capital.x population  country.y       capital.y language
1  2   Germany    Berlin   83130000    Germany          Berlin       DE
2  3      <NA>      <NA>         NA Luxembourg Luxembourg City       FR"

inner_join example


merged.df <- inner_join(x = x.df, 
                        y = y.df,
                        by = "id") #by which column will you merge
merged.df
" id country.x capital.x population country.y capital.y language
1  2   Germany    Berlin   83130000   Germany    Berlin       DE"

full_join example


merged.df <- full_join(x = x.df, 
                       y = y.df,
                       by = "id") #by which column will you merge
merged.df
"id country.x capital.x population  country.y       capital.y language
1  1    France     Paris   67500000       <NA>            <NA>     <NA>
2  2   Germany    Berlin   83130000    Germany          Berlin       DE
3  3      <NA>      <NA>         NA Luxembourg Luxembourg City       FR"

advanced merge


You can change the suffix setting if “.x” & “.y” does not help you to have a full picture of the merging process

merged.df <- left_join(x = x.df, 
                       y = y.df,
                       by = "id", #by which column will you merge
                       suffix = c(".population", ".language"))
merged.df
"id country.population capital.population population country.language capital.language language
1  1             France              Paris   67500000             <NA>             <NA>     <NA>
2  2            Germany             Berlin   83130000          Germany           Berlin       DE"

advanced merge


Let’s remove redundant columns

merged.df <- left_join(x = x.df, 
                       y = y.df,
                       by = "id", #by which column will you merge
                       suffix = c(".population", ".language"))
merged.df

#delete redundant columns
##retrieving column names that has ".language"
grepl(".language", colnames(merged.df)) 
del_cols <- colnames(merged.df)[grepl(".language", colnames(merged.df))]
##remove!
merged.df <- merged.df %>%
  select(-all_of(del_cols))
merged.df
"id country.population capital.population population language
1  1             France              Paris   67500000     <NA>
2  2            Germany             Berlin   83130000       DE"

advanced merge


Let’s rename the columns and fill in the blank language cell!

merged.df <- left_join(x = x.df, 
                       y = y.df,
                       by = "id", #by which column will you merge
                       suffix = c(".population", ".language"))
merged.df

#delete redundant columns
##retrieving column names that has ".language"
grepl(".language", colnames(merged.df)) 
del_cols <- colnames(merged.df)[grepl(".language", colnames(merged.df))]
##remove!
merged.df <- merged.df %>%
  select(-all_of(del_cols))
merged.df
"id country.population capital.population population language
1  1             France              Paris   67500000     <NA>
2  2            Germany             Berlin   83130000       DE"

#rename column names
##declaring the column names you want to have
##removing all the ".population" part
col_names <- stringr::str_replace_all(string = colnames(merged.df),
                                      pattern = ".population",
                                      replacement =  "") 
colnames(merged.df) <- col_names
print(colnames(merged.df))
'[1] "id"         "country"    "capital"    "population" "language"  '

#filling in the language column for France
merged.df <- merged.df %>%
  mutate(language = ifelse(country=="France",
                           "FR",
                           language))
merged.df
"  id country capital population language
1  1  France   Paris   67500000       FR
2  2 Germany  Berlin   83130000       DE"

advanced merge


Let’s rename the columns and fill in the blank language cell!

merged.df <- left_join(x = x.df, 
                       y = y.df,
                       by = "id", #by which column will you merge
                       suffix = c(".population", ".language"))
merged.df

#delete redundant columns
##retrieving column names that has ".language"
grepl(".language", colnames(merged.df)) 
del_cols <- colnames(merged.df)[grepl(".language", colnames(merged.df))]
##remove!
merged.df <- merged.df %>%
  select(-all_of(del_cols))
merged.df
"id country.population capital.population population language
1  1             France              Paris   67500000     <NA>
2  2            Germany             Berlin   83130000       DE"

#rename column names
##declaring the column names you want to have
##removing all the ".population" part
col_names <- stringr::str_replace_all(string = colnames(merged.df),
                                      pattern = ".population",
                                      replacement =  "") 
colnames(merged.df) <- col_names
print(colnames(merged.df))
'[1] "id"         "country"    "capital"    "population" "language"  '

#filling in the language column for France
merged.df <- merged.df %>%
  mutate(language = ifelse(country=="France",
                           "FR",
                           language))
merged.df
"  id country capital population language
1  1  France   Paris   67500000       FR
2  2 Germany  Berlin   83130000       DE"

Simple statistical analysis

Checking average of a variable


Let’s read in a gapminder data set

library(gapminder)
#reading in the data
data <- gapminder
head(data)
'# A tibble: 6 × 6
  country     continent  year lifeExp      pop gdpPercap
  <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
1 Afghanistan Asia       1952    28.8  8425333      779.
2 Afghanistan Asia       1957    30.3  9240934      821.
3 Afghanistan Asia       1962    32.0 10267083      853.
4 Afghanistan Asia       1967    34.0 11537966      836.
5 Afghanistan Asia       1972    36.1 13079460      740.
6 Afghanistan Asia       1977    38.4 14880372      786.'

Checking average of a variable


Simple version


summary(data$lifeExp)
" Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  23.60   48.20   60.71   59.47   70.85   82.60 "

Checking average of a variable


Calculating the average of Europe


data %>%
  filter(continent=="Europe") %>%
  summarise(life_avg_eu = mean(lifeExp))
" life_avg_eu
        <dbl>
1        71.9"

Checking average of a variable


Calculating the average & standard deviation of the life Expectancy by continent


data %>%
  group_by(continent) %>%
  summarise(life_avg = mean(lifeExp),
            life_sd = sd(lifeExp))
"# A tibble: 5 × 3
  continent life_avg life_sd
  <fct>        <dbl>   <dbl>
1 Africa        48.9    9.15
2 Americas      64.7    9.35
3 Asia          60.1   11.9 
4 Europe        71.9    5.43
5 Oceania       74.3    3.80"

Checking average of a variable


If you want to save the resulted table, don’t forget to save it!


life_sum.df <- data %>%
  group_by(continent) %>%
  summarise(life_avg = mean(lifeExp),
            life_sd = sd(lifeExp))
life_sum.df
"# A tibble: 5 × 3
  continent life_avg life_sd
  <fct>        <dbl>   <dbl>
1 Africa        48.9    9.15
2 Americas      64.7    9.35
3 Asia          60.1   11.9 
4 Europe        71.9    5.43
5 Oceania       74.3    3.80"

Checking average of a variable(Advanced)


Let’s group the data by multiple variables

sum.df <- data %>%
  group_by(continent, year) %>%
  summarise(life_avg = mean(lifeExp),
            gdp_avg = mean(gdpPercap),
            .groups = "keep") %>% #keeping the group variables
  ungroup()
sum.df
"# A tibble: 60 × 4
   continent  year life_avg gdp_avg
   <fct>     <int>    <dbl>   <dbl>
 1 Africa     1952     39.1   1253.
 2 Africa     1957     41.3   1385.
 3 Africa     1962     43.3   1598.
 4 Africa     1967     45.3   2050.
 5 Africa     1972     47.5   2340.
 6 Africa     1977     49.6   2586.
 7 Africa     1982     51.6   2482.
 8 Africa     1987     53.3   2283.
 9 Africa     1992     53.6   2282.
10 Africa     1997     53.6   2379.
# … with 50 more rows
# ℹ Use `print(n = ...)` to see more rows"

Sorting the data set


Let’s check which continent has the highest average life expectancy in each year by descending order

sum.df %>%
  arrange(-year, -life_avg)
"# A tibble: 60 × 4
   continent  year life_avg gdp_avg
   <fct>     <int>    <dbl>   <dbl>
 1 Oceania    2007     80.7  29810.
 2 Europe     2007     77.6  25054.
 3 Americas   2007     73.6  11003.
 4 Asia       2007     70.7  12473.
 5 Africa     2007     54.8   3089.
 6 Oceania    2002     79.7  26939.
 7 Europe     2002     76.7  21712.
 8 Americas   2002     72.4   9288.
 9 Asia       2002     69.2  10174.
10 Africa     2002     53.3   2599.
# … with 50 more rows
# ℹ Use `print(n = ...)` to see more rows"

Linear regression


Simple linear regression


#simple linear regression
ols <- lm(formula = lifeExp ~ gdpPercap + continent,
          data = data)
ols

Linear regression


Checking the result

#simple linear regression
ols <- lm(formula = lifeExp ~ gdpPercap + continent,
          data = data)
ols
summary(ols)
"Call:
lm(formula = lifeExp ~ gdpPercap + continent, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-49.241  -4.479   0.347   5.105  25.138 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)       4.789e+01  3.398e-01  140.93   <2e-16 ***
gdpPercap         4.453e-04  2.350e-05   18.95   <2e-16 ***
continentAmericas 1.359e+01  6.008e-01   22.62   <2e-16 ***
continentAsia     8.658e+00  5.555e-01   15.59   <2e-16 ***
continentEurope   1.757e+01  6.257e-01   28.08   <2e-16 ***
continentOceania  1.815e+01  1.787e+00   10.15   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 8.39 on 1698 degrees of freedom
Multiple R-squared:  0.5793,    Adjusted R-squared:  0.5781 
F-statistic: 467.7 on 5 and 1698 DF,  p-value: < 2.2e-16"

Linear regression


Checking the regression result in detail

#ols result!
ols_result <- summary(ols)

#checking the coefficients
ols_result$coefficients
"                      Estimate   Std. Error   t value      Pr(>|t|)
(Intercept)       4.788852e+01 3.398053e-01 140.92927  0.000000e+00
gdpPercap         4.452704e-04 2.349795e-05  18.94933  8.552893e-73
continentAmericas 1.359272e+01 6.007856e-01  22.62491  2.822476e-99
continentAsia     8.657793e+00 5.554859e-01  15.58598  2.719424e-51
continentEurope   1.757234e+01 6.257430e-01  28.08236 7.595526e-143
continentOceania  1.814604e+01 1.787426e+00  10.15205  1.502557e-23"
class(ols_result$coefficients)
'"matrix" "array"'

Linear regression


Dealing with scientific notations(e)

#scientific notation
ols_result$coefficients[,1]
"     (Intercept)         gdpPercap continentAmericas     continentAsia 
     4.788852e+01      4.452704e-04      1.359272e+01      8.657793e+00 
  continentEurope  continentOceania 
     1.757234e+01      1.814604e+01"
format(ols_result$coefficients[,1], scientific = F)
'     (Intercept)         gdpPercap continentAmericas     continentAsia 
  "47.8885161789"   " 0.0004452704"   "13.5927218705"   " 8.6577934769" 
  continentEurope  continentOceania 
  "17.5723409668"   "18.1460410154" '
#NOTE THAT the numbers are characters now!

Linear regression

#scientific notation
ols_result$coefficients[,1]
"(Intercept)         gdpPercap continentAmericas     continentAsia 
     4.788852e+01      4.452704e-04      1.359272e+01      8.657793e+00 
  continentEurope  continentOceania 
     1.757234e+01      1.814604e+01"
format(ols_result$coefficients[,1], scientific = F)
' (Intercept)         gdpPercap continentAmericas     continentAsia 
  "47.8885161789"   " 0.0004452704"   "13.5927218705"   " 8.6577934769" 
  continentEurope  continentOceania 
  "17.5723409668"   "18.1460410154" '
#NOTE THAT the numbers are characters now!

#try round() function
round(ols_result$coefficients[,1],4) #rounding up at the fourth decimal point
'      (Intercept)         gdpPercap continentAmericas     continentAsia   continentEurope  continentOceania 
          47.8885            0.0004           13.5927            8.6578           17.5723           18.1460'

Other econometric methods


For more details of running different regressions, please check this website: Introduction to Econometrics with R.

Plotting graphs

ggplot


Now, it is time to plot pretty graphs!


We will use ggplot2 package for plotting graphs.


Please check out this link for more information.


Download the ggplot cheat sheet!

Our goal

Our goal is to draw a graph like this

Let’s clean the R environemnt first!

# Clear plots
if(!is.null(dev.list())) dev.off()
# Clear console
cat("\014") 
# Clean workspace
rm(list=ls())

#restart R
rstudioapi::restartSession()

#calling necessary packages
packages <- c("dplyr", "ggplot2", "stringr", "stringi")
lapply(packages, require, character.only=T)

basic dot plot


Let’s start with a basic plot

#reading in data again
data <- gapminder::gapminder

#simple dot graph for Italy
data %>%
  filter(country=="Italy") %>%
  ggplot() + 
  geom_point(aes(x = year, y = gdpPercap))

usual graph types


  • dot plot - geom_point()

  • line graph - geom_line()

  • bar plot - geom_bar(x)

  • bar plot - geom_col(x,y)

  • density graph - geom_density()

with multiple countries


Multiple countries!


Impossible to differentiate which point belongs to which country

#reading in data again
data <- gapminder::gapminder

#simple dot graph for some European countries
data %>%
  filter(country %in% c("United Kingdom", "Germany", "France", "Italy", "Spain")) %>%
  ggplot() + 
  geom_point(aes(x = year, y = gdpPercap))

Adding colors


Let’s add some colors according to a country!

#simple dot graph for some European countries
data %>%
  filter(country %in% c("United Kingdom", "Germany", "France", "Italy", "Spain")) %>%
  ggplot() + 
  #note that it is important to put color = country inside of aes()
  geom_point(aes(x = year, y = gdpPercap, color = country)) 

Adding colors


Let’s add some colors according to a country!

#simple bar(column) graph for some European countries
data %>%
  filter(country %in% c("United Kingdom", "Germany", "France", "Italy", "Spain")) %>%
  ggplot() + 
  geom_col(aes(x = year, y = gdpPercap, color = country)) 

#for bar graphs you should use fill!
data %>%
  filter(country %in% c("United Kingdom", "Germany", "France", "Italy", "Spain")) %>%
  ggplot() + 
  geom_col(aes(x = year, y = gdpPercap, fill = country)) 

Changing axis labels

#simple line graph for some European countries
data %>%
  filter(country %in% c("United Kingdom", "Germany", "France", "Italy", "Spain")) %>%
  ggplot() + 
  geom_line(aes(x = year, y = gdpPercap, color = country)) +
  labs(x="Year", y="Real GDP per capita in USD") #changing the axis label

Setting the plot title

#simple line graph for some European countries
data %>%
  filter(country %in% c("United Kingdom", "Germany", "France", "Italy", "Spain")) %>%
  ggplot() + 
  geom_line(aes(x = year, y = gdpPercap, color = country)) +
  labs(x="Year", y="Real GDP per capita in USD",
       title = "Plotting GDP per Capita of Five European Countries over Time")

Changing legend position

#simple line graph for some European countries
data %>%
  filter(country %in% c("United Kingdom", "Germany", "France", "Italy", "Spain")) %>%
  ggplot() + 
  geom_line(aes(x = year, y = gdpPercap, color = country)) +
  labs(x="Year", y="Real GDP per capita in USD",
       title = "Plotting GDP per Capita of Five European Countries over Time") + 
  theme(legend.position = "bottom")

Changing title/legend/axis text size

#simple line graph for some European countries
data %>%
  filter(country %in% c("United Kingdom", "Germany", "France", "Italy", "Spain")) %>%
  ggplot() + 
  geom_line(aes(x = year, y = gdpPercap, color = country)) +
  labs(x="Year", y="Real GDP per capita in USD",
       title = "Plotting GDP per Capita of Five European Countries over Time") + 
  theme(legend.position = "bottom",
        legend.title = element_text(size = 20), #changing legend title size
        legend.text = element_text(size = 20), #changing legend text size
        axis.text.x = element_text(size = 15), #changing axis text size
        axis.title.x = element_text(size = 20), #changing axis title size
        axis.text.y = element_text(size = 15), #changing axis text size
        axis.title.y = element_text(size = 20), #changing axis title size
        title = element_text(size = 20)) #changing title size

Changing legend box size

#simple line graph for some European countries
data %>%
  filter(country %in% c("United Kingdom", "Germany", "France", "Italy", "Spain")) %>%
  ggplot() + 
  geom_line(aes(x = year, y = gdpPercap, color = country)) +
  labs(x="Year", y="Real GDP per capita in USD",
       title = "Plotting GDP per Capita of Five European Countries over Time") + 
  theme(legend.position = "bottom",
        legend.title = element_text(size = 20), #changing legend title size
        legend.text = element_text(size = 20), #changing legend text size
        axis.text.x = element_text(size = 15), #changing axis text size
        axis.title.x = element_text(size = 20), #changing axis title size
        axis.text.y = element_text(size = 15), #changing axis text size
        axis.title.y = element_text(size = 20),  #changing axis title size
        title = element_text(size = 20), #changing title size
        legend.key.size = unit(1.5, 'cm')) #changing legend color box size

Changing legend title

#simple line graph for some European countries
base_plot <- data %>%
  filter(country %in% c("United Kingdom", "Germany", "France", "Italy", "Spain")) %>%
  ggplot() + 
  geom_line(aes(x = year, y = gdpPercap, color = country)) +
  labs(x="Year", y="Real GDP per capita in USD",
       title = "Plotting GDP per Capita of Five European Countries over Time") + 
  theme(legend.position = "bottom",
        legend.title = element_text(size = 20), #changing legend title size
        legend.text = element_text(size = 20), #changing legend text size
        axis.text.x = element_text(size = 15), #changing axis text size
        axis.title.x = element_text(size = 20), #changing axis title size
        axis.text.y = element_text(size = 15), #changing axis text size
        axis.title.y = element_text(size = 20),  #changing axis title size
        title = element_text(size = 20), #changing title size
        legend.key.size = unit(1.5, 'cm')) + #changing legend color box size
  scale_color_discrete(name = "Country") #changing legend title
base_plot

Formatting axis tick

#simple line graph for some European countries
base_plot <- data %>%
  filter(country %in% c("United Kingdom", "Germany", "France", "Italy", "Spain")) %>%
  ggplot() + 
  geom_line(aes(x = year, y = gdpPercap, color = country)) +
  labs(x="Year", y="Real GDP per capita in USD",
       title = "Plotting GDP per Capita of Five European Countries over Time") + 
  theme(legend.position = "bottom",
        legend.title = element_text(size = 20), #changing legend title size
        legend.text = element_text(size = 20), #changing legend text size
        axis.text.x = element_text(size = 15), #changing axis text size
        axis.title.x = element_text(size = 20), #changing axis title size
        axis.text.y = element_text(size = 15), #changing axis text size
        axis.title.y = element_text(size = 20),  #changing axis title size
        title = element_text(size = 20), #changing title size
        legend.key.size = unit(1.5, 'cm')) + #changing legend color box size
  scale_color_discrete(name = "Country") +#changing legend title
  scale_y_continuous(labels= scales::comma) #formatting the y axis tick
base_plot

Adding GDP text labels on the graph


First, let’s create a separate data frame for the GDP text labels.


#print(sort(unique(data$year)))
"[1] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 2002 2007"

#creating the GDP text data frame
gdp_text.df <- data %>%
  filter(country %in% c("United Kingdom", "Italy")) %>%
  filter(year==1952 | year==2007)

Adding GDP text labels on the graph


Second, let’s add another layer!


#print(sort(unique(data$year)))
"[1] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 2002 2007"

#creating the GDP text data frame
gdp_text.df <- data %>%
  filter(country %in% c("United Kingdom", "Italy")) %>%
  filter(year==1952 | year==2007)

#adding the text layer
mid_plot <- base_plot + 
  geom_text(data = gdp_text.df,
            aes(x = year, y = gdpPercap, label = gdpPercap))
mid_plot

Adjusting the text size/position

#print(sort(unique(data$year)))
"[1] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 2002 2007"

#creating the GDP text data frame
gdp_text.df <- data %>%
  filter(country %in% c("United Kingdom", "Italy")) %>%
  filter(year==1952 | year==2007)

#adding the text layer
mid_plot <- base_plot + 
  geom_text(data = gdp_text.df,
            aes(x = year, y = gdpPercap, label = gdpPercap),
            size = 7, #size adjustment
            vjust = 0.7) #vertical position adjustment
mid_plot

Formatting the GDP neatly

#print(sort(unique(data$year)))
"[1] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 2002 2007"

#creating the GDP text data frame
gdp_text.df <- data %>%
  filter(country %in% c("United Kingdom", "Italy")) %>%
  filter(year==1952 | year==2007)

#adding the text layer
mid_plot <- base_plot + 
  geom_text(data = gdp_text.df,
            aes(x = year, y = gdpPercap, 
                label = round(gdpPercap)), #rounding up the GDP!
            size = 7, #size adjustment
            vjust = 0.7) #position adjustment
mid_plot

Formatting the GDP neatly

#print(sort(unique(data$year)))
"[1] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 2002 2007"

#creating the GDP text data frame
gdp_text.df <- data %>%
  filter(country %in% c("United Kingdom", "Italy")) %>%
  filter(year==1952 | year==2007)

#adding the text layer
mid_plot <- base_plot + 
  geom_text(data = gdp_text.df,
            aes(x = year, y = gdpPercap, 
                #rounding up the GDP and adding a comma!!
                label = format(round(gdpPercap), big.mark = ",", big.interval = 3L)), 
            size = 7, #size adjustment
            vjust = 0.7) #position adjustment
mid_plot

Adding Country text labels on the graph


Let’s create a separate data frame for the country text labels.


#creating the country text data frame
country_text.df <- data %>%
  filter(country %in% c("United Kingdom", "Germany", "France", "Italy", "Spain"))%>%
  filter(year==2002)

Adding country text labels on the graph


Second, let’s add another layer!


#creating the country text data frame
country_text.df <- data %>%
  filter(country %in% c("United Kingdom", "Germany", "France", "Italy", "Spain"))%>%
  filter(year==2002)

#adding the text layer
final_plot <- mid_plot + 
  geom_text(data = country_text.df,
            aes(x = year, y = gdpPercap, label = country, color = country),
            size = 7,
            vjust = 0.7,
            show.legend = F) #not showing the legend
final_plot

Saving the graph


Let’s save the final_plot


#saving the final plot
ggsave(plot = final_plot,
       filename = "img/output.jpg", 
       #make sure you have created an 'img' folder in your project directory
       width = 40,
       height = 30,
       units = "cm")

Writing a report

R Markdown


What is an R Markdown?


-> Think as a latex file combined with R!


Check these links:

Let’s clean the R environemnt first!

# Clear plots
if(!is.null(dev.list())) dev.off()
# Clear console
cat("\014") 
# Clean workspace
rm(list=ls())

#restart R
rstudioapi::restartSession()

installing necessary packages


rmarkdown, markdown, knitr package are essential packages for

any R Markdown file

#type this in the console
install.packages(c("rmarkdown", "markdown", "knitr"))

creating a new R Markdown file

creating a new R Markdown file

choosing the output type(pdf)


This session only focuses on the pdf format.


If you are more interested in html format, please check out:


Setting the yaml header

yaml header is the first part of an R markdown file.

It decides many important aspects of your output.

---
title: "R markdown guide"
author: "Jongoh Kim"
date: "`r format(Sys.time(), '%d %B, %Y')`" #setting the date by dd month, yyyy
output: 
  pdf_document:
    toc: true #add a table of contents a the start of a document
    toc_depth: 3 #The lowest level of headings to add to table of contents (e.g. 2, 3)
    number_sections: true #numbering the sections
urlcolor: blue #setting the hyperlink color to blue
---

Jumping right into the R markdown

Check out the ‘markdown_example.Rmd’ file and

its output(markdown_example.pdf) from

the github page

Thanks!


Special thanks to Etienne Bacher for his help and slide codes!


Source code for slides:

https://github.com/jongohkim91/R-A-Z/index.qmd


Good resources

R for Data Science from Hadley Wickham and Garrett Grolemund:

https://r4ds.had.co.nz/


ggplot2:

https://ggplot2.tidyverse.org/


R Markdown: The Definitive Guide from Yihui Xie, J. J. Allaire, and Garrett Grolemund:

https://bookdown.org/yihui/rmarkdown/

Good resources


Communicate Data with R: https://communicate-data-with-r.netlify.app/