Data analysis with Tidyverse(dplyr)

from reading data to graphing results

Jongoh Kim

LISER

February 2, 2023

Introduction

This training aims to teach you data analysis in R from A-Z.

Data analysis includes

reading in a data set
cleaning data / manipulating values in columns
creating new columns
merging data
statistical analysis
graphing results
writing reports

Prerequisite

This training is for people who have basic knowledge of programming!

The basic knowledge includes:

data type
if,else statement
for, while loop
basic concept of a user-defined function

Prerequisite

This training assumes you are using R Studio for coding your R script so

if you have not installed the program, please do by clicking this link.

Let’s clean up the R environment!

Let’s set the environment for our R-coding.

I strongly suggest you put these lines of code at the start of your code file

You will use them a lot.

# Clear plots
if(!is.null(dev.list())) dev.off()
# Clear console
cat("\014") 
# Clean workspace
rm(list=ls())

"OR the best way for R studio users
to really initialize the R environment"
#install.packages("rstudioapi")
##restart R
rstudioapi::restartSession()

Packages

Packages are a toolbox that you can get(install)

and use to make your life(coding) easier!

Tidyverse

The most famous package bundle in the R world!

My personal choice for R packages

String Manipulation

stringr, stringi

Data Analysis

dplyr and/or data.table

My personal choice for R packages

Graph

ggplot2

Report

rmarkdown, markdown, knitr

My personal choice for R packages

Pipeline tools

targets(snakemake for Python users)

What is a Pipeline tool?

Check out this website for more details.

The website states that:

With targets, you can maintain a reproducible workflow without repeating yourself.

targets learns how your pipeline fits together,
skips costly runtime for tasks that are already up to date,
runs only the necessary computation,
supports implicit parallel computing,
abstracts files as R objects,
and shows tangible evidence that the results match the underlying code and data.

Summary

targets
- optional but much better to know if your work involves a lot of coding
ggplot2
- essential!
- personally I believe this package is the best for graphing results(better than matplotlib from Python)
rmarkdown packages
- optional
- you can stick to Latex or word

Summary

dplyr
- essential!
- %>%(piping), merging, sampling methods from dplyr are much more intuitive and easier to understand
data.table
- essential if you are dealing with a data set that is larger than 30GB in CSV file format
- it has a steep learning curve

Installing packages

Let’s install a package called ‘gapminder’

install.packages("gapminder")

Installing packages

Let’s install the ‘dplyr’ package, ‘stringr’ package, and ‘ggplot2’ package

#installing multiple packages at one-go
install.packages(c("dplyr", "stringr", "ggplot2"))

#you can also install the whole packages from the tidyverse bundle by
install.packages('tidyverse')

Calling packages

Intuitive way

library(dplyr)
library(ggplot2)
library(stringr)
library(stringi)

Calling packages

Elegant way

packages <- c("dplyr", "ggplot2", "stringr", "stringi")
lapply(packages, require, character.only=T)

Checking the description of the function

You can type this in console

?lapply

highlight the function and press f1

or manually type the function name in the search bar of the Help panel, that is located on the bottom right.

Checking the description of the function

Setting working directory

#checking your current working directory
getwd()

#setting your working directory
setwd("Here_you_type_the_directory_path")

Project

The better way to do data analysis with R is by creating a project!!

Project allows you:

to keep the R studio interface as it was before you closed it
not to type in the working directory every time
to manage your files more easily

Let’s try it out

Let’s create a “training” project at our desktop.

Follow the section 6 of this webpage.

Writing/exporting Data

Let’s initially focus on CSV files only!

First let’s read in the gapminder table from the gapminder package

library(gapminder)
#assigning the gapminder table to data
data <- gapminder

"OR"

#assigning the gapminder table to data
data <- gapminder::gapminder

Writing/exporting Data

Let’s export this table to a CSV file

under the data folder in the project directory

Don’t forget to create the data folder before running the code!

or you will get an error

#writing the CSV file
## decimal points = "."
write.csv(data, "data/gapminder.csv")

## decimal points = comma
write.csv2(data, "data/gapminder.csv")

Writing/exporting Data

It is faster to export a data if you use the data.table package

#you should have installed the data.table package first
## install.packages("data.table")

##write as a CSV file!
data.table::fwrite(data, "data/gapminder.csv",
                   dec = ".", #decimal points("." is the default)
                   sep = ",") #separator(comma is the default)

Reading in Data

CSV - basic

#basic way
?read.csv
#let's read the gapminder.csv file stored in the data folder
## decimal points = "."
data <- read.csv("data/gapminder.csv")
## decimal points = ","
data <- read.csv2("data/gapminder.csv")

CSV - data.table

It is fast!(link)

CSV - data.table

#data table way
library(data.table)

##install.packages("data.table")
data <- fread("data/gapminder.csv")

#advantages of fread
data <- data.table::fread("data/gapminder.csv",
                          encoding = "UTF-8", #choose encoding 
                          nrows = 10, #choose how many rows
                          select = c("country", "continent")) #choose which columns

STATA, SAS, and SPSS

#we need haven package!
install.packages("haven")
library(haven)

#STATA
data <- read_dta("stata.dta")
#SAS
data <- read_sas("sas.sas")
#SPSS
data <- read_spss("spss.spss")

other tables(tab delimited and etc.)

In general, Google/Stack Overflow is your friend.

?read.table

#tab delimited
data <- read.table(file = "data_file_path", 
                   header = T, #if header exists
                   sep = "\t", #tab
                   dec = ",") #using a comma for decimal points

#equivalent to read.csv
read.table(file = "data/gapminder.csv",
           header = T,
           sep = ",",
           dec = ".")

Calculating time to read a data file

Many times, researchers would like to know how long it takes to read a file.

Especially, if the file is large.

ct <- Sys.time()
data <- read.csv2("data/gapminder.csv")
print(Sys.time()-ct)
"Time difference of 0.01645303 secs"

Checking/Cleaning the data

Checking the data

There are two ways to check/see your data

pressing the variable(in our case “data”) button on your right-top panel
or typing the variable name in the console
type str(data)

Checking the data

For instance,

data <- gapminder::gapminder
str(data)
'tibble [1,704 × 6] (S3: tbl_df/tbl/data.frame)
 $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ year     : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
 $ lifeExp  : num [1:1704] 28.8 30.3 32 34 36.1 ...
 $ pop      : int [1:1704] 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
 $ gdpPercap: num [1:1704] 779 821 853 836 740 ...'

first and the last of the data

Maybe you simply want to check the top N rows or the last N rows

#checking the first 6 rows
head(data)
#checking the last 6 rows
tail(data)

#checking the first 10 rows
head(data, 10)
#checking the last 10 rows
tail(data, 10)

Checking unique values

Maybe you simply want to check unique values of a column

#checking unique values of the continent column
unique(data$continent)
"[1] Asia     Europe   Africa   Americas Oceania 
 Levels: Africa Americas Asia Europe Oceania"

Checking variable type

typeof(data$continent)
'[1] "integer"'
class(data$continent)
'[1] "factor"' #ordered categorical variable

#comparing factor with character
print(c("a", "b", "c"))
'"a" "b" "c"'
print(as.factor(c("a", "b", "c")))
"[1] a b c
 Levels: a b c"

Changing variable types

#changing variable types
data$continent[1]
"[1] Asia
 Levels: Africa Americas Asia Europe Oceania"
#character
as.character(data$continent[1])
"Asia"
#integer
as.integer(data$continent[1])
"3"
# as.numeric()
# as.factor()
# as.data.frame()

list and vectors

#vector can only have one data type!
c(1,2,3)
"[1] 1 2 3"
c(1,"a")
'[1] "1" "a"' #notice the quotation marks(")

#list can have different data types and name them!
temp <- list("a", 1)
temp
'[[1]]
[1] "a"

[[2]]
[1] 1'

list and vectors

# changing the list into a vector again
temp <- unlist(temp)
temp
'[1] "a" "1"'

# list can have sub-structures
temp <- list(country = c("France", "Germany"),
             capital = c("Paris", "Berlin"),
             population = c(67500000, 83130000))
'$country
[1] "France"  "Germany"

$capital
[1] "Paris"  "Berlin"

$population
[1] 67500000 83130000'

list and vectors

You should be careful when you want to extract a value from a list

#list
temp[1]
'$country
[1] "France"  "Germany"'
class(temp[1])
'[1] "list"'

#vector
temp[[1]]
#or
temp$country #think $ a sign to use to access a sub-structure(sub-list)
'[1] "France"  "Germany"'
class(temp[[1]])
'[1] "character"'

list and vectors

You should be careful when you want to extract a value from a list

#list
temp[1]
'$country
[1] "France"  "Germany"'
class(temp[1])
'[1] "list"'

#vector
temp[[1]]
#or
temp$country #think $ a sign to use to access a sub-structure(sub-list)
'[1] "France"  "Germany"'
class(temp[[1]])
'[1] "character"'

#trying to access a value of a sub-list(country)
temp[1][1] #wrong
'$country
[1] "France"  "Germany"' #retrieved the whole sub-list(country) instead
temp[[1]][1] #right
#or
temp$country[1]
'[1] "France"'

Cleaning missing values

First, let’s create an example data.frame(table in R)

df <- data.frame(country = c("France", "Germany", "Luxembourg"),
             capital = c("Paris", "Berlin", NA),
             population = c(67500000, 83130000, 639000))
df
"country capital population
1     France   Paris   67500000
2    Germany  Berlin   83130000
3 Luxembourg    <NA>     639000"

Cleaning missing values

Piping!

%>%

A special operator from the dplyr package!

Think it as “then” in English.

df %>% # call df
  filter(!is.na(capital))
#then filter observations that do not have a missing value(NA) in the capital column
# ! means NOT e.g. 1==1 1!=2

"NA is a special data type in R that indicates a missing value
(comparable to NaN value in Python)"

#the equivalent is
filter(df, !is.na(capital))

Cleaning missing values

The advantage of piping is that it allows you to write concise codes. For instance,

#basic way
lux <- filter(df, country=="Luxembourg") #getting Luxembourg observation
lux <- mutate(lux, capital = "Luxembourg City") #changing the capital value for Luxembourg
lux
"    country         capital population
1 Luxembourg Luxembourg City     639000"
#piping!
lux <- df %>%
  filter(country=="Luxembourg") %>%
  mutate(capital = "Luxembourg City")
lux
"    country         capital population
1 Luxembourg Luxembourg City     639000"

Cleaning missing values

Let’s say you have odd values for missing values e.g. -999

df <- data.frame(country = c("France", "Germany", "Luxembourg"),
             capital = c("Paris", "Berlin", -999),
             population = c(67500000, 83130000, "Non_existent"))
#retrieving luxembourg in two different ways
##1
df %>% 
  filter(capital==-999)
##2
df %>% 
  filter(population=="Non_existent")

Cleaning missing values

#retrieve an observation that has the value "-999" in the capital columns
df %>% 
  filter(capital=="-999")
#change the capital value 
df %>% 
  filter(capital=="-999") %>%
  mutate(capital="Luxembourg City")
#change the population value 
df %>% 
  filter(capital=="-999") %>%
  mutate(capital="Luxembourg City") %>%
  mutate(population = 639000)
#different version
df %>% 
  filter(capital=="-999") %>%
  mutate(capital="Luxembourg City",
         population = 639000)

Cleaning missing values

Don’t forget to assign the changed values to the original data frame!

df <- df %>% 
  filter(capital=="-999") %>%
  mutate(capital="Luxembourg City",
         population = 639000)
df
"     country         capital population
1  Luxembourg Luxembourg City     639000"
#I WANTED TO CHANGE THE VALUES AND STORE THEM!
#I DON'T WANT TO ONLY GET CLEAN LUX ROW!

Cleaning missing values

Actually you need to use ifelse() function to get all observations

#re-creating the data frame
df <- data.frame(country = c("France", "Germany", "Luxembourg"),
             capital = c("Paris", "Berlin", -999),
             population = c(67500000, 83130000, "Non_existent"))
#changing the capital values!
df %>% 
  mutate(capital=ifelse(capital=="-999", #condition!
                        "Lux City", #if TRUE
                        "FALSE")) #if FALSE
"    country  capital   population
1     France    FALSE     67500000
2    Germany    FALSE     83130000
3 Luxembourg Lux City Non_existent"

Cleaning missing values

Actually you need to use ifelse() function to get all observations

#keeping the other values
df <- df %>% 
  mutate(capital=ifelse(capital=="-999",
                        "Luxembourg City",
                        capital)) #keeping the rest values
#changing the population as well
df <- df %>% 
  mutate(population=ifelse(population=="Non_existent",
                        639000,
                        population)) %>%
  mutate(population = as.integer(population))

df
"    country         capital population
1     France           Paris   67500000
2    Germany          Berlin   83130000
3 Luxembourg Luxembourg City     639000"

Custom function

Sometimes, you need to create your custom function to detect missing values.

For instance,

check_miss <- function(column, bool=F){
  if(bool==F){#if one don't want T/F values for the result
    fin_value <- is.na(column) | stringi::stri_isempty(column)
    fin_value <- grep(T, fin_value) #returning the location where T is located
    return(fin_value)
  }else{
    fin_value <- is.na(column) | stringi::stri_isempty(column) #returning a boolean vector
    return(fin_value)
  }
}

Custom function

Let’s test the function

check_miss(c(NA, "Luxembourg" ,""))
"[1] 1 3"

check_miss(c(NA, "Luxembourg" ,""), bool = T)
"[1]  TRUE FALSE  TRUE"

Creating/deleting columns

Creating columns

It is simple! Just use mutate() function to create a new column

df <- df %>%
  mutate(total_pop = sum(population))
df
"     country         capital population total_pop
1     France           Paris   67500000 151269000
2    Germany          Berlin   83130000 151269000
3 Luxembourg Luxembourg City     639000 151269000"

Creating columns based on condition

There are two ways!

1.The intuitive way

#saving the value
temp_pop <- df %>%
  filter(country %in% c("France", "Luxembourg")) %>%
  summarize(french_pop = sum(population)) %>%
  unlist()
temp_pop
"french_pop 
  68139000"
#and then creating the column
df <- df %>%
  mutate(french_pop = temp_pop)
df
"     country         capital population total_pop french_pop
1     France           Paris   67500000 151269000   68139000
2    Germany          Berlin   83130000 151269000   68139000
3 Luxembourg Luxembourg City     639000 151269000   68139000"

Creating columns based on condition

2.The complex way

df <- df %>%
  #creating a column that indicates whether a country speaks french
  mutate(french = ifelse(country %in% c("France", "Luxembourg"),
                             T,
                             F)) %>%
  #grouping by french column
  group_by(french) %>%
  #creating a new column
  mutate(french_pop = ifelse(french==T,
                             sum(population),
                             NA)) %>%
  ungroup() #important!
df
" country    capital         population total_pop french_pop french
  <chr>      <chr>                <int>     <int>      <int> <lgl> 
1 France     Paris             67500000 151269000   68139000 TRUE  
2 Germany    Berlin            83130000 151269000         NA FALSE 
3 Luxembourg Luxembourg City     639000 151269000   68139000 TRUE "

Creating columns based on condition

If you forget to ungroup…

df <- df %>%
  mutate(french = ifelse(country %in% c("France", "Luxembourg"),
                             T,
                             F)) %>%
  group_by(french) %>%
  mutate(french_pop = ifelse(french==T,
                             sum(population),
                             NA))
#results in grouped sum by french still!
df %>%
  mutate(temp = sum(population))
"country    capital         population total_pop french_pop french     temp
  <chr>      <chr>                <int>     <int>      <int> <lgl>     <int>
1 France     Paris             67500000 151269000   68139000 TRUE   68139000
2 Germany    Berlin            83130000 151269000         NA FALSE  83130000
3 Luxembourg Luxembourg City     639000 151269000   68139000 TRUE   68139000"

Creating columns based on condition

If you forget to ungroup…

df <- df %>%
  mutate(french = ifelse(country %in% c("France", "Luxembourg"),
                             T,
                             F)) %>%
  group_by(french) %>%
  mutate(french_pop = ifelse(french==T,
                             sum(population),
                             NA))
#results in grouped sum by french still!
df %>%
  mutate(temp = sum(population))
"country    capital         population total_pop french_pop french     temp
  <chr>      <chr>                <int>     <int>      <int> <lgl>     <int>
1 France     Paris             67500000 151269000   68139000 TRUE   68139000
2 Germany    Berlin            83130000 151269000         NA FALSE  83130000
3 Luxembourg Luxembourg City     639000 151269000   68139000 TRUE   68139000"
#check it
is.grouped_df(df)
"[1] TRUE"

Deleting columns

#choose the columns you want to exclude
df %>%
  select(-french_pop, -french)
"# A tibble: 3 × 3
  country    capital         population
  <chr>      <chr>                <int>
1 France     Paris             67500000
2 Germany    Berlin            83130000
3 Luxembourg Luxembourg City     639000"
#choose the columns you only need
df %>%
  select(country, capital, population, total_pop)
"# A tibble: 3 × 4
  country    capital         population total_pop
  <chr>      <chr>                <int>     <int>
1 France     Paris             67500000 151269000
2 Germany    Berlin            83130000 151269000
3 Luxembourg Luxembourg City     639000 151269000"

Deleting columns

#choose the columns you want to exclude
df %>%
  select(-french_pop, -french)
"# A tibble: 3 × 3
  country    capital         population
  <chr>      <chr>                <int>
1 France     Paris             67500000
2 Germany    Berlin            83130000
3 Luxembourg Luxembourg City     639000"
#choose the columns you only need
df %>%
  select(country, capital, population, total_pop)
"# A tibble: 3 × 4
  country    capital         population total_pop
  <chr>      <chr>                <int>     <int>
1 France     Paris             67500000 151269000
2 Germany    Berlin            83130000 151269000
3 Luxembourg Luxembourg City     639000 151269000"
#declare the columns of interest first!
cols <- c("country", "capital", "population", "total_pop")
df %>%
  select(all_of(cols)) #there is also any_of()
"# A tibble: 3 × 4
  country    capital         population total_pop
  <chr>      <chr>                <int>     <int>
1 France     Paris             67500000 151269000
2 Germany    Berlin            83130000 151269000
3 Luxembourg Luxembourg City     639000 151269000"

Deleting columns

Let’s try an advanced example!

#advanced
df <- data.frame(country = "France",
                 capital = "Paris",
                 population = 67500000, 
                 Y1950_population = NA,
                 Y1960_population = "",
                 Y1970_population = -999)
df
" country capital population Y1950_population Y1960_population Y1970_population
1  France   Paris   67500000               NA                              -999"

Deleting columns

Find columns that you want to erase!

#advanced
df <- data.frame(country = "France",
                 capital = "Paris",
                 population = 67500000, 
                 Y1950_population = NA,
                 Y1960_population = "",
                 Y1970_population = -999)
df

#check the column names of df
colnames(df)
'[1] "country"          "capital"          "population"       "Y1950_population" "Y1960_population" "Y1970_population"'

#find the columns you want to delete!
grep("^Y", colnames(df))
"[1] 4 5 6"
grepl("^Y", colnames(df))
"[1] FALSE FALSE FALSE  TRUE  TRUE  TRUE"

Deleting columns

Declare the columns you want to erase!

#advanced
df <- data.frame(country = "France",
                 capital = "Paris",
                 population = 67500000, 
                 Y1950_population = NA,
                 Y1960_population = "",
                 Y1970_population = -999)
df

#check the column names of df
colnames(df)
'[1] "country"          "capital"          "population"       "Y1950_population" "Y1960_population" "Y1970_population"'

#find the columns you want to delete!
grep("^Y", colnames(df))
"[1] 4 5 6"
grepl("^Y", colnames(df))
"[1] FALSE FALSE FALSE  TRUE  TRUE  TRUE"

#declare the columns you want to erase!
del_cols <- colnames(df)[grep("^Y", colnames(df))]
del_cols
'Y1950_population" "Y1960_population" "Y1970_population"'

Deleting columns

Remove the columns from the data frame!

#advanced
df <- data.frame(country = "France",
                 capital = "Paris",
                 population = 67500000, 
                 Y1950_population = NA,
                 Y1960_population = "",
                 Y1970_population = -999)
df

#check the column names of df
colnames(df)
'[1] "country"          "capital"          "population"       "Y1950_population" "Y1960_population" "Y1970_population"'

#find the columns you want to delete!
grep("^Y", colnames(df))
"[1] 4 5 6"
grepl("^Y", colnames(df))
"[1] FALSE FALSE FALSE  TRUE  TRUE  TRUE"

#declare the columns you want to erase!
del_cols <- colnames(df)[grep("^Y", colnames(df))]
del_cols
'Y1950_population" "Y1960_population" "Y1970_population"'

#exclude the columns!
df <- df %>%
  select(-all_of(del_cols))
df
"  country capital population
1  France   Paris   67500000"

Deleting columns

For more information on string matching based on regular expression, please check this website.

Merging data

Different joins

In R programming, people often use the term ‘join’ for merging the data, and many different types of join exist.

left join
right join
inner join
full join

Data

Let’s say we have two data sets: X and Y.

join

We would like the merge the two.

inner join

Animation

left join

Animation

right join

Animation

full join

Animation

etc.

Help page is accessible by typing

?left_join

it shows simple explanation of each join.

For different joins and animation visit this github page

Example

#setting the data sets
x.df <- data.frame(id = c(1,2),
                   country = c("France", "Germany"),
                   capital = c("Paris", "Berlin"),
                   population = c(67500000, 83130000))

y.df <- data.frame(id = c(2,3),
                   country = c("Germany", "Luxembourg"),
                   capital = c( "Berlin", "Luxembourg City"),
                   language = c( "DE",  "FR"))

left_join example

merged.df <- left_join(x = x.df, 
                       y = y.df,
                       by = "id") #by which column will you merge
merged.df
" id country.x capital.x population country.y capital.y language
1  1    France     Paris   67500000      <NA>      <NA>     <NA>
2  2   Germany    Berlin   83130000   Germany    Berlin       DE"

right_join example

merged.df <- right_join(x = x.df, 
                        y = y.df,
                        by = "id") #by which column will you merge
merged.df
" id country.x capital.x population  country.y       capital.y language
1  2   Germany    Berlin   83130000    Germany          Berlin       DE
2  3      <NA>      <NA>         NA Luxembourg Luxembourg City       FR"

inner_join example

merged.df <- inner_join(x = x.df, 
                        y = y.df,
                        by = "id") #by which column will you merge
merged.df
" id country.x capital.x population country.y capital.y language
1  2   Germany    Berlin   83130000   Germany    Berlin       DE"

full_join example

merged.df <- full_join(x = x.df, 
                       y = y.df,
                       by = "id") #by which column will you merge
merged.df
"id country.x capital.x population  country.y       capital.y language
1  1    France     Paris   67500000       <NA>            <NA>     <NA>
2  2   Germany    Berlin   83130000    Germany          Berlin       DE
3  3      <NA>      <NA>         NA Luxembourg Luxembourg City       FR"

advanced merge

You can change the suffix setting if “.x” & “.y” does not help you to have a full picture of the merging process

merged.df <- left_join(x = x.df, 
                       y = y.df,
                       by = "id", #by which column will you merge
                       suffix = c(".population", ".language"))
merged.df
"id country.population capital.population population country.language capital.language language
1  1             France              Paris   67500000             <NA>             <NA>     <NA>
2  2            Germany             Berlin   83130000          Germany           Berlin       DE"

advanced merge

Let’s remove redundant columns

merged.df <- left_join(x = x.df, 
                       y = y.df,
                       by = "id", #by which column will you merge
                       suffix = c(".population", ".language"))
merged.df

#delete redundant columns
##retrieving column names that has ".language"
grepl(".language", colnames(merged.df)) 
del_cols <- colnames(merged.df)[grepl(".language", colnames(merged.df))]
##remove!
merged.df <- merged.df %>%
  select(-all_of(del_cols))
merged.df
"id country.population capital.population population language
1  1             France              Paris   67500000     <NA>
2  2            Germany             Berlin   83130000       DE"

advanced merge

Let’s rename the columns and fill in the blank language cell!

merged.df <- left_join(x = x.df, 
                       y = y.df,
                       by = "id", #by which column will you merge
                       suffix = c(".population", ".language"))
merged.df

#delete redundant columns
##retrieving column names that has ".language"
grepl(".language", colnames(merged.df)) 
del_cols <- colnames(merged.df)[grepl(".language", colnames(merged.df))]
##remove!
merged.df <- merged.df %>%
  select(-all_of(del_cols))
merged.df
"id country.population capital.population population language
1  1             France              Paris   67500000     <NA>
2  2            Germany             Berlin   83130000       DE"

#rename column names
##declaring the column names you want to have
##removing all the ".population" part
col_names <- stringr::str_replace_all(string = colnames(merged.df),
                                      pattern = ".population",
                                      replacement =  "") 
colnames(merged.df) <- col_names
print(colnames(merged.df))
'[1] "id"         "country"    "capital"    "population" "language"  '

#filling in the language column for France
merged.df <- merged.df %>%
  mutate(language = ifelse(country=="France",
                           "FR",
                           language))
merged.df
"  id country capital population language
1  1  France   Paris   67500000       FR
2  2 Germany  Berlin   83130000       DE"

advanced merge

Let’s rename the columns and fill in the blank language cell!

merged.df <- left_join(x = x.df, 
                       y = y.df,
                       by = "id", #by which column will you merge
                       suffix = c(".population", ".language"))
merged.df

#delete redundant columns
##retrieving column names that has ".language"
grepl(".language", colnames(merged.df)) 
del_cols <- colnames(merged.df)[grepl(".language", colnames(merged.df))]
##remove!
merged.df <- merged.df %>%
  select(-all_of(del_cols))
merged.df
"id country.population capital.population population language
1  1             France              Paris   67500000     <NA>
2  2            Germany             Berlin   83130000       DE"

#rename column names
##declaring the column names you want to have
##removing all the ".population" part
col_names <- stringr::str_replace_all(string = colnames(merged.df),
                                      pattern = ".population",
                                      replacement =  "") 
colnames(merged.df) <- col_names
print(colnames(merged.df))
'[1] "id"         "country"    "capital"    "population" "language"  '

#filling in the language column for France
merged.df <- merged.df %>%
  mutate(language = ifelse(country=="France",
                           "FR",
                           language))
merged.df
"  id country capital population language
1  1  France   Paris   67500000       FR
2  2 Germany  Berlin   83130000       DE"

Simple statistical analysis

Checking average of a variable

Let’s read in a gapminder data set

library(gapminder)
#reading in the data
data <- gapminder
head(data)
'# A tibble: 6 × 6
  country     continent  year lifeExp      pop gdpPercap
  <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
1 Afghanistan Asia       1952    28.8  8425333      779.
2 Afghanistan Asia       1957    30.3  9240934      821.
3 Afghanistan Asia       1962    32.0 10267083      853.
4 Afghanistan Asia       1967    34.0 11537966      836.
5 Afghanistan Asia       1972    36.1 13079460      740.
6 Afghanistan Asia       1977    38.4 14880372      786.'

Checking average of a variable

Simple version

summary(data$lifeExp)
" Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  23.60   48.20   60.71   59.47   70.85   82.60 "

Checking average of a variable

Calculating the average of Europe

data %>%
  filter(continent=="Europe") %>%
  summarise(life_avg_eu = mean(lifeExp))
" life_avg_eu
        <dbl>
1        71.9"

Checking average of a variable

Calculating the average & standard deviation of the life Expectancy by continent

data %>%
  group_by(continent) %>%
  summarise(life_avg = mean(lifeExp),
            life_sd = sd(lifeExp))
"# A tibble: 5 × 3
  continent life_avg life_sd
  <fct>        <dbl>   <dbl>
1 Africa        48.9    9.15
2 Americas      64.7    9.35
3 Asia          60.1   11.9 
4 Europe        71.9    5.43
5 Oceania       74.3    3.80"

Checking average of a variable

If you want to save the resulted table, don’t forget to save it!

life_sum.df <- data %>%
  group_by(continent) %>%
  summarise(life_avg = mean(lifeExp),
            life_sd = sd(lifeExp))
life_sum.df
"# A tibble: 5 × 3
  continent life_avg life_sd
  <fct>        <dbl>   <dbl>
1 Africa        48.9    9.15
2 Americas      64.7    9.35
3 Asia          60.1   11.9 
4 Europe        71.9    5.43
5 Oceania       74.3    3.80"

Checking average of a variable(Advanced)

Let’s group the data by multiple variables

sum.df <- data %>%
  group_by(continent, year) %>%
  summarise(life_avg = mean(lifeExp),
            gdp_avg = mean(gdpPercap),
            .groups = "keep") %>% #keeping the group variables
  ungroup()
sum.df
"# A tibble: 60 × 4
   continent  year life_avg gdp_avg
   <fct>     <int>    <dbl>   <dbl>
 1 Africa     1952     39.1   1253.
 2 Africa     1957     41.3   1385.
 3 Africa     1962     43.3   1598.
 4 Africa     1967     45.3   2050.
 5 Africa     1972     47.5   2340.
 6 Africa     1977     49.6   2586.
 7 Africa     1982     51.6   2482.
 8 Africa     1987     53.3   2283.
 9 Africa     1992     53.6   2282.
10 Africa     1997     53.6   2379.
# … with 50 more rows
# ℹ Use `print(n = ...)` to see more rows"

Sorting the data set

Let’s check which continent has the highest average life expectancy in each year by descending order

sum.df %>%
  arrange(-year, -life_avg)
"# A tibble: 60 × 4
   continent  year life_avg gdp_avg
   <fct>     <int>    <dbl>   <dbl>
 1 Oceania    2007     80.7  29810.
 2 Europe     2007     77.6  25054.
 3 Americas   2007     73.6  11003.
 4 Asia       2007     70.7  12473.
 5 Africa     2007     54.8   3089.
 6 Oceania    2002     79.7  26939.
 7 Europe     2002     76.7  21712.
 8 Americas   2002     72.4   9288.
 9 Asia       2002     69.2  10174.
10 Africa     2002     53.3   2599.
# … with 50 more rows
# ℹ Use `print(n = ...)` to see more rows"

Linear regression

Simple linear regression

#simple linear regression
ols <- lm(formula = lifeExp ~ gdpPercap + continent,
          data = data)
ols

Linear regression

Checking the result

#simple linear regression
ols <- lm(formula = lifeExp ~ gdpPercap + continent,
          data = data)
ols
summary(ols)
"Call:
lm(formula = lifeExp ~ gdpPercap + continent, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-49.241  -4.479   0.347   5.105  25.138 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)       4.789e+01  3.398e-01  140.93   <2e-16 ***
gdpPercap         4.453e-04  2.350e-05   18.95   <2e-16 ***
continentAmericas 1.359e+01  6.008e-01   22.62   <2e-16 ***
continentAsia     8.658e+00  5.555e-01   15.59   <2e-16 ***
continentEurope   1.757e+01  6.257e-01   28.08   <2e-16 ***
continentOceania  1.815e+01  1.787e+00   10.15   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 8.39 on 1698 degrees of freedom
Multiple R-squared:  0.5793,    Adjusted R-squared:  0.5781 
F-statistic: 467.7 on 5 and 1698 DF,  p-value: < 2.2e-16"

Linear regression

Checking the regression result in detail

#ols result!
ols_result <- summary(ols)

#checking the coefficients
ols_result$coefficients
"                      Estimate   Std. Error   t value      Pr(>|t|)
(Intercept)       4.788852e+01 3.398053e-01 140.92927  0.000000e+00
gdpPercap         4.452704e-04 2.349795e-05  18.94933  8.552893e-73
continentAmericas 1.359272e+01 6.007856e-01  22.62491  2.822476e-99
continentAsia     8.657793e+00 5.554859e-01  15.58598  2.719424e-51
continentEurope   1.757234e+01 6.257430e-01  28.08236 7.595526e-143
continentOceania  1.814604e+01 1.787426e+00  10.15205  1.502557e-23"
class(ols_result$coefficients)
'"matrix" "array"'

Linear regression

Dealing with scientific notations(e)

#scientific notation
ols_result$coefficients[,1]
"     (Intercept)         gdpPercap continentAmericas     continentAsia 
     4.788852e+01      4.452704e-04      1.359272e+01      8.657793e+00 
  continentEurope  continentOceania 
     1.757234e+01      1.814604e+01"
format(ols_result$coefficients[,1], scientific = F)
'     (Intercept)         gdpPercap continentAmericas     continentAsia 
  "47.8885161789"   " 0.0004452704"   "13.5927218705"   " 8.6577934769" 
  continentEurope  continentOceania 
  "17.5723409668"   "18.1460410154" '
#NOTE THAT the numbers are characters now!

Linear regression

#scientific notation
ols_result$coefficients[,1]
"(Intercept)         gdpPercap continentAmericas     continentAsia 
     4.788852e+01      4.452704e-04      1.359272e+01      8.657793e+00 
  continentEurope  continentOceania 
     1.757234e+01      1.814604e+01"
format(ols_result$coefficients[,1], scientific = F)
' (Intercept)         gdpPercap continentAmericas     continentAsia 
  "47.8885161789"   " 0.0004452704"   "13.5927218705"   " 8.6577934769" 
  continentEurope  continentOceania 
  "17.5723409668"   "18.1460410154" '
#NOTE THAT the numbers are characters now!

#try round() function
round(ols_result$coefficients[,1],4) #rounding up at the fourth decimal point
'      (Intercept)         gdpPercap continentAmericas     continentAsia   continentEurope  continentOceania 
          47.8885            0.0004           13.5927            8.6578           17.5723           18.1460'

Other econometric methods

For more details of running different regressions, please check this website: Introduction to Econometrics with R.

Plotting graphs

ggplot

Now, it is time to plot pretty graphs!

We will use ggplot2 package for plotting graphs.

Please check out this link for more information.

Download the ggplot cheat sheet!

Our goal

Our goal is to draw a graph like this

Let’s clean the R environemnt first!

# Clear plots
if(!is.null(dev.list())) dev.off()
# Clear console
cat("\014") 
# Clean workspace
rm(list=ls())

#restart R
rstudioapi::restartSession()

#calling necessary packages
packages <- c("dplyr", "ggplot2", "stringr", "stringi")
lapply(packages, require, character.only=T)

basic dot plot

Let’s start with a basic plot

#reading in data again
data <- gapminder::gapminder

#simple dot graph for Italy
data %>%
  filter(country=="Italy") %>%
  ggplot() + 
  geom_point(aes(x = year, y = gdpPercap))

usual graph types

dot plot - geom_point()
line graph - geom_line()
bar plot - geom_bar(x)
bar plot - geom_col(x,y)
density graph - geom_density()

with multiple countries

Multiple countries!

Impossible to differentiate which point belongs to which country

#reading in data again
data <- gapminder::gapminder

#simple dot graph for some European countries
data %>%
  filter(country %in% c("United Kingdom", "Germany", "France", "Italy", "Spain")) %>%
  ggplot() + 
  geom_point(aes(x = year, y = gdpPercap))

Adding colors

Let’s add some colors according to a country!

#simple dot graph for some European countries
data %>%
  filter(country %in% c("United Kingdom", "Germany", "France", "Italy", "Spain")) %>%
  ggplot() + 
  #note that it is important to put color = country inside of aes()
  geom_point(aes(x = year, y = gdpPercap, color = country))

Adding colors

Let’s add some colors according to a country!

#simple bar(column) graph for some European countries
data %>%
  filter(country %in% c("United Kingdom", "Germany", "France", "Italy", "Spain")) %>%
  ggplot() + 
  geom_col(aes(x = year, y = gdpPercap, color = country)) 

#for bar graphs you should use fill!
data %>%
  filter(country %in% c("United Kingdom", "Germany", "France", "Italy", "Spain")) %>%
  ggplot() + 
  geom_col(aes(x = year, y = gdpPercap, fill = country))

Changing axis labels

#simple line graph for some European countries
data %>%
  filter(country %in% c("United Kingdom", "Germany", "France", "Italy", "Spain")) %>%
  ggplot() + 
  geom_line(aes(x = year, y = gdpPercap, color = country)) +
  labs(x="Year", y="Real GDP per capita in USD") #changing the axis label

Setting the plot title

#simple line graph for some European countries
data %>%
  filter(country %in% c("United Kingdom", "Germany", "France", "Italy", "Spain")) %>%
  ggplot() + 
  geom_line(aes(x = year, y = gdpPercap, color = country)) +
  labs(x="Year", y="Real GDP per capita in USD",
       title = "Plotting GDP per Capita of Five European Countries over Time")

Changing legend position

#simple line graph for some European countries
data %>%
  filter(country %in% c("United Kingdom", "Germany", "France", "Italy", "Spain")) %>%
  ggplot() + 
  geom_line(aes(x = year, y = gdpPercap, color = country)) +
  labs(x="Year", y="Real GDP per capita in USD",
       title = "Plotting GDP per Capita of Five European Countries over Time") + 
  theme(legend.position = "bottom")

Changing title/legend/axis text size

#simple line graph for some European countries
data %>%
  filter(country %in% c("United Kingdom", "Germany", "France", "Italy", "Spain")) %>%
  ggplot() + 
  geom_line(aes(x = year, y = gdpPercap, color = country)) +
  labs(x="Year", y="Real GDP per capita in USD",
       title = "Plotting GDP per Capita of Five European Countries over Time") + 
  theme(legend.position = "bottom",
        legend.title = element_text(size = 20), #changing legend title size
        legend.text = element_text(size = 20), #changing legend text size
        axis.text.x = element_text(size = 15), #changing axis text size
        axis.title.x = element_text(size = 20), #changing axis title size
        axis.text.y = element_text(size = 15), #changing axis text size
        axis.title.y = element_text(size = 20), #changing axis title size
        title = element_text(size = 20)) #changing title size

Changing legend box size

#simple line graph for some European countries
data %>%
  filter(country %in% c("United Kingdom", "Germany", "France", "Italy", "Spain")) %>%
  ggplot() + 
  geom_line(aes(x = year, y = gdpPercap, color = country)) +
  labs(x="Year", y="Real GDP per capita in USD",
       title = "Plotting GDP per Capita of Five European Countries over Time") + 
  theme(legend.position = "bottom",
        legend.title = element_text(size = 20), #changing legend title size
        legend.text = element_text(size = 20), #changing legend text size
        axis.text.x = element_text(size = 15), #changing axis text size
        axis.title.x = element_text(size = 20), #changing axis title size
        axis.text.y = element_text(size = 15), #changing axis text size
        axis.title.y = element_text(size = 20),  #changing axis title size
        title = element_text(size = 20), #changing title size
        legend.key.size = unit(1.5, 'cm')) #changing legend color box size

Changing legend title

#simple line graph for some European countries
base_plot <- data %>%
  filter(country %in% c("United Kingdom", "Germany", "France", "Italy", "Spain")) %>%
  ggplot() + 
  geom_line(aes(x = year, y = gdpPercap, color = country)) +
  labs(x="Year", y="Real GDP per capita in USD",
       title = "Plotting GDP per Capita of Five European Countries over Time") + 
  theme(legend.position = "bottom",
        legend.title = element_text(size = 20), #changing legend title size
        legend.text = element_text(size = 20), #changing legend text size
        axis.text.x = element_text(size = 15), #changing axis text size
        axis.title.x = element_text(size = 20), #changing axis title size
        axis.text.y = element_text(size = 15), #changing axis text size
        axis.title.y = element_text(size = 20),  #changing axis title size
        title = element_text(size = 20), #changing title size
        legend.key.size = unit(1.5, 'cm')) + #changing legend color box size
  scale_color_discrete(name = "Country") #changing legend title
base_plot

Formatting axis tick

#simple line graph for some European countries
base_plot <- data %>%
  filter(country %in% c("United Kingdom", "Germany", "France", "Italy", "Spain")) %>%
  ggplot() + 
  geom_line(aes(x = year, y = gdpPercap, color = country)) +
  labs(x="Year", y="Real GDP per capita in USD",
       title = "Plotting GDP per Capita of Five European Countries over Time") + 
  theme(legend.position = "bottom",
        legend.title = element_text(size = 20), #changing legend title size
        legend.text = element_text(size = 20), #changing legend text size
        axis.text.x = element_text(size = 15), #changing axis text size
        axis.title.x = element_text(size = 20), #changing axis title size
        axis.text.y = element_text(size = 15), #changing axis text size
        axis.title.y = element_text(size = 20),  #changing axis title size
        title = element_text(size = 20), #changing title size
        legend.key.size = unit(1.5, 'cm')) + #changing legend color box size
  scale_color_discrete(name = "Country") +#changing legend title
  scale_y_continuous(labels= scales::comma) #formatting the y axis tick
base_plot

Adding GDP text labels on the graph

First, let’s create a separate data frame for the GDP text labels.

#print(sort(unique(data$year)))
"[1] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 2002 2007"

#creating the GDP text data frame
gdp_text.df <- data %>%
  filter(country %in% c("United Kingdom", "Italy")) %>%
  filter(year==1952 | year==2007)

Adding GDP text labels on the graph

Second, let’s add another layer!

#print(sort(unique(data$year)))
"[1] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 2002 2007"

#creating the GDP text data frame
gdp_text.df <- data %>%
  filter(country %in% c("United Kingdom", "Italy")) %>%
  filter(year==1952 | year==2007)

#adding the text layer
mid_plot <- base_plot + 
  geom_text(data = gdp_text.df,
            aes(x = year, y = gdpPercap, label = gdpPercap))
mid_plot

Adjusting the text size/position

#print(sort(unique(data$year)))
"[1] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 2002 2007"

#creating the GDP text data frame
gdp_text.df <- data %>%
  filter(country %in% c("United Kingdom", "Italy")) %>%
  filter(year==1952 | year==2007)

#adding the text layer
mid_plot <- base_plot + 
  geom_text(data = gdp_text.df,
            aes(x = year, y = gdpPercap, label = gdpPercap),
            size = 7, #size adjustment
            vjust = 0.7) #vertical position adjustment
mid_plot

Formatting the GDP neatly

#print(sort(unique(data$year)))
"[1] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 2002 2007"

#creating the GDP text data frame
gdp_text.df <- data %>%
  filter(country %in% c("United Kingdom", "Italy")) %>%
  filter(year==1952 | year==2007)

#adding the text layer
mid_plot <- base_plot + 
  geom_text(data = gdp_text.df,
            aes(x = year, y = gdpPercap, 
                label = round(gdpPercap)), #rounding up the GDP!
            size = 7, #size adjustment
            vjust = 0.7) #position adjustment
mid_plot

Formatting the GDP neatly

#print(sort(unique(data$year)))
"[1] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 2002 2007"

#creating the GDP text data frame
gdp_text.df <- data %>%
  filter(country %in% c("United Kingdom", "Italy")) %>%
  filter(year==1952 | year==2007)

#adding the text layer
mid_plot <- base_plot + 
  geom_text(data = gdp_text.df,
            aes(x = year, y = gdpPercap, 
                #rounding up the GDP and adding a comma!!
                label = format(round(gdpPercap), big.mark = ",", big.interval = 3L)), 
            size = 7, #size adjustment
            vjust = 0.7) #position adjustment
mid_plot

Adding Country text labels on the graph

Let’s create a separate data frame for the country text labels.

#creating the country text data frame
country_text.df <- data %>%
  filter(country %in% c("United Kingdom", "Germany", "France", "Italy", "Spain"))%>%
  filter(year==2002)

Adding country text labels on the graph

Second, let’s add another layer!

#creating the country text data frame
country_text.df <- data %>%
  filter(country %in% c("United Kingdom", "Germany", "France", "Italy", "Spain"))%>%
  filter(year==2002)

#adding the text layer
final_plot <- mid_plot + 
  geom_text(data = country_text.df,
            aes(x = year, y = gdpPercap, label = country, color = country),
            size = 7,
            vjust = 0.7,
            show.legend = F) #not showing the legend
final_plot

Saving the graph

Let’s save the final_plot

#saving the final plot
ggsave(plot = final_plot,
       filename = "img/output.jpg", 
       #make sure you have created an 'img' folder in your project directory
       width = 40,
       height = 30,
       units = "cm")

Writing a report

R Markdown

What is an R Markdown?

-> Think as a latex file combined with R!

Check these links:

Let’s clean the R environemnt first!

# Clear plots
if(!is.null(dev.list())) dev.off()
# Clear console
cat("\014") 
# Clean workspace
rm(list=ls())

#restart R
rstudioapi::restartSession()

installing necessary packages

rmarkdown, markdown, knitr package are essential packages for

any R Markdown file

#type this in the console
install.packages(c("rmarkdown", "markdown", "knitr"))

creating a new R Markdown file

choosing the output type(pdf)

This session only focuses on the pdf format.

If you are more interested in html format, please check out:

Setting the yaml header

yaml header is the first part of an R markdown file.

It decides many important aspects of your output.

---
title: "R markdown guide"
author: "Jongoh Kim"
date: "`r format(Sys.time(), '%d %B, %Y')`" #setting the date by dd month, yyyy
output: 
  pdf_document:
    toc: true #add a table of contents a the start of a document
    toc_depth: 3 #The lowest level of headings to add to table of contents (e.g. 2, 3)
    number_sections: true #numbering the sections
urlcolor: blue #setting the hyperlink color to blue
---

Jumping right into the R markdown

Check out the ‘markdown_example.Rmd’ file and

its output(markdown_example.pdf) from

the github page

Thanks!

Special thanks to Etienne Bacher for his help and slide codes!

Source code for slides:

https://github.com/jongohkim91/R-A-Z/index.qmd

Good resources

R for Data Science from Hadley Wickham and Garrett Grolemund:

https://r4ds.had.co.nz/

ggplot2:

https://ggplot2.tidyverse.org/

R Markdown: The Definitive Guide from Yihui Xie, J. J. Allaire, and Garrett Grolemund:

https://bookdown.org/yihui/rmarkdown/

Good resources

Communicate Data with R: https://communicate-data-with-r.netlify.app/