(Embarrassingly) Easy Parallelization
LISER
February 7, 2023
This training aims to introduce you to (embarrassingly) simple parallel computing.
This training is for people who have intermediate knowledge of R programming!
You should have at least the following experiences:
you have
Parallel computing is a type of computation in which many calculations or processes are carried out simultaneously.
Let’s say we have a dataset with text, and it comprises of real and fake news. We’re interested in calculating negative/positive sentiment scores in each article and looking at its distribution.
Top part of the _targets.R file
library(targets)
source("scripts/functions/parallel_functions.R")
#source("R/different_code.R")
# Set packages.
tar_option_set(packages = c("qs", "dplyr", "stringr", "stringi", "ggplot2", "data.table", "parallel", "tidytext", "stopwords"),
format = "qs")
# End this file with a list of target objects.
list(
#reading in the news data
tar_target(data,
read_news()),
#cleaning the text
tar_target(cleaning_text,
clean_text(data)),
#doing with sentiment analysis without parallelization
tar_target(sentiment_analysis,
extract_sentiment(data, cleaning_text))
)
#getting the sentiment scores from a list of texts(each text is a vector of words!)
extract_sentiment <- function(data, clean_text_list){
print("Doing simple lapply(for-loop)!")
#creating the final table
final.df <- data %>%
select(-text)
tryCatch(expr = {
#getting the sentiment score
final.df[,sentiment_score:=sapply(X = clean_text_list,
FUN = get_sentiment_score,
USE.NAMES = F)]
})
return(final.df)
}
#reading the result
result <- tar_read(sentiment_analysis)
#getting the first 6 rows without the date information
result %>% select(-date) %>% head()
" title subject is_real sentiment_score
1: As U.S. budget fight looms, Republicans flip their fiscal script politicsNews TRUE 12
2: U.S. military to accept transgender recruits on Monday: Pentagon politicsNews TRUE 14
3: Senior U.S. Republican senator: 'Let Mr. Mueller do his job' politicsNews TRUE 6
4: FBI Russia probe helped by Australian diplomat tip-off: NYT politicsNews TRUE 7
5: Trump wants Postal Service to charge 'much more' for Amazon shipments politicsNews TRUE -5
6: White House, Congress prepare for talks on spending, immigration politicsNews TRUE 6"
library(targets)
library(clustermq)
options(clustermq.scheduler = "multiprocess")
source("scripts/functions/parallel_functions.R")
#source("R/different_code.R")
# Set packages.
tar_option_set(packages = c("qs", "dplyr", "stringr", "stringi",
"ggplot2", "data.table", "parallel",
"tidytext", "stopwords"),
format = "qs")
Then you simply type:
To be safe, leave at least 33% of your cores to run your computer’s OS and other background programs. For instance, if you have 4 cores, use only 2!
library(targets)
library(future)
library(future.callr)
plan(callr)
source("scripts/functions/parallel_functions.R")
#source("R/different_code.R")
# Set packages.
tar_option_set(packages = c("qs", "dplyr", "stringr", "stringi", "ggplot2", "data.table", "parallel", "tidytext", "stopwords"),
format = "qs")
Then you simply type:
It is a bit different with parallel.
The top part of _targets.R is as same as the lapply version but calling the parallel package.
library(targets)
source("scripts/functions/parallel_functions.R")
# # configuring the script it should run(run it one time and it will create an targets.yaml file in the project folder)
# tar_config_set(script = "scripts/2._targets_pattern.R")
# Set packages.
tar_option_set(packages = c("qs", "dplyr", "stringr", "stringi", "ggplot2", "data.table", "parallel", "tidytext", "stopwords"),
format = "qs")
The major difference lies in the function you call to do the parallel computing!
#getting the sentiment scores from a list of texts(each text is a vector of words!)
extract_sentiment <- function(data, clean_text_list){
print("Doing simple lapply(for-loop)!")
#creating the final table
final.df <- data %>%
select(-text)
tryCatch(expr = {
#getting the sentiment score
final.df[,sentiment_score:=sapply(X = clean_text_list,
FUN = get_sentiment_score,
USE.NAMES = F)]
})
return(final.df)
}
#getting the sentiment scores from a list of texts(each text is a vector of words!)
extract_sentiment <- function(data, clean_text_list){
print("Number of Cores that could be used:")
print(parallel::detectCores(logical = F))
#declaring the number of cores
num_cores <- floor(parallel::detectCores(logical = F)*0.66) #at least leave 33% of your cores to run your OS & other programs
#create the cluster
cl <- makeCluster(num_cores)
print("DON'T USE ALL YOUR CORES!")
print(paste("Currently using", num_cores, "Cores!"))
#getting the sentiment scores from a list of texts(each text is a vector of words!)
extract_sentiment <- function(data, clean_text_list){
print("Number of Cores that could be used:")
print(parallel::detectCores(logical = F))
#declaring the number of cores
num_cores <- floor(parallel::detectCores(logical = F)*0.66) #at least leave 33% of your cores to run your OS & other programs
#create the cluster
cl <- makeCluster(num_cores)
print("DON'T USE ALL YOUR CORES!")
print(paste("Currently using", num_cores, "Cores!"))
#creating the final table
final.df <- data %>%
select(-text)
tryCatch(expr = {
#getting the sentiment score
final.df[,sentiment_score:=parSapply(cl = cl,
X = clean_text_list,
FUN = get_sentiment_score,
USE.NAMES = F)]
},
finally = {
#stop using the cluster IMPORTANT!
stopCluster(cl)
})
return(final.df)
}
You have to call the required packages inside the function!
#getting the sentiment scores by each text
get_sentiment_score <- function(text){#text should be a vector of words!
#calling the packages again because when you do parallization packages need to be recalled!
packages <- c("qs", "dplyr", "stringr", "stringi", "data.table", "parallel", "tidytext", "stopwords")
lapply(packages, require, character.only = TRUE)
#setting the words related to sentiments
sentiment_words <- get_sentiments("bing")
Special thanks to Etienne Bacher for his slide codes!
Source code for slides:
https://github.com/jongohkim91/targets_parallelization/blob/master/index.qmd
Examples I used in this training - link
The {targets} R package user manual from Will Landau(The creator of ‘targets’ package)
R Programming for Data Science from Roger D. Peng - Parallel Computation part https://bookdown.org/rdpeng/rprogdatascience/parallel-computation.html
Parallel Processing in R from Josh Errickson (Uni Michigan of Statistics)
Nice examples for parLapply https://dept.stat.lsa.umich.edu/~jerrick/courses/stat701/notes/parallel.html#using-sockets-with-parlapply