Parallel Computing with targets package

(Embarrassingly) Easy Parallelization

Jongoh Kim

LISER

February 7, 2023

Introduction

Objectvie

This training aims to introduce you to (embarrassingly) simple parallel computing.

Prerequisite

This training is for people who have intermediate knowledge of R programming!

You should have at least the following experiences:

you have

comfortably used apply functions(lapply, sapply, vapply)
the basic knowledge of targets package

What is parallel computing?

Parallel computing is a type of computation in which many calculations or processes are carried out simultaneously.

What is embarrassingly parallel?

also called embarrassingly parallelizable, perfectly parallel, delightfully parallel or pleasingly parallel
little or no effort is needed to separate the problem into a number of parallel tasks

When can I do (embarrassingly) parallel computing?

If you have more than one core in your CPU
If little or no dependency exists between those parallel tasks, or for results between them!
- e.g. for loops

Three ways to do simple parallel computing wtih targets

Easy setting(but for HPC)
- clustermq package
- future package

Hard setting
- parallel package

Real Example

Setting

Let’s say we have a dataset with text, and it comprises of real and fake news. We’re interested in calculating negative/positive sentiment scores in each article and looking at its distribution.

Overall workflow

Without parllelization

Top part of the _targets.R file

library(targets)

source("scripts/functions/parallel_functions.R")
#source("R/different_code.R")

# Set packages.
tar_option_set(packages = c("qs", "dplyr", "stringr", "stringi", "ggplot2", "data.table", "parallel", "tidytext", "stopwords"),
               format = "qs")

Without parllelization

library(targets)

source("scripts/functions/parallel_functions.R")
#source("R/different_code.R")

# Set packages.
tar_option_set(packages = c("qs", "dplyr", "stringr", "stringi", "ggplot2", "data.table", "parallel", "tidytext", "stopwords"),
               format = "qs")

# End this file with a list of target objects.
list(
  #reading in the news data
  tar_target(data, 
             read_news()),
  
  #cleaning the text
  tar_target(cleaning_text, 
             clean_text(data)),
  
  #doing with sentiment analysis without parallelization
  tar_target(sentiment_analysis, 
             extract_sentiment(data, cleaning_text))
)

The extract_sentiment function

#getting the sentiment scores from a list of texts(each text is a vector of words!)
extract_sentiment <- function(data, clean_text_list){
  print("Doing simple lapply(for-loop)!")
  #creating the final table
  final.df <- data %>%
    select(-text)
  tryCatch(expr = {
    #getting the sentiment score
    final.df[,sentiment_score:=sapply(X = clean_text_list,
                                      FUN = get_sentiment_score,
                                      USE.NAMES = F)]
  })
  return(final.df)
}

The output

#reading the result
result <- tar_read(sentiment_analysis)
#getting the first 6 rows without the date information
result %>% select(-date) %>% head()
"                                                                   title      subject is_real sentiment_score
1:      As U.S. budget fight looms, Republicans flip their fiscal script politicsNews    TRUE              12
2:      U.S. military to accept transgender recruits on Monday: Pentagon politicsNews    TRUE              14
3:          Senior U.S. Republican senator: 'Let Mr. Mueller do his job' politicsNews    TRUE               6
4:           FBI Russia probe helped by Australian diplomat tip-off: NYT politicsNews    TRUE               7
5: Trump wants Postal Service to charge 'much more' for Amazon shipments politicsNews    TRUE              -5
6:      White House, Congress prepare for talks on spending, immigration politicsNews    TRUE               6"

How long it took

With clustermq

library(targets)
library(clustermq)
options(clustermq.scheduler = "multiprocess")
source("scripts/functions/parallel_functions.R")
#source("R/different_code.R")

# Set packages.
tar_option_set(packages = c("qs", "dplyr", "stringr", "stringi", 
                            "ggplot2", "data.table", "parallel", 
                            "tidytext", "stopwords"),
               format = "qs")

With clustermq

Then you simply type:

#without saying how many cores you will use
tar_make_clustermq()

"OR"

#setting how many cores you will use
tar_make_clustermq(workers = 2)

REMEMBER!

To be safe, leave at least 33% of your cores to run your computer’s OS and other background programs. For instance, if you have 4 cores, use only 2!

How long it took

With Future

library(targets)
library(future)
library(future.callr)
plan(callr)
source("scripts/functions/parallel_functions.R")
#source("R/different_code.R")

# Set packages.
tar_option_set(packages = c("qs", "dplyr", "stringr", "stringi", "ggplot2", "data.table", "parallel", "tidytext", "stopwords"),
               format = "qs")

With Future

Then you simply type:

#without saying how many cores you will use
tar_make_future()

"OR"

#setting how many cores you will use
tar_make_future(workers = 2)

How long it took

With Parallel

It is a bit different with parallel.

The top part of _targets.R is as same as the lapply version but calling the parallel package.

library(targets)
source("scripts/functions/parallel_functions.R")


# # configuring the script it should run(run it one time and it will create an targets.yaml file in the project folder)
# tar_config_set(script = "scripts/2._targets_pattern.R")

# Set packages.
tar_option_set(packages = c("qs", "dplyr", "stringr", "stringi", "ggplot2", "data.table", "parallel", "tidytext", "stopwords"),
               format = "qs")

The difference

The major difference lies in the function you call to do the parallel computing!

The extract_sentiment function before

#getting the sentiment scores from a list of texts(each text is a vector of words!)
extract_sentiment <- function(data, clean_text_list){
  print("Doing simple lapply(for-loop)!")
  #creating the final table
  final.df <- data %>%
    select(-text)
  tryCatch(expr = {
    #getting the sentiment score
    final.df[,sentiment_score:=sapply(X = clean_text_list,
                                      FUN = get_sentiment_score,
                                      USE.NAMES = F)]
  })
  return(final.df)
}

The extract_sentiment function for parallel

#getting the sentiment scores from a list of texts(each text is a vector of words!)
extract_sentiment <- function(data, clean_text_list){
  print("Number of Cores that could be used:")
  print(parallel::detectCores(logical = F))
  
  #declaring the number of cores
  num_cores <- floor(parallel::detectCores(logical = F)*0.66) #at least leave 33% of your cores to run your OS & other programs 
  #create the cluster
  cl <- makeCluster(num_cores)
  
  print("DON'T USE ALL YOUR CORES!")
  print(paste("Currently using", num_cores, "Cores!"))

The extract_sentiment function for parallel

#getting the sentiment scores from a list of texts(each text is a vector of words!)
extract_sentiment <- function(data, clean_text_list){
  print("Number of Cores that could be used:")
  print(parallel::detectCores(logical = F))
  
  #declaring the number of cores
  num_cores <- floor(parallel::detectCores(logical = F)*0.66) #at least leave 33% of your cores to run your OS & other programs 
  #create the cluster
  cl <- makeCluster(num_cores)
  
  print("DON'T USE ALL YOUR CORES!")
  print(paste("Currently using", num_cores, "Cores!"))
  
  #creating the final table
  final.df <- data %>%
    select(-text)
  tryCatch(expr = {
    #getting the sentiment score
    final.df[,sentiment_score:=parSapply(cl = cl,
                                         X = clean_text_list,
                                         FUN = get_sentiment_score,
                                         USE.NAMES = F)]
  },
  finally = {
    #stop using the cluster IMPORTANT!
    stopCluster(cl)
  })
  return(final.df)
}

The get_sentiment_score function for parallel

You have to call the required packages inside the function!

#getting the sentiment scores by each text
get_sentiment_score <- function(text){#text should be a vector of words!
  #calling the packages again because when you do parallization packages need to be recalled!
  packages <- c("qs", "dplyr", "stringr", "stringi", "data.table", "parallel", "tidytext", "stopwords")
  lapply(packages, require, character.only = TRUE)
  
  #setting the words related to sentiments
  sentiment_words <- get_sentiments("bing")

How long it took

Thanks!

Special thanks to Etienne Bacher for his slide codes!

Source code for slides:

https://github.com/jongohkim91/targets_parallelization/blob/master/index.qmd

Examples I used in this training - link

Good resources

The {targets} R package user manual from Will Landau(The creator of ‘targets’ package)

The parallel computing in the HPC environment part https://books.ropensci.org/targets/hpc.html

clustermq part https://books.ropensci.org/targets/hpc.html#clustermq

future part https://books.ropensci.org/targets/hpc.html#future

Good resources

R Programming for Data Science from Roger D. Peng - Parallel Computation part https://bookdown.org/rdpeng/rprogdatascience/parallel-computation.html

Parallel Processing in R from Josh Errickson (Uni Michigan of Statistics)

Nice examples for parLapply https://dept.stat.lsa.umich.edu/~jerrick/courses/stat701/notes/parallel.html#using-sockets-with-parlapply