Big Data Essentials¶

L6: Parallel Computing¶

Yanfei Kang
yanfeikang@buaa.edu.cn
School of Economics and Management
Beihang University
http://yanfei.site

What is parallel computing¶

Data becomes cheaper, because machines are becoming cheaper and faster.
Computers become cheaper and faster too. However, the performance of single processor core has not changed much recently — instead we are getting multi-core processor, and clusters of multi-cores.
To handle the large amount of data collected from modern machines with large amount of computing units, we need to use parallel or distributed computing, in other words, to get many computers to perform the same task simultaneously.

Embarrassingly parallel¶

Many problems are “embarrassingly parallel”. In other words, they can be split into many smaller tasks and passed on to many many computers for the computation to be carried out simultaneously .
If we have a single computer at our disposal and have to run $n$ models, each taking $s$ seconds, the total running time will be $n*s$. If however, we have $k < n$ computers we can run our models on, the total running time will $n*s/k$. In the old days this was how parallel code was run; and is still run on larger servers.
However, modern computers have “multicore” processors and can be equivalent to running multiple computers at a time. The equation is not quite as clean (there are other things running on each process; overhead in transferring between processors exists; etc) but in general we see the same gain.

Terminology¶

A core: a general term for either a single processor on your own computer (technically you only have one processor, but a modern processor like the i7 can have multiple cores - hence the term) or a single machine in a cluster network.
A cluster: a collection of objecting capable of hosting cores, either a network or just the collection of cores on your personal computer.
A Node/Machine: a single physical machine in the cluster.
A process: a single running version of R (or more generally any program). Each core runs a single process.

When to parallelize¶

It’s not as simple as it may seem.
While in theory each added processor would linearly increase the throughput of a computation, there is overhead that reduces that efficiency.
For example, the code and, importantly, the data need to be copied to each additional CPU, and this takes time and bandwidth.
Plus, new processes and/or threads need to be created by the operating system, which also takes time. This overhead reduces the efficiency enough that realistic performance gains are much less than theoretical, and usually do not scale linearly as a function of processing power.
For example, if the time that a computation takes is short, then the overhead of setting up these additional resources may actually overwhelm any advantages of the additional processing power, and the computation could potentially take longer!

When to parallelize¶

Your notice computations are too slow, and wonder “why is that?” Should you store your data differently? Should you use different software? Should you buy more RAM? Should you “go cloud”?
No one-size-fits-all solution to speed problems.
Solving a RAM bottleneck may consume more CPU. Solving a CPU bottleneck may consume more RAM. Parallelisation means using multiple CPUs simultaneously. It will thus aid with CPU bottlenecks, but may consume more RAM. Parallelising is thus ill advised when dealing with a RAM bottleneck.

How to diagnose?¶

When deciding if, and how, to parallelise, it is crucial that you diagnose your bottleneck. The good news is- that diagnostics is not too hard.

You never drive without looking at your dashboard; you should never program without looking at your system monitors. Windows users have their Task Manager; Linux users have top, or preferably, htop; Mac users have the Activity Monitor. The system monitor will inform you how your RAM and CPUs are being used.
If you forcefully terminate your computation, and R takes a long time to respond, you are probably dealing with a RAM bottleneck.
Profile your code to detect how much RAM and CPU are consumed by each line of code.

Parallel computing in R¶

R comes up with a group of packages for parallel computing. You may have heard about multicore, snow, foreach, etc.
When you have a list of repetitive tasks, you may be able to speed it up by adding more computing power. If each task is completely independent of the others, then it is a prime candidate for executing those tasks in parallel, each on its own core.

Parallelize using `parallel`¶

The parallel package was introduced in 2011 to unify two popular parallisation packages: snow and multicore. The multicore package was designed to parallelise using the fork mechanism, on Linux machines. The snow package was designed to parallelise other mechanisms. R processes started with snow are not forked, so they will not see the parent’s data. Data will have to be copied to child processes. The good news: snow can start R processes on Windows machines, or remotely machines in the cluster.

The parallel library can be used to send tasks (encoded as function calls) to each of the processing cores on your machine in parallel.
The most popular mclapply() function essentially parallelizes calls to lapply().
mclapply gathers up the responses from each of these function calls, and returns a list of responses that is the same length as the list or vector of input data (one return per input item).

Remark on `mclapply`¶

The mclapply() function (and related mc* functions) works via the fork mechanism on Unix-style operating systems.
Briefly, your R session is the main process and when you call a function like mclapply(), you fork a series of sub-processes that operate independently from the main process (although they share a few low-level features).
These sub-processes then execute your function on their subsets of the data, presumably on separate cores of your CPU. Once the computation is complete, each sub-process returns its results and then the sub-process is killed. The parallel package manages the logistics of forking the sub-processes and handling them once they’ve finished.

How many cores can I use?¶

The first thing you might want to check with the parallel package is if your computer in fact has multiple cores that you can take advantage of.

In [9]:

library(parallel)
nCores <- detectCores()
nCores

Let’s build a simple loop that uses sample with replacement to do a bootstrap analysis.

We select Sepal.Length and Species from the iris dataset, subset it to 100 observations, and then iterate across 10,000 trials, each time resampling the observations with replacement.
Run a logistic regression fitting species as a function of length, and record the coefficients for each trial to be returned.

In [7]:

# Sequential computing
x <- iris[which(iris[,5] != "setosa"), c(1,5)]
trials <- seq(1, 10000)
boot_fx <- function(trial) {
  ind <- sample(100, 100, replace=TRUE)
  result1 <- glm(x[ind,2]~x[ind,1], family=binomial(logit))
  r <- coefficients(result1)
  res <- rbind(data.frame(), r)
}
system.time({
  results <- lapply(trials, boot_fx)
})

   user  system elapsed 
 18.808   0.000  18.934

In [10]:

# Parallel computing using mclapply
x <- iris[which(iris[,5] != "setosa"), c(1,5)]
trials <- seq(1, 10000)
boot_fx <- function(trial) {
  ind <- sample(100, 100, replace=TRUE)
  result1 <- glm(x[ind,2]~x[ind,1], family=binomial(logit))
  r <- coefficients(result1)
  res <- rbind(data.frame(), r)
}
system.time({
  results <- mclapply(trials, boot_fx, mc.cores = detectCores())
})

   user  system elapsed 
 28.256   1.611   1.756

Note¶

When executing parallel jobs via mclapply() it’s important to pre-calculate how much memory all of the processes will require and make sure this is less than the total amount of memory on your computer.
A major advantage of the multicore processing in parallel due to forked parallel processing is that global variables in the main R session are inherited by the child processes. This means the developer does not have to spend efforts on identifying and exporting those to the parallel workers.
Only applicable in Unix-style OS.

Parallelize with `parLapply`¶

Using the forking mechanism on your computer is one way to execute parallel computation but it’s not the only way that the parallel package offers.
Another way to build a “cluster” using the multiple cores on your computer is via sockets.
A socket is simply a mechanism with which multiple processes or applications running on your computer (or different computers, for that matter) can communicate with each other.
With parallel computation, data and results need to be passed back and forth between the parent and child processes and sockets can be used for that purpose.

`parLapply`¶

Building a socket cluster is simple to do in R with the makeCluster() function.

In [5]:

# Parallel computing using parLapply
library(snow)
cl <- makeCluster(nCores, type = 'SOCK')
system.time(results <- parLapply(cl, trials, boot_fx))
stopCluster(cl)

Attaching package: ‘snow’


The following objects are masked from ‘package:parallel’:

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, clusterSplit, makeCluster, parApply,
    parCapply, parLapply, parRapply, parSapply, splitIndices,
    stopCluster

Error in checkForRemoteErrors(val): 24 nodes produced errors; first error: object 'x' not found
Traceback:

1. system.time(results <- parLapply(cl, trials, boot_fx))
2. parLapply(cl, trials, boot_fx)
3. docall(c, clusterApply(cl, splitList(x, length(cl)), lapply, 
 .     fun, ...))
4. do.call("fun", lapply(args, enquote))
5. lapply(args, enquote)
6. clusterApply(cl, splitList(x, length(cl)), lapply, fun, ...)
7. staticClusterApply(cl, fun, length(x), argfun)
8. checkForRemoteErrors(val)
9. stop(count, " nodes produced errors; first error: ", firstmsg)

Timing stopped at: 0.017 0.001 0.022

`parLapply`¶

The advantages of this model is that it is supported on all operating systems.
The disadvantages are increased communication overhead, and that global variables have to be identified and explicitly exported to each worker in the cluster before processing. As discussed below, another advantage with cluster processing is that it supports also workers on external machines, possibly running in remote locations.

In [6]:

# Parallel computing using parLapply
x <- iris[which(iris[,5] != "setosa"), c(1,5)]
trials <- seq(1, 10000)
cl <- makeCluster(nCores, type = 'SOCK')
clusterExport(cl, "x")
system.time(results <- parLapply(cl, trials, boot_fx))
stopCluster(cl)

   user  system elapsed 
  0.025   0.003   1.670

Parallelize using `foreach`¶

On Unix-style OS, multicore + doMC + foreach.
On Windows (and most OS), snow + doParallel + foreach.
Very easy to go back to sequential computing.

In [8]:

# Almost any OS
library(doParallel)
cl <- makeCluster(nCores, type = 'SOCK')
registerDoParallel(cl)
result <- foreach(i = 1:10000, .combine = c) %dopar% sqrt(i)
stopCluster(cl)
class(result)

Loading required package: foreach

Loading required package: iterators

'numeric'

In [10]:

# Unix-style OS
library(doMC)
registerDoMC(nCores)
result <- foreach(i = 1:10000, .combine = c) %dopar% sqrt(i)

Go back to the bootstraping example¶

In [11]:

# Parallel computing using foreach
doMC::registerDoMC(cores = nCores)
x <- iris[which(iris[,5] != "setosa"), c(1,5)]
trials <- 10000
system.time({
  r <- foreach(1:trials, .combine = rbind) %dopar% {
    ind <- sample(100, 100, replace=TRUE)
    result1 <- glm(x[ind,2]~x[ind,1], family=binomial(logit))
    coefficients(result1)
  }
})

   user  system elapsed 
 26.224   1.586   2.270

Other packages for parallel computing in R¶

Many R packages come up with parallel features (or arguments). You may refer to this link for more details. Examples are:

future provides a lightweight and unified Future API for sequential and parallel processing of R expression via futures.
Data.table is a venerable and powerful package written primarily by Matt Dowle. It is a high-performance implementation of R’s data frame construct, with an enhanced syntax. There have been innumerable benchmarks showcasing the power of the data.table package. It provides a highly optimized tabular data structure for most common analytical operations.
The caret package (Classification And REgression Training) is a set of functions that streamline the process for creating predictive models. The package contains tools for data splitting, preprocessing, feature selection, model tuning using resampling, variable importance estimation, and other functionality.
Multidplyr is a backend for dplyr that partitions a data frame across multiple cores. You tell multidplyr how to split the data up with partition(), and then the data stays on each node until you explicitly retrieve it with collect(). This minimizes time spent moving data around, and maximizes parallel performance.

Lab¶

Think about a slow piece of code you have ever written for either your past assignments or projects.
Parallelize it in Python.

Big Data Essentials¶

L6: Parallel Computing¶

What is parallel computing¶

Embarrassingly parallel¶

Terminology¶

When to parallelize¶

When to parallelize¶

How to diagnose?¶

Parallel computing in R¶

Parallel computing in R¶

Parallelize using parallel¶

Remark on mclapply¶

How many cores can I use?¶

Note¶

Parallelize with parLapply¶

parLapply¶

parLapply¶

Parallelize using foreach¶

Go back to the bootstraping example¶

Other packages for parallel computing in R¶

Lab¶

Parallelize using `parallel`¶

Remark on `mclapply`¶

Parallelize with `parLapply`¶

`parLapply`¶

`parLapply`¶

Parallelize using `foreach`¶