Processing math: 100%

Statistical Computing

Lecture 5: Debugging and profiling R code

Yanfei Kang
yanfeikang@buaa.edu.cn

School of Economics and Management
Beihang University
http://yanfei.site

Debugging

Something’s Wrong!

R has a number of ways to indicate to you that something’s not right.

message: A generic notification/diagnostic message produced by the message() function; execution of the function continues
warning: An indication that something is wrong but not necessarily fatal; execution of the function continues. Warnings are generated by the warning() function
error: An indication that a fatal problem has occurred and execution of the function stops. Errors are produced by the stop() function.
condition: A generic concept for indicating that something unexpected has occurred; programmers can create their own custom conditions if they want.

Try log(-1).

Example

printmessage <- function(x) {
        if(x > 0)
                print("x is greater than zero")
        else
                print("x is less than or equal to zero")
        invisible(x)        
}

Now try printmessage(1) and printmessage(NA). What happened? How to fix this problem?

Solution

printmessage2 <- function(x) {
    if (is.na(x)) 
        print("x is a missing value!") else if (x > 0) 
        print("x is greater than zero") else print("x is less than or equal to zero")
    invisible(x)
}

Now try printmessage2(NA).

Example

Now try the following and see what happened.

x <- log(c(-1, 2))
printmessage2(x)

Solution

allowing vector arguments.
vectorizing the printmessage2() function to allow it to take vector arguments.

For the first solution, we check the length of the input.

printmessage3 <- function(x) {
    if (length(x) > 1L) 
        stop("'x' has length > 1")
    if (is.na(x)) 
        print("x is a missing value!") else if (x > 0) 
        print("x is greater than zero") else print("x is less than or equal to zero")
    invisible(x)
}

Now try printmessage3(1:2).

For the second solution, vectorizing the function can be accomplished easily with the Vectorize() function.

printmessage4 <- Vectorize(printmessage2)
out <- printmessage4(c(-1, 2))

Figuring Out What's Wrong

The primary task of debugging any R code is correctly diagnosing what the problem is. Some basic questions you need to ask are

What was your input? How did you call the function?
What were you expecting? Output, messages, other results?
What did you get?
How does what you get differ from what you were expecting?
Were your expectations correct in the first place?
Can you reproduce the problem (exactly)?

Debugging Tools in R

R provides a number of tools to help you with debugging your code.

traceback(): prints out the function call stack after an error occurs; does nothing if there’s no error
debug(): flags a function for “debug” mode which allows you to step through execution of a function one line at a time
browser(): suspends the execution of a function wherever it is called and puts the function in debug mode
trace(): allows you to insert debugging code into a function a specific places
recover(): allows you to modify the error behavior so that you can browse the function call stack

These functions are interactive tools specifically designed to allow you to pick through a function. There's also the more blunt technique of inserting print() or cat() statements in the function.

Using `traceback()`

The traceback() function prints out the function call stack after an error has occurred.
For example, you may have a function a() which subsequently calls function b() which calls c() and then d(). If an error occurs, it may not be immediately clear in which function the error occurred. The traceback() function shows you how many levels deep you were when the error occurred.

mean(x)
traceback()

Example

f <- function(x) {
  r <- x - g(x)
  r
}
g <- function(y) {
  r <- y * h(y)
  r
}
h <- function(z) {
  r <- log(z)
  if (r < 10)
    r^2
  else r^3
}

Try f(-1).
Try traceback().
Looking at the traceback is useful for figuring out roughly where an error occurred but it's not useful for more detailed debugging. For that you might turn to the debug() function.

Using `debug()`

The debug() function initiates an interactive debugger (also known as the "browser" in R) for a function.
With the debugger, you can step through an R function one expression at a time to pinpoint exactly where an error occurs.
The debug() function takes a function as its first argument. For debugging the f() function, try debug(f).
Now, every time you call the f() function it will launch the interactive debugger. To turn this behavior off you need to call the undebug() function.

Commands in `debug()`

The debugger calls the browser at the very top level of the function body. From there you can step through each expression in the body. There are a few special commands you can call in the browser:

n executes the current expression and moves to the next expression
c continues execution of the function and does not stop until either an error or the function exits
Q quits the browser

Example 1

Try f(-1).

While you are in the browser you can execute any other R function that might be available to you in a regular session.
You can use ls() to see what is in your current environment (the function environment) and print() to print out the values of R objects in the function environment.
You can turn off interactive debugging with the undebug() function.

Example 2

SS <- function(mu, x) {
  d <- x - mu
  d2 <- d ^ 2
  ss <- sum(d2)
  ss
}
debug(SS)
SS(1, rnorm(100))

Using `recover()`

The recover() function can be used to modify the error behavior of R when an error occurs. Normally, when an error occurs in a function, R will print out an error message, exit out of the function, and return you to your workspace to await further commands.
With recover() you can tell R that when an error occurs, it should halt execution at the exact point at which the error occurred. That can give you the opportunity to poke around in the environment in which the error occurred. This can be useful to see if there are any R objects or data that have been corrupted or mistakenly modified.

options(error = recover)    ## Change default R error behavior
f(-1)

The recover() function will first print out the function call stack when an error occurrs. Then, you can choose to jump around the call stack and investigate the problem. When you choose a frame number, you will be put in the browser (just like the interactive debugger triggered with debug()) and will have the ability to poke around.

Summary

There are three main indications of a problem/condition: message, warning, error; only an error is fatal
When analyzing a function with a problem, make sure you can reproduce the problem, clearly state your expectations and how the output differs from your expectation
Interactive debugging tools traceback, debug, browser, trace, and recover can be used to find problematic code in functions
Debugging tools are not a substitute for thinking!

Profiling R

The real problem is that programmers have spent far too much time worrying about efficiency in the wrong places and at the wrong times; premature optimization is the root of all evil (or at least most of it) in programming.

—Donald Knuth

R Profiler

R comes with a profiler to help you optimize your code and improve its performance.
In generall, it's usually a bad idea to focus on optimizing your code at the very beginning of development.
Rather, in the beginning it's better to focus on translating your ideas into code and writing code that's coherent and readable.
The problem is that heavily optimized code tends to be obscure and difficult to read, making it harder to debug and revise. Better to get all the bugs out first, then focus on optimizing.

What to optimize

Optimizing the parts of your code that are running slowly
How do we know what parts those are? $\Rightarrow$ This is what the profiler is for.
Profiling is a systematic way to examine how much time is spent in different parts of a program.
Often you might write some code that runs fine once. But then later, you might put that same code in a big loop that runs 1,000 times. Now the original code that took 1 second to run is taking 1,000 seconds to run! Getting that little piece of original code to run faster will help the entire loop.

Basic principles

Design first, then optimize
Remember: Premature optimization is the root of all evil
Measure (collect data), don’t guess.
If you're going to be scientist, you need to apply the same principles here!

Using `system.time()`

The system.time() function computes the time (in seconds) needed to execute an expression. The function returns:

user time: time charged to the CPU(s) for this expression
elapsed time: "wall clock" time, the amount of time that passes for you as you're sitting there

The elapsed time may be greater than the user time if the CPU spends a lot of time waiting around.

## Elapsed time > user time
system.time(readLines("http://www.jhsph.edu"))

##    user  system elapsed 
##   0.042   0.005   4.956

The elapsed time may be smaller than the user time if your machine has multiple cores/processors (and is capable of using them).

library(plyr)
library(doMC)

## Loading required package: foreach

## Loading required package: iterators

## Loading required package: parallel

registerDoMC(cores = detectCores())
system.time(aaply(1:10000, 1, function(x) rnorm(1, mean = x), 
    .parallel = TRUE))

##    user  system elapsed 
##   3.790   0.745   2.518

Timing Longer Expressions

You can time longer expressions by wrapping them in curly braces within the call to system.time().

system.time({
    n <- 1000
    r <- numeric(n)
    for (i in 1:n) {
        x <- rnorm(n)
        r[i] <- mean(x)
    }
})

##    user  system elapsed 
##   0.105   0.004   0.109

If your expression is getting pretty long (more than 2 or 3 lines), it might be better to either break it into smaller pieces or to use the profiler. The problem is that if the expression is too long, you won't be able to identify which part of the code is causing the bottleneck.

The R Profiler

Using system.time() allows you to test certain functions or code blocks to see if they are taking excessive amounts of time.
However, this approach assumes that you already know where the problem is and can call system.time() on it that piece of code.
What if you don’t know where to start?
This is where the profiler comes in handy.

The R Profiler

The Rprof() function starts the profiler in R.
In conjunction with Rprof(), we will use the summaryRprof() function which summarizes the output from Rprof().
Rprof() keeps track of the function call stack at regularly sampled intervals and tabulates how much time is spent inside each function.
By default it will write its output to a file called Rprof.out. You can specify the name of the output file if you don't want to use this default.
Once you call the Rprof() function, everything that you do from then on will be measured by the profiler.
The profiler can be turned off by passing NULL to Rprof().

Example

Now let's play around some data with the profiler running.

Rprof()
data(diamonds, package = "ggplot2")
plot(price ~ carat, data = diamonds)
m <- lm(price ~ carat, data = diamonds)
abline(m, col = "red")
Rprof(NULL)

Using `summaryRprof()`

The summaryRprof() function tabulates the R profiler output and calculates how much time is spent in which function. There are two methods for normalizing the data.

"by.total" divides the time spend in each function by the total run time
"by.self" does the same as "by.total" but first subtracts out time spent in functions above the current function in the call stack.

Now try summaryRprof() and interprete the results.

Interactive Visualizations for Profiling R Code

profvis is a tool for helping you to understand how R spends its time.

library(profvis)

profvis({
  data(diamonds, package = "ggplot2")
  plot(price ~ carat, data = diamonds)
  m <- lm(price ~ carat, data = diamonds)
  abline(m, col = "red")
})

Note that Rstudio includes integrated support for profiling with profvis.

Summary

Rprof() runs the profiler for performance of analysis of R code
summaryRprof() summarizes the output of Rprof() and gives percent of time spent in each function (with two types of normalization)
Good to break your code into functions so that the profiler can give useful information about where time is being spent
RStudio includes integrated support for profiling with profvis
C or Fortran code is not profiled

Lab Session 4

In this lab, write your own code, enjoy the tools of debugging and profiling and write a short report of optimizing your code.

References

Chapters 20 and 21 of the book "R programming for data science".

Statistical Computing

Lecture 5: Debugging and profiling R code

Debugging

Something’s Wrong!

Example

Solution

Example

Solution

Figuring Out What's Wrong

Debugging Tools in R

Using traceback()

Example

Using debug()

Commands in debug()

Example 1

Example 2

Using recover()

Summary

Profiling R

R Profiler

What to optimize

Basic principles

Using system.time()

Timing Longer Expressions

The R Profiler

The R Profiler

Example

Using summaryRprof()

Interactive Visualizations for Profiling R Code

Summary

Lab Session 4

References

Using `traceback()`

Using `debug()`

Commands in `debug()`

Using `recover()`

Using `system.time()`

Using `summaryRprof()`