layout: true --- class: inverse, center, middle background-image: url(../figs/titlepage16-9.png) background-size: cover <br> <br> # Bayesian Statistics and Computing ## Lecture 5: Debugging and profiling R code <img src="../figs/slides.png" width="150px"/> #### *Yanfei Kang | BSC | Beihang University* --- class: inverse, center, middle # Debugging --- # Something’s Wrong! R has a number of ways to indicate to you that something’s not right, e.g., - `message`: A generic notification/diagnostic message; execution of the function continues. - `warning`: An indication that something is wrong but not necessarily fatal; execution of the function continues. - `error`: An indication that a fatal problem has occurred and execution of the function stops. Try `log(-1)`. --- # Example ```r printmessage <- function(x) { if(x > 0) print("x is greater than zero") else print("x is less than or equal to zero") invisible(x) } ``` Now try `printmessage(1)` and `printmessage(NA)`. What happened? How to fix this problem? --- # Solution ```r printmessage2 <- function(x) { if (is.na(x)) print("x is a missing value!") else if (x > 0) print("x is greater than zero") else print("x is less than or equal to zero") invisible(x) } ``` Now try `printmessage2(NA)`. --- # Example Now try the following and see what happened. ```r x <- log(c(-1, 2)) printmessage2(x) ``` --- # Solution 1. allowing vector arguments. 2. vectorizing the `printmessage2()` function to allow it to take vector arguments. For the first solution, we check the length of the input. ```r printmessage3 <- function(x) { if (length(x) > 1L) stop("'x' has length > 1") if (is.na(x)) print("x is a missing value!") else if (x > 0) print("x is greater than zero") else print("x is less than or equal to zero") invisible(x) } ``` Now try `printmessage3(1:2)`. --- # Solution For the second solution, vectorizing the function can be accomplished easily with the `Vectorize()` function. ```r printmessage4 <- Vectorize(printmessage2) out <- printmessage4(c(-1, 2)) ``` --- # Figuring Out What's Wrong The primary task of debugging any R code is correctly diagnosing what the problem is. Some basic questions you need to ask are - What was your input? How did you call the function? - What were you expecting? Output, messages, other results? - What did you get? - How does what you get differ from what you were expecting? - Were your expectations correct in the first place? - Can you reproduce the problem (exactly)? --- # Debugging Tools in R R provides a number of tools to help you with debugging your code. - `traceback()`: prints out the function call stack after an error occurs; does nothing if there’s no error - `debug()`: flags a function for “debug” mode which allows you to step through execution of a function one line at a time - `browser()`: suspends the execution of a function wherever it is called and puts the function in debug mode - `trace()`, `recover()`, etc. These functions are interactive tools specifically designed to allow you to pick through a function. There's also the more blunt technique of inserting `print()` or `cat()` statements in the function. --- # Using `traceback()` - The `traceback()` function prints out the *function call stack* after an error has occurred. - For example, you may have a function `a()` which subsequently calls function `b()` which calls `c()` and then `d()`. If an error occurs, it may not be immediately clear in which function the error occurred. The `traceback()` function shows you how many levels deep you were when the error occurred. ```r mean(x) traceback() ``` --- # Example ```r f <- function(x) { r <- x - g(x) r } g <- function(y) { r <- y * h(y) r } h <- function(z) { r <- log(z) if (r < 10) r^2 else r^3 } ``` - Try `f(-1)`. - Try `traceback()`. --- # Using `debug()` - The `debug()` function initiates an interactive debugger (also known as the "browser" in R) for a function. - With the debugger, you can step through an R function one expression at a time to pinpoint exactly where an error occurs. - The `debug()` function takes a function as its first argument. For debugging the `f()` function, try `debug(f)`. - Now, every time you call the `f()` function it will launch the interactive debugger. To turn this behavior off you need to call the `undebug()` function. --- # Commands in `debug()` The debugger calls the browser at the very top level of the function body. From there you can step through each expression in the body. There are a few special commands you can call in the browser: * `n` executes the current expression and moves to the next expression * `c` continues execution of the function and does not stop until either an error or the function exits * `Q` quits the browser --- # Example 1 Try `f(-1)`. - While you are in the browser you can execute any other R function that might be available to you in a regular session. - You can use `ls()` to see what is in your current environment (the function environment) and `print()` to print out the values of R objects in the function environment. - You can turn off interactive debugging with the `undebug()` function. --- # Example 2 ```r SS <- function(mu, x) { d <- x - mu d2 <- d ^ 2 ss <- sum(d2) ss } debug(SS) SS(1, rnorm(100)) ``` --- # Summary - There are three main indications of a problem/condition: `message`, `warning`, `error`; only an `error` is fatal. - When analyzing a function with a problem, make sure you can reproduce the problem, clearly state your expectations and how the output differs from your expectation. - Interactive debugging tools `traceback`, `debug`, `browser`, `trace`, and `recover` can be used to find problematic code in functions. - Debugging tools are not a substitute for thinking! --- class: inverse, center, middle # Profiling R > The real problem is that programmers have spent far too much time worrying about efficiency in the wrong places and at the wrong times; premature optimization is the root of all evil (or at least most of it) in programming. ---*Donald Knuth* --- # R Profiler - R comes with a profiler to help you optimize your code and improve its performance. - In general, it's usually a bad idea to focus on optimizing your code at the very beginning of development. - Rather, in the beginning it's better to focus on translating your ideas into code and writing code that's coherent and readable. - The problem is that heavily optimized code tends to be obscure and difficult to read, making it harder to debug and revise. Better to get all the bugs out first, then focus on optimizing. --- # What to optimize - Optimizing the parts of your code that are running *slowly*. - How do we know what parts those are? `\(\Rightarrow\)` This is what the profiler is for. - Profiling is a systematic way to examine how much time is spent in different parts of a program. - Often you might write some code that runs fine once. But then later, you might put that same code in a big loop that runs 1,000 times. Now the original code that took 1 second to run is taking 1,000 seconds to run! Getting that little piece of original code to run faster will help the entire loop. --- # Basic principles * Design first, then optimize. * Measure (collect data), don’t guess. * If you're going to be scientist, you need to apply the same principles here! --- # Using `system.time()` The `system.time()` function computes the time (in seconds) needed to execute an expression. The function returns: - *user time*: time charged to the CPU(s) for this expression - *elapsed time*: "wall clock" time, the amount of time that passes for *you* as you're sitting there --- # Using `system.time()` The elapsed time may be *greater than* the user time if the CPU spends a lot of time waiting around. ```r ## Elapsed time > user time system.time(readLines("http://yanfei.site")) #> user system elapsed #> 0.024 0.009 0.511 ``` The elapsed time may be *smaller than* the user time if your machine has multiple cores/processors (and is capable of using them). ```r library(plyr) library(doMC) registerDoMC(cores = detectCores()) system.time(aaply(1:10000, 1, function(x) rnorm(1, mean = x), .parallel = TRUE)) #> user system elapsed #> 2.335 0.720 1.401 ``` --- # Timing Longer Expressions You can time longer expressions by wrapping them in curly braces within the call to `system.time()`. ```r system.time({ n <- 1000 r <- numeric(n) for (i in 1:n) { x <- rnorm(n) r[i] <- mean(x) } }) #> user system elapsed #> 0.049 0.001 0.050 ``` If your expression is getting pretty long (more than 2 or 3 lines), it might be better to either break it into smaller pieces or to use the profiler. The problem is that if the expression is too long, you won't be able to identify which part of the code is causing the bottleneck. --- # The R Profiler - Using `system.time()` allows you to test certain functions or code blocks to see if they are taking excessive amounts of time. - However, this approach assumes that you already know where the problem is and can call `system.time()` on it that piece of code. - What if you don’t know where to start? - This is where the profiler comes in handy. --- # The R Profiler - The `Rprof()` function starts the profiler in R. - In conjunction with `Rprof()`, we will use the `summaryRprof()` function which summarizes the output from `Rprof()`. - `Rprof()` keeps track of the function call stack at regularly sampled intervals and tabulates how much time is spent inside each function. - By default it will write its output to a file called `Rprof.out`. You can specify the name of the output file if you don't want to use this default. - Once you call the `Rprof()` function, everything that you do from then on will be measured by the profiler. - The profiler can be turned off by passing `NULL` to `Rprof()`. --- # Example Now let's play around some data with the profiler running. ```r Rprof() data(diamonds, package = "ggplot2") plot(price ~ carat, data = diamonds) m <- lm(price ~ carat, data = diamonds) abline(m, col = "red") Rprof(NULL) ``` --- # Using `summaryRprof()` The `summaryRprof()` function tabulates the R profiler output and calculates how much time is spent in which function. There are two methods for normalizing the data. * "by.total" divides the time spend in each function by the total run time * "by.self" does the same as "by.total" but first subtracts out time spent in functions above the current function in the call stack. Now try `summaryRprof()` and interprete the results. --- # Interactive Visualizations for Profiling R Code `profvis` is a tool for helping you to understand how R spends its time. ```r library(profvis) profvis({ data(diamonds, package = "ggplot2") plot(price ~ carat, data = diamonds) m <- lm(price ~ carat, data = diamonds) abline(m, col = "red") }) ``` Note that Rstudio includes integrated support for profiling with `profvis`. --- # Summary * `Rprof()` runs the profiler for performance of analysis of R code * `summaryRprof()` summarizes the output of `Rprof()` and gives percent of time spent in each function (with two types of normalization) * Good to break your code into functions so that the profiler can give useful information about where time is being spent * RStudio includes integrated support for profiling with `profvis` * C or Fortran code is not profiled --- # Lab Session 4 In this lab, write your own code, enjoy the tools of debugging and profiling and write a short report of optimizing your code. --- # References Chapters 18 and 19 of the book [R programming for data science](https://bookdown.org/rdpeng/rprogdatascience/).