Sections in this Chapter:
In the first Chapter, we are coding within the interactive mode of R/Python. When we are working on the real world projects, using an Integrated development environment (IDE) is a more pragmatic choice. There are not many choices for R IDE, and among them RStudio1 is the best one I have used so far. As for Python, I would recommend either Visual Studio Code2 or PyCharm3. But of course, you could use any text editor to write R/Python scripts.
Let’s write a simple script to print Hello World!
in R/Python. I have made a directory chapter2
on my disk, the R script is saved as hello_world.R
and the Python script is saved as hello_world.py
, inside the directory.
chapter2/hello_world.R
chapter2/hello_world.py
There are a few ways to run the R script. For example, we can run the script from the console with the r -f filename
command. Also, we can open the R interactive session and use the source()
function. I would recommend the second approach with source()
function. As for the Python script, we can run it from the console.
Debugging is one of the most important aspects of programming. What is debugging in programming? The programs we write might include errors/bugs and debugging is a step-by-step process to find and remove the errors/bugs in order to get the desired results.
If you are smart enough or the bugs are evident enough then you can debug the program on your mind without using a computer at all. But in general we need some tools/techniques to help us with debugging.
Most of programming languages provide the functionality of printing, which is a natural way of debugging. By trying to place print
statements at different positions we may finally catch the bugs. When I use print
to debug, it’s feeling like playing the game of minesweeper. In Python, there is a module called logging
4 which could be used for debugging like the print
function, but in a more elegant fashion.
In R, there is a function browser()
which interrupts the execution and allows the inspection of the current environment. Similarly, there is a module pdb
in Python that provides more debugging features. We would only focus on the basic usages of browser()
and the set_trace()
function in pdb
module.
The essential difference between debugging using print()
and browser()
and set_trace()
is that the latter functions allows us to debug in an interactive mode.
Let’s write a function which takes a sorted vector/list v
and a target value x
as input and returns the leftmost index pos
of the sorted vector/list so that v[pos]>=x
. Since v
is already sorted, we may simply loop through it from left to right to find pos
.
chapter2/find_pos.R
chapter2/find_pos.py
Now let’s run these two scripts.
When x=11
, the function returns NULL
in R and None
in Python because there is no such element in v
larger than x
.
The implementation above is trivial, but not efficient. If you have some background in data structures and algorithms, you probably know this question can be solved by binary search. The essential idea of binary search comes from divide-and-conquer5. Since v
is already sorted, we may divide it into two partitions by cutting it from the middle, and then we get the left partition and the right partition. v
is sorted implies that both the left partition and the right partition are also sorted. If the target value x
is larger than the rightmost element in the left partition, we can just discard the left partition and search x
within the right partition. Otherwise, we can discard the right partition and search x
within the left partition. Once we have determined which partition to search, we may apply the idea recursively so that in each step we reduce the size of v
by half. If the length of v
is denoted as n
, in terms of big O notation6, the run time complexity of binary search is $\mathcal{O}(\log{}n)$, compared with $\mathcal{O}(n)$ of the for-loop implementation.
The code below implements the binary search solution to our question (It is more intuitive to do it with recursion but here I write it with iteration since tail recursion optimization7 in R/Python is not supported).
chapter2/find_binary_search_buggy.R
chapter2/find_binary_search_buggy.py
Now let’s run these two binary search scripts.
The binary search solutions don’t work as expected when x=11
. We write two new scripts.
chapter2/find_binary_search_buggy_debug.R
chapter2/find_binary_search_buggy_debug.py
Let’s try to debug the programs with the help of browser()
and set_trace()
.
In the R code snippet above, we placed the browser()
function on the top of the function binary_search_buggy
. Then when we call the function we enter into the debugging environment. By calling ls()
we see all variables in the current debugging scope, i.e., v,x
. Typing n
will evaluate the next statement. After typing n
a few times, we finally exit from the while loop because start = 4
such that start < end
is FALSE. As a result, the function just returns the value of start, i.e., 4. To exit from the debugging environment, we can type Q
; to continue the execution we can type c
.
The root cause is that we didn’t deal with the corner case when the target value x
is larger than the last/largest element in v
correctly.
Let’s debug the Python function using pdb
module.
Similar to R, command n
would evaluate the next statement in pdb
. Typing command l
would show the current line of current execution. Command b line_number
would set the corresponding line as a break point; and c
would continue the execution until the next breakpoint (if exists).
In R, besides the browser()
function there are a pair of functions debug() and undebug()
which are also very handy when we try to debug a function; especially when the function is wrapped in a package. More specifically, the debug
function would invoke the debugging environment whenever we call the function to debug. See the example below how we invoke the debugging environment for the sd
function (standard deviation calculation).
The binary_search solutions are fixed below.
chapter2/find_binary_search.R
chapter2/find_binary_search.py
By benchmarking, I mean measuring the entire operation time of a piece of program. There is another term called profiling which is related to benchmarking. But profiling is more complex since it commonly aims at understanding the behavior of the program and optimizing the program in terms of time elapsed during the operation.
In R, I like using the microbenchmark
package. And in Python, timeit
module is a good tool to use when we want to benchmark a small bits of Python code.
As mentioned before, the run time complexity of binary search is better than that of a for-loop search. We can do benchmarking to compare the two algorithms.
chapter2/benchmark.R
In the R code above, times=1000
means we want to call the function 1000
times in the benchmarking process. The sample()
function is used to draw samples from a set of elements. Specifically, we pass the argument 1
to sample()
to draw a single element. It’s the first time we use set.seed()
function in this book. In R/Python, we draw random numbers based on the pseudorandom number generator (PRNG) algorithm8. The sequence of numbers generated by PRNG is completed determined by an initial value, i.e., the seed. Whenever a program involves the usage of PRNG, it is better to set the seed in order to get replicable results (see the example below).
Now let’s run the R script to see the benchmarking result.
The binary_search solution is much more efficient based on the benchmarking result.
Doing the same benchmarking in Python is a bit of complicated.
chapter2/benchmark.py
The most interesting part of the Python code above is from __main__ import
. Let’s ignore it for now, and we would revisit it later.
Below is the benchmarking result in Python (the unit is second).
In parallel computing, automatic vectorization9 means a program in a scalar implementation is converted to a vector implementation which process multiple pairs of operands simultaneously by compilers that feature auto-vectorization. For example, let’s calculate the element-wise sum of two arrays x
and y
of the same length in C programming language.
The C code above might be vectorized by the compiler so that the actual number of iterations performed could be less than 4. If 4 pairs of operands are processed at once, there would be only 1 iteration. Automatic vectorization may make the program runs much faster in some languages like C. However, when we talk about vectorization in R/Python, it is different from automatic vectorization. Vectorization in R/Python usually refers to the human effort paid to avoid for-loops. First, let’s see some examples of how for-loops may slow your programs in R/Python.
chapter2/vectorization_1.R
Running the R code results in the following result on my local machine.
Please note that in this Python example we are using the random
submodule of numpy
module instead of the built-in random
module since random
module doesn’t provide the vectorized version of random number generation function. Running the Python code results in the following result on my local machine.
Also, the timeit.timeit
measures the total time to run the main statements number
times, but not the average time per run.
In either R or Python, the vectorized version of random normal random variable (r.v.) is significantly faster than the scalar version. It is worth noting the usage of the print(f'')
statement in the Python code, which is different from the way how we print the object of Complex
class in \autoref{ch:chp1. In the code above, we use the f-string
10 which is a literal string prefixed with 'f'
containing expressions inside \{\
which would be replaced with their values. f-string
was a feature introduced since Python 3.6. If you are familiar with Scala, you may find that this feature is quite similar with the string interpolation mechanism introduced since Scala 2.10.
It’s also worth noting that lots of built-in functions in R are already vectorized, such as the basic arithmetic operators, comparison operators, ifelse()
, element-wise logical operators &,|
. But the logical operators &&, ||
are not vectorized.
In addition to vectorization, there are also some built-in functions which may help to avoid the usages of for-loops. For example, in R we might be able use the apply
family of functions to replace for-loops; and in Python the map()
function can also be useful. In the Python pandas
module, there are also many usages of map/apply
methods. But in general the usage of apply/map
functions has little or nothing to do with performance improvement. However, appropriate usages of such functions may help with the readability of the program. Compared with the apply
family of functions in R, I think the do.call()
function is more useful in practice. We would spend some time in do.call()
later.
Considering the importance of vectorization in scientific programming, let’s try to get more familiar with vectorization thorough the Biham–Middleton–Levine (BML) traffic model11. The BML model is very important in modern studies of traffic flow since it exhibits a sharp phase transition from free flowing status to a fully jammed status. A simplified BML model could be characterized as follows:
- Initialized on a 2-D lattice, each site of which is either empty or occupied by a colored particle (blue or red);
- Particles are distributed randomly through the initialization according to a uniform distribution; the two colors of particles are equally distributed.
- On even time steps, all blue particles attempt to move one site up and an attempt fails if the site to occupy is not empty;
- On Odd time steps, all red particles attempt to move one site right and an attempt fails if the site to occupy is not empty;
- The lattice is assumed periodic which means when a particle moves out of the lattice, it would move into the lattice from the opposite side.
The BML model specified above is implemented in both R/Python as follows to illustrate the usage of vectorization.
chapter2/BML.R
Now we can create a simple BML system on a $5\times5$ lattice using the R code above.
In the initialization step, we used the inverse transform sampling approach12 to generate the status of each site. Inverse transform sampling method is basic but powerful approach to generate r.v. from any probability distribution given its cumulative distribution function (CDF). Reading the wiki page is enough to master this sampling method.
The Python implementation is also given.
Please note that although we have imported numpy
in BML.py, we import it again in the code above in order to set the random seed. If we change the line to from BML import *
, we don’t need to import numpy
again. But it is not recommended to import *
from a module.
According to the explanation of wikipedia13, single-threading is the processing of one command at a time, and its opposite is multithreading. A process is the instance of a computer program executed by one or many threads14. Multithreading is not the same as parallelism. In a single processor environment, the processor can switch execution between threads, which results in concurrent execution. However, it is possible a process with multithreads runs on on a multiprocessor environment and each of the threads on a separate processor, which results in parallel execution.
Both R and Python are single-threaded. In Python, there is a threading
package15, which supports multithreading on a single core. It may suit some specific tasks. For example, in web scraping multithreading on a single core may be able to speed up the program if the download time is in excess of the CPU processing time.
Now let’s talk about embarrassingly parallelism by multi-processing. Embarrassingly parallel problem is one where little or no effort is needed to separate the problem into a number of parallel tasks16. In R there are various packages supporting multi-processing on multiple CPU cores, for example, the parallel
package, which is my favorite one. In Python, there are also some available modules, such as multiprocessing
, joblib
and concurrent.futures
. Let’s see an application of embarrassingly parallelism to calculate $\pi$ using Monte Carlo simulation17.
Monte Carlo simulation provides a simple and straightforward way to estimate $\pi$. We know the area of a circle with radius 1 is just $\pi$. Thus, we can convert the original problem of $\pi$ calculation to a new problem, i.e., how to calculate the area of a circle with radius 1. We also know the area of a square with side length 2 is equal to 4. Thus, $\pi$ can be calculated as 4$r_{c/s}$ where $r_{c/s}$ denotes the ratio of areas of a circle with radius 1 and a square with side length 2. Now the problem is how to calculate the ratio $r_{c/s}$? When we randomly throw $n$ points into the square and $m$ of these pionts fall into the inscribed circle, then we can estimate the ratio as $m/n$. As a result, a natural estimate of $\pi$ is $4m/n$. This problem is an embarrassingly parallel problem by its nature. Let’s see how we implement the idea in R/Python.
chapter2/pi.R
In the above R code snippet, we use the function mclapply
which is not currently available on some operation systems18. When it is not available, we may consider to use parLapply
instead.
chapter2/pi.py
In the above Python code snippet, we defined a function generate_points_parallel
which returns a list of number of inside points. But the problem of this function is that each process in the pool shares the same random state. As a result, we will obtain the a list of duplicated numbers by calling this function. To overcome the issue, we defined another function generate_points_parallel_set_seed
. Let’s see the results of our estimation for $\pi$ running on a laptop.
We see the winner in this example is vectorization, and the parallel solution is better than the naive solution. However, when the problem cannot be vectorized we may use parallelization to achieve better performance.
R is a lazy programming language since it uses lazy evaluation by default19. Lazy evaluation strategy delays the evaluation of an expression until its value is needed. When an expression in R is evaluated, it follows an outermost reduction order in general. But Python is not a lazy programming language and the expression in Python follows an innermost reduction order.
The code snippets below illustrate the innermost reduction vs. outermost reduction.
Because of the outermost reduction order, the R code snippet evaluate the function power
first and since if second argument is 0 the first argument is not required to evaluate. Thus, the function call returns 1. But the Python code snippet first evaluates the function call of divide
and an exception is raised because of division by zero.
Although Python is an eager language, it is possible to simulate the lazy evaluation behavior. For example, the code inside a generator20 is evaluated when the generator is consumed but not evaluated when it is created. We can create a sequence of Fibonacci numbers with the help of generator
.
In the code snippet above, we create a generator which generates the sequence of Fibonacci numbers less than 10. When the generator is created, the sequence is not generated immediately. When we consume the generator in a loop, each element is then generated as needed. It is also common to consume the generator with the next
function.
The purpose of this section is not to teach C/C++. If you know how to write C/C++ then it may help to improve the performance of R/Python program.
It is recommended to use vectorization technique to speed up your program whenever possible. But what if it’s infeasible to vectorize the algorithm? One potential option is to use C/C++ in the R/Python code. According to my limited experience, it is much easier to use C/C++ in R with the help of Rcpp
package than in Python. In Python, more options are available but not as straightforward as the solution of Rcpp
. In this section, let’s use Cython
21. Actually, Cython
itself is a programming language written in Python and C. In Rcpp
we can write C++ code directly and use it in native R code. With the help of Cython
, we can write python-like code which is able to be compiled to C or C++ and automatically wrapped into python-importable modules. Cython
could also wrap independent C or C++ code into python-importable modules.
Let’s see how to calculate Fibonacci numbers with these tools. The first step is to install Rcpp
package in R and the Cython
module (use pip
) in Python. Once they are installed, we can write the code in C++ directly for R. As for Cython
, we write the python-like code which would be compiled to C/C++ later.
chapter2/Fibonacci.cpp
chapter2/Fibonacci.pyx
It’s worth noting the extension of the Cython
file above is pyx
, not py
. And in Fibonacci.pyx
, the function we defined is quite following the native Python syntax. But in Cython
we can add static typing declarations which are often helpful to improve the performance, although not mandatory. The keyword cdef
makes the function Fibonacci_c
invisible to Python and thus we have to define the Fibonacci_static
function as a wrapper which can be imported into Python. The function Fibonacci
is not static typed for benchmarking.
Let’s also implement the same functions in native R/Python for benchmarking purpose.
chapter2/Fibonacci.R
chapter2/Fibonacci.py
Now let’s compare the performance of different implementations.
The native implementation in R is much slower the C++ implementation (see the difference of units).
The results show the static typed implementation is the fastest, and the native implementation in pure Python is the slowest. Again the time measured by timeit.timeit
is the total time to run the main statement 1000 times. The average time per run of Fibonacci_static
function is close to the average time per run of Fibonacci
in R.
Functional (programming) languages define programs as mathematical functions and treat them as first-class2]. In pure functional programming, a function takes input and returns output without making any side effect. A side effect means a state variable outside the function is modified. For example, printing the value of a variable inside a function is a side effect.
Both R and Python are not purely functional languages per se, although R behaves more functional than Python23. If you wonder why the return
function could be ignored to return a value in an R function, now you get the explanation – a function should return the output automatically from a FP perspective.
I don’t think it makes any sense to debate on if we should choose OOP or FP using R/Python. And both languages support multiple programming paradigms (FP, OOP etc.). If you want to do purist FP, neither R nor Python would be the best choice. For example, purist FP should avoid loop because loop always involves an iteration counter whose value changes over iterations, which is obviously a side effect. Thus, recursion is very important in FP. But tail recursion optimization is not supported in R and Python although it can be implemented via trampoline24.
Let’s introduce a few tools that could help to get a first impression of FP in R and Python.
We have seen anonymous functions before. Anonymous functions are commonly used in FP. To recap, let’s see how to define a useless anonymous function.
Usually, we define anonymous functions because we want to pass them to other functions. And thus, the functions defined above are useless as they are not passed to any other function and after the definition they are deleted automatically from the memory.
Of course, we can assign names to the anonymous functions – but what is the point to name anonymous functions?
Map could be used to avoid loops. The first argument to map is a function.
In R, we use Map
which is shipped with base R; and in Python we use map
which is also a built-in function.
There are a few things to note from the code snippets above. First, the returned value from Map
in R is a list rather than a vector. That is because Map
is just a wrapper of mapply
function in R, and the SIMPLIFY
argument is set to FALSE
. If we want to get a vector value returned, just use mapply
instead.
The returned value from map
in Python is a map
object which is an iterator. The generator
function we have seen before is also an iterator. To use an iterator, we could convert it to a list, or get the next value using the next
function, or just use a for loop.
Similar to map
, the first argument to Filter
in R and filter
in Python is also a function. This function is used to get the elements which satisfy a certain condition specified in the first argument. For example, we can use it to get the even numbers.
Reduce
behaves a bit different from Map
and Filter
. The first argument is a function with exactly two arguments. The second argument is a sequence, on each element of which the first function argument will be applied from left to right. And there is one optional argument called initializer. When the initializer is provided, it would be used as the first argument of the function argument of Reduce
. The examples below depict how it works.
Please note that in order to use reduce
in Python, we have to import
it from the functools
module.
Utilizing these functions flexibly may help to make the code more concise, at the cost of readability.
We have introduced the basics of R/Python programming so far. There are much more to learn to become an advanced user of R/Python. For example, the appropriate usages of iterator, generator, decorator
could improve both the conciseness and readability of your Python code. The generator
25 is commonly seen in machine learning programs to prepare training/testing samples. decorator
is a kind of syntactic sugar to allow the modification of a function’s behavior in a simple way. In R there are no built-in iterator, generator, decorator
, but you may find some third-party libraries to mimic these features; or you may try to implement your own.
One advantage of Python over R is that there are some built-in modules containing high-performance data structures or commonly-used algorithms implemented efficiently. For example, I enjoy using the deque
structure in the Python collections
module26, but there is no built-in counterpart in R. We have written our own binary search algorithm earlier in this Chapter, which can also be replaced by the functions in the built-in module bisect
27 in Python.
Another important aspect of programming is testing. Unit testing is a typical software testing method that is commonly adopted in practice. In R there are two third-party packages testthat
and RUnit
. In Python, the built-in unittest
is quite powerful. Also, the third-party module pytest
28 is very popular.
1 https://www.rstudio.com/products/rstudio
2 https://code.visualstudio.com/
3 https://www.jetbrains.com/pycharm
4 https://docs.python.org/3/library/logging.html
5 https://en.wikipedia.org/wiki/Divide-and-conquer_algorithm
6 https://en.wikipedia.org/wiki/Big_O_notation
7 https://en.wikipedia.org/wiki/Tail_call
8 https://en.wikipedia.org/wiki/Pseudorandom_number_generator
9 https://en.wikipedia.org/wiki/Automatic_vectorization
10 https://www.python.org/dev/peps/pep-0498
11 https://en.wikipedia.org/wiki/Biham-Middleton-Levine_traffic_model
12 https://en.wikipedia.org/wiki/Inverse_transform_sampling
13 https://en.wikipedia.org/wiki/Thread_(computing)
14 https://en.wikipedia.org/wiki/Process_(computing)
15 https://docs.python.org/3.7/library/threading.html
16 https://en.wikipedia.org/wiki/Embarrassingly_parallel
17 https://en.wikipedia.org/wiki/Monte_Carlo_method
18 https://stat.ethz.ch/R-manual/R-devel/library/parallel/doc/parallel.pdf
19 Floréal Morandat, Brandon Hill, Leo Osvald, and Jan Vitek. Evaluating the design of the r language. In European Conference on Object-Oriented Programming, pages 104–131. Springer, 2012.
20 https://www.python.org/dev/peps/pep-0255
21 https://en.wikipedia.org/wiki/Cython
22 https://en.wikipedia.org/wiki/List_of_programming_languages_by_type#Functional_languages
23 John M Chambers et al. Object-oriented programming, functional programming and r. Statistical Science, 29(2):167–180, 2014.
24 https://en.wikipedia.org/wiki/Tail_call#Through_trampolining
25 https://docs.python.org/3/howto/functional.html
26 https://docs.python.org/3/library/collections.html
27 https://docs.python.org/3.7/library/bisect.html
28 https://docs.pytest.org/en/latest