Sections in this Chapter:
R and Python are general-purpose programming languages that can be used for writing softwares in a variety of domains. But for now, let us start from using them as basic calculators. The first thing is to have them installed. R1 and Python2 can be downloaded from their official website. In this book, I would keep using R 3.5 and Python 3.7.
To use R/Python as basic calculators, let’s get familiar with the interactive mode. After the installation, we can type R or Python (it is case insensitive so we can also type r/python) to invoke the interactive mode. Since Python 2 is installed by default on many machines, in order to avoid invoking Python 2 we type python3.7 instead.
The messages displayed by invoking the interactive mode depend on both the version of R/Python installed and the machine. Thus, you may see different messages on your local machine.
As the messages said, to quit R we can type q()
. There are 3 options prompted by asking the user if the workspace should be saved or not. Since we just want to use R as a basic calculator, we quit without saving workspace.
To quit Python, we can simply type exit()
.
Once we are inside the interactive mode, we can use R/Python as a calculator.
From the code snippet above, R is working as a calculator perfectly. However, errors are raised when we call log(2)
and exp(2)
in Python. The error messages are self-explanatory – log
function and exp
function don’t exist in the current Python environment. In fact, log
function and exp
function are defined in the math
module in Python. A module3 is a file consisting of Python code. When we invoke the interactive mode of Python, a few built-in modules are loaded into the current environment by default. But the math
module is not included in these built-in modules. That explains why we got the NameError
when we try to use the functions defined in the math
module. To resolve the issue, we should first load the functions to use by using the import
statement as follows.
In the previous section we have seen how to use R/Python as calculators. Now, let’s see how to write real programs. First, let’s define some variables.
Here, we defined 4 different variables a, b, x, e
. To get the type of each variable, we can utilize the function typeof()
in R and type()
in Python, respectively.
The type of x
in R is called character, and in Python is called str.
We have seen two functions log
and exp
when we use R/Python as calculators. A function is a block of code which performs a specific task. A major purpose of wrapping a block of code into a function is to reuse the code.
It is simple to define functions in R/Python.
Here, we defined a function fun1
in R/Python. This function takes x
as input and returns the square of x
. When we call a function, we simply type the function name followed by the input argument inside a pair of parentheses. It is worth noting that input or output are not required to define a function. For example, we can define a function fun2
to print Hello World!
without input and output.
One major difference between R and Python codes is that Python codes are structured with indentation. Each logical line of R/Python code belongs to a certain group. In R, we use {}
to determine the grouping of statements. However, in Python we use leading whitespace (spaces and tabs) at the beginning of a logical line to compute the indentation level of the line, which is used to determine the statements’ grouping. Let’s see what happens if we remove the leading whitespace in the Python function above.
We got an IndentationError
because of missing indentation.
Let’s go back to fun1
and have a closer look at the return
. In Python, if we want to return something we have to use the keyword return
explicitly. return
in R is a function but it is not a function in Python and that is why no parenthesis follows return
in Python. In R, return
is not required even though we need to return something from the function. Instead, we can just put the variables to return in the last line of the function defined in R. That being said, we can define fun1
as follows.
Sometimes we want to give a default value to an argument for a function, and both R and Python allow functions to have default values.
In Python we have to put the arguments with default values to the end, which is not required in R. However, from readability perspective, it is better always to put them to the end. You may have noticed the error message above about positional argument. In Python there are two types of arguments, i.e., positional arguments and keyword arguments. Simply speaking, a keyword argument must be preceded by an identifier, e.g., base in the example above. And positional arguments refer to non-keyword arguments.
To implement a complex logic in R/Python, we may need control flows.
Let’s define a function to return the absolute value of input.
The code snippet above shows how to use if/else
in R/Python. The subtle difference between R and Python is that the condition after if
must be embraced by parenthesis in R but it is optional in Python.
We can also put if
after else
. But in Python, we use elif
as a shortcut.
Similar to the usage of if
in R, we also have to use parenthesis after the keyword for
in R. But in Python there should be no parenthesis after for
.
There is something more interesting than the for loop
itself in the snippets above.
In the R code, the expression 1:3
creates a vector with elements 1,2 and 3. In the Python code, we use the range()
function for the first time. Let’s have a look at the type of them.
range()
function returns a range
type object, which represents an immutable sequence of numbers. range()
function can take three arguments, i.e.,
range(start, stop, step)
. However, start
and step
are both optional. It’s critical to keep in mind that the stop
argument that defines the upper limit of the sequence is exclusive. And that is why in order to loop through 1 to 3 we have to pass 4 as the stop
argument to range()
function. The step
argument specifies how much to increase from one number to the next.
The default values of start
and step
are 0 and 1, respectively.
You may have noticed that in Python we can do i+=1
to add 1 to i
, which is not feasible in R by default. Both for loop and while loop can be nested.
Break/continue helps if we want to break the for/while loop earlier, or to skip a specific iteration. In R, the keyword for continue is called next
, in contrast to continue
in Python. The difference between break
and continue
is that calling break
would exit the innermost loop (when there are nested loops, only the innermost loop is affected); while calling continue
would just skip the current iteration and continue the loop if not finished.
In the previous sections, we haven’t seen much difference between R and Python. However, regarding the built-in data structures, there are some significant differences we would see in this section.
In R, we can use function c()
to create a vector; A vector is a sequence of elements with the same type. In Python, we can use []
to create a list, which is also a sequence of elements. But the elements in a list don’t need to have the same type. To get the number of elements in a vector in R, we use the function length()
; and to get the number of elements in a list in Python, we use the function len()
.
In the code snippet above, the first element in the variable z
in R is coerced from 1 (numeric) to “1” (character) since the elements must have the same type.
To access a specific element from a vector or list, we could use []
. In R, sequence types are indexed beginning with the one subscript; In contrast, sequence types in Python are indexed beginning with the zero subscript.
What if the index to access is out of boundary?
In Python, negative index number means indexing from the end of the list. Thus, x[-1]
points to the last element and x[-2]
points to the second-last element of the list. But R doesn’t support indexing with negative number in the same way as Python. Specifically, in R x[-index]
returns a new vector with x[index]
excluded.
When we try to access with an index out of boundary, Python would throw an IndexError
. The behavior of R when indexing out of boundary is more interesting. First, when we try to access x[0]
in R we get a numeric(0)
whose length is also 0. Since its length is 0, numeric(0)
can be interpreted as an empty numeric vector. When we try to access x[length(x)+1]
we get a NA
. In R, there are also NaN
and NULL
.
NaN
means “Not A Number” and it can be verified by checking its type – “double”. 0/0
would result in a NaN
in R. NA
in R generally represents missing values. And NULL
represents a NULL (empty) object. To check if a value is NA
, NaN
or NULL
, we can use is.na()
, is.nan()
or is.null
, respectively.
In Python, there is no built-in NA
or NaN
. The counterpart of NULL
in Python is None
. In Python, we can use the is
keyword or == to check if a value is equal to None
.
From the code snippet above, we also notice that in R the boolean type value is written as “TRUE/FALSE”, compared with “True/False” in Python. Although in R “TRUE/FALSE” can also be abbreviated as “T/F”, I don’t recommend to use the abbreviation.
There is one interesting fact that we can’t add a NULL
to a vector in R, but it is feasible to add a None
to a list in Python.
Sometimes we want to create a vector/list with replicated elements, for example, a vector/list with all elements equal to 0.
When we use the *
operator to make replicates of a list, there is one caveat – if the element inside the list is mutable then the replicated elements point to the same memory address. As a consequence, if one element is mutated other elements are also affected.
How to get a list with replicated elements but pointing to different memory addresses?
Beside accessing a specific element from a vector/list, we may also need to do slicing, i.e., to select a subset of the vector/list. There are two basic approaches of slicing:
The code snippet above uses hash character #
for comments in both R and Python. Everything after #
on the same line would be treated as comment (not executable). In the R code, we also used the function seq()
to create a vector. When I see a function that I haven’t seen before, I might either google it or use the builtin helper mechanism. Specifically, in R use ?
and in Python use help()
.
Condition-based slicing means to select a subset of the elements which satisfy certain conditions. In R, it is quite straightforward by using a boolean vector whose length is the same as the vector to slice.
The condition-based slicing in Python is quite different from that in R. The prerequisite is list comprehension which provides a concise way to create new lists in Python. For example, let’s create a list of squares of another list.
We can also use if
statement with list comprehension to filter a list to achieve list slicing.
It is also common to use if/else
with list comprehension to achieve more complex operations. For example, given a list x, let’s create a new list y so that the non-negative elements in x are squared and the negative elements are replaced by 0s.
The example above shows the power of list comprehension. To use if
with list comprehension, the if
statement should be placed in the end after the for
loop statement; but to use if/else
with list comprehension, the if/else
statement should be placed before the for
loop statement.
We can also modify the value of an element in a vector/list variable.
Two or multiple vectors/lists can be concatenated easily.
As the list structure in Python is mutable, there are many things we can do with list.
I like the list structure in Python much more than the vector structure in R. list in Python has a lot more useful features which can be found from the python official documentation5.
Array is one of the most important data structures in scientific programming. In R, there is also an object type “matrix”, but according to my own experience, we can almost ignore its existence and use array instead. We can definitely use list as array in Python, but lots of linear algebra operations are not supported for the list type. Fortunately, there is a Python package numpy
off the shelf.
You may have noticed that the results of the R code snippet and Python code snippet are different. The reason is that in R the conversion from a vector to an array is by-column; but in numpy
the reshape from a list to an 2D numpy.array
is by-row. There are two ways to reshape a list to a 2D numpy.array by column.
To learn more about numpy, the official website6 has great documentation/tutorials.
The term broadcasting describes how arrays with different shapes are handled during arithmetic operations. A simple example of broadcasting is given below.
However, the broadcasting rules in R and Python are not exactly the same.
From the R code, we see the broadcasting in R is like recycling along with the column. In Python, when the two arrays have different dimensions, the one with fewer dimensions is padded with ones on its leading side. According to this rule, when we do x * y
, the dimension of x
is (3, 2) but the dimension of y
is 3. Thus, the dimension of y
is padded to (1, 3), which explains what happens when x * y
.
Yes, in R there is also an object type called list. The major difference between a vector and a list in R is that a list could contain different types of elements. list in R supports integer-based accessing using [[]]
(compared to []
for vector).
list in R could be named and support accessing by name via either [[]]
or $
operator. But vector in R can also be named and support accessing by name.
However, elements in list in Python can’t be named as R. If we need the feature of accessing by name in Python, we can use the dictionary structure. If you used Java before, you may consider dictionary in Python as the counterpart of HashMap in Java. Essentially, a dictionary in Python is a collection of key:value pairs.
Unlike dictionary in Python, list in R doesn’t support the pop()
operation. Thus, in order to modify a list in R, a new one would be created explicitly or implicitly.
data.frame is a built-in type in R for data manipulation. In Python, there is no such built-in data structure since Python is a more general-purpose programming language. The solution for data.frame in Python is the pandas
7 module.
Before we dive into data.frame, you may be curious why we need it? In other words, why can’t we just use vector, list, array/matrix and dictionary for all data manipulation tasks? I would say yes – data.frame is not a must-have feature for most of ETL (extraction, transformation and Load) operations. But data.frame provides a very intuitive way for us to understand the structured data set. A data.frame is usually flat with 2 dimensions, i.e., row and column. The row dimension is across multiple observations and the column dimension is across multiple attributes/features. If you are familiar with relational database, a data.frame can be viewed as a table.
Let’s see an example of using data.frame to represent employees’ information in a company.
There are quite a few ways to create data.frame. The most commonly used one is to create data.frame object from array/matrix. We may also need to convert a numeric data.frame to an array/matrix.
In general, operations on an array/matrix is much faster than that on a data frame. In R, we may use the built-in function data.matrix
to convert a data.frame to an array/matrix. In Python, we could use the function asarray
in numpy
module.
Although data.frame is a built-in type, it is not quite efficient for many operations. I would suggest to use data.table8 whenever possible. dplyr9 is also a very popular package in R for data manipulation. Many good online resources are available online to learn data.table and pandas.
We have talked about variables and functions so far. When a function has a name, its name is also a valid variable. After all, what is a variable?
In mathematics, a variable is a symbol that represents an element, and we do not care whether we conceptualize a variable in our mind, or write it down on a paper. However, in programming a variable is not only a symbol. We have to understand that a variable is a name given to a memory location in computer systems. When we run x=2
in R or Python, somewhere in memory has the value 2, and the variable (name) points to this memory address. If we further run y=x
, the variable y
points to the same memory location pointed to by x
. What if we run x=3
? It doesn’t modify the memory which stores the value 2
. Instead, somewhere in the memory now has the value 3 and this memory location has a name x
. And the variable y
is not affected at all, as well as the memory location it points to.
Almost everything in R or Python is an object, including these data structures we introduced in previous sections. Mutability is a property of objects, not variables, because a variable is just a name.
A list in Python is mutable meaning that we could change the elements stored in the list object without copying the list object from one memory location to another. We can use the id
function in Python to check the memory location for a variable. In the code below, we modified the first element of the list object with name x
. And since Python list is mutable, the memory address of the list doesn’t change.
Is there any immutable data structure in Python? Yes, for example tuple is immutable, which contains a sequence of elements. The element accessing and subset slicing of tuple is following the same rules of list in Python.
If we have two Python variables pointed to the same memory, when we modify the memory via one variable the other is also affected as we expect (see the example below).
In contrast, the mutability of vector in R is more complex and sometimes confusing. First, let’s see the behavior when there is a single name given to the vector object stored in memory.
It is clear in this case the vector object is mutable since the memory address doesn’t change after the modification. What if there is an additional name given to the memory?
Before the modification, both variable a
and b
point to the same vector object in the memory. But surprisingly, after the modification the memory address of variable a
also changed, which is called “copy on modify” in R. And because of this unique behavior, the modification of a
doesn’t affect the object stored in the old memory and thus the vector object is immutable in this case. The mutability of R list
is similar to that of R vector
.
Most of functions/methods in R and Python take some variables as argument. What happens when we pass the variables into a function?
In Python, the variable, i.e., the name of the object is passed into a function. If the variable points to an immutable object, any modification to the variable, i.e., the name doesn’t persist. However, when the variable points to a mutable object, the modification of the object stored in memory persist. Let’s see the examples below.
We see that the object is passed into function by its name. If the object is immutable, a new copy is created in memory when any modification is made to the original object. When the object is immutable, no new copy is made and the thus the change persists out of the function.
In R, the passed object is always copied on a modification inside the function, and thus no modification can be made on the original object in memory.
People may argue that R functions are not as flexible as Python functions. However, it makes more sense to do functional programming in R since we usually can’t modify object passed into a function.
What is the scope of a variable and why does it matter? Let’s first have a look at the code snippets below.
The results of the code above seem strange before knowing the concept of variable scope. Inside a function, a variable may refer to a function argument/parameter or it could be formally declared inside the function which is called a local variable. But in the code above, x
is neither a function argument nor a local variable. How does the print()
function know where the identifier x
points to?
The scope of a variable determines where the variable is available/accessible (can be referenced). Both R and Python apply lexical/static scoping for variables, which set the scope of a variable based on the structure of the program. In static scoping, when an ‘unknown’ variable referenced, the function will try to find it from the most closely enclosing block. That explains how the print()
function could find the variable x
.
In the R code above, x=x+1
the first x
is a local variable created by the =
operator; the second x
is referenced inside the function so the static scoping rule applies. As a result, a local variable x
which is equal to 2 is created, which is independent with the x
outside of the function var_func_2()
. However, in Python when a variable is assigned a value in a statement the variable would treated as a local variable and that explains the UnboundLocalError
.
Is it possible to change a variable inside a function which is declared outside the function without passing it as an argument? Based on the static scoping rule only, it’s impossible. But there are workarounds in both R/Python. In R, we need the help of environment; and in Python we can use the keyword global
.
So what is an environment in R? An environment is a place where objects are stored. When we invoke the interactive R session, an environment named as .GlobalEnv
is created automatically. We can also use the function environment()
to get the present environment. The ls()
function can take an environment as the argument to list all objects inside the environment.
The above code shows that each function has its own environment containing all function arguments and local variables declared inside the function. In order to change a variable declared outside of a function, we need the access of the environment enclosing the variable to change. There is a function parent_env(e)
that returns the parent environment of the given environment e
in R. Using this function, we are able to change the value of x
declared in .GlobalEnv
inside a function which is also declared in .GlobalEnv
. The global
keyword in Python works in a totally different way, which is simple but less flexible.
I seldomly use the global
keyword in Python, if any. But the environment in R could be very handy in some occasions. In R, environment could be used as a purely mutable version of the list
data structure.
Actually, the object of an R6 class type is also an environment.
In Python, we can assign values to multiple variables in one line.
Even though in the left snippet above there aren’t parentheses embracing 1, 2
after the =
operator, a tuple is created first and then the tuple is unpacked and assigned to x,y
. Such mechanism doesn’t exist in R, but we can define our own multiple assignment operator with the help of environment.
Before going into the script deeper, first let’s see the usage of the multiple assignment operator we defined.
In the \%=\%
operator defined above, we used two functions substitute, deparse
which are very powerful but less known by R novices. To better understand these functions as well as some other less known R functions, the Rchaeology19 tutorial is worth reading.
It is also interesting to see that we defined the function recursive_assign
inside the \%=\%
function. Both R and Python support the concept of first class functions. More specifically, a function in R/Python is an object, which can be
The essential idea behind the recursive_assign
function is a depth-first search (DFS), which is a fundamental graph traversing algorithm20. In the context of the recursive_assign
function, we use DFS to traverse the parse tree of the left
argument created by calling substitute(left)
.
All the codes we wrote above follow the procedural programming paradigm10. We can also do functional programming (FP) and OOP in R/Python. In this section, let’s focus on OOP in R/Python.
Class is the key concept in OOP. In R there are two commonly used built-in systems to define classes, i.e., S3 and S4. In addition, there is an external package R611 which defines R6 classes. S3 is a light-weight system but its style is quite different from OOP in many other programming languages. S4 system follows the principles of modern object oriented programming much better than S3. However, the usage of S4 classes is quite tedious. I would ignore S3/S4 and introduce R6, which is more close to the class in Python.
Let’s build a class in R/Python to represent complex numbers.
By overriding the print function in the R6 class, we can have the object printed in the format of real+imag j
. To achieve the same effect in Python, we override the method __repr__
. In Python, we call the functions defined in classes as methods. And overriding a method means changing the implementation of a method provided by one of its ancestors. To understand the concept of ancestors in OOP, one needs to understand the concept of inheritance12.
You may be curious of the double underscore surrounding the methods, such as __init__
and __repr__
. These methods are well-known as magic methods13. Magic methods could be very handy if we use them in the suitable cases. For example, we can use the magic method __add__
to implement the +
operator for the Complex
class we defined above.
In the definition of the magic method __repr__
in the Python code, the format
method of str
object14 is used.
You may wonder why we use the set_real/imag
method in the Python class. Actually they are not necessary. It is fine to modify the variables of a class directly. For example, in the code below we remove these methods which simplifies the definition of the Complex
class.
We can also implement the +
operator for Complex
class in R like what we have done for Python.
The most interesting part of the code above is `+.Complex`
. First, why do we use ``
to quote the function name? Before getting into this question, let’s have a look at the Python 3’s variable naming rules15.
Within the ASCII range (U+0001..U+007F), the valid characters for identifiers (also referred to as names) are the same as in Python 2.x: the uppercase and lowercase letters A through Z, the underscore _ and, except for the first character, the digits 0 through 9.
According to the rule, we can’t declare a variable with name 2x
. Compared with Python, in R we can also use .
in the variable names16. However, there is a workaround to use invalid variable names in R with the help of ``
.
Now it is clear the usage of ``
in `+.Complex`
is to define a function with invalid name. Placing .Complex
after +
is related to S3 method dispatching which would not be discussed here.
In some programming languages the members (variable or methods) of a class can be declared with access modifiers which specifies the accessibility or scope of a member. In Python, class members don’t have explicit access modifiers, but it is still possible to specify the accessibility. By default, the class member can be accessed inside or outside the class definition. If the name of the member starts with a single underscore, the member should not be accessed outside the class definition by convention, but it is not enforced. If the name of the member starts with double underscore, the member name is mangled and thus the member can’t be accessed by its original member outside the class definition. But inside the class definition these variables can always be accessed. Let’s see the example below.
In this example, an error is thrown when we try to access __y
or __func2
outside the class definition. But they are reachable within the class definition and these fields are usually called private fields.
There are some items that I haven’t discussed so far, which are also important in order to master R/Python.
install.packages()
function in Rpip
17 to install modules in PythonVirtual environment is a tool to mange dependencies in Python. There are different ways to create virtual environments in Python. But I suggest to use the venv
module shipped with Python 3. Unfortunately, there is nothing like a real virtual environment in R as far as I know although there quite a few of packages management tools/packages.
If you have known R before, you probably heard of the advice18 to use <-
to rather than =
for value assignment. However, I always use =
for value assignment. Let’s see an example when <-
makes a difference when we do value assignment.
When the assignment is made inside an argument of a function call, <-
and =
have different effects. But it is not recommended to use <-
inside an argument of a function call.
1 https://www.r-project.org
2 https://www.python.org
3 https://docs.python.org/3/tutorial/modules.html
4 https://cran.r-project.org/doc/manuals/r-devel/NEWS.html
5 https://docs.python.org/3/tutorial/datastructures.html
6 http://www.numpy.org
7 https://pandas.pydata.org/
8 https://cran.r-project.org/web/packages/data.table/index.html
9 https://dplyr.tidyverse.org
10 https://en.wikipedia.org/wiki/Comparison_of_programming_paradigms
11 https://cran.r-project.org/web/packages/R6/index.html
12 https://en.wikipedia.org/wiki/Inheritance_(object-oriented_programming)
13 https://rszalski.github.io/magicmethods
14 https://docs.python.org/3.7/library/string.html
15 https://docs.python.org/3.3/reference/lexical_analysis.html
16 https://cran.r-project.org/doc/manuals/r-release/R-lang.html\#Identifiers
17 https://packaging.python.org/tutorials/installing-packages
18 https://google.github.io/styleguide/Rguide.xml
19 https://en.wikipedia.org/wiki/Depth-first\_search
20 https://cran.r-project.org/web/packages/rockchalk/vignettes/Rchaeology.pdf