Another Book on Data Science - Introduction to R/Python Programming

Sections in this Chapter:

Calculator
Variable & type
Functions
Control flows
Some built-in data structures
Revisit of variables
Object-oriented programming (OOP) in R/Python
Miscellaneous

Calculator

R and Python are general-purpose programming languages that can be used for writing softwares in a variety of domains. But for now, let us start from using them as basic calculators. The first thing is to have them installed. R¹ and Python² can be downloaded from their official website. In this book, I would keep using R 3.5 and Python 3.7.

To use R/Python as basic calculators, let’s get familiar with the interactive mode. After the installation, we can type R or Python (it is case insensitive so we can also type r/python) to invoke the interactive mode. Since Python 2 is installed by default on many machines, in order to avoid invoking Python 2 we type python3.7 instead.

 1 ~ $R
 2 
 3 R version 3.5.1 (2018-07-02) -- "Feather Spray"
 4 Copyright (C) 2018 The R Foundation for Statistical Computing
 5 Platform: x86_64-apple-darwin15.6.0 (64-bit)
 6 
 7 R is free software and comes with ABSOLUTELY NO WARRANTY.
 8 You are welcome to redistribute it under certain conditions.
 9 Type 'license()' or 'licence()' for distribution details.
10 
11   Natural language support but running in an English locale
12 
13 R is a collaborative project with many contributors.
14 Type 'contributors()' for more information and
15 'citation()' on how to cite R or R packages in publications.
16 
17 Type 'demo()' for some demos, 'help()' for on-line help, or
18 'help.start()' for an HTML browser interface to help.
19 Type 'q()' to quit R.
20 
21 >

Python

1 ~ $python3.7
2 Python 3.7.1 (default, Nov  6 2018, 18:45:35) 
3 [Clang 10.0.0 (clang-1000.11.45.5)] on darwin
4 Type "help", "copyright", "credits" or "license" for more information.
5 >>>

The messages displayed by invoking the interactive mode depend on both the version of R/Python installed and the machine. Thus, you may see different messages on your local machine.

As the messages said, to quit R we can type q(). There are 3 options prompted by asking the user if the workspace should be saved or not. Since we just want to use R as a basic calculator, we quit without saving workspace.

1 > q()
2 Save workspace image? [y/n/c]: n
3 ~ $

To quit Python, we can simply type exit().

Once we are inside the interactive mode, we can use R/Python as a calculator.

1 > 1+1
2 [1] 2
3 > 2*3+5
4 [1] 11
5 > log(2)
6 [1] 0.6931472
7 > exp(0)
8 [1] 1

Python

 1 >>> 1+1
 2 2
 3 >>> 2*3+5
 4 11
 5 >>> log(2)
 6 Traceback (most recent call last):
 7   File "<stdin>", line 1, in <module>
 8 NameError: name 'log' is not defined
 9 >>> exp(0)
10 Traceback (most recent call last):
11   File "<stdin>", line 1, in <module>
12 NameError: name 'exp' is not defined

From the code snippet above, R is working as a calculator perfectly. However, errors are raised when we call log(2) and exp(2) in Python. The error messages are self-explanatory – log function and exp function don’t exist in the current Python environment. In fact, log function and exp function are defined in the math module in Python. A module³ is a file consisting of Python code. When we invoke the interactive mode of Python, a few built-in modules are loaded into the current environment by default. But the math module is not included in these built-in modules. That explains why we got the NameError when we try to use the functions defined in the math module. To resolve the issue, we should first load the functions to use by using the import statement as follows.

Python

1 >>> from math import log,exp
2 >>> log(2)
3 0.6931471805599453
4 >>> exp(0)
5 1.0

Variable & Type

In the previous section we have seen how to use R/Python as calculators. Now, let’s see how to write real programs. First, let’s define some variables.

 1 > a=2
 2 > b=5.0
 3 > x='hello world'
 4 > a
 5 [1] 2
 6 > b
 7 [1] 5
 8 > x
 9 [1] "hello world"
10 > e=a*2+b
11 > e
12 [1] 9

Python

 1 >>> a=2
 2 >>> b=5.0
 3 >>> x='hello world'
 4 >>> a
 5 2
 6 >>> b
 7 5.0
 8 >>> x
 9 'hello world'
10 >>> e=a*2+b
11 >>> e
12 9.0

Here, we defined 4 different variables a, b, x, e. To get the type of each variable, we can utilize the function typeof() in R and type() in Python, respectively.

1 > typeof(x)
2 [1] "character"
3 > typeof(e)
4 [1] "double"

Python

1 >>> type(x)
2 <class 'str'>
3 >>> type(e)
4 <class 'float'>

The type of x in R is called character, and in Python is called str.

Functions

We have seen two functions log and exp when we use R/Python as calculators. A function is a block of code which performs a specific task. A major purpose of wrapping a block of code into a function is to reuse the code.

It is simple to define functions in R/Python.

1 > fun1=function(x){return(x*x)}
2 > fun1
3 function(x){return(x*x)}
4 > fun1(2)
5 [1] 4

Python

1 >>> def fun1(x):
2 ...   return x*x # note the indentation
3 ... 
4 >>> fun1(2)
5 4

Here, we defined a function fun1 in R/Python. This function takes x as input and returns the square of x. When we call a function, we simply type the function name followed by the input argument inside a pair of parentheses. It is worth noting that input or output are not required to define a function. For example, we can define a function fun2 to print Hello World! without input and output.

One major difference between R and Python codes is that Python codes are structured with indentation. Each logical line of R/Python code belongs to a certain group. In R, we use {} to determine the grouping of statements. However, in Python we use leading whitespace (spaces and tabs) at the beginning of a logical line to compute the indentation level of the line, which is used to determine the statements’ grouping. Let’s see what happens if we remove the leading whitespace in the Python function above.

1 >>> def fun1(x):
2 ... return x*x # note the indentation
3   File "<stdin>", line 2
4     return x*x # note the indentation
5          ^
6 IndentationError: expected an indented block

We got an IndentationError because of missing indentation.

1 > fun2=function(){print('Hello World!')}
2 > fun2()
3 [1] "Hello World!"

Python

1 >>> def fun2(): print('Hello World!')
2 ... 
3 >>> fun2()
4 Hello World!\end{python}

Let’s go back to fun1 and have a closer look at the return. In Python, if we want to return something we have to use the keyword return explicitly. return in R is a function but it is not a function in Python and that is why no parenthesis follows return in Python. In R, return is not required even though we need to return something from the function. Instead, we can just put the variables to return in the last line of the function defined in R. That being said, we can define fun1 as follows.

1 > fun1=function(x){x*x}

Sometimes we want to give a default value to an argument for a function, and both R and Python allow functions to have default values.

 1 > log_fun = function(x, base=2){
 2 + return(log(x, base))
 3 + }
 4 > log_fun(5, base=2)
 5 [1] 2.321928
 6 > log_fun(5, 2)
 7 [1] 2.321928
 8 > log_fun(base=2, 5)
 9 [1] 2.321928
10 >

Python

 1 >>> def log_fun(x, base=2):
 2 ...   return math.log(x, base)
 3 ... 
 4 >>> log_fun(5,2)
 5 2.321928094887362
 6 >>> log_fun(5, base=2)
 7 2.321928094887362
 8 >>> log_fun(base=2, 5)
 9   File "<stdin>", line 1
10 SyntaxError: positional argument follows keyword argument

In Python we have to put the arguments with default values to the end, which is not required in R. However, from readability perspective, it is better always to put them to the end. You may have noticed the error message above about positional argument. In Python there are two types of arguments, i.e., positional arguments and keyword arguments. Simply speaking, a keyword argument must be preceded by an identifier, e.g., base in the example above. And positional arguments refer to non-keyword arguments.

Control flows

To implement a complex logic in R/Python, we may need control flows.

If/else

Let’s define a function to return the absolute value of input.

 1 > fun3=function(x){
 2 +   if (x>=0){
 3 +     return(x)}
 4 +   else{
 5 +     return(-x)}
 6 + }
 7 > fun3(2.5)
 8 [1] 2.5
 9 > fun3(-2.5)
10 [1] 2.5

Python

 1 >>> def fun3(x):
 2 ...   if x>=0:
 3 ...     return x
 4 ...   else:
 5 ...     return -x
 6 ... 
 7 >>> fun3(2.5)
 8 2.5
 9 >>> fun3(-2.5)
10 2.5

The code snippet above shows how to use if/else in R/Python. The subtle difference between R and Python is that the condition after if must be embraced by parenthesis in R but it is optional in Python.

We can also put if after else. But in Python, we use elif as a shortcut.

 1 > fun4=function(x){
 2 +   if (x==0){
 3 +     print('zero')}
 4 +   else if (x>0){
 5 +     print('positive')}
 6 +   else{
 7 +     print('negative')}
 8 + }
 9 > fun4(0)
10 [1] "zero"
11 > fun4(1)
12 [1] "positive"
13 > fun4(-1)
14 [1] "negative"

Python

 1 >>> def fun4(x):
 2 ...   if x==0:
 3 ...     print('zero')
 4 ...   elif x>0:
 5 ...     print('positive')
 6 ...   else:
 7 ...     print('negative')
 8 ... 
 9 >>> fun4(0)
10 zero
11 >>> fun4(1)
12 positive
13 >>> fun4(-1)
14 negative

For loop

Similar to the usage of if in R, we also have to use parenthesis after the keyword for in R. But in Python there should be no parenthesis after for.

1 > for (i in 1:3){print(i)}
2 [1] 1
3 [1] 2
4 [1] 3

Python

1 >>> for i in range(1,4):print(i)
2 ... 
3 1
4 2
5 3

There is something more interesting than the for loop itself in the snippets above.
In the R code, the expression 1:3 creates a vector with elements 1,2 and 3. In the Python code, we use the range() function for the first time. Let’s have a look at the type of them.

1 > typeof(1:3)
2 [1] "integer"

Python

1 >>> type(range(1,4))
2 <class 'range'>

range() function returns a range type object, which represents an immutable sequence of numbers. range() function can take three arguments, i.e.,
range(start, stop, step). However, start and step are both optional. It’s critical to keep in mind that the stop argument that defines the upper limit of the sequence is exclusive. And that is why in order to loop through 1 to 3 we have to pass 4 as the stop argument to range() function. The step argument specifies how much to increase from one number to the next.
The default values of start and step are 0 and 1, respectively.

While loop

1 > i=1
2 > while (i<=3){
3 +   print(i)
4 +   i=i+1
5 + }
6 [1] 1
7 [1] 2
8 [1] 3

Python

1 >>> i=1
2 >>> while i<=3:
3 ...   print(i)
4 ...   i+=1
5 ... 
6 1
7 2
8 3

You may have noticed that in Python we can do i+=1 to add 1 to i, which is not feasible in R by default. Both for loop and while loop can be nested.

Break/continue

Break/continue helps if we want to break the for/while loop earlier, or to skip a specific iteration. In R, the keyword for continue is called next, in contrast to continue in Python. The difference between break and continue is that calling break would exit the innermost loop (when there are nested loops, only the innermost loop is affected); while calling continue would just skip the current iteration and continue the loop if not finished.

 1 > for (i in 1:3){
 2 +   print(i)
 3 +   if (i==1) break
 4 + }
 5 [1] 1
 6 > for (i in 1:3){
 7 +   if (i==2){next}
 8 +   print(i)
 9 + }
10 [1] 1
11 [1] 3

Python

 1 >>> for i in range(1,4):
 2 ...   print(i)
 3 ...   if i==1: break
 4 ... 
 5 1
 6 >>> for i in range(1,4):
 7 ...   if i==2: continue
 8 ...   print(i)
 9 ... 
10 1
11 3

Some built-in Data Structures

In the previous sections, we haven’t seen much difference between R and Python. However, regarding the built-in data structures, there are some significant differences we would see in this section.

vector in R and list in Python

In R, we can use function c() to create a vector; A vector is a sequence of elements with the same type. In Python, we can use [] to create a list, which is also a sequence of elements. But the elements in a list don’t need to have the same type. To get the number of elements in a vector in R, we use the function length(); and to get the number of elements in a list in Python, we use the function len().

 1 > x=c(1,2,5,6)
 2 > y=c('hello','world','!')
 3 > x
 4 [1] 1 2 5 6
 5 > y
 6 [1] "hello" "world" "!"    
 7 > length(x)
 8 [1] 4
 9 > z=c(1,'hello')
10 > z
11 [1] "1"     "hello"

Python

 1 >>> x=[1,2,5,6]
 2 >>> y=['hello','world','!']
 3 >>> x
 4 [1, 2, 5, 6]
 5 >>> y
 6 ['hello', 'world', '!']
 7 >>> len(x)
 8 4
 9 >>> z=[1,'hello']
10 >>> z
11 [1, 'hello']

In the code snippet above, the first element in the variable z in R is coerced from 1 (numeric) to “1” (character) since the elements must have the same type.

To access a specific element from a vector or list, we could use []. In R, sequence types are indexed beginning with the one subscript; In contrast, sequence types in Python are indexed beginning with the zero subscript.

1 > x=c(1,2,5,6)
2 > x[1]
3 [1] 1

Python

1 >>> x=[1,2,5,6]
2 >>> x[1]
3 2
4 >>> x[0]
5 1

What if the index to access is out of boundary?

 1 > x=c(1,2,5,6)
 2 > x[-1]
 3 [1] 2 5 6
 4 > x[0]
 5 numeric(0)
 6 > x[length(x)+1]
 7 [1] NA
 8 > length(numeric(0))
 9 [1] 0
10 > length(NA)
11 [1] 1

Python

1 >>> x=[1,2,5,6]
2 >>> x[-1]
3 6
4 >>> x[len(x)+1]
5 Traceback (most recent call last):
6   File "<stdin>", line 1, in <module>
7 IndexError: list index out of range

In Python, negative index number means indexing from the end of the list. Thus, x[-1] points to the last element and x[-2] points to the second-last element of the list. But R doesn’t support indexing with negative number in the same way as Python. Specifically, in R x[-index] returns a new vector with x[index] excluded.

When we try to access with an index out of boundary, Python would throw an IndexError. The behavior of R when indexing out of boundary is more interesting. First, when we try to access x[0] in R we get a numeric(0) whose length is also 0. Since its length is 0, numeric(0) can be interpreted as an empty numeric vector. When we try to access x[length(x)+1] we get a NA. In R, there are also NaN and NULL.

NaN means “Not A Number” and it can be verified by checking its type – “double”. 0/0 would result in a NaN in R. NA in R generally represents missing values. And NULL represents a NULL (empty) object. To check if a value is NA, NaN or NULL, we can use is.na(), is.nan() or is.null, respectively.

 1 > typeof(NA)
 2 [1] "logical"
 3 > typeof(NaN)
 4 [1] "double"
 5 > typeof(NULL)
 6 [1] "NULL"
 7 > is.na(NA)
 8 [1] TRUE
 9 > is.null(NULL)
10 [1] TRUE
11 > is.nan(NaN)

Python

1 >>> type(None)
2 <class 'NoneType'>
3 >>> None is None
4 True
5 >>> 1 == None
6 False

In Python, there is no built-in NA or NaN. The counterpart of NULL in Python is None. In Python, we can use the is keyword or == to check if a value is equal to None.

From the code snippet above, we also notice that in R the boolean type value is written as “TRUE/FALSE”, compared with “True/False” in Python. Although in R “TRUE/FALSE” can also be abbreviated as “T/F”, I don’t recommend to use the abbreviation.

There is one interesting fact that we can’t add a NULL to a vector in R, but it is feasible to add a None to a list in Python.

1 > x=c(1,NA,NaN,NULL)
2 > x
3 [1]   1  NA NaN
4 > length(x)
5 [1] 3

Python

1 >>> x=[1,None]
2 >>> x
3 [1, None]
4 >>> len(x)
5 2

Sometimes we want to create a vector/list with replicated elements, for example, a vector/list with all elements equal to 0.

1 > x=rep(0, 10)
2 > x
3  [1] 0 0 0 0 0 0 0 0 0 0
4 > y=rep(c(0,1), 5)
5 > y
6  [1] 0 1 0 1 0 1 0 1 0 1

Python

1 >>> x=[0]*10
2 >>> x
3 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
4 >>> y=[0, 1]*5
5 >>> y
6 [0, 1, 0, 1, 0, 1, 0, 1, 0, 1]

When we use the * operator to make replicates of a list, there is one caveat – if the element inside the list is mutable then the replicated elements point to the same memory address. As a consequence, if one element is mutated other elements are also affected.

Python

 1 >>> x=[0] # x is a list which is mutable
 2 >>> y=[x]*5 # each element in y points to x
 3 >>> y
 4 [[0], [0], [0], [0], [0]]
 5 >>> y[2]=2 # we point y[2] to 2 but x is not mutated
 6 >>> y
 7 [[0], [0], 2, [0], [0]]
 8 >>> y[1][0]=-1 # we mutate x by changing y[1][0] from 0 to -1
 9 >>> y
10 [[-1], [-1], 2, [-1], [-1]]
11 >>> x
12 [-1]

How to get a list with replicated elements but pointing to different memory addresses?

Python

1 >>> x=[0]
2 >>> y=[x[:] for _ in range(5)] # [:] makes a copy of the list x; another solution is [list(x) for _ in range(5)]
3 >>> y
4 [[0], [0], [0], [0], [0]]
5 >>> y[0][0]=2
6 >>> y
7 [[2], [0], [0], [0], [0]]

Beside accessing a specific element from a vector/list, we may also need to do slicing, i.e., to select a subset of the vector/list. There are two basic approaches of slicing:

Integer-based

1 > x=c(1,2,3,4,5,6)
2 > x[2:4]
3 [1] 2 3 4
4 > x[c(1,2,5)] # a vector of indices
5 [1] 1 2 5
6 > x[seq(1,5,2)] # seq creates a vector to be used as indices
7 [1] 1 3 5

Python

1 >>> x=[1,2,3,4,5,6]
2 >>> x[1:4] # x[start:end] start is inclusive but end is exclusive
3 [2, 3, 4]
4 >>> x[0:5:2] # x[start:end:step]
5 [1, 3, 5]

The code snippet above uses hash character # for comments in both R and Python. Everything after # on the same line would be treated as comment (not executable). In the R code, we also used the function seq() to create a vector. When I see a function that I haven’t seen before, I might either google it or use the builtin helper mechanism. Specifically, in R use ? and in Python use help().

1 > ?seq

Python

1 >>> help(print)

Condition-based

Condition-based slicing means to select a subset of the elements which satisfy certain conditions. In R, it is quite straightforward by using a boolean vector whose length is the same as the vector to slice.

1 > x=c(1,2,5,5,6,6)
2 > x[x %% 2==1] # %% is the modulo operator in R; we select the odd elements
3 [1] 1 5 5
4 > x %% 2==1 # results in a boolean vector with the same length as x
5 [1]  TRUE FALSE  TRUE  TRUE FALSE FALSE

The condition-based slicing in Python is quite different from that in R. The prerequisite is list comprehension which provides a concise way to create new lists in Python. For example, let’s create a list of squares of another list.

Python

1 >>> x=[1,2,5,5,6,6]
2 >>> [e**2 for e in x] # ** is the exponent operator, i.e., x**y means x to the power of y
3 [1, 4, 25, 25, 36, 36]

We can also use if statement with list comprehension to filter a list to achieve list slicing.

Python

1 >>> x=[1,2,5,5,6,6]
2 >>> [e for e in x if e%2==1] # % is the modulo operator in Python
3 [1, 5, 5]

It is also common to use if/else with list comprehension to achieve more complex operations. For example, given a list x, let’s create a new list y so that the non-negative elements in x are squared and the negative elements are replaced by 0s.

Python

1 >>> x=[1,-1,0,2,5,-3]
2 >>> [e**2 if e>=0 else 0 for e in x]
3 [1, 0, 0, 4, 25, 0]

The example above shows the power of list comprehension. To use if with list comprehension, the if statement should be placed in the end after the for loop statement; but to use if/else with list comprehension, the if/else statement should be placed before the for loop statement.

We can also modify the value of an element in a vector/list variable.

1 > x=c(1,2,3)
2 > x[1]=-1
3 > x
4 [1] -1  2  3

Python

1 >>> x=[1,2,3]
2 >>> x[0]=-1
3 >>> x
4 [-1, 2, 3]

Two or multiple vectors/lists can be concatenated easily.

1 > x=c(1,2)
2 > y=c(3,4)
3 > z=c(5,6,7,8)
4 > c(x,y,z)
5 [1] 1 2 3 4 5 6 7 8

Python

1 >>> x=[1,2]
2 >>> y=[3,4]
3 >>> z=[5,6,7,8]
4 >>> x+y+z
5 [1, 2, 3, 4, 5, 6, 7, 8]

As the list structure in Python is mutable, there are many things we can do with list.

Python

 1 >>> x=[1,2,3]
 2 >>> x.append(4) # append a single value to the list x
 3 >>> x
 4 [1, 2, 3, 4]
 5 >>> y=[5,6]
 6 >>> x.extend(y) # extend list y to x
 7 >>> x
 8 [1, 2, 3, 4, 5, 6]
 9 >>> last=x.pop() # pop the last elememt from x
10 >>> last
11 6
12 >>> x
13 [1, 2, 3, 4, 5]

I like the list structure in Python much more than the vector structure in R. list in Python has a lot more useful features which can be found from the python official documentation⁵.

array

Array is one of the most important data structures in scientific programming. In R, there is also an object type “matrix”, but according to my own experience, we can almost ignore its existence and use array instead. We can definitely use list as array in Python, but lots of linear algebra operations are not supported for the list type. Fortunately, there is a Python package numpy off the shelf.

 1 > x=1:12
 2 > array1=array(x,c(4,3)) # convert vector x to a 4 rows * 3 cols array
 3 > array1
 4      [,1] [,2] [,3]
 5 [1,]    1    5    9
 6 [2,]    2    6   10
 7 [3,]    3    7   11
 8 [4,]    4    8   12
 9 > y=1:6
10 > array2=array(y,c(3,2)) # convert vector y to a 3 rows * 2 cols array
11 > array2
12      [,1] [,2]
13 [1,]    1    4
14 [2,]    2    5
15 [3,]    3    6
16 > array3 = array1 %*% array2 # %*% is the matrix multiplication operator
17 > array3
18      [,1] [,2]
19 [1,]   38   83
20 [2,]   44   98
21 [3,]   50  113
22 [4,]   56  128
23 > dim(array3) # get the dimension of array3
24 [1] 4 2

Python

 1 >>> import numpy as np # we import the numpy module and alias it as np
 2 >>> array1=np.reshape(list(range(1,13)),(4,3)) # convert a list to a 2d np.array
 3 >>> array1
 4 array([[ 1,  2,  3],
 5        [ 4,  5,  6],
 6        [ 7,  8,  9],
 7        [10, 11, 12]])
 8 >>> type(array1)
 9 <class 'numpy.ndarray'>
10 >>> array2=np.reshape(list(range(1,7)),(3,2))
11 >>> array2
12 array([[1, 2],
13        [3, 4],
14        [5, 6]])
15 >>> array3=np.dot(array1,array2) # matrix multiplication using np.dot()
16 >>> array3
17 array([[ 22,  28],
18        [ 49,  64],
19        [ 76, 100],
20        [103, 136]])
21 >>> array3.shape # get the shape(dimension) of array3
22 (4, 2)

You may have noticed that the results of the R code snippet and Python code snippet are different. The reason is that in R the conversion from a vector to an array is by-column; but in numpy the reshape from a list to an 2D numpy.array is by-row. There are two ways to reshape a list to a 2D numpy.array by column.

Python

 1 >>> array1=np.reshape(list(range(1,13)),(4,3),order='F') # use order='F'
 2 >>> array1
 3 array([[ 1,  5,  9],
 4        [ 2,  6, 10],
 5        [ 3,  7, 11],
 6        [ 4,  8, 12]])
 7 >>> array2=np.reshape(list(range(1,7)),(2,3)).T # use .T to transpose an array
 8 >>> array2
 9 array([[1, 4],
10        [2, 5],
11        [3, 6]])
12 >>> np.dot(array1,array2) # now we get the same result as using R
13 array([[ 38,  83],
14        [ 44,  98],
15        [ 50, 113],
16        [ 56, 128]])

To learn more about numpy, the official website⁶ has great documentation/tutorials.

broadcasting

The term broadcasting describes how arrays with different shapes are handled during arithmetic operations. A simple example of broadcasting is given below.

1 > x = c(1, 2, 3)
2 > x+1
3 [1] 2 3 4

Python

1 >>> import numpy as np
2 >>> x = np.array([1, 2, 3])
3 >>> x + 1
4 array([2, 3, 4])

However, the broadcasting rules in R and Python are not exactly the same.

 1 > x = array(c(1:6), c(3,2))
 2 > y = c(1, 2, 3)
 3 > z = c(1, 2)
 4 # point-wise multiplication
 5 > x * y
 6      [,1] [,2]
 7 [1,]    1    4
 8 [2,]    4   10
 9 [3,]    9   18
10 > x*z
11      [,1] [,2]
12 [1,]    1    8
13 [2,]    4    5
14 [3,]    3   12

Python

 1 >>> import numpy as np
 2 >>> x = np.array([[1, 2], [3, 4], [5, 6]])
 3 >>> y = np.array([1, 2, 3])
 4 >>> z = np.array([1, 2])
 5 >>> # point-wise multiplication
 6 >>> x * y
 7 Traceback (most recent call last):
 8   File "<stdin>", line 1, in <module>
 9 ValueError: operands could not be broadcast together with shapes (3,2) (3,) 
10 >>> x * z
11 array([[ 1,  4],
12        [ 3,  8],
13        [ 5, 12]])

From the R code, we see the broadcasting in R is like recycling along with the column. In Python, when the two arrays have different dimensions, the one with fewer dimensions is padded with ones on its leading side. According to this rule, when we do x * y, the dimension of x is (3, 2) but the dimension of y is 3. Thus, the dimension of y is padded to (1, 3), which explains what happens when x * y.

list in R and dictionary in Python

Yes, in R there is also an object type called list. The major difference between a vector and a list in R is that a list could contain different types of elements. list in R supports integer-based accessing using [[]] (compared to [] for vector).

 1 > x=list(1,'hello world!')
 2 > x
 3 [[1]]
 4 [1] 1
 5 
 6 [[2]]
 7 [1] "hello world!"
 8 
 9 > x[[1]]
10 [1] 1
11 > x[[2]]
12 [1] "hello world!"
13 > length(x)
14 [1] 2

list in R could be named and support accessing by name via either [[]] or $ operator. But vector in R can also be named and support accessing by name.

 1 > x=c('a'=1,'b'=2)
 2 > names(x)
 3 [1] "a" "b"
 4 > x['b']
 5 b 
 6 2 
 7 > l=list('a'=1,'b'=2)
 8 > l[['b']]
 9 [1] 2
10 > l$b
11 [1] 2
12 > names(l)
13 [1] "a" "b"

However, elements in list in Python can’t be named as R. If we need the feature of accessing by name in Python, we can use the dictionary structure. If you used Java before, you may consider dictionary in Python as the counterpart of HashMap in Java. Essentially, a dictionary in Python is a collection of key:value pairs.

Python

 1 >>> x={'a':1,'b':2} # {key:value} pairs
 2 >>> x
 3 {'a': 1, 'b': 2}
 4 >>> x['a']
 5 1
 6 >>> x['b']
 7 2
 8 >>> len(x) # number of key:value pairs
 9 2
10 >>> x.pop('a') # remove the key 'a' and we get its value 1
11 1
12 >>> x
13 {'b': 2}

Unlike dictionary in Python, list in R doesn’t support the pop() operation. Thus, in order to modify a list in R, a new one would be created explicitly or implicitly.

data.frame

data.frame is a built-in type in R for data manipulation. In Python, there is no such built-in data structure since Python is a more general-purpose programming language. The solution for data.frame in Python is the pandas⁷ module.

Before we dive into data.frame, you may be curious why we need it? In other words, why can’t we just use vector, list, array/matrix and dictionary for all data manipulation tasks? I would say yes – data.frame is not a must-have feature for most of ETL (extraction, transformation and Load) operations. But data.frame provides a very intuitive way for us to understand the structured data set. A data.frame is usually flat with 2 dimensions, i.e., row and column. The row dimension is across multiple observations and the column dimension is across multiple attributes/features. If you are familiar with relational database, a data.frame can be viewed as a table.

Let’s see an example of using data.frame to represent employees’ information in a company.

1 > employee_df = data.frame(name=c("A", "B", "C"),department=c("Engineering","Operations","Sales"))
2 > employee_df
3   name  department
4 1    A Engineering
5 2    B  Operations
6 3    C       Sales

Python

1 >>> import pandas as pd
2 >>> employee_df=pd.DataFrame({'name':['A','B','C'],'department':["Engineering","Operations","Sales"]})
3 >>> employee_df
4   name   department
5 0    A  Engineering
6 1    B   Operations
7 2    C        Sales

There are quite a few ways to create data.frame. The most commonly used one is to create data.frame object from array/matrix. We may also need to convert a numeric data.frame to an array/matrix.

 1 > x=array(rnorm(12),c(3,4))
 2 > x
 3            [,1]       [,2]      [,3]       [,4]
 4 [1,] -0.8101246 -0.8594136 -2.260810  0.5727590
 5 [2,] -0.9175476  0.1345982  1.067628 -0.7643533
 6 [3,]  0.7865971 -1.9046711 -0.154928 -0.6807527
 7 > random_df=as.data.frame(x)
 8 > random_df
 9           V1         V2        V3         V4
10 1 -0.8101246 -0.8594136 -2.260810  0.5727590
11 2 -0.9175476  0.1345982  1.067628 -0.7643533
12 3  0.7865971 -1.9046711 -0.154928 -0.6807527
13 > data.matrix(random_df)
14              V1         V2        V3         V4
15 [1,] -0.8101246 -0.8594136 -2.260810  0.5727590
16 [2,] -0.9175476  0.1345982  1.067628 -0.7643533
17 [3,]  0.7865971 -1.9046711 -0.154928 -0.6807527

Python

 1 >>> import numpy as np
 2 >>> import pandas as pd
 3 >>> x=np.random.normal(size=(3,4))
 4 >>> x
 5 array([[-0.54164878, -0.14285267, -0.39835535, -0.81522719],
 6        [ 0.01540508,  0.63556266,  0.16800583,  0.17594448],
 7        [-1.21598262,  0.52860817, -0.61757696,  0.18445057]])
 8 >>> random_df=pd.DataFrame(x)
 9 >>> random_df
10           0         1         2         3
11 0 -0.541649 -0.142853 -0.398355 -0.815227
12 1  0.015405  0.635563  0.168006  0.175944
13 2 -1.215983  0.528608 -0.617577  0.184451
14 >>> np.asarray(random_df)
15 array([[-0.54164878, -0.14285267, -0.39835535, -0.81522719],
16        [ 0.01540508,  0.63556266,  0.16800583,  0.17594448],
17        [-1.21598262,  0.52860817, -0.61757696,  0.18445057]])

In general, operations on an array/matrix is much faster than that on a data frame. In R, we may use the built-in function data.matrix to convert a data.frame to an array/matrix. In Python, we could use the function asarray in numpy module.

Although data.frame is a built-in type, it is not quite efficient for many operations. I would suggest to use data.table⁸ whenever possible. dplyr⁹ is also a very popular package in R for data manipulation. Many good online resources are available online to learn data.table and pandas.

Revisit of variables

We have talked about variables and functions so far. When a function has a name, its name is also a valid variable. After all, what is a variable?

In mathematics, a variable is a symbol that represents an element, and we do not care whether we conceptualize a variable in our mind, or write it down on a paper. However, in programming a variable is not only a symbol. We have to understand that a variable is a name given to a memory location in computer systems. When we run x=2 in R or Python, somewhere in memory has the value 2, and the variable (name) points to this memory address. If we further run y=x, the variable y points to the same memory location pointed to by x. What if we run x=3? It doesn’t modify the memory which stores the value 2. Instead, somewhere in the memory now has the value 3 and this memory location has a name x. And the variable y is not affected at all, as well as the memory location it points to.

Mutability

Almost everything in R or Python is an object, including these data structures we introduced in previous sections. Mutability is a property of objects, not variables, because a variable is just a name.

A list in Python is mutable meaning that we could change the elements stored in the list object without copying the list object from one memory location to another. We can use the id function in Python to check the memory location for a variable. In the code below, we modified the first element of the list object with name x. And since Python list is mutable, the memory address of the list doesn’t change.

Python

1 >>> x=list(range(1,1001)) # list() convert a range object to a list
2 >>> hex(id(x)) # print the memory address of x
3 '0x10592d908'
4 >>> x[0]=1.0 # from integer to float
5 >>> hex(id(x))
6 '0x10592d908'

Is there any immutable data structure in Python? Yes, for example tuple is immutable, which contains a sequence of elements. The element accessing and subset slicing of tuple is following the same rules of list in Python.

Python

 1 >>> x=(1,2,3,) # use () to create a tuple in Python, it is better to always put a comma in the end
 2 >>> type(x)
 3 <class 'tuple'>
 4 >>> len(x)
 5 3
 6 >>> x[0]
 7 1
 8 >>> x[0]=-1
 9 Traceback (most recent call last):
10   File "<stdin>", line 1, in <module>
11 TypeError: 'tuple' object does not support item assignment

If we have two Python variables pointed to the same memory, when we modify the memory via one variable the other is also affected as we expect (see the example below).

Python

 1 >>> x=[1,2,3]
 2 >>> id(x)
 3 4535423616
 4 >>> x[0]=0
 5 >>> x=[1,2,3]
 6 >>> y=x
 7 >>> id(x)
 8 4535459104
 9 >>> id(y)
10 4535459104
11 >>> x[0]=0
12 >>> id(x)
13 4535459104
14 >>> id(y)
15 4535459104
16 >>> x
17 [0, 2, 3]
18 >>> y
19 [0, 2, 3]

In contrast, the mutability of vector in R is more complex and sometimes confusing. First, let’s see the behavior when there is a single name given to the vector object stored in memory.

1 > a=c(1,2,3)
2 > .Internal(inspect(a))
3 @7fe94408f3c8 14 REALSXP g0c3 [NAM(1)] (len=3, tl=0) 1,2,3
4 > a[1]=0
5 > .Internal(inspect(a))
6 @7fe94408f3c8 14 REALSXP g0c3 [NAM(1)] (len=3, tl=0) 0,2,3

It is clear in this case the vector object is mutable since the memory address doesn’t change after the modification. What if there is an additional name given to the memory?

 1 > a=c(1,2,3)
 2 > b=a
 3 > .Internal(inspect(a))
 4 @7fe94408f238 14 REALSXP g0c3 [NAM(2)] (len=3, tl=0) 1,2,3
 5 > .Internal(inspect(b))
 6 @7fe94408f238 14 REALSXP g0c3 [NAM(2)] (len=3, tl=0) 1,2,3
 7 > a[1]=0
 8 > .Internal(inspect(a))
 9 @7fe94408f0a8 14 REALSXP g0c3 [NAM(1)] (len=3, tl=0) 0,2,3
10 > .Internal(inspect(b))
11 @7fe94408f238 14 REALSXP g0c3 [NAM(2)] (len=3, tl=0) 1,2,3
12 > a
13 [1] 0 2 3
14 > b
15 [1] 1 2 3

Before the modification, both variable a and b point to the same vector object in the memory. But surprisingly, after the modification the memory address of variable a also changed, which is called “copy on modify” in R. And because of this unique behavior, the modification of a doesn’t affect the object stored in the old memory and thus the vector object is immutable in this case. The mutability of R list is similar to that of R vector.

1 > x=c(1:3)
2 > tracemem(x) # print the memory address of x whenever the address changes
3 [1] "<0x7ff360c95c08>"
4 > x[1]=-x[1] # type not changed, i.e., from integer to integer
5 > tracemem(x)
6 [1] "<0x7ff360c95c08>"
7 > x[1]=-1.0
8 tracemem[0x7ff360c95c08 -> 0x7ff3604692d8]:

Variable as function argument

Most of functions/methods in R and Python take some variables as argument. What happens when we pass the variables into a function?

In Python, the variable, i.e., the name of the object is passed into a function. If the variable points to an immutable object, any modification to the variable, i.e., the name doesn’t persist. However, when the variable points to a mutable object, the modification of the object stored in memory persist. Let’s see the examples below.

Python

 1 >>> def g(x):
 2 ...   print(id(x))
 3 ...   x-=1
 4 ...   print(id(x))
 5 ...   print(x)
 6 >>> a=1
 7 >>> id(a)
 8 4531658512
 9 >>> g(a)
10 4531658512
11 4531658480
12 0
13 >>> a
14 1

Python

 1 >>> def f(x):
 2 ...   id(x)
 3 ...   x[0]-=1
 4 ...   id(x)
 5 >>> a=[1,2,3]
 6 >>> id(a)
 7 4535423616
 8 >>> f(a)
 9 4535423616
10 4535423616
11 >>> a
12 [0, 2, 3]

We see that the object is passed into function by its name. If the object is immutable, a new copy is created in memory when any modification is made to the original object. When the object is immutable, no new copy is made and the thus the change persists out of the function.

In R, the passed object is always copied on a modification inside the function, and thus no modification can be made on the original object in memory.

 1 > f=function(x){
 2 +   print(.Internal(inspect(x)))
 3 +   x[1]=x[1]-1
 4 +   print(.Internal(inspect(x)))
 5 +   print(x)
 6 + }
 7 > 
 8 > a=c(1,2,3)
 9 > .Internal(inspect(a))
10 @7fe945538688 14 REALSXP g0c3 [NAM(1)] (len=3, tl=0) 1,2,3
11 > f(a)
12 @7fe945538688 14 REALSXP g0c3 [NAM(3)] (len=3, tl=0) 1,2,3
13 [1] 1 2 3
14 @7fe945538598 14 REALSXP g0c3 [NAM(1)] (len=3, tl=0) 0,2,3
15 [1] 0 2 3
16 [1] 0 2 3
17 > a
18 [1] 1 2 3

People may argue that R functions are not as flexible as Python functions. However, it makes more sense to do functional programming in R since we usually can’t modify object passed into a function.

Scope of variables

What is the scope of a variable and why does it matter? Let’s first have a look at the code snippets below.

1 > x=1
2 > var_func_1 = function(){print(x)}
3 > var_func_1()
4 [1] 1
5 > var_func_2 = function(){x=x+1; print(x)}
6 > var_func_2()
7 [1] 2
8 > x
9 [1] 1

Python

 1 >>> x=1
 2 >>> def var_func_1():print(x)
 3 >>> var_func_1()
 4 1
 5 >>> def var_func_2():x+=1
 6 ... 
 7 >>> var_func_2()
 8 Traceback (most recent call last):
 9   File "<stdin>", line 1, in <module>
10   File "<stdin>", line 1, in var_func_2
11 UnboundLocalError: local variable 'x' referenced before assignment

The results of the code above seem strange before knowing the concept of variable scope. Inside a function, a variable may refer to a function argument/parameter or it could be formally declared inside the function which is called a local variable. But in the code above, x is neither a function argument nor a local variable. How does the print() function know where the identifier x points to?

The scope of a variable determines where the variable is available/accessible (can be referenced). Both R and Python apply lexical/static scoping for variables, which set the scope of a variable based on the structure of the program. In static scoping, when an ‘unknown’ variable referenced, the function will try to find it from the most closely enclosing block. That explains how the print() function could find the variable x.

In the R code above, x=x+1 the first x is a local variable created by the = operator; the second x is referenced inside the function so the static scoping rule applies. As a result, a local variable x which is equal to 2 is created, which is independent with the x outside of the function var_func_2(). However, in Python when a variable is assigned a value in a statement the variable would treated as a local variable and that explains the UnboundLocalError.

Is it possible to change a variable inside a function which is declared outside the function without passing it as an argument? Based on the static scoping rule only, it’s impossible. But there are workarounds in both R/Python. In R, we need the help of environment; and in Python we can use the keyword global.

So what is an environment in R? An environment is a place where objects are stored. When we invoke the interactive R session, an environment named as .GlobalEnv is created automatically. We can also use the function environment() to get the present environment. The ls() function can take an environment as the argument to list all objects inside the environment.

 1 > typeof(.GlobalEnv)
 2 [1] "environment"
 3 > environment()
 4 <environment: R_GlobalEnv>
 5 > x=1
 6 > ls(environment())
 7 [1] "x"
 8 > env_func_1=function(x){
 9 +   y=x+1
10 +   print(environment())
11 +   ls(environment())
12 + }
13 > env_func_1(2)
14 <environment: 0x7fc59d165a20>
15 [1] "x" "y"
16 > env_func_2=function(){print(environment())}
17 > env_func_2()
18 <environment: 0x7fc59d16f520>

The above code shows that each function has its own environment containing all function arguments and local variables declared inside the function. In order to change a variable declared outside of a function, we need the access of the environment enclosing the variable to change. There is a function parent_env(e) that returns the parent environment of the given environment e in R. Using this function, we are able to change the value of x declared in .GlobalEnv inside a function which is also declared in .GlobalEnv. The global keyword in Python works in a totally different way, which is simple but less flexible.

1 > x=1
2 > env_func_3=function(){
3 +   cur_env=environment()
4 +   par_env=parent.env(cur_env)
5 +   par_env$x=2
6 + }
7 > env_func_3()
8 > x
9 [1] 2

Python

1 >>> def env_func_3():
2 ...   global x
3 ...   x = 2
4 ... 
5 >>> x=1
6 >>> env_func_3()
7 >>> x
8 2

I seldomly use the global keyword in Python, if any. But the environment in R could be very handy in some occasions. In R, environment could be used as a purely mutable version of the list data structure.

1 # list is not purely mutable
2 > x=list(1)
3 > tracemem(x)
4 [1] "<0x7f829183f6f8>"
5 > x$a=2
6 > tracemem(x)
7 [1] "<0x7f828f4d05c8>"

1 # environment is purely mutable
2 > x=new.env()
3 > x
4 <environment: 0x7f8290aee7e8>
5 > x$a=2
6 > x
7 <environment: 0x7f8290aee7e8>

Actually, the object of an R6 class type is also an environment.

1 > # load the Complex class that we defined in chapter 1
2 > x = Complex$new(1,2)
3 > typeof(x)
4 [1] "environment"

In Python, we can assign values to multiple variables in one line.

Python

1 # list is not purely mutable
2 >>> x,y = 1,2
3 >>> x
4 1
5 >>> y
6 2

Python

1 >>> x,y=(1,2)
2 >>> print(x,y)
3 1 2
4 >>> (x,y)=(1,2)
5 >>> print(x,y)
6 1 2
7 >>> [x,y]=(1,2)
8 >>> print(x,y)
9 1 2

Even though in the left snippet above there aren’t parentheses embracing 1, 2 after the = operator, a tuple is created first and then the tuple is unpacked and assigned to x,y. Such mechanism doesn’t exist in R, but we can define our own multiple assignment operator with the help of environment.

 1 `%=%` = function(left, right) {
 2   # we require the RHS to be a list strictly
 3   stopifnot(is.list(right))
 4   # dest_env is the desitination environment enclosing the variables on LHS
 5   dest_env = parent.env(environment())
 6   left = substitute(left)
 7 
 8   recursive_assign = function(left, right, dest_env) {
 9     if (length(left) == 1) {
10       assign(x = deparse(left),
11              value = right,
12              envir = dest_env)
13       return()
14     }
15     if (length(left) != length(right) + 1) {
16       stop("LHS and RHS must have the same shapes")
17     }
18 
19     for (i in 2:length(left)) {
20       recursive_assign(left[[i]], right[[i - 1]],  dest_env)
21     }
22   }
23 
24   recursive_assign(left, right, dest_env)
25 }

Before going into the script deeper, first let’s see the usage of the multiple assignment operator we defined.

 1 > source('multi_assignment.R')
 2 > c(x,y,z) %=% list(1,"Hello World!",c(2,3))
 3 > x
 4 [1] 1
 5 > y
 6 [1] "Hello World!"
 7 > z
 8 [1] 2 3
 9 > list(a,b) %=% list(1,as.Date('2019-01-01'))
10 > a
11 [1] 1
12 > b
13 [1] "2019-01-01"

In the \%=\% operator defined above, we used two functions substitute, deparse which are very powerful but less known by R novices. To better understand these functions as well as some other less known R functions, the Rchaeology¹⁹ tutorial is worth reading.

It is also interesting to see that we defined the function recursive_assign inside the \%=\% function. Both R and Python support the concept of first class functions. More specifically, a function in R/Python is an object, which can be

stored as a variable;
passed as a function argument;
returned from a function.

The essential idea behind the recursive_assign function is a depth-first search (DFS), which is a fundamental graph traversing algorithm²⁰. In the context of the recursive_assign function, we use DFS to traverse the parse tree of the left argument created by calling substitute(left).

Object-oriented programming (OOP) in R/Python

All the codes we wrote above follow the procedural programming paradigm¹⁰. We can also do functional programming (FP) and OOP in R/Python. In this section, let’s focus on OOP in R/Python.

Class is the key concept in OOP. In R there are two commonly used built-in systems to define classes, i.e., S3 and S4. In addition, there is an external package R6¹¹ which defines R6 classes. S3 is a light-weight system but its style is quite different from OOP in many other programming languages. S4 system follows the principles of modern object oriented programming much better than S3. However, the usage of S4 classes is quite tedious. I would ignore S3/S4 and introduce R6, which is more close to the class in Python.

Let’s build a class in R/Python to represent complex numbers.

 1 > library(R6) # load the R6 package
 2 > 
 3 > Complex = R6Class("Complex", 
 4 + public = list( # only elements declared in this list are accessible by the object of this class
 5 + real = NULL,
 6 + imag = NULL,
 7 + # the initialize function would be called automatically when we create an object of the class 
 8 + initialize = function(real,imag){
 9 +     # call functions to change real and imag values
10 + 	self$set_real(real)
11 + 	self$set_imag(imag)
12 + },
13 + # define a function to change the real value
14 + set_real = function(real){
15 + 	self$real=real
16 + },
17 + # define a function to change the imag value
18 + set_imag = function(imag){
19 + 	self$imag=imag
20 + },
21 + # override print function
22 + print = function(){
23 + 	cat(paste0(as.character(self$real), '+', as.character(self$imag),'j'), '\n')
24 + }
25 + )
26 + )
27 > # let's create a complex number object based on the Complex class we defined above using the new function
28 > x = Complex$new(1,2)
29 > x
30 1+2j 
31 > x$real # the public attributes of x could be accessed by $ operator
32 [1] 1

Python

 1 >>> class Complex:
 2 ...   # the __init__ function would be called automatically when we create an object of the class 
 3 ...   def __init__(self,real,imag):
 4 ...     self.real = None
 5 ...     self.imag = None
 6 ...     self.set_real(real)
 7 ...     self.set_imag(imag)
 8 ...   # define a function to change the real value
 9 ...   def set_real(self,real):
10 ...     self.real=real
11 ...   # define a function to change the imag value
12 ...   def set_imag(self,imag):
13 ...     self.imag=imag
14 ...   def __repr__(self):
15 ...     return "{0}+{1}j".format(self.real,self.imag)
16 ... 
17 >>> x = Complex(1,2)
18 >>> x
19 1+2j
20 >>> x.real # different from the $ operator in R, here we use . to access the attribute of an object
21 1

By overriding the print function in the R6 class, we can have the object printed in the format of real+imag j. To achieve the same effect in Python, we override the method __repr__. In Python, we call the functions defined in classes as methods. And overriding a method means changing the implementation of a method provided by one of its ancestors. To understand the concept of ancestors in OOP, one needs to understand the concept of inheritance¹².

You may be curious of the double underscore surrounding the methods, such as __init__ and __repr__. These methods are well-known as magic methods¹³. Magic methods could be very handy if we use them in the suitable cases. For example, we can use the magic method __add__ to implement the + operator for the Complex class we defined above.

In the definition of the magic method __repr__ in the Python code, the format method of str object¹⁴ is used.

You may wonder why we use the set_real/imag method in the Python class. Actually they are not necessary. It is fine to modify the variables of a class directly. For example, in the code below we remove these methods which simplifies the definition of the Complex class.

Python

 1 >>> class Complex:
 2 ...   def __init__(self, real, imag):
 3 ...     # we can modify real and imag directly
 4 ...     self.real = real
 5 ...     self.imag = imag
 6 ...   def __repr__(self):
 7 ...     return "{0}+{1}j".format(self.real, self.imag)
 8 ...   def __add__(self, another):
 9 ...     return Complex(self.real + another.real, self.imag + another.imag)
10 ... 
11 >>> x = Complex(1,2)
12 >>> y = Complex(2,4)
13 >>> x+y # + operator works now
14 3+6j

We can also implement the + operator for Complex class in R like what we have done for Python.

1 >  `+.Complex` = function(x,y){
2 +   Complex$new(x$real+y$real,x$imag+y$imag)
3 + }
4 > x=Complex$new(1,2)
5 > y=Complex$new(2,4)
6 > x+y
7 3+6j

The most interesting part of the code above is `+.Complex`. First, why do we use `` to quote the function name? Before getting into this question, let’s have a look at the Python 3’s variable naming rules¹⁵.

Within the ASCII range (U+0001..U+007F), the valid characters for identifiers (also referred to as names) are the same as in Python 2.x: the uppercase and lowercase letters A through Z, the underscore _ and, except for the first character, the digits 0 through 9.

According to the rule, we can’t declare a variable with name 2x. Compared with Python, in R we can also use . in the variable names¹⁶. However, there is a workaround to use invalid variable names in R with the help of ``.

1 > 2x = 5
2 Error: unexpected symbol in "2x"
3 > .x = 3
4 > .x
5 [1] 3
6 > `+2x%` = 0
7 > `+2x%`
8 [1] 0

Python

 1 >>> 2x = 5
 2   File "<stdin>", line 1
 3     2x = 5
 4      ^
 5 SyntaxError: invalid syntax
 6 >>> .x = 3
 7   File "<stdin>", line 1
 8     .x = 3
 9     ^
10 SyntaxError: invalid syntax

Now it is clear the usage of `` in `+.Complex` is to define a function with invalid name. Placing .Complex after + is related to S3 method dispatching which would not be discussed here.

Member Accessibility in Python

In some programming languages the members (variable or methods) of a class can be declared with access modifiers which specifies the accessibility or scope of a member. In Python, class members don’t have explicit access modifiers, but it is still possible to specify the accessibility. By default, the class member can be accessed inside or outside the class definition. If the name of the member starts with a single underscore, the member should not be accessed outside the class definition by convention, but it is not enforced. If the name of the member starts with double underscore, the member name is mangled and thus the member can’t be accessed by its original member outside the class definition. But inside the class definition these variables can always be accessed. Let’s see the example below.

Python

 1 >>> class ProtectedMemberClass:
 2 ...     """
 3 ...     _x and _func1 are protected and are recommended not to be accessed out of the class definition, but not enforced.
 4 ...     __y and __func2 are private and the names would be mangled
 5 ...     """
 6 ...     def __init__(self, val1, val2):
 7 ...         self._x = val1
 8 ...         self.__y = val2
 9 ...     def _func1(self):
10 ...         print("protected _func1 called")
11 ...     def __func2(self):
12 ...         print("private __func2 called")
13 ...     def func3(self):
14 ...         # inside the class definition, we can access all these members
15 ...         print("self._x is {0} and self.__y is{1}".format(
16 ...             self._x, self.__y))
17 ...         self._func1()
18 ...         self.__func2()
19 ... 
20 >>> p = ProtectedMemberClass(0, 1)
21 >>> p._x
22 0
23 >>> p.__y
24 Traceback (most recent call last):
25   File "<stdin>", line 1, in <module>
26 AttributeError: 'ProtectedMemberClass' object has no attribute '__y'
27 >>> p._func1()
28 protected _func1 called
29 >>> p.__func2()
30 Traceback (most recent call last):
31   File "<stdin>", line 1, in <module>
32 AttributeError: 'ProtectedMemberClass' object has no attribute '__func2'
33 >>> p.func3()
34 self._x is 0 and self.__y is1
35 protected _func1 called
36 private __func2 called

In this example, an error is thrown when we try to access __y or __func2 outside the class definition. But they are reachable within the class definition and these fields are usually called private fields.

Miscellaneous

There are some items that I haven’t discussed so far, which are also important in order to master R/Python.

Package/module installation

Use install.packages() function in R
Use R IDE to install packages
Use pip¹⁷ to install modules in Python

Virtual environment

Virtual environment is a tool to mange dependencies in Python. There are different ways to create virtual environments in Python. But I suggest to use the venv module shipped with Python 3. Unfortunately, there is nothing like a real virtual environment in R as far as I know although there quite a few of packages management tools/packages.

<- vs. =

If you have known R before, you probably heard of the advice¹⁸ to use <- to rather than = for value assignment. However, I always use = for value assignment. Let’s see an example when <- makes a difference when we do value assignment.

1 > x=1
2 > a=list(x <- 2)
3 > a
4 [[1]]
5 [1] 2
6 
7 > x
8 [1] 2

1 > x=1
2 > a=list(x = 2)
3 > a
4 $x
5 [1] 2
6 
7 > x
8 [1] 1

When the assignment is made inside an argument of a function call, <- and = have different effects. But it is not recommended to use <- inside an argument of a function call.

¹ https://www.r-project.org

² https://www.python.org

³ https://docs.python.org/3/tutorial/modules.html

⁴ https://cran.r-project.org/doc/manuals/r-devel/NEWS.html

⁵ https://docs.python.org/3/tutorial/datastructures.html

⁶ http://www.numpy.org

⁷ https://pandas.pydata.org/

⁸ https://cran.r-project.org/web/packages/data.table/index.html

⁹ https://dplyr.tidyverse.org

¹⁰ https://en.wikipedia.org/wiki/Comparison_of_programming_paradigms

¹¹ https://cran.r-project.org/web/packages/R6/index.html

¹² https://en.wikipedia.org/wiki/Inheritance_(object-oriented_programming)

¹³ https://rszalski.github.io/magicmethods

¹⁴ https://docs.python.org/3.7/library/string.html

¹⁵ https://docs.python.org/3.3/reference/lexical_analysis.html

¹⁶ https://cran.r-project.org/doc/manuals/r-release/R-lang.html\#Identifiers

¹⁷ https://packaging.python.org/tutorials/installing-packages

¹⁸ https://google.github.io/styleguide/Rguide.xml

¹⁹ https://en.wikipedia.org/wiki/Depth-first\_search

²⁰ https://cran.r-project.org/web/packages/rockchalk/vignettes/Rchaeology.pdf