Upon the feedbacks and requests from readers of the first few chapters, I made a decision to spend a whole chapter in the introduction of data.table and pandas. Of course, there are many other great data analysis tools in both R and Python. For example, many R users like using dplyr to build up data analysis pipelines. The performance of data.table is superior and that is the main reason I feel there is no need to use other tools for the same tasks in R. But if you are a big fan for the pipe operator \%>\% you may use data.table and dplyr together. Regarding the big data ecosystem, Apache Spark has API in both R and Python. Recently, there are also some emerging projects aiming at better usability and performance, such as Apache Arrow1, Modin2.
SQL
Similar to the previous chapters, I will introduce the tools side by side. However, I feel before diving into the world of data.table and pandas, it is better to talk a little bit about SQL3. SQL is a Query language designed for managing data in relational database management system (RDBMS). Some of the most popular RDBMSs include MS SQL Server, MySQL, PostgreSQL, etc. Different RDBMSs may use SQL languages with major or subtle differences.
If you have never used RDBMS you may wonder why we need it?
first, we need a system to store the data;
second, we also need a system that allows us to easily access, manage and update the data.
Let’s assume there is a table mtcars in a database (I’m using sqlite3 in this book) and see some simple tasks we can do with SQL queries.
mtcars data loaded from R 3.5.1
name
mpg
cyl
disp
hp
drat
wt
qsec
vs
am
gear
carb
Mazda RX4
21
6
160
110
3.9
2.62
16.46
0
1
4
4
Mazda RX4 Wag
21
6
160
110
3.9
2.875
17.02
0
1
4
4
Datsun 710
22.8
4
108
93
3.85
2.32
18.61
1
1
4
1
Hornet 4 Drive
21.4
6
258
110
3.08
3.215
19.44
1
0
3
1
Hornet Sportabout
18.7
8
360
175
3.15
3.44
17.02
0
0
3
2
Valiant
18.1
6
225
105
2.76
3.46
20.22
1
0
3
1
Duster 360
14.3
8
360
245
3.21
3.57
15.84
0
0
3
4
Merc 240D
24.4
4
146.7
62
3.69
3.19
20
1
0
4
2
Merc 230
22.8
4
140.8
95
3.92
3.15
22.9
1
0
4
2
Merc 280
19.2
6
167.6
123
3.92
3.44
18.3
1
0
4
4
Merc 280C
17.8
6
167.6
123
3.92
3.44
18.9
1
0
4
4
Merc 450SE
16.4
8
275.8
180
3.07
4.07
17.4
0
0
3
3
Merc 450SL
17.3
8
275.8
180
3.07
3.73
17.6
0
0
3
3
Merc 450SLC
15.2
8
275.8
180
3.07
3.78
18
0
0
3
3
Cadillac Fleetwood
10.4
8
472
205
2.93
5.25
17.98
0
0
3
4
Lincoln Continental
10.4
8
460
215
3
5.424
17.82
0
0
3
4
Chrysler Imperial
14.7
8
440
230
3.23
5.345
17.42
0
0
3
4
Fiat 128
32.4
4
78.7
66
4.08
2.2
19.47
1
1
4
1
Honda Civic
30.4
4
75.7
52
4.93
1.615
18.52
1
1
4
2
Toyota Corolla
33.9
4
71.1
65
4.22
1.835
19.9
1
1
4
1
Toyota Corona
21.5
4
120.1
97
3.7
2.465
20.01
1
0
3
1
Dodge Challenger
15.5
8
318
150
2.76
3.52
16.87
0
0
3
2
AMC Javelin
15.2
8
304
150
3.15
3.435
17.3
0
0
3
2
Camaro Z28
13.3
8
350
245
3.73
3.84
15.41
0
0
3
4
Pontiac Firebird
19.2
8
400
175
3.08
3.845
17.05
0
0
3
2
Fiat X1-9
27.3
4
79
66
4.08
1.935
18.9
1
1
4
1
Porsche 914-2
26
4
120.3
91
4.43
2.14
16.7
0
1
5
2
Lotus Europa
30.4
4
95.1
113
3.77
1.513
16.9
1
1
5
2
Ford Pantera L
15.8
8
351
264
4.22
3.17
14.5
0
1
5
4
Ferrari Dino
19.7
6
145
175
3.62
2.77
15.5
0
1
5
6
Maserati Bora
15
8
301
335
3.54
3.57
14.6
0
1
5
8
Volvo 142E
21.4
4
121
109
4.11
2.78
18.6
1
1
4
2
SQL
In the example above, I select two rows from the table using the syntax select from. The keyword limit in sqlite specifies the number of rows to return. In other RMDBSs, we may need to use top instead.
It is straightforward to select on conditions with the where keyword.
SQL
We can definitely use more conditions with the where clause.
SQL
Staring at the example above, what are we doing? Actually, we are just accessing specific rows and columns from the table in database with select from where.
We can also do something a bit of more fancy, for example, to get the maximum, the minimum and the average of mpg for all vehicles grouped by the number of cylinders.
SQL
In the above example, there are a few things worth noting. We use as to create an alias for a variable; we group the original rows by the number of cylinders with the keyword group by; and we sort the output rows with the keyword order by. max, min and avg are all built-in functions that we can use directly.
It is also possible to have user-defined functions in SQL as what we usually do in other programming languages.
SQL is a very powerful tool for data analysis, but it works on RMDBS and generally we can’t apply R or Python functions to the database tables directly. Many practitioners in data science have to work with database at times but more often they need to work in a programming languages such as R or Python. We have introduced the data.frame in both R and Python in previous chapters. A data.frame is just like a database table that you may operate within the corresponding language. Usually a data.frame is stored in memory, but of course it can also be deserialized for storage in hard disks.
With the data.frame-like objects, we could build up a better data processing pipeline by only reading the original data from the database and storing the final output to the database if necessary. Most of works we can do with data.frame-like objects may also be done in RMBDS with SQL. But they may require intense interactions with a database, which is not preferred if they could be avoided.
Now, let’s get started with data.table and pandas. In this book, I will use data.table 1.12.0 and pandas 0.24.0.
Get started with data.table & pandas
R
Python
I have saved the mtcars data to a csv4 file (code/chapter3/mtcars.csv), which is loaded from R 3.5.1. Although the mtcars data is loaded into R environment by default, let’s load the data by reading the raw csv file for learning purpose.
RPython
The type of mtcars_dt is data.table, not data.frame. Here we use the fread function from data.table to read a file and the output type is a data.table directly. Regarding reading csv in R, a very good package is readr for very large files, but the output has a data.frame type. In practice, it is very common to convert a data.frame to data.table with the function as.data.table.
Indexing & selecting data
Before to introduce the indexing rules in data.table and pandas, it’s better to understand the key in data.table and the index in pandas.
What is the key in a data.table? We have talked about RMDBS and SQL in the previous section. With select from where we can easily access specific rows satisfying certain conditions. When the database table is too large, a database index is used to improve the performance of data retrieval operations. Essentially, a database index is a data structure, and to maintain the data structure additional cost (for example, space) may be required. The reason to use key in data.table and index in pandas is very similar. Now let’s see how to set key and index.
RPython
There are quite a few things worth noting from the above code snippets. When we use the setkey function the quotes for the column name is optional. So setkey(mtcars_dt, name) is equivalent to setkey(mtcars_dt, 'name'). But in pandas, quotes are required. The effect of setkey is in place, which means no copies of the data made at all. But in pandas, by default set_index set the index on a copy of the data and the modified copy is returned. Thus, in order to make the effect in place, we have the set the argument inplace=True explicitly. Another difference is that setkey would sort the original data in place automatically but set_index does not. It’s also worth noting every pandas data.frame has an index; and by default it is numpy.arange(n) where n is the number of rows. But there is no default key in a data.table.
In the above example, we only use a single column as the key/index. It is possible to use multiple columns as well.
R
Python
To use multiple columns as the key in data.table, we use the function setkeyv. It is also interesting that we have to use index.names rather than index.name to get the multiple column names of the index (which is called MultiIndex) in pandas. There are duplicated combinations of (cyl, gear) in the data, which implies key or index could be duplicated.
Once the key/index set, we can access rows with given the indices fast.
R
Here is a bit of explanation for the code above. We can simply use [] to access the rows with the specified key values if the key has a character type. But if the key has a numeric type, list() is required to enclose the key values. In data.table, .() is just an alias of list(), which means we would get the same results with mtcars_dt[list(6,4)]. Of course, we can use also do mtcars_dt[.('Merc 230')] which is equivalent to mtcars_dt[.('Merc 230')].
Python
Compared to data.table, we need to use the loc method when accessing rows based on index. The loc method also takes boolean conditions.
Python
When using boolean conditions, loc could be ignored for convenience.
Python
If the key/index is not needed, we can remove the key or reset the index. For data.table we can set a new key to override the existing one which then becomes a column. But in pandas, set_index method removes the exiting index which also disappears from the data.frame.
RPython
In Chapter 1, we introduced integer-based indexing for list/vector. It is also applicable to data frame and data.table
RPython
So far we have seen how to access specific rows. What about columns? Accessing columns in data.table and pandas is quite straightforward. For data.table, we can use \$ sign to access a single column or a vector to specify multiple columns inside []. For data.frame in pandas, we can use . to access a single column or a list to specify multiple columns inside [].
R
Python
In addition to passing a vector to access multiple columns in data.table, we can also use the .(variable_1,variable_2,...).
It is also possible to do integer-based columns slicing.
R
Python
To access specific rows and specific columns, there are two strategies:
select rows and then select columns in a chain;
select rows and columns simultaneously.
Let’s see some examples.
RPython
As we have seen, using the setkey function for data.table sorts the data.table automatically. However, sorting is not always desired. In data.table, there is another function setindex/setindexv which has similar effects to setkey/setkeyv but doesn’t sort the data.table. In addition, one data.table could have multiple indices, but it cannot have multiple keys.
R
Add/Remove/Update
First, let’s see how to delete a single column.
R
Python
The := operator in data.table can be used to add/remove/update columns, by reference. Thus, when we use := no copies of the data is created. Getting familiar with this operator is critical to master data.table.
Next, let’s see how to delete multiple columns at the same time.
R
Python
The interesting fact of the code above is that in R the \%in\% function is vectorized, but the in function in Python is not.
Adding a single column to an existing data.table or DataFrame is as straightforward as removing.
RPython
Adding multiple columns is a bit of tricky compared with adding a single column.
RPython
In the R code, we use `:=` to create multiple columns. In the Python code, we put the new columns into a dictionary and use the assign function with the dictionary unpacking operator **. To learn the dictionary unpacking operator, please refer to official document5. The assign method of a DataFrame doesn’t have inplace argument so we need to assign the modified DataFrame to the original one explicitly.
Now let’s see how to update values. We can update the entire column or just the column on specific rows.
RPython
We can also combine the technique of rows indexing with column update.
RPython
In addition to :=, we can also use set function to modify values in a data.table. When used properly, the performance gain could be significant. Let’s a fictional use case.
R
We see that updating the values with the set function in this example is as fast as updating the values in an array.
Group by
At the beginning of this Chapter, we have seen an example with group by in SQL query. group by is a very powerful operation, and it is also available in data.table and pandas. Let’s try to get the average mpg grouped by cyl.
RPython
Group by also works on multiple columns.
RPython
We can also create multiple columns with group by. My feeling is data.table is more expressive.
RPython
In data.table, there is also a keyword called keyby which enables group by and sort operations together.
R
It is even possible to add expressions after the by keyword.
R
Join
Join6 combines columns from one or more tables for RMDBs. We also have the Join operation available in data.table and pandas. We only talk about 3 different types of joins here, .i.e., inner join, left join, and right join. Left join and right join are also referred as outer join.
Let’s make two tables to join.
RPython
We can join tables with or without the help of index/key. In general, joining on index/key is more fast. Thus, I recommend always to set key/index for join operations.
R
To join data.table A and B, the syntax A[B] and B[A] only work when the keys of A and B are set. The function merge from data.table package works with or without key.
R
The join operations for data.table are summarized in table below.
Syntax of join in data.table
type
syntax 1
syntax 2
A inner join B
A[B, nomatch=0] or B[A, nomatch=0]
merge(A, B, all.x=FALSE, all.y=FALSE)
A left join B
B[A]
merge(A, B, all.x=TRUE, all.y=FALSE)
B right join A
A[B]
merge(A, B, all.x=FALSE, all.y=TRUE)
In pandas, there are also different ways to join DataFrame. Let’s just focus on the basic method with the merge function (other methods may also be based on this function).
Python
We have learned the very basics of data.table and pandas. In fact, there are lots of other useful features in both tools which are not covered in this chapter. For example, the .I/.N symbol in data.table, and the stack/unstack method in pandas.