Sections in this Chapter:
Machine learning is such a huge topic that we would only touch on a bit in this book. There are different learning paradigms.
Basically, supervised learning refers the learning task in which we have input data and output data and the purpose is to learn a functional map from the input data to the output data. In contrast to supervised learning, there is unsupervised learning in which only the input data in available. In addition to supervised/unsupervised learning, there are also other learning paradigms, such as reinforcement learning 20 and transfer learning 21.
The is no strict distinction between different learning paradigms. For example, semi-supervised learning 22 falls between supervised learning and unsupervised learning, and supervised learning paradigm can also be found inside some modern reinforcement learning algorithms.
In this book, we have talked about quite a few supervised learning models, such as linear regression and its variants, logistic regression. We will also spend some effort in tree-based models in this chapter. There are many other interest supervised learning models that we don’t cover in this book, such as support vector machine 23, linear discriminant analysis 24. Almost all well-known supervised learning models’ implementations can be found online and I recommend learning from reading the source code. As for the usage, there is an abundance of off-the-shelf options in both R and Python. Algorithm-wise, my favorite supervised learning models include linear models, gradient boosting trees, and (deep) neural network models. Linear models are simple with great interpretability; and it is rare if gradient boosting tree models or deep neural network models cannot match the prediction accuracy of other models for a specific problem.
A population is a complete set of elements of interest for a specific problem. It is usually defined by the researcher of the problem. A population can be either finite or infinite. For example, if we define the set of real numbers as our population, it is infinite. But if we are only interested in the integers between 1 and 100, then we get a finite population.
A random sample is a set of elements selected from a population. The size of a random sample could be larger than the population since an element can be taken multiple times.
Why do we care random samples? Because we are interested in the population from which the random sample is taken, and the random sample could help to make inference on the population.
In predictive modeling, each element in the population has a set of attributes which are usually called features or covariates, as well as a label. For example, a bank may use the mortgage applicant’s personal information (FICO score, years of employment, debt to income ratio, etc.) as the covariates, and the status of the mortgage (default, or paid off) as the label. A predictive model can be used to predict the final status of the mortgage for a new applicant, and such kind of models are classification models. When a mortgage is in default status, the applicant may have already made payments partially. Thus, a model to predict the amount of loss for the bank is also useful, and such kind of models are regression models.
But why do we need a predictive model? If we know the labels of the entire population, nothing is needed to learn. All what we need is a database table or a dictionary (HashMap) for lookup. The issue is that many problems in the real world don’t allow us to have the labels of the entire population. And thus, we need to learn or infer based on the random sample collected from the unknown population.
The Universal approximation theorem says that a single hidden layer neural network can approximate any continuous functions ($\mathbf{R}^n\rightarrow\mathbf{R}$) with sufficient number of neurons under mild assumptions on the activation function (for example, the sigmoid activation function)1. There are also other universal approximators, such as the decision trees.
Not all machine learning models can approximate universally, for example, the linear regression without polynomial items. If we have the data from the entire population, we may fit the population with a universal approximator. But as we have discussed earlier, when the entire population is available there is no need to fit a machine learning model for prediction. But if only a random sample is available, is a universal approximator still the best choice? It depends.
One of the risks of fitting a random sample with a predictive model is overfitting (see the figure below). Using universal approximators as the predictive model may even amplify such risk.
To mediate the risk of overfitting, we usually use cross-validation to assess how accurately a predictive model is able to predict for unseen data. The following steps specifies how to perform a cross-validation.
Cross-validation can also be considered as a metaheuristic algorithm since it is not problem-specific because it doesn’t matter what type of predictive models we use.
There are some ready-to-use tools in R and Python modules for cross-validation. But for pedagogical purpose, let’s do it step by step. We do the cross validation using the Lasso regression we build in previous Chapter on the Boston dataset.
Run the code snippets, we have the $5$-fold cross-validation accuracy as follows.
We add the line sys.path.append("..")
in the Python code, otherwise it would throw an error because of the import
mechanism2.
In the example above, we measure the root of mean squared error (RMSE) as the accuracy of the linear regression model. There are various metrics to evaluate the accuracy of predictive models.
For regression models, RMSE, mean absolute error (MAE)3 and mean absolute percentage error (MAPE)4 are some of the commonly-used evaluation metrics. You may have heard of the coefficient of determination ($R^2$ or adjusted $R^{2}$) in Statistics. But from predictive modeling perspective, $R^{2}$ is not a metric that evaluates the predictive power of the model since its calculation is based on the training data. But what we are actually interested in is the model performance on the unseen data. In Statistics, goodness of fit is a term to describe how good a model fits the observations, and $R^{2}$ is one of these measures for goodness of fit. In predictive modeling, we care more about the error of the model on the unseen data, which is called generalization error. But of course, it is possible to calculate the counterpart of $R^{2}$ on testing data.
metric | formula |
RMSE | $\sqrt{\frac {\sum_{i=1}^{n} {(\hat{y}_{i}-y})^{2}} n}$ |
MAE | $ \frac {\sum_{i=1}^{n} {\lvert \hat{y}_{i} – y \rvert}} {n} $ |
MAPE | $ \frac {1} n \sum_{i=1}^{n} \lvert{\frac {\hat{y}_{i}-y_{i}} {y_{i}}\rvert}$ |
The most intuitive metric for classification models is the accuracy, which is the percentage of corrected classified instances. To calculate the accuracy, we need to label each instance to classify. Recall the logistic regression we introduced in Chapter 5, the direct output of a logistic regression are probabilities rather than labels. In that case, we need to convert the probability output to the label for accuracy calculation. For example, consider a classification problem, where the possible labels of each instance is $0$, $1$, $2$ and $3$. If the predictive probabilities of each label are $0.2$, $0.25$, $0.5$, $0.05$ for label $0$, $1$, $2$ and $3$ respectively, then the predictive label is $2$ since its corresponding probability is the largest.
But actually we don’t always care the labels of an instance. For example, a classification model for mortgage default built in a bank may only be used to calculate the expected monetary loss. Another example is the recommendation system that predicts the probabilities which are used for ranking of items. In that case, the model performance could be evaluated by logloss
, AUC
, etc. using the output probabilities directly. We have seen in Chapter 5 the loss function of logistic regression is the log-likelihood function.
Actually, logloss
is just the average evaluated log-likelihood on the testing data, and thus it can also be used for classification models with more than 2 classes (labels) because likelihood function is not restricted to Bernoulli distribution (extended Bernoulli distribution is called categorical distribution5). Another name of logloss
is cross-entropy loss
.
In practice, AUC
(Area Under the ROC Curve) is a very popular evaluation metric for binary-class classification problems. AUC is bounded between 0 and 1. A perfect model leads to an AUC equal to 1. If a model’s predictions are $100\%$ wrong, the resulted AUC is equal to 0. But if we know a binary-class classification model always results in $100\%$ wrong predictions, we can instead use $1-\hat{y}$ as the corrected prediction and as a result we will get a perfect model and the AUC based on the corrected prediction becomes 1. Actually, a model using completely random guess as the prediction leads to an AUC equal to 0.5. Thus, in practice the evaluated AUC is usually between 0.5 and 1.
There are also many other metrics, such as recalls, precisions, and F1 score6.
The selection of evaluation metrics in predictive modeling is important but also subjective. Sometimes we may also need to define a customized evaluation metric.
Many evaluation metrics can be found from the R package Metrics
and the Python module sklearn.metrics
.
According to the explanation of feature engineering7 on wikipedia, feature engineering is the process to use domain knowledge to create new features based on existing features. In reality, it is not rare to see feature engineering leads to better prediction accuracy. And sometimes I use feature engineering too. But I think feature engineering should be less and less useful in the future as the machine learning algorithms become more and more intelligent.
Let’s consider three features $x_{1},x_{2}$ and $x_{3}$ and assume the actual model is specified as $y=f(x_{1},g(x_{2}, x_{3}))$. After all, $y$ is still a function of $x_{1},x_{2}$ and $x_{3}$, and we can write it as $y=h(x_{1},x_{2}, x_{3})$. Thus, even without creating the new feature $x_{4}=g(x_{2}, x_{3})$ explicitly, a universal approximator should be able to learn (i.e., approximate) the function $h$ from the data ideally. This idea is also supported by the Kolmogorov–Arnold representation theorem8 which says any continuous real-valued multivariate functions can be written as a finite composition of continuous functions of a single variable.
$$
\begin{equation}
f(x\sb{1},…,x\sb{m})=\sum_{q=0}^{2m} {\Phi\sb{q}\big(\sum_{p=1} ^{m} \phi_{q,p} (x_{p})}\big) .
\label{eq:ka}
\end{equation}
$$
As of today, since the machine learning algorithms are not that intelligent, it is worth trying feature engineering especially when domain knowledge is available.
If you are familiar with dimension reduction, embedding can be considered as something similar. Dimension reduction aims reducing the dimension of $\boldsymbol{X}$. It sounds interesting and promising if we can transform the high-dimensional dataset into a a low-dimensional dataset and feed the dataset in a low dimension space to the machine learning model. However, I don’t think this is a good idea in general because it is not guaranteed the low-dimensional predictors still keep all the information related to the response variable. Actually, many machine learning models are capable to handle the high-dimensional predictors directly.
Embedding also transform the features into a new space, which usually has a lower dimension. But generally embedding is not done by the traditional dimension reduction techniques (for example, principal component analysis). In natural language process, a word can be embedded into a vector space by word2vec9 (or other techniques). When an instance is associated with an image, we may consider to use autoencoder10 to encode/embed the image into a space with lower dimension, which is usually achieved by (deep) neural networks.
Collinearity is one of the cliches in machine learning. For non-linear models, collinearity is usually not a problem. For linear models, I recommend reading this discussion11 to see when it is not a problem.
We have seen how the Lasso solutions of linear models can be used for feature selection in Chapter 5. What about non-linear models? There are some model-specific techniques for feature selection. Also, there is a metaheuristic approach to select features – cross-validation. Specifically, we can try different combinations of the features and use cross-validation to select the set of features which results in the best cross-validation evaluation metric. However, the major problem of this approach is its efficiency. When the number of features is too large, it is impossible to try all different combinations with limited computational resources. Thus, it is better to use the model-specific feature selection techniques in practice.
To tune model parameters, such as the $\lambda$ in Lasso, we can also use cross-validation. But again, the efficiency is our major concern.
Can we have feature selection and parameter tuning done automatically? Actually, automated machine learning has been a hot research topic in both academia and industry.
A decision tree consists of a bunch of nodes. In a decision tree there is a node with no parent nodes, which is called root node. The node without any children is called leaf node.
The length (number of nodes) of the longest path from the root node to a leaf node is called the depth of the tree. For example, the depth of the tree above is 4. Each leaf node has a label. In regression tasks, the label is a real number, and in classification tasks the label could could be a real number which is used to get the class indirectly (for example, fed into a sigmoid function to get the probability), or an integer representing the predicted class directly.
Each node except the leaves in a decision tree is associated with a splitting rule. These splitting rules determine to which leaf an instance belongs. A rule is just a function taking a feature as input and returns true or false as output. For example, a rule on the root could be $x_{1}<0$ and if it is true, we go to the left node otherwise go to the right node. Once we arrive at a leaf, we can get the predicted value based on the label of the leaf.
To get a closer look, let’s try to implement a binary tree structure for regression tasks in R/Python from scratch.
Let’s implement the binary tree as a recursive data structure, which is composed partially of similar instances of the same data structure. More specifically, a binary tree can be decomposed into three components, i.e., its root node, the left subtree under the root, and the right subtree of the root. To define a binary (decision) tree, we only need to define these three components. And to define the left and right subtrees, this decomposition is applied recursively until the leaves.
Now we have the big picture how to define a binary tree. However, to make the binary tree a decision tree, we also need to define the
splitting rules. For simplicity, we assume there is no missing value in our data and all variables are numeric. Then a splitting rule of a node is composed of two components, i.e., the variable to split on, and the corresponding breakpoint for splitting.
There is one more component we need to define in a decision tree, that is, the predict method which takes an instance as input and returns the prediction.
Now we are ready to define our binary decision tree.
chapter7/tree.R
chapter7/tree.py
You may have noticed the usage @property
in our Python implementation. It is one of builtin decorators in Python. We won’t talk too much of decorators. Basically, adding this decorator makes the method depth
behave like a property, in the sense that we can call self.depth
instead of self.depth()
to get the depth.
In the R implementation, the invisible(self)
is returned in the print method which seems strange. It is an issue12 of R6
class due to the S3
dispatch mechanism which is not introduced in this book.
The above implementation doesn’t involve the training or fitting of the decision tree. In this book, we wouldn’t talk about how to fit a traditional decision tree model due to its limited usage in the context of modern machine learning. Let’s see how to use the decision tree structures we defined above by creating a pseudo decision tree illustrated below.
It’s worth noting decision trees can approximate universally.
What is a gradient boosting machine (GBM) (or gradient boosting regression)? Essentially, a GBM is just a forest of decision trees.
If you have heard of random forest (RF), you may know that a random forest is also a bunch of trees. What is the difference between a GBM and RF?
Looking at the fitted trees from RF and GBM, there is no way to tell if the trees are fitted by a RF or a GBM. The major difference is how these trees are trained, rather than the trees themselves. A minor difference is how these trees are used for prediction. In many RF implementations, the prediction for an instance is the average prediction of each tree within the forest. If it is a classification task, there are two ways to get the final prediction – (a) predict the class with majority voting directly, i.e., the predicted class is the one with highest frequency among the predicted classes of all trees; (b) predict the probability based on the frequencies, for example, if among five trees there are three trees output class 1 then the predicted probability of class 1 is equal to $3/5$. In many GBM implementations, the prediction (for regression tasks) is the sum of the predictions of individual trees.
GBM fits trees sequentially, but RF fits trees independently. The obvious advantage of fitting trees independently is that it can be done in parallel. But accuracy-wise, GBM usually performs better according to my limited experience.
We have seen the structure of a single decision tree in GBM. Now it’s time to see how to get these trees fitted in GBM. Let’s start from the first tree.
To grow a tree, we start from its root node. In GBM fitting, usually we pre-determine a maximum depth $d$ for each tree to grow. And the final tree’s depth may be equal to or less than the maximum depth $d$. At a high level, the tree is grown in a recursive fashion. Specifically, first we attempt to split the current node and if the splitting improves the performance we grow the left subtree and the right subtree under the root node. When we grow the left subtree, its maximum depth is $d-1$, and the same applies to the right subtree. We can define a tree grow function for such purpose which takes a root node $Node_{root}$ (it is also a leaf) as input. The pseudo code of tree grow function is illustrated below.
In fact, the algorithm of tree growing is just a DFS algorithm.
To complete the pseudo algorithm above, we need to have a split function which takes a leaf node as input and returns a boolean value as output. If true is returned, we will do the splitting, i.e., to grow the left/right subtree. So now the challenge is how to define the split function, which requires the understanding of the theories behind GBM.
Similar to other regression models we have seen so far, GBM with $K$ trees has a loss function which is defined below.
$$
\begin{equation}
\mathcal{L}=\sum_{i=1}^{n} {(y_{i}-\hat{y}_{i})^{2}},
\label{eq:gbm0}
\end{equation}
$$
where
$$
\begin{equation}
\hat{y}_{i} = \sum_{t=1}^{K} {f_{t}(\boldsymbol{x}_{i})}.
\label{eq:treepred}
\end{equation}
$$
$f_{t}$ denotes the prediction of $t^{th}$ tree in the forest. As we mentioned previously, the fitting is done sequentially. When we fit the $t^{th}$ tree, all the previous $t-1$ trees are fixed. And the loss function for fitting the $t^{th}$ tree is given below.
$$
\begin{equation}
\mathcal{L}^{(t)}=\sum_{i=1}^{n} {(y_{i}- \sum_{l=1}^{t-1} {f_{l}(\boldsymbol{x}_{i})} – f_{t}(\boldsymbol{x}_{i}))^{2}}
\label{eq:gbm_t}
\end{equation}
$$
In practice, a regularized loss function \eqref{eq:treepred1} is used instead of \eqref{eq:treepred} to reduce overfitting.
$$
\begin{equation}
\mathcal{L}^{(t)}=\sum_{i=1}^{n} {(y_{i}- \sum_{l=1}^{t-1} {f_{l}(\boldsymbol{x}_{i})} – f_{t}(\boldsymbol{x}_{i}))^{2}} + \Phi(f_{t}).
\label{eq:treepred1}
\end{equation}
$$
Let’s follow the paper13 and use the number of leaves as well as the L2 penalty of the values (also called weights) of the leaves for regularization. The loss function then becomes
$$
\begin{equation}
\mathcal{L}^{(t)}=\sum_{i=1}^{n} {(y_{i}- \sum_{l=1}^{t-1} {f_{l}(\boldsymbol{x}_{i})} – f_{t}(\boldsymbol{x}_{i}))^{2}} + \gamma T + \frac 1 2 {\lambda \sum_{j=1}^{T}{\omega_{j}^{2}}} ,
\label{eq:growtree0}
\end{equation}
$$
where $\omega_{j}$ is the value associated with the $j_{th}$ leaf of the current tree.
Again, we get an optimization problem, i.e., to minimize the loss function \eqref{eq:growtree0}. The minimization problem can also be viewed as a quadratic programming problem. However, it seems different from the other optimization problems we have seen before, in the sense that the decision tree $f_{t}$ is a non-parametric model. A model is non-parametric if the model structure is learned from the data rather than pre-determined.
A common approach used in GBM is the second order approximation. By second order approximation, the loss function becomes
$$
\begin{equation}
\mathcal{L}^{(t)}\approx \sum_{i=1}^{n} {(y_{i} – \sum_{l=1}^{t-1} {f_{l}(\boldsymbol{x}_{i})})^2} + \sum_{i=1}^{n} {(g_{i}f_{t}(\boldsymbol{x}_{i}) +\frac 1 2 {h_{i}f_{t}(\boldsymbol{x}_{i})^{2}})} + \gamma T + \frac 1 2 {\lambda \sum_{j=1}^{T}{\omega_{j}^{2}}} ,
\label{eq:growtree}
\end{equation}
$$
where $g\sb{i}=2(f\sb{t}(\boldsymbol{x}\sb{i}) + \sum\sb{l=1}^{t-1} {f\sb{l}(\boldsymbol{x}_{i})} – y_{i})$ and $h\sb{i}=2$ are the first and the second order derivatives of the function $(y\sb{i}- \sum_{l=1}^{t-1} {f\sb{l}(\boldsymbol{x}\sb{i})-f\sb{t}(\boldsymbol{x}\sb{i}))^{2}} $ with respect to the function $f_{t}(\boldsymbol{x}_{i})$.
Let’s implement the function to calculate $g$ and $h$ and put them into util.py
.
chapter7/util.py
Since the first item is a constant, let’s ignore it.
$$
\begin{equation}
\mathcal{L}^{(t)}\approx\sum_{i=1}^{n} {(g_{i}f_{t}(\boldsymbol{x}_{i}) +\frac 1 2 {h_{i}f_{t}(\boldsymbol{x}_{i})^{2}})} + \gamma T + \frac 1 2 {\lambda \sum_{j=1}^{T}{\omega_{j}^{2}}} ).
\label{eq:growtree1}
\end{equation}
$$
Let’s think of the prediction of an instance of the current tree. The training data, i.e., instances fall under the leaves of a tree. Thus, the prediction of an instance is the value $\omega$ associated with the corresponding leaf that the instance belongs to. Based on this fact, the loss function can be further rewritten as follows,
$$
\begin{equation}
\mathcal{L}^{(t)} \approx \sum\sb{j=1}^{T} {(\omega \sb{j} \sum\sb{i\in I\sb{j} }{g\sb{i}} + \frac 1 2 {\omega \sb{j} ^{2}( \sum\sb{i\in I\sb{j}} h\sb{i} +\lambda ) }}) + \gamma T .
\label{eq:growtree2}
\end{equation}
$$
When the structure of the tree is fixed the loss function \eqref{eq:growtree2} is a quadratic convex function of $\omega_{j}$, and the optimal solution can be obtained by setting the derivative to zero.
$$
\begin{equation}
\omega_{j}=- \frac {\sum_{i\in I_{j} } g_{i}} {\sum_{i\in I_{j}} h_{i} +\lambda}
\label{eq:omega}
\end{equation}
$$
Plugging \eqref{eq:omega} into the loss function results in the minimal loss of the current tree structure
$$
\begin{equation}
-\frac 1 2 \sum_{j=1}^{T} {\frac {(\sum_{i\in I_{j} }g_{i})^{2}} {\sum_{i\in I_{j}}h_{i} +\lambda} } + \gamma T .
\label{eq:minimalloss}
\end{equation}
$$
Now let’s go back to the split function required in the tree grow function discussed previously. How to determine if we need a splitting on a leaf? \eqref{eq:minimalloss} gives the solution – we can calculate the loss reduction after splitting which is given below
$$
\begin{equation}
\frac 1 2 {\bigg(\frac {(\sum\sb{i\in I\sb{left} }g\sb{i})^{2}} {\sum\sb{i\in I\sb{left}}h\sb{i}+ \lambda } + \frac {(\sum\sb{i\in I\sb{right} }g\sb{i})^{2}} {\sum\sb{i\in I\sb{right}}h\sb{i} + \lambda} – \frac {(\sum\sb{i\in I } g\sb{i})^{2}} {\sum\sb{i\in I} h\sb{i}+\lambda} \bigg)} – \gamma .
\label{eq:lossreduction}
\end{equation}
$$
If the loss reduction is positive, the split function returns true otherwise returns false.
So far, we have a few ingredients ready to implement our own GBM, which are listed below,
Tree
class defined;However, there are a few additional items we need to go through before the implementation.
In Chapter 6, we have seen how stochasticity works in iterative optimization algorithms. The stochasticity can be applied in both instance-wise and feature-wise. The stochasticity technique is very important is the optimization of GBM. More specifically, we apply the stochasticity both instance-wise and feature-wise. Instance-wise, when we fit a new tree, we randomly get a subsample from the training sample. And feature-wise, we randomly select a few features/variables when we fit a new tree. The stochasticity technique could help to reduce overfitting. The extent of the stochasticity can be controlled by arguments.
Like the gradient decent algorithm, in GBM there is also a learning rate parameter which scales the values of the leaves after a tree is fitted.
In practice, we may also not want to have too few instances under a leaf, to reduce potential overfitting. When there are two few instances under a leaf, we may just stop the splitting process.
Now we have almost all the ingredients to make a working GBM. Let’s define the split_node
function in the code snippet below.
chapter7/grow.py
And the implementation of the GBM
class is given below.
chapter7/gbm.py
Now let’s see the performance of our GBM implemented from scratch.
chapter7/test_gbm.py
Running the code above, we have output as below.
We don’t implement the model in R, but it is not difficult to do based on the Python implementation above.
GBM can be used with various loss functions, and the major difference is the implementation of the first/second order derivatives, i.e., $g$ and $h$.
Regardless of the performance, there are two major missing features in our implementation is a) cross-validation and b) early stopping.
We have talked about cross-validation, but what is early stopping? It is a very useful technique in GBM. Usually, the cross-validated loss decreases when we add new trees at the beginning and at a certain point, the loss may increase when more trees are fitted (due to overfitting). Thus, we may select the best number of trees based on the cross-validated loss. Specifically, stop the fitting process when the cross-validated loss doesn’t decrease. In practice, we don’t want to stop the fitting immediately when the cross-validated loss starts increasing. Instead, we specify a number of trees, e.g. 50, as a buffer, after which the fitting process should stop if cross-validated loss doesn’t decrease.
Early stopping is also used in other machine learning models, for example, neural network. Ideally, we would like to have early stopping based on the cross-validated loss. But when the training process is time-consuming, it’s fine to use the loss on a testing date set14.
The commonly used GBM packages include XGBoost
15, LightGBM
16 and CatBoost
17.
Let’s see how to use XGBoost
for the same regression task on the Boston dataset.
chapter7/xgb.R
chapter7/xgb.py
The parameters subsample
and colsample_bytree
control the stochasticity, within the range of $[0, 1]$. If we set these two parameters to 1, then all instances and all features are selected for fitting every tree.
The two code snippets illustrate a minimal workflow of fitting a GBM model. First, we conduct (hyper) parameter tuning (such as learning rate, number of trees, regularization parameters, stochasticity parameters) with cross-validation, and next we train the model with the tuned parameters.
Running the code snippets, we have the following results.
In XGBoost
, we could also use linear regression models as the booster (or base learner) instead of decision trees. However, when 'booster':'gblinear'
is used, the sum of the prediction from all boosters in the model is equivalent to the prediction from a single (combined) linear model. In that sense, what we get is just a Lasso solution of a linear regression model.
GBM can be used in different tasks, such as classification, ranking, survival analysis, etc. When we use GBM for predictive modeling, missing value imputation is not required, which is one big advantage over linear models. But in our own implementation we don’t consider missing values for simplicity. In GBM, if a feature is categorical we could do label-encoding18, i.e., mapping the feature to integers directly without creating dummy variables (such as one-hot encoding). Of course one-hot encoding19 can also be used. But when there are too many new columns created by one-hot encoding, the probability that the original categorical feature is selected is higher than these numerical variables. In other words, we are assigning a prior weight to the categorical feature regarding the feature-wise stochasticity.
For quite a few real-world prediction problems, the monotonic constraints are desired. Monotonic constraints are either increasing or decreasing. The increasing constraint for feature $x_k$ refer to the relationship that $f(x_1,…,x_k,…,x_m)\le f(x_1,…,x_k’,…,x_m)$ if $x_k\le x_k’$. For example, an increasing constraint for the number of bedrooms in a house price prediction model makes lots of sense. Using gradient boosting tree regression models we can enforce such monotonic constraints in a straightforward manner. Simply, after we get the best split for the current node, we may check if the monotonic constraint is violated by the split. The split won’t be adopted if the constraint is broken.
For many supervised learning tasks, we could formulate the problem as an optimization problem by writing down the loss function as a function of training input and output. In unsupervised learning problems there are no label/output. It is more difficult to formulate unsupervised learning problems in a unified approach. Still, some unsupervised learning problems can still be formulated as an optimization problem. Let’s see a few examples briefly.
PCA is a very popular technique in many Engineering disciplines. As you may heard of PCA, it is a technique for dimension reduction. More specific, let $\boldsymbol{x}$ denote a $n*m$ matrix, and each row of $\boldsymbol{x}$ denoted as $\boldsymbol{x_i}’$ represents a point in $\mathbb{R}^m$. Sometimes the dimension $m$ could be relatively large and we don’t like that. In order to represent the data in a more compact way, we want to transform the raw data points into a new coordinate system in $\mathbb{R}^p$ where $p<m$. The $k^{th};k=1,…,p$ coordinate of $\boldsymbol{x_i}$ in the new coordinate system can be written $\boldsymbol{w_k}’\boldsymbol{x_i}$. However, there are infinite $\boldsymbol{w_k}$ for the transformation and we have to make a guidance for such transformation. The key of our guidance is to make the data points projected onto the first transformed coordinate have the largest variance, and $\boldsymbol{w_1}’\boldsymbol{x_i}$ is called the first principal component. And we could find the remaining transformations and principal components iteratively.
So now we see how the PCA is formulated as an optimization problem. However, under the above setting, there are infinite solutions for $\boldsymbol{w_k}$. We usually add a constraint on $\boldsymbol{w_k}$ in PCA, i.e., $|\boldsymbol{w_k}|=1$. The solution to this optimization problem is surprisingly elegant – the optimal $\boldsymbol{w_k}$ is the eigen vectors of the covariance matrix of $\boldsymbol{x}$. Now let’s try to conduct a PCA with eigen decomposition (it can also be done with other decompositions).
In fact, it does not matter if the raw data points are centered or not if the eigen decomposition is on the covariance matrix. If you prefer to decompose $\boldsymbol{x’}\boldsymbol{x}$ directly, centering is a necessary step. Many times, we don’t want to perform PCA in this way since there are a lot of functions/packages available in both R and Python.
In previous chapters we have seen how to fit a distribution to data. What if the data points actually come from multiple distributions, for example, a mixture of two Gaussian distribution. If we know from which distribution each observed data point comes from it is not difficult to estimate the parameters. But in some situations it is impossible to tell the actual distribution a point is sampled from. Usually, a mixture of multiple distributions can be estimated by maximum likelihood method. We can derive the likelihood function of the observed data points and then we have an optimization problem.
Suppose we have a random size with sample size $n$, and each sample $x_i;i=1,…,n$ is from one of the $K$ multivariate Guassian distributions. The $k^{th}$ distribution is denoted as $\mathcal{N}_k(\mu_k,\Sigma_k)$. We want to estimate the parameters in these Gaussian distributions.
As we talked in Chapter 4, there are two commonly used approaches for distribution fitting, i.e., method of moments and maximum likelihood estimation. In this case, we use the maximum likelihood estimation because the likelihood function can be easily derived as below.
$$
\begin{equation}
\mathcal{P} = \prod_{i=1}^{n}{\sum_{k=1}^{K}{\pi_k}f(x_i|\mu_k,\Sigma_k) },
\label{eq:gmm1}
\end{equation}
$$
where $\pi_k$ represents the probability that a randomly selected data point belongs to distribution $k$.
And thus, the log-likelihood function becomes:
$$
\begin{equation}
\mathcal{L} = \sum_{i=1}^{n}log(\sum_{k=1}^{K}{\pi_k}f(x_i|\mu_k,\Sigma_k)).
\label{eq:gmm2}
\end{equation}
$$
Let’s try to implement the above idea in R with the optim
function.
The estimates of parameters are not, why? Remember we have emphasized the importance of convexity in chapter 6. Actually, the log-likelihood function given in \eqref{eq:gmm2} is not convex. For non-convex problems, the optim
function may not converge. In practice, EM algorithm 25 is frequently applied for mixture model.
The theory of EM algorithm is out of the scope of this book. Let’s have a look at the implementation for this specific problem. Basically, there are two steps, i.e., E-step and M-step which run in an iterative fashion. In the $t^{th}$ E-step, we update the membership probability $w_{i,k}$ that represents the probability that $x_i$ belongs to distribution $k$, based on the current parameter estimates as follows.
$$
\begin{equation}
w_{i, k}^{(t)} = \frac {\alpha_k^{(t)} f(x_i|\mu_k^{(t)},\Sigma_k^{(t)})} {\sum_{k=1}^{K} {\alpha_k^{(t)} f(x_i|\mu_k^{(t)},\Sigma_k^{(t)})} }.
\label{eq:7_e}
\end{equation}
$$
And in the $t^{th}$ M-step, for each $k$ we update the parameters as follows,
$$
\begin{equation}
\begin{split}
& \alpha_k^{(t+1)} = \frac {\sum_{i=1}^{n} {w_{i,k}^{(t)}} } {n} ,\\
& \mu_k^{(t+1)}= \frac {\sum_{i=1}^{n} {w_{i,k}^{(t)} x_i }} {\sum_{i=1}^{n} {w_{i,k}^{(t)}}}, \\
& \Sigma_k^{(t+1)}= \frac {\sum_{i=1}^{n} { w_{i,k}^{(t)} (x_i-\mu_k^{(t)})(x_i-\mu_k^{(t)})’ }} {\sum_{i=1}^{n} {w_{i,k}^{(t)}}} .
\end{split}
\end{equation}
$$
The Python code below implements the above EM update schedule for Gaussian mixture model.
chapter7/gmm.py
For space-saving, we use a third-party package in R rather than implementing from scratch.
In the above code, $z_i$ represents the membership probability. We plot the data points with their clusters in the figure below.
Clustering is the task of grouping similar objects together. Actually in the example above we have used Gaussian mixture model for clustering. There are many clustering algorithms and one of the most famous clustering algorithms might be $K$-means 26.
Different from supervised/unsupervised learning, reinforcement learning (RL) is trying to learn a strategy which is used for decision-making. AlphaGo 27 and OpenAI Five 28 are two of the most famous use cases in the real-world. There are many good resources for learning RL 33 34, and it is impossible to give an in-depth introduction on RL in this short Section. But I think we could learn from a specific example for a first impression.
Let’s play a hypothetical game, in which the player is asked to make a sequence of decisions. Each time, the decision is to choose one number from 1 and 2. If number 1 is chosen, the next decision-making time would be 1 time unit later and number 2 is chosen, the next decision-making time would be 2 time units later. No matter which number is chosen, the player always get \$1 as the reward immediately. The game would end in $100$ time unit, and the goal is to collect as much rewards as possible.
The game seems very simple and the best strategy is to choose number 1 in each step. This problem cannot be solved by supervised or unsupervised learning techniques. And one obvious distinction in this problem is that we don’t have any training data to learn from. This problem falls into the agent-environment interaction paradigm.
In step $t$, the agent takes an action and the action $a^{(t)}$ in turn would have an effect on the environment. As a result, the state of the environment goes from $s^{(t)}$ to $s^{(t+1)}$ and returns a reward $r^{(t+1)}$ to the agent. Usually, the goal in RL is to pick up actions so that the cumulative reward $\sum_{t=0}^{m}{r^{(t+1)}}$ is maximized. When ${m\to\infty}$, the cumulative reward may not bounded. In that case, the future reward can be discounted by $\lambda\in{[0, 1]}$ and we want to maximize $\sum_{t=0}^{\infty}{\lambda^{t}r^{(t+1)}}$. And in the following step, the agent would pick up the next action $a^{(t+1)}$ based on the information collected. In general, the environment state and reward at step $t$ are random variables. If the probability distribution of the next state and reward only depends on the current environment state and the action picked up, the Markov property holds which results in a Markov decision process (MDP). A majority of RL studies focus on Markov decision process MDP.
With the advance of deep learning, deep learning-based RL has made significant progress recently. Many DRL algorithms have been developed, such as deep Q-networks (DQN) algorithm which we would talk with more details in the next Section.
As we mentioned, usually there are no data to learn in RL. Both supervised and unsupervised learning need data to perform optimization on some objective functions. So does RL, in a different approach. RL can also be viewed as a simulation-based optimization approach, because RL runs simulations for the agent-environment interaction.
There are many RL algorithms available online. In the next section we would have a look at the DQN algorithm.
To introduce DQN we first define an action-value function $Q_{\pi}(s,a)$ and the optimal action-value function $Q_{\pi}^{*}(s,a)$. The action-value function is defined as
$$
\begin{equation}
Q_{\pi}(s,a)=E[R|s,a,\pi],
\label{eq:rl_e}
\end{equation}
$$
where $R$ denotes the random cumulative (discounted) rewards to obtain if the agents start from state $s$ and take the action $a$ and then always follow the policy(strategy) $\pi$.
Based on the definition of the action-value function, the optimal action-value function is defined as
$$
\begin{equation}
Q_{\pi}^{*}(s,a)=Q_{\pi}(s,a).
\label{eq:rl_e1}
\end{equation}
$$
The optimal action-value function represents the optimal expected cumulative (discounted) rewards to obtain if the agent start from state $s$, action $a$. If we know the exact optimal action-value function, it is easy to derive the optimal policy $\pi$ – we can just choose the action that maximizes $Q_{\pi}^{*}(s,a)$. But how to get the optimal action-value function is the key problem. And DQN aims at approximating the optimal action-value function by (deep) neural networks. Let $Q(s,a;\theta)$ denote the neural networks parameterized by $\theta$.
Compared with the vanilla DQN, double DQN is more commonly used due to its advantages. The training of these neural networks are still done in a supervised manner, i.e., specify a loss function and update the parameters based on gradient-based methods. An important idea to improve the performance of DQN is the experience replay which is a brilliant idea and worth learning, but we would not talk about it.
Let’s see how to do RL in practice. Although there are some RL tools in R, I recommend using Python because there are much matured tools in the deep learning eco-system. To find the optimal strategy of the game we talked in the previous Section, we will use two packages, i.e., gym
29 and stable_baselines
30. The gym
package helps us define the environment and the stable_baselines
package contains functionalities for building the DQN agent.
In gym
there is a class Env
based on which we can define our own environment class. More specifically, our environment class has to have three methods customized to fit the problem, i.e., the __init__
method, the reset
method and the step
method. The reset
method specifies how to reset the environment before we start the learning. The step
method specifies how the environment react to the action and it should return the next state and the reward. Also, step
method should return if the current game is finished or not. Regarding our game, the state that the agent observes in each step could be the current time in the game, and the reward is always 1. If the current environment time is greater than the time horizon, the game is finished. We could define the environment class in the following code snippet.
chapter7/game_env.py
And the code below shows how to use a DQN agent to play the game with the help of the stable_baselines
package.
chapter7/game_env_run.py
By running the code above, we have the following output.
In the game we discussed above, the environment does not have stochasticity and the optimal policy does not depend on the environment’s state at all as it is a simplified example to illustrate the usage of the tools. The real-world problems are usually much more complicated than the example, but we could follow the same procedure, i.e., define the environment for the problem and then build the agent for learning.
In Chapter 6 we talked about the gradient-based optimization algorithms, such as the vanilla gradient descent. We implemented the gradient descent method for linear regression. If you have used modern machine learning frameworks such as tensorflow
, you may remember in tensorflow
there is no need to define the gradient update step manually and instead the framework handled the update magically. That is because these frameworks have automatic differentiation implemented.
It is worth distinguishing numerical differentiation, symbolic differentiation and automatic differentiation. Symbolic differentiation 31 aims at finding the derivative of a given formula with respect to a specified variable. It takes a symbolic formula as input and also returns a symbolic formula. For example, by using symbolic differentiation we could simplify $\partial {(x^2+y^2)}/\partial{x}$ to $2x$. Numerical differentiation is often used when the expression/formula of the function is unknown (or in a black box, for example, a computer program). Compared with symbolic differentiation, numerical differentiation and automatic differentiation try to evaluate the derivative of a function at a fixed point. For example, we may want to know the value of $\partial{f(x,y)}/\partial{x}$ when $x=x_0;y=y_0$. The most basic formula for numerical differentiation is $\lim_{h \to 0} (f(x+h))-f(x))/h$. In implementation, this limit could be approximated by setting $h$ to a very small value, for example $1e-6$. Automatic differentiation also aims at finding the numerical values of derivatives, in a more efficient approach based on chain rule. However, unlike the numerical differentiation, automatic differentiation requires the exact function formula/structure. Also, Automatic differentiation does not need the approximation for the limit which is done in numerical differentiation. At the lowest level, it evaluate the derivatives based on symbolic rules. Thus automatic differentiation is partly symbolic and partly numerical 35.
Let’s see an example how to do symbolic differentiation in R and Python.
It’s worth noting there are many other symbolic operations we could do in R/Python, for example, symbolic integration.
Automatic differentiation is extremely important in modern machine learning. Many deep learning models can be viewed as function compositions. Let’s take a two-hidden layer neural network illustrated below as an example. Let $x$ denote the input and $f_{u},g_{v}$ denote the first and second hidden layer, respectively. The output $y$ is written as $y=g_v(f_u(x))$. The learning task is to estimate the parameters $u,v$ in these hidden layers. If we use gradient descent approach for the parameter estimation, the evaluation of gradients for both $u$ and $v$ should be calculated. However, there might be overlapped operations if we simply perform two numerical differentiation operations independently. By utilizing automatic differentiation technique, it is possible to reduce the computational complexity.
Let’s see how we could utilize the automatic differentiation to simply our linear regression implementation.
chapter7/linear_regression_ad.py
In the implementation above, we use autograd.numpy
to replace the ordinary numpy
package in order to perform automatic differentiation on the functions of interest. In the loss
method, we return a function loss_with_coef
which captures the data required for the computation. This type of function is also called closure 32.
1 Andrew R Barron. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information theory, 39(3):930–945, 1993.
2 https://docs.python.org/3/reference/import.html
3 https://en.wikipedia.org/wiki/Mean_absolute_error
4 https://en.wikipedia.org/wiki/Mean_absolute_percentage_error
5 https://en.wikipedia.org/wiki/Categorical_distribution
6 https://en.wikipedia.org/wiki/Precision_and_recall
7 https://en.wikipedia.org/wiki/Feature_engineering
8 https://en.wikipedia.org/wiki/Kolmogorov-Arnold_representation_theorem
9 Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
10 https://en.wikipedia.org/wiki/Autoencoder
11 https://statisticalhorizons.com/multicollinearity
12 https://github.com/r-lib/R6/issues/140
13 Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pages 785–794. ACM, 2016.
14 https://en.wikipedia.org/wiki/Early_stopping
15 https://github.com/dmlc/xgboost
16 https://github.com/microsoft/LightGBM
17 https://github.com/catboost/catboost
18 https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html
19 https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html
20 https://en.wikipedia.org/wiki/Reinforcement_learning
21 https://en.wikipedia.org/wiki/Transfer_learning
22 https://en.wikipedia.org/wiki/Semi-supervised_learning
23 https://en.wikipedia.org/wiki/Support-vector_machine
24 https://en.wikipedia.org/wiki/Linear_discriminant_analysis
25 https://en.wikipedia.org/wiki/Expectation-maximization_algorithm
26 https://en.wikipedia.org/wiki/K-means_clustering
27 https://deepmind.com/research/case-studies/alphago-the-story-so-far
28 https://openai.com/projects/five
29 https://gym.openai.com
30 https://stable-baselines.readthedocs.io/en/master
31 https://www.cs.utexas.edu/users/novak/asg-symdif.html
32 https://en.wikipedia.org/wiki/Closure_(computer_programming)
33 http://incompleteideas.net/book/the-book.html
34 https://spinningup.openai.com/en/latest/
35 https://arxiv.org/abs/1502.05767