Sections in this Chapter:
We will see a few topics related to random variables (r.v.) and distributions in this chapter.
Distributions describe the random phenomenon in terms of probabilities. Distributions are connected to random variables. A random variable is specified by its distribution. We know a r.v. could be either discrete or continuous. The number of calls received by a call center is an example of discrete r.v., and the amount of time taken to read this book is a continuous random variable.
There are infinite number of distributions. Many important distributions fall into some distribution families (i.e., a parametric set of probability distributions of a certain form). For example, the multivariate normal distribution belongs to exponential distribution family1.
Cumulative distribution function (CDF) specifies how a real-valued random variable $X$ is distributed, i.e.,
$$
\begin{equation}\label{eq:4_0_1}
F_X(x)=P(X\le x).
\end{equation}
$$
It is worth noting that CDF is always monotonic.
When the derivative of CDF exists, we call the derivative of $F$ as the probability density function (PDF) which is denoted by the lower case $f_X$. PDF is associated with continuous random variables. For discrete random variables, the counterpart of PDF is called probability mass function (PMF). PMF gives the probability that a random variable equal to some value, but PDF does not represent probabilities directly.
The quantile function is the inverse of CDF, i.e.,
$$
\begin{equation}\label{eq:4_0_1_2}
Q_X(p)=inf \{ x\in \boldsymbol{R}:p\le F_X(x) \}.
\end{equation}
$$
PDF, CDF and Quantile functions are all heavily used in quantitative analysis. In R and Python, we can find a number of functions to evaluate these functions. The best known distribution should be univariate normal/Gaussian distribution. Let’s use Gaussian random variables for illustration.
We see that qnorm(pnorm(x))=x
.
In Python, we use the functions in numpy
and scipy.stats
for the same purpose.
A random variable could also be multivariate. In fact, the univariate normal distribution is a special case of multivariate normal distribution whose PDF is given in
$$
\begin{equation}\label{eq:4_0_1_3}
p(\boldsymbol{x};\boldsymbol{\mu},\boldsymbol{\Sigma})=\frac {1} {(2\pi)^{m/2}{|\boldsymbol{\Sigma|}}^{1/2}}\exp(-\frac 1 2 (\boldsymbol{x}-\boldsymbol{\mu})^T\boldsymbol{\Sigma}^{-1}(\boldsymbol{x}-\boldsymbol{\mu})),
\end{equation}
$$
where $\boldsymbol{\mu}$ is the mean and $\boldsymbol{\Sigma}$ is the covariance matrix of the random variable $\boldsymbol{x}$.
Sampling from distributions are involved in many algorithms, such as Monte Carlo simulation. First, let’s see a simple example in which we draw samples from a 3-dimensional normal distribution.
Please note in the example above, we do not calculate the quantiles. For multivariable distributions, the quantiles are not necessary to be fixed points.
Sampling from the Gaussian or other famous distributions could be as simple as just calling a function. What if we want to draw samples from any distribution with its CDF? Inversion sampling is a generic solution. The key idea of inversion sampling is that $F_{X}(x)$ is always following a uniform distribution between $0$ and $1$. There are two steps to draw a sample with inversion sampling approach.
Let’s see how to use the inversion sampling technique to sample from exponential distribution with CDF $f_X(x;\lambda) = 1-e^{-\lambda x}$.
In the above R implementation, we used the builtin quantile function in step 2; however, for many distributions there are no builtin quantile functions available and thus we need to specify the quantile function by ourselves (illustrated in the Python implementation).
Rejection sampling is also a basic algorithm to draw samples for a random variable $X$ given its PDF $f_X$. The basic idea of rejection sampling is to draw samples for a random variable $Y$ with PDF $f_Y$ and accept the samples with probability $f_X(x)/(Mf_Y(x))$. $M$ is selected such that $f_X(x)/(Mf_Y(x))\le1$. If the sample generated is rejected, the sampling procedure is repeated until an acceptance. More theoretical details of rejection sampling can be found from the wikipedia 2. The distribution $f_Y$ is called proposal distribution.
Let’s try to draw samples from an exponential distribution truncated between $0$ and $b$. The PDF of this random variable is specified by $f_X(x;\lambda, b) = \lambda e^{-\lambda x}/( 1-e^{-\lambda b})$.
A naive approach is to sample from the untruncated exponential distribution and only accept the samples smaller than $b$, which is implemented as follows.
After running the code snippets above, we have the samples stored in $x$ from the truncated exponential distribution.
Now let’s use the rejection sampling technique for this task. Since we want to sample the random variable between $0$ and $b$, one natural choice of the proposal distribution $f_Y$ is a uniform distribution between $0$ and $b$ and we choose $M=b\lambda/(1-e^{-\lambda b})$. As a result, the acceptance probability $f_X(x)/(Mf_Y(x))$ becomes $e^{-\lambda x}$.
We have seen the basic examples on how to draw samples with inversion samples and truncated samples. Now let’s work on a more challenging problem.
Without loss of generality, let’s consider a unit sphere, i.e., the radius $r=1$. We want to draw i.i.d. points from a unit sphere. The problem appears simple at a first glance – we could utilize the spherical coordinates system and draw samples for $\phi$ and $\theta$. Now the question is how to sample for $\phi$ and $\theta$. A straightforward idea is to draw independent and uniform samples $\phi$ from $0$ to $2\pi$ and $\theta$ from $0$ to $\pi$, respectively. However, this idea is incorrect which will be analyzed below.
Let’s use $f_P(\phi,\theta)$ to denote the PDF of the joint distribution of $(\phi,\theta)$. We integrate this PDF, then
$$
\begin{equation}\label{eq:4_0_2}
1 = \int_{0}^{2\pi} \int_{0}^{\pi} f_P(\phi,\theta) d\phi d\theta = \int_{0}^{2\pi} \int_{0}^{\pi} f_\Phi(\phi) f_{\Theta|\Phi}(\theta|\phi) d\phi d\theta.
\end{equation}
$$
If we enforce $\Phi$ has a uniform distribution between $0$ and $2\pi$, then $f_\Phi(\phi)=1/{2\pi}$, and
$$
\begin{equation}\label{eq:4_0_3}
1=\int_{0}^{\pi} f_{\Theta|\Phi}(\theta|\phi) d\theta.
\end{equation}
$$
One solution to \eqref{eq:4_0_3} is $f_{\Theta|\Phi}(\theta|\phi)=sin(\theta)/2$.
Thus, we could generate the samples of $\Phi$ from the uniform distribution and the samples of $\Theta$ from the distribution whose PDF is $sin(\phi)/2$. Sampling for $\Phi$ is trivial, but how about $\Theta$? We could use the inversion sampling technique. The CDF of $\Theta$ is $(1-cos(\theta))/2;0\le\theta\le \pi$, and the quantile function is $Q_\Theta(p)=arccos(1-2p)$.
The implementation of sampling from unit sphere is implemented below.
There are also other solutions to this problem, which wouldn’t be discussed in this book. A related problem is to draw samples inside a sphere. We could solve the inside sphere sampling problem with a similar approach, or using rejection sampling approach, i.e., sampling from a cube with acceptance ratio $\pi/6$.
We are not trying to introduce these concepts from scratch. This section is more like a recap.
In previous section, we see the PDF for multivariate normal distribution in \eqref{eq:4_0_1_3}. A multivariate distribution is also called joint distribution, since the multivariate random variable can be viewed as a joint of multiple univariate random variables. Joint PDF gives the probability density of a set of random variables. Sometimes we may only be interested in the probability distribution of a single random variable from a set. And that distribution is called marginal distribution. The PDF of a marginal distribution can be obtained by integrating the joint PDF over all the other random variables. For example, the integral of \eqref{eq:4_0_1_3} gives the PDF of a univariate normal distribution.
The joint distribution is the distribution about the whole population. In the context of a bivariate Gaussian random variable $(X_1,X_2)$, the joint PDF $f_{X_1,X_2}(x_1,x_2)$ specifies the probability density for all pairs of $(X_1,X_2)$ in the 2-dimension plane. The marginal distribution of $X_1$ is still about the whole population because we are not ruling out any points from the support of the distribution function. Sometimes we are interested in a subpopulation only, for example, the subset of $(X_1,X_2)$ where $X_2=2$ or $X_2>5$. We can use conditional distribution to describe the probability distribution of a subpopulation. To denote conditional distribution, the symbol $|$ is frequently used. We use $f_{X_1|X_2=0}(x_1|x_2)$ to represent the distribution of $X_1$ conditional on $X_2=0$. By the rule of conditional probability $P(A|B)=P(A,B)/P(B)$, the calculation $f_{X_1|X_2}(x_1|x_2)$ is straightforward, i.e., $f_{X_1|X_2}(x_1|x_2)=f_{X_1,X_2}(x_1,x_2)/f_{X_2}(x_2)$.
The most well-known joint distribution is the multivariate Gaussian distribution. Multivariate Gaussian distribution has many important and useful properties. For example, given the observation of $(X_1,…,X_k)$ from $(X_1,…,X_m)$, $(X_k+1,…,X_m)$ is still following a multivariate Gaussian distribution, which is essential to Gaussian process regression3.
We have seen the extension from univariate Gaussian distribution to multivariate Gaussian distribution, but how about other distributions? For example, what is the joint distribution for two univariate exponential distribution? We could use copula4 for such purpose. For the random variable $(X_1,…,X_m)$, let $(U_1,…,U_m)=(F_{X_1}(X_1),…,F_{X_m}(X_m))$ where $F_{X_k}$ is the CDF of $X_k$. We know $U_k$ is following a uniform distribution. Let $C(U_1,…,U_m)$ denote the joint CDF of $(U_1,…,U_m)$ and the CDF is called copula.
There are different copula functions, and one commonly-used is the Gaussian copula. The standard Gaussian copula is specified as below.
$$
\begin{equation}\label{eq:4_0_3_0}
C^{Gauss}_{\Sigma}(u_1,…,u_m)=\Phi_{\Sigma}(\Phi^{-1}(u_1),…,\Phi^{-1}(u_m)),
\end{equation}
$$
where $\Phi$ denotes the CDF of the standard Gaussian distribution, and $\Phi_{\Sigma}$ denotes the CDF of a multivariate Gaussian distribution with mean $\boldsymbol{0}$ and correlation matrix $\Sigma$.
Let’s see an example to draw samples from a bivariate exponential distribution constructed via Gaussian copula. The basic idea of sampling multivariate random variables via copula is to sample $U_1,…,U_m$ first and then transform it to the desired random variables.
We plot 2000 samples generated from the bivariate exponential distribution constructed via copula in the figure below.
With the help of copula, we can even construct joint distribution with marginals from different distributions. For example, let’s make a joint distribution of a uniform distributed random variable and an exponential distributed random variable.
Statistics is used to solved real-world problems with data. In many cases we may have a collection of observations for a random variable and want to know the distribution which the observations follow. In fact, there are two questions involved in the process of fitting a distribution. First, which distribution to fit? And second, given the distribution how to estimate the parameters. These two questions are essentially the same questions that we have to answer in supervised learning. In supervised learning, we need to choose a model and estimate the parameters (if the model has parameters). We can also call these two questions as model selection and model fitting. Usually, model selection is done based on the model fitting.
Two widely-used methods in distribution fitting – method of moments and maximum likelihood method. In this Section we will see the method of moments. The maximum likelihood method would be introduced in Chapter 6. The $k^{th}$ moment of a random variable is defined as $\mu_k=E(x^k)$. If there are $m$ parameters, usually we derive the first $m$ theoretical moments in terms of the parameters, and by equating these theoretical moments to the sample moments $\hat{\mu_k}=1/n\sum_1^n{x_i^k}$ we will get the estimate.
Let’s take the univariate Gaussian distribution as an example. We want to estimate the mean $\mu$ and variance $\sigma^2$. The first and second theoretical moments is $\mu$ and $\mu^2+\sigma^2$. Thus, the estimate $\hat\mu$ and $\hat{\sigma}^2$ are $1/n\sum_1^n{x_i}$ and $1/n\sum_1^n{x_i^2}-(1/n\sum_1^n{x_i})^2=1/n\sum_1^n{(x_i-\hat\mu)^2}$. The code snippets below show the implementation.
We could also fit another distribution to the data generated from a normal distribution. But which one is better? One answer is to compare the likelihood functions evaluated at the fitted parameters and choose the one that gives the larger likelihood value.
Please note that different methods to fit a distribution may lead to different parameter estimates. For example, the estimate of population variance using maximum likelihood method is different from that using moments method. Actually, the estimator for population mean is biased using methods method but the estimator using maximum likelihood method is unbiased.
In the previous Section we have seen the parameter estimation of a distribution. However, the parameter estimates from either the method of moments or the maximum likelihood estimation are not the exact values of the unknown parameters. There are uncertainties associated with distribution fitting because the data to which the distribution is fit usually is just a random sample rather than a population. Suppose we could repeat the distribution fitting process $m$ times and each time we collect a random sample of size $n$ (i.e., a collection of $n$ observations for the random variable of interest), then we will get $n$ estimates of the distribution parameters. Which estimate is the best to use? In fact, all these $n$ estimates are observations of the estimator random variable. Estimator is a function of the random sample, and itself is a random variable. For example, eq1 is an estimator for the $\mu$ parameter in a Gaussian distribution.
Now we know when we fit a distribution, the parameter estimates are not the exact values of the unknown parameters. The question is how to quantify the uncertainties? To answer this question, we better know the distribution of the estimator. In the example of the Gaussian distribution, what distribution does the estimator $\hat\mu$ follow? It is straightforward to see that estimator is still following a Gaussian distribution since each $X_i$ is following a Gaussian distribution (sum of independent Gaussian random variables still follow a Gaussian distribution). But what if $X_i$ is from an arbitrary distribution? We may still derive an exact distribution of the sample mean $\hat{\mu}$ for an arbitrary distribution, but sometimes the derivation may not be easy. A simple idea is to use the central limit theorem (CLT), which states that the distribution of the mean of a random sample from a population with finite variance is approximately normally distributed when the sample size is large enough. More specifically, $\sqrt{n}(\bar{X}-\mu)/\sigma\xrightarrow{d} N(0,1)$ where $\mu$ and $\sigma$ are the population mean and standard deviation, respectively. Sometimes we do not have the actual value of the population standard deviation and the sample standard deviation $S$ can be used instead and thus $\sqrt{n}(\bar{X}-\mu)/S\xrightarrow{d} N(0,1)$.
We know if $Z$ is a Gaussian random variable, $P(- Z_{(1+\alpha)/2} < Z\le Z_{(1+\alpha)/2}) = \alpha$ where $Z_{u}$ denotes the quantile of $u$.
By CLT, $P(- Z_{(1+\alpha)/2} \le \sqrt{n}(\bar{X}-\mu)/S \le Z_{(1+\alpha)/2}) = \alpha$, which further leads to $P(\bar{X}- Z_{(1+\alpha)/2}S/\sqrt{n} \le \mu \le \bar{X}+ Z_{(1+\alpha)/2}S/\sqrt{n}) = \alpha$. The interval $\bar{X} \pm Z_{(1+\alpha)/2}S/\sqrt{n}$ is called an $\alpha$ confidence interval (CI) for the population mean $\mu$. For example, since $Z_{(1+0.95)/2}=1.96$ the 95% CI is constructed as $\bar{X} \pm 1.96S/\sqrt{n}$. Of course, if we know the exact value of $\sigma$, we could use $\bar{X} \pm 1.96\sigma/\sqrt{n}$ instead.
We show an example for the confidence interval calculation of population mean in normal distribution.
The interpretation of CI is tricky. A 95% CI does not mean the probability that the constructed CI contains the true population mean is $0.95$. Actually, a constructed CI again is a random variable because the CI is created based on each random sample collected. Following the classic explanation from textbooks, when we repeat the procedures to create CI multiple times, the probability that the true parameter falls into the CI is equal to $\alpha$. Let’s do an example to see that point.
So far we have seen how to create the CI for sample mean. What if we are interested in quantifying the uncertainty of other parameters, for example, the variance of a random variable? If we estimate these parameters with maximum likelihood method, we can still construct the CI in a similar approach with the large sample theory[]. However, we would not discuss it in this book.
Alternatively, we could use the bootstrap technique.
Bootstrap is simple yet powerful. It is a simulation-based technique. If we want to estimate a quantity $\theta$, first we write the estimator for $\theta$ as a function of a random sample i.e., $\hat{\theta}=g(X_1,…,X_n)$. Next, we just draw a random sample and calculate $\hat{\theta}$ and repeat this process $B$ times to get a collection of $\hat{\theta}$ denoted as $\hat{\theta}^{(1)},…,\hat{\theta}^{(B)}$. From these simulated $\hat{\theta}$, we could simply use the percentile $\hat{\theta}_{(1-\alpha)/2}$ and $\hat{\theta}_{(1+\alpha)/2}$ to construct the $\alpha$ CI. There are also other variants of bootstrapping method with similar ideas.
Let’s try to use bootstrap to construct a 95% CI for the population variance of a Gaussian distributed random variable.
We have talked about confidence interval, which is used to quantify the uncertainty in parameter estimation. The root cause of uncertainty in parameter estimation is that we do the inference based on random samples. Hypothesis testing is another technique related to confidence interval calculation.
A hypothesis test is an assertion about populations based on random samples. The assertion could be for a single population or multiple populations. When we collect a random sample, we may try to use the random sample as evidence to judge the hypothesis. Usually a hypothesis testing consists of two hypotheses:
When we perform a hypothesis testing, there are two possible outcomes, i.e., a) reject $H_0$ if the evidence is likely to support the alternative hypothesis, and b) do not reject $H_0$ because of insufficient evidence.
The key point to understand hypothesis testing is the significant level which is usually denoted as $\alpha$. When the null hypothesis is true, the rejection of null hypothesis is called type I error. And the significance level is the probability of committing a type I error. When the alternative hypothesis is true, the acceptance of null hypothesis is called type II error. And the probability of committing a type II error is denoted as $\beta$. $1-\beta$ is called the power of a test.
To conduct a hypothesis testing, there are a few steps to go. First we have to specify the null and alternative hypotheses, and the significance level. Next, we calculate the test statistic based on the data collected. Finally, we calculate the $p$-value. If the $p$-value is smaller than the significance level, we reject the null hypothesis; otherwise we accept it. Some books may describe a procedure to compare the test statistic with a critic region, which is essentially the same as the $p$-value approach. The real challenge to conduct a hypothesis testing is to calculate the $p$-value, whose calculation depends on which hypothesis test to use. Please note that the $p$-value itself is a random variable since it is calculated from the random sample. And when the null hypothesis is true, the distribution of $p$-value is uniform from $0$ to $1$.
$p$-value is also a conditional probability. A major misinterpretation about $p$-value is that it is the conditional probability that given the observed data the null hypothesis is true. Actually, $p$-value is the probability of the observation given the null hypothesis is true.
For many reasons, we will not go in-depth into the calculation of $p$-values in this book. But the basic idea is to figure out the statistical distribution of the test statistic. Let’s skip all the theories behind and go to the tools in R/Python.
Probably one-sample $t$ test is the most basic and useful hypothesis tests. It can determine if the sample mean is statistically different from a hypothesized population mean for continuous random variables. To use one-sample $t$ test some assumptions are usually required. For example, the observations should be independent. Another assumption is the normality, i.e., the population should be normal distributed or approximately normal distributed. However, the normality assumption is controversial but it is beyond the scope of this book.
In one-sample $t$ test, the alternative hypothesis could be two-sided, or one-sided. A two-sided $H_1$ does not specify if the population mean is greater or smaller than the hypothesized population mean. In contrast, a one-sided $H_1$ specifies the direction.
Now let’s see how to perform one-sample $t$ test in R/Python.
In the R code snippet we show both one-sided and two-sided one-sample $t$ tests. However, we only show a two-sided test in the Python program. It is feasible to perform a one-sided test in an indirect manner with the same function, but I don’t think it’s worth discussing here. For hypothesis testing, it seems R is a better choice than Python.
Two-sample $t$ test is a bit of more complex than one-sample $t$ test. There are two types of $t$ tests commonly used in practice, i.e., paired $t$ test and unpaired $t$ test. In paired $t$ test, the samples are paired together. For example, we may want to know if there is a significant difference of the human blood pressures in morning and in evening. Hypothesis testing may help. To do so, we may conduct an experiment and collect the blood pressures in morning and in evening from a number of participants, respectively. Let $X_i$ denote the morning blood pressure and $Y_i$ denote the even blood pressure of participant $i$. We should pair $X_i$ and $Y_i$ since the pair is measured from the same person. Then the paired $t$ test can be used to compare the population means. The null hypothesis usually states that the difference of two population means is equal to a hypothesized value. Just like the one-sample $t$ test, we could do one-sided or two-sided paired $t$ test.
Now let’s see how to do paired $t$ test in R/Python.
Paired $t$ test can also be done in Python, but we would not show the examples.
Unpaired $t$ test is also about two population means’ difference, but the samples are not paired. For example, we may want to study if the average blood pressure of men is higher than that of women. In the unpaired $t$ test, we also have to specify if we assume the two populations have equal variance or not.
There are many other important hypothesis tests, such as the chi-squared test5, likelihood-ratio test6.
1 https://en.wikipedia.org/wiki/Exponential_family
2 https://en.wikipedia.org/wiki/Rejection_sampling
3 http://www.gaussianprocess.org/gpml/chapters/RW2.pdf
4 https://en.wikipedia.org/wiki/Copula_(probability_theory)
5 https://en.wikipedia.org/wiki/Chi-squared_test
6 https://en.wikipedia.org/wiki/Likelihood-ratio_test