Gaussian Process Regressor - Part I

Sep 22, 2017

digest

1. Introduction
2. Derivation
3. Implementations for GPR

This time let’s talk about a non-parametric model.

Introduction

A non-parametric model uses all the training data to do predictions. In contrast, a parametric model obtains its “parameters” from the training data before prediction, and only uses those parameters for prediction.

The non-parametric model based on Gaussian process (GPR) has been around for quite a while. It has deep connections to methods such as radial basis function network (RBFN). In geostatistics, and later in the field of engineering, the method is called kriging. The GPR with explicit mean function is also called generalized least squares (GLS). There are two good properties of GPR. First, GPR can handle the noise in the sample data. Second, the model not only provides the interpolated value, but also the uncertainty of the prediction. These properties can be particularly useful in engineering applications - if there is time, this matter will be explained.

The idea of GPR is simple. It assumes that all data points are sampled from a Gaussian process, and therefore these points subject to a joint Gaussian distribution. The new data point should subject to a Gaussian distribution as well, determined by the training data. In that Gaussian distribution, the mean becomes the prediction, and the variance becomes the error estimation. In this article, the derivation of GPR will be presented, with some discussion in computational aspects. Most of the materials are based on the GPML book (for derivation) and the manual of the DACE toolbox (for computation). The DACE manual also provides the derivation from the pointview of GLS.

Derivation

The basic form

Assume the unknown function subjects to a GP, $\displaystyle f({\mathbf{x}}) \sim {\mathcal{GP}}(m({\mathbf{x}}),k({\mathbf{x}},{\mathbf{x}}'))$ where $m({\mathbf{x}})$ is the mean function and $k({\mathbf{x}},{\mathbf{x}}')$ is the covariance function. By definition, they satisfies, $\displaystyle \begin{aligned} m({\mathbf{x}}) &= \mathrm{E}[f({\mathbf{x}})] \\ k({\mathbf{x}},{\mathbf{x}}') &= \mathrm{E}[(f({\mathbf{x}})-m({\mathbf{x}}))(f({\mathbf{x}}')-m({\mathbf{x}}'))] \end{aligned}$ The probabilistic distribution of a set of points ${\mathcal{D}}=\{ {\mathbf{X}},{\mathbf{y}}\}$ sampled from GP is Gaussian, $\displaystyle {\mathbf{y}}\sim{\mathcal{N}}({\mathbf{m}},{\mathbf{K}})$ where $[{\mathbf{m}}]_i=m({\mathbf{X}}_i)$ and $[{\mathbf{K}}]_{ij}=k({\mathbf{X}}_i,{\mathbf{X}}_j)$ . Here the input variable ${\mathbf{X}}$ can be a matrix with each row being one sample, but only scalar output is considered for now. Now slipt the points into two sets: the training set ${\mathbf{X}}_s$ with known output ${\mathbf{y}}_s$ , and the target set ${\mathbf{X}}_u$ with unknown output ${\mathbf{y}}_s$ . The joint distribution is still Gaussian, with the following mean, $\displaystyle {\mathbf{m}}^T = [{\mathbf{m}}({\mathbf{X}}_s)^T, {\mathbf{m}}({\mathbf{X}}_u)^T] \equiv [{\mathbf{m}}_s^T, {\mathbf{m}}_u^T]$ and the block covariance matrix, $\displaystyle {\mathbf{K}}= \begin{bmatrix} {\mathbf{K}}_{ss} & {\mathbf{K}}_{su} \\ {\mathbf{K}}_{us} & {\mathbf{K}}_{uu} \end{bmatrix}$ where the $i,j$ th element of ${\mathbf{K}}$ is associated with the $i$ th and $j$ th rows of ${\mathbf{X}}$ , $\displaystyle [{\mathbf{K}}_{ab}]_{ij} = k([{\mathbf{X}}_a]_i, [{\mathbf{X}}_b]_j)$ In the case that ${\mathbf{y}}_s$ is noisy, a white noise term with variance $\sigma_n^2$ is added to the Gaussian distribution, adding a diagonal matrix to ${\mathbf{K}}_{ss}$ $\displaystyle {\mathbf{K}}_y = {\mathbf{K}}_{ss} + \sigma_n^2 {\mathbf{I}}_{ss}$

The distribution of ${\mathbf{y}}_u$ given ${\mathbf{y}}_s$ is found by conditional probability, $\displaystyle {\mathbf{y}}_u|{\mathbf{y}}_s \sim {\mathcal{N}}({\mathbf{m}}_p, {\mathbf{K}}_p)$ where the predictive mean is, $\displaystyle {\mathbf{m}}_p = {\mathbf{m}}_u - {\mathbf{K}}_{us}{\mathbf{K}}_y^{-1}({\mathbf{y}}_s-{\mathbf{m}}_s)$ and the covariance is, $\displaystyle {\mathbf{K}}_p = {\mathbf{K}}_{uu} - {\mathbf{K}}_{us}{\mathbf{K}}_y^{-1}{\mathbf{K}}_{su}$

Using basis functions for the mean function

The form of mean function is usually unknown beforehand, so it is more common to use a set of basis functions, $\displaystyle m({\mathbf{x}}) = {\mathbf{h}}({\mathbf{x}})^T{\mathcal{b}}$ The coefficients ${\mathcal{b}}$ have to be fitted from the sample data. Under a Bayesian framework, ${\mathcal{b}}$ is assumed to subject to, $\displaystyle {\mathcal{b}}\sim {\mathcal{N}}({\mathbf{b}}, {\mathbf{B}})$ Taking ${\mathcal{b}}$ into account, the GP from previous section is modified as, $\displaystyle f({\mathbf{x}}) \sim {\mathcal{GP}}({\mathbf{h}}({\mathbf{x}})^T{\mathbf{b}},k({\mathbf{x}},{\mathbf{x}}')+{\mathbf{h}}({\mathbf{x}})^T{\mathbf{B}}{\mathbf{h}}({\mathbf{x}}))$ Extra terms due to the basis functions shall be added to the covariance matrix, $\displaystyle {\mathbf{K}}_{ab} = {\mathbf{K}}({\mathbf{X}}_a, {\mathbf{X}}_b) + {\mathbf{H}}_a {\mathbf{B}}{\mathbf{H}}_b^T,\quad [{\mathbf{H}}_*]_i={\mathbf{h}}([{\mathbf{X}}_*]_i)^T$

The extra terms make the predictive mean and covariance from the previous section extremely cumbersome to compute. Simplifications are needed and enabled by invoking the matrix inversion lemma, $\displaystyle ({\mathbf{A}}+{\mathbf{U}}{\mathbf{C}}{\mathbf{V}}^T)^{-1} = {\mathbf{A}}^{-1} - {\mathbf{A}}^{-1}{\mathbf{U}}({\mathbf{C}}^{-1}+{\mathbf{V}}^T{\mathbf{A}}^{-1}{\mathbf{U}})^{-1} {\mathbf{V}}^T{\mathbf{A}}^{-1}$ In current case, the lemma is applied as follows, $\displaystyle ({\mathbf{K}}_y + {\mathbf{H}}{\mathbf{B}}{\mathbf{H}}^T)^{-1} = {\mathbf{K}_y^{-1}}-{\mathbf{K}_y^{-1}}{\mathbf{H}}{\mathbf{C}}^{-1} {\mathbf{H}}^T {\mathbf{K}_y^{-1}}\equiv {\mathbf{K}_y^{-1}}-{\mathbf{A}}$

where ${\mathbf{C}}={\mathbf{B}}^{-1} + {\mathbf{H}}^T {\mathbf{K}_y^{-1}}{\mathbf{H}}$ and the subscript $s$ is dropped in ${\mathbf{H}}_s$ . Furthermore, the following identities can be derived, $\displaystyle \begin{aligned} {\mathbf{B}}{\mathbf{H}}^T({\mathbf{K}_y^{-1}}-{\mathbf{A}}) &= {\mathbf{C}}^{-1}{\mathbf{H}}^T{\mathbf{K}_y^{-1}}\\ ({\mathbf{K}_y^{-1}}-{\mathbf{A}}){\mathbf{H}}{\mathbf{B}}&= {\mathbf{K}_y^{-1}}{\mathbf{H}}{\mathbf{C}}^{-1} \end{aligned}$ The simplification procedure is bascally to cancel out matrices by combining the ${\mathbf{H}}$ and $({\mathbf{K}_y^{-1}}-{\mathbf{A}})$ terms.

The predictive mean is simplified to, $\displaystyle \begin{aligned} {\mathbf{m}}_p^* &= {\mathbf{H}}_u{\mathbf{b}}+ ({\mathbf{K}}_{us} + {\mathbf{H}}_u{\mathbf{B}}{\mathbf{H}}^T)({\mathbf{K}}_y + {\mathbf{H}}{\mathbf{B}}{\mathbf{H}}^T)^{-1}({\mathbf{y}}_s-{\mathbf{H}}{\mathbf{b}}) \\ &= {\mathbf{K}}_{us}{\mathbf{K}_y^{-1}}{\mathbf{y}}_s + {\mathbf{D}}^T {\mathbf{C}}^{-1} ({\mathbf{H}}^T {\mathbf{K}_y^{-1}}{\mathbf{y}}_s + {\mathbf{B}}^{-1}{\mathbf{b}}) \end{aligned}$ where ${\mathbf{D}}={\mathbf{H}}_u^T-{\mathbf{H}}^T {\mathbf{K}_y^{-1}}{\mathbf{K}}_{su}$ . The covariance matrix is simplied to, $\displaystyle \begin{aligned} {\mathbf{K}}_p^* &= {\mathbf{K}}_{uu} + {\mathbf{H}}_u{\mathbf{B}}{\mathbf{H}}_u^T - ({\mathbf{K}}_{us}+{\mathbf{H}}_u{\mathbf{B}}{\mathbf{H}}^T) ({\mathbf{K}}_y + {\mathbf{H}}{\mathbf{B}}{\mathbf{H}}^T)^{-1} ({\mathbf{K}}_{su}+{\mathbf{H}}{\mathbf{B}}{\mathbf{H}}_u^T) \\ &= {\mathbf{K}}_p + {\mathbf{D}}^T {\mathbf{C}}^{-1} {\mathbf{D}}\end{aligned}$

Finally, consider the case that the coefficients are distributed uniformly, instead of Gaussian. That means ${\mathbf{b}}\rightarrow\mathrm{0}$ and ${\mathbf{B}}^{-1}\rightarrow {\mathbf{O}}$ . With that, we arrive at the commonly used form of GPR, $\displaystyle \begin{aligned} {\mathbf{m}}_p^* &= {\mathbf{K}}_{us}{\mathbf{K}_y^{-1}}{\mathbf{y}}_s + {\mathbf{D}}^T ({\mathbf{H}}^T{\mathbf{K}_y^{-1}}{\mathbf{H}})^{-1} {\mathbf{H}}{\mathbf{K}_y^{-1}}{\mathbf{y}}_s \\ &= {\mathbf{K}}_{us}\bar{\mathbf{g}}+ {\mathbf{H}}_u\bar{\mathbf{b}}\\ {\mathbf{K}}_p^* &= {\mathbf{K}}_p + {\mathbf{D}}^T ({\mathbf{H}}^T{\mathbf{K}_y^{-1}}{\mathbf{H}})^{-1} {\mathbf{D}}\end{aligned}$ where $\bar{\mathbf{b}}=({\mathbf{H}}^T{\mathbf{K}_y^{-1}}{\mathbf{H}})^{-1}{\mathbf{H}}^T{\mathbf{K}_y^{-1}}{\mathbf{y}}_s$ and $\bar{\mathbf{g}}={\mathbf{K}_y^{-1}}({\mathbf{y}}_s-{\mathbf{H}}\bar{\mathbf{b}})$ . The term $\bar{\mathbf{b}}$ is essentially the coefficients of the mean function fitted from the sample data. Also, note that some people argue that the last term in ${\mathbf{K}}_p^*$ can neglected for simplicity [Sasena2002].

Computational aspects

The computation of ${\mathbf{m}}_p^*$ and ${\mathbf{K}}_p^*$ can be tricky due to the ill-conditioned matrix inversion. The strategy is to combine cholesky decomposition and QR decomposition to stabilize the inversions, i.e. ${\mathbf{K}}_y^{-1}$ and $({\mathbf{H}}^T{\mathbf{K}_y^{-1}}{\mathbf{H}})^{-1}$ . For inversion of ${\mathbf{K}}_y$ , $\displaystyle {\mathbf{K}_y^{-1}}{\mathbf{x}}= ({\mathbf{L}}{\mathbf{L}}^T)^{-1}{\mathbf{x}}= {\mathbf{L}}^{-T} ({\mathbf{L}}^{-1}{\mathbf{x}})$ where ${\mathbf{L}}$ is a lower triangular matrix, and the matrix inversion is converted to two consecutive triangular solves. To inverse ${\mathbf{H}}^T{\mathbf{K}_y^{-1}}{\mathbf{H}}$ , $\displaystyle {\mathbf{H}}^T{\mathbf{K}_y^{-1}}{\mathbf{H}}= ({\mathbf{L}}^{-1}{\mathbf{H}})^T ({\mathbf{L}}^{-1}{\mathbf{H}}) \equiv {\mathbf{F}}^T{\mathbf{F}}= {\mathbf{R}}^T{\mathbf{R}}$ where QR decomposition is used ${\mathbf{Q}}{\mathbf{R}}={\mathbf{F}}$ , ${\mathbf{Q}}$ is an orthogonal matrix, and ${\mathbf{R}}$ is an upper triangular matrix. ${\mathbf{Q}}$ is tall and slim, because the number of rows equals to the number of samples, while the number of columns equals to the number of basis functions. With the stabilized matrix inversion introduced above, the quantities $\bar{\mathbf{b}}$ and $\bar{\mathbf{g}}$ in ${\mathbf{m}}_p^*$ are computed as follows, $\displaystyle \begin{aligned} \bar{\mathbf{b}}&= ({\mathbf{H}}^T{\mathbf{K}_y^{-1}}{\mathbf{H}})^{-1}{\mathbf{H}}^T{\mathbf{K}_y^{-1}}{\mathbf{y}}_s = {\mathbf{R}}^{-1} [{\mathbf{Q}}^T({\mathbf{L}}^{-1}{\mathbf{y}}_s)] \\ \bar{\mathbf{g}}&= {\mathbf{K}_y^{-1}}({\mathbf{y}}_s-{\mathbf{H}}\bar{\mathbf{b}}) = {\mathbf{L}}^{-T} [{\mathbf{L}}^{-1} ({\mathbf{y}}_s-{\mathbf{H}}\bar{\mathbf{b}})] \end{aligned}$ The covariance matrix is computed as follows, $\displaystyle {\mathbf{K}}_p^* = {\mathbf{K}}_{uu} - ({\mathbf{L}}^{-1}{\mathbf{K}}_{su})^2 + ({\mathbf{R}}^{-T} [{\mathbf{H}}_u^T - {\mathbf{F}}^T({\mathbf{L}}^{-1}{\mathbf{K}}_{su})])^2$ where $(\Box)^2=\Box^T\Box$ .

Implementations for GPR

There are a lot of libraries that implements GPR. Here provides a big list, which is last updated in 2011. Besides those in the list, there are four noteworthy libraries. The first one is the DACE MATLAB toolbox, which has some GLS flavor. It appears to be very popular in the engineering fields, especially those that relies on the MATLAB ecosystem heavily. The second one is the sklearn Python package. Up to version 0.17, the GPR is implemented in the gaussian_process.GaussianProcess class, as a translation of the DACE toolbox. Afterwards, the GPR is re-implemented in the gaussian_process.GaussianProcessRegressor class using the formulation in GPML. Unfortunately, the new implementation only supports zero mean function. The third library is GPy, which is a large, comprehensive repository of GP-related models. Finally, the fourth library is the GPflow, which is an elegant implementation under the TensorFlow framework with roots from GPy. Modern variants of GPR are also included, some of which will be discussed in the follow-up articles.

kriging surrogate model machine learning