Gaussian Process Regressor - Part II

Sep 26, 2017

digest

1. Introduction
2. Covariance function
3. Non-dimensionalization
4. Learning the hyperparameters

This time let’s focus on the hyperparameters.

Introduction

In the last article, the Gaussian process regression (GPR) model is discussed. Given the training data, it looks like all the coefficients, the predictive mean and the variances, can be determined from those formulations. Well, that is not the whole story. There are some extra coefficients in the covariance function of the Gaussian process that need to be determined, namely the hyperparameters.

The determination of the hyperparameters is a little more involved than the formulations in the last article, as this problem is non-convex.

Covariance function

A covariance function in the Gaussian process takes the form of, $\displaystyle k({\mathbf{x}}_i,{\mathbf{x}}_j) = {\sigma_f^2}R({\mathbf{x}}_i,{\mathbf{x}}_j)$ where ${\sigma_f^2}$ is the process variance, and $R$ is typically stationary, i.e. $R({\mathbf{x}},{\mathbf{x}}')=R({\mathbf{x}}-{\mathbf{x}}')$ , and monotonically decreasing. It is also assumed $R(0)=1$ , meaning that a point correlates with itself the most. When the observation is noisy, the covariance function is modified as, $\displaystyle k({\mathbf{x}}_i,{\mathbf{x}}_j) = {\sigma_f^2}R({\mathbf{x}}_i,{\mathbf{x}}_j) + {\sigma_n^2}\delta_{ij} \equiv {\sigma_f^2}[R({\mathbf{x}}_i,{\mathbf{x}}_j) + {\tilde{\sigma}_n^2}\delta_{ij}]$ A typical choice of $R$ is the squared exponential function, $\displaystyle R({\mathbf{x}}_i,{\mathbf{x}}_j) = \exp[-({\mathbf{x}}_i-{\mathbf{x}}_j)^T{\mathbf{M}}({\mathbf{x}}_i-{\mathbf{x}}_j)]$ where in the isotropic case, ${\mathbf{M}}=\frac{1}{2l^2}{\mathbf{I}}$ meaning that the length scales in all dimensions are the same; while in the anisotropic case, ${\mathbf{M}}={\frac{1}{2}}diag({\mathbf{l}})^{-2}$ is a diagonal matrix with positive diagonal entries, representing the length scales in different dimensions. In both cases, the length scales have to be figured out from the training data. The latter case is sometimes also called automatic relevance determination (ARD), meaning that the training process automatically determines if the particular dimension has a significant influence on the output. See the DACE manual for more discussion.

The length scales ${\mathbf{l}}$ and the variances ${\sigma_f^2}$ and ${\sigma_n^2}$ are the hyperparameters to be determined in the training process. In the general sense, the coefficients for the basis functions $\bar{\mathbf{b}}$ are hyperparameters too.

Non-dimensionalization

Before proceeding to the training process, it is beneficial to non-dimensionalize some quantities in the GPR model. First, the covariance matrices ${\mathbf{K}}_y$ and ${\mathbf{K}}_{su}$ can be “non-dimensionalized” by ${\sigma_f^2}$ , $\displaystyle \begin{aligned} {\mathbf{K}}_y &\equiv {\sigma_f^2}{\tilde{\mathbf{K}}}_y \equiv {\sigma_f^2}({\tilde{\mathbf{K}}}_{ss} + {\tilde{\sigma}_n^2}{\mathbf{I}}) \\ {\mathbf{K}}_{su} &\equiv {\sigma_f^2}{\tilde{\mathbf{K}}}_{su} \end{aligned}$

Subsequently, examine the effect of ${\sigma_f^2}$ on the predictive mean, $\displaystyle {\mathbf{m}}_p^* = {\mathbf{K}}_{su}^T\bar{\mathbf{g}}+ {\mathbf{H}}_u^T\bar{\mathbf{b}}= {\tilde{\mathbf{K}}}_{su}^T{\tilde{\mathbf{g}}}+ {\mathbf{H}}_u^T{\tilde{\mathbf{b}}}$ where, $\displaystyle \begin{aligned} \bar{\mathbf{b}}&= ({\mathbf{H}}^T{\mathbf{K}_y^{-1}}{\mathbf{H}})^{-1}{\mathbf{H}}^T{\mathbf{K}_y^{-1}}{\mathbf{y}}_s = ({\mathbf{H}}^T{\tilde{\mathbf{K}}_y^{-1}}{\mathbf{H}})^{-1}{\mathbf{H}}^T{\tilde{\mathbf{K}}_y^{-1}}{\mathbf{y}}_s \equiv {\tilde{\mathbf{b}}}\\ \bar{\mathbf{g}}&= {\mathbf{K}_y^{-1}}({\mathbf{y}}_s-{\mathbf{H}}\bar{\mathbf{b}}) = \sigma_f^{-2} {\tilde{\mathbf{K}}_y^{-1}}({\mathbf{y}}_s-{\mathbf{H}}{\tilde{\mathbf{b}}}) \end{aligned}$ And the covariance, $\displaystyle \begin{aligned} {\mathbf{K}}_p^* &= {\mathbf{K}}_{uu} - {\mathbf{K}}_{us}{\mathbf{K}}_y^{-1}{\mathbf{K}}_{su} + {\mathbf{D}}^T ({\mathbf{H}}^T{\mathbf{K}_y^{-1}}{\mathbf{H}})^{-1} {\mathbf{D}}\\ &= {\sigma_f^2}[{\tilde{\mathbf{K}}}_{uu} - {\tilde{\mathbf{K}}}_{us}{\tilde{\mathbf{K}}_y^{-1}}{\tilde{\mathbf{K}}}_{su} + {\tilde{\mathbf{D}}}^T ({\mathbf{H}}^T{\tilde{\mathbf{K}}_y^{-1}}{\mathbf{H}})^{-1} {\tilde{\mathbf{D}}}] \end{aligned}$ where, $\displaystyle {\mathbf{D}}= {\mathbf{H}}_u^T-{\mathbf{H}}^T {\mathbf{K}_y^{-1}}{\mathbf{K}}_{su} = {\mathbf{H}}_u^T-{\mathbf{H}}^T {\tilde{\mathbf{K}}_y^{-1}}{\tilde{\mathbf{K}}}_{su} \equiv {\tilde{\mathbf{D}}}$ In sum, ${\sigma_f^2}$ has no direct effect on the mean, but the covariance is proportional to ${\sigma_f^2}$ , once the hyperparameters are given. The question is, how to obtain the hyperparameters?

Learning the hyperparameters

Log-likelihood

The hyperparameters are determined using the maximum likelihood estimation (MLE). With the joint Gaussian distribution, the log marginal likelihood of the training data is, $\displaystyle \begin{aligned} {\mathcal{L}}&= \log p({\mathbf{y}}_s|{\mathbf{X}}_s,{\mathbf{b}},{\mathbf{B}}) \\ &= - {\frac{1}{2}}({\mathbf{y}}_s-{\mathbf{H}}{\mathbf{b}})^T ({\mathbf{K}}_y+{\mathbf{H}}^T{\mathbf{B}}{\mathbf{H}})^{-1} ({\mathbf{y}}_s-{\mathbf{H}}{\mathbf{b}}) - {\frac{1}{2}}\log |{\mathbf{K}}_y+{\mathbf{H}}^T{\mathbf{B}}{\mathbf{H}}| - \frac{n}{2}\log 2\pi \end{aligned}$ where $n$ is number of training data points. Utilizing the determinant counterpart of matrix inversion lemma, $\displaystyle |{\mathbf{A}}+{\mathbf{U}}{\mathbf{C}}{\mathbf{V}}^T| = |{\mathbf{A}}| |{\mathbf{C}}| |{\mathbf{C}}^{-1} + {\mathbf{V}}^T{\mathbf{A}}^{-1}{\mathbf{U}}|$ and let ${\mathbf{b}}=0$ and ${\mathbf{B}}^{-1}\rightarrow{\mathbf{O}}$ as was done last time, $\displaystyle -2{\mathcal{L}}= {\mathbf{y}}_s^T [{\mathbf{K}_y^{-1}}- {\mathbf{K}_y^{-1}}{\mathbf{H}}({\mathbf{H}}^T{\mathbf{K}_y^{-1}}{\mathbf{H}})^{-1} {\mathbf{H}}^T {\mathbf{K}_y^{-1}}] {\mathbf{y}}_s + \log|{\mathbf{K}}_y| + \log|{\mathbf{H}}^T{\mathbf{K}_y^{-1}}{\mathbf{H}}| + (n-m)\log 2\pi$ where $m$ is number of basis functions, and was introduced due to the singularity caused by $|{\mathbf{B}}|$ . An interesting simplification can be done to the first term in the above expression, $\displaystyle \begin{aligned} I_1 &= {\mathbf{y}}_s^T [{\mathbf{K}_y^{-1}}- {\mathbf{K}_y^{-1}}{\mathbf{H}}({\mathbf{H}}^T{\mathbf{K}_y^{-1}}{\mathbf{H}})^{-1} {\mathbf{H}}^T {\mathbf{K}_y^{-1}}] {\mathbf{y}}_s \\ &= {\mathbf{y}}_s^T {\mathbf{K}_y^{-1}}({\mathbf{y}}_s - {\mathbf{H}}\bar{\mathbf{b}}) \equiv I_2 \\ I_1 &= ({\mathbf{y}}_s - {\mathbf{H}}\bar{\mathbf{b}})^T {\mathbf{K}_y^{-1}}{\mathbf{y}}_s \equiv I_3 \\ I_1 &= {\mathbf{y}}_s^T {\mathbf{K}_y^{-1}}{\mathbf{y}}_s - \bar{\mathbf{b}}^T{\mathbf{H}}^T {\mathbf{K}_y^{-1}}{\mathbf{H}}\bar{\mathbf{b}}\equiv I_4 \\ I_1 &= I_2+I_3-I_4 = ({\mathbf{y}}_s - {\mathbf{H}}\bar{\mathbf{b}})^T {\mathbf{K}_y^{-1}}({\mathbf{y}}_s - {\mathbf{H}}\bar{\mathbf{b}}) \end{aligned}$ Indicating that the term $I_1$ behaves as if it is centered at ${\mathbf{H}}\bar{\mathbf{b}}$ , with a covariance matrix ${\mathbf{K}}_y$ . Furthermore, factor ${\mathcal{L}}$ with ${\sigma_f^2}$ , $\displaystyle -2{\tilde{\mathcal{L}}}_1 = \sigma_f^{-2} ({\mathbf{y}}_s - {\mathbf{H}}{\tilde{\mathbf{b}}})^T {\tilde{\mathbf{K}}_y^{-1}}({\mathbf{y}}_s - {\mathbf{H}}{\tilde{\mathbf{b}}}) + \log|{\tilde{\mathbf{K}}}_y| + \log|{\mathbf{H}}^T{\tilde{\mathbf{K}}_y^{-1}}{\mathbf{H}}| + (n-m)\log{\sigma_f^2}$ where the constant term is neglected.

If the prior on the coefficients of the basis functions is ignored, the log-likelihood would be simplified to, $\displaystyle -2{\tilde{\mathcal{L}}}_2 = \sigma_f^{-2} ({\mathbf{y}}_s - {\mathbf{H}}{\tilde{\mathbf{b}}})^T {\tilde{\mathbf{K}}_y^{-1}}({\mathbf{y}}_s - {\mathbf{H}}{\tilde{\mathbf{b}}}) + \log|{\tilde{\mathbf{K}}}_y| + n\log{\sigma_f^2}$

Process variances

A natural step next would be finding ${\sigma_f^2}$ by setting $\partial(-2{\tilde{\mathcal{L}}})/\partial{\sigma_f^2}=0$ and solving for ${\sigma_f^2}$ , $\displaystyle {\sigma_f^2}= \frac{1}{N}({\mathbf{y}}_s - {\mathbf{H}}{\tilde{\mathbf{b}}})^T {\tilde{\mathbf{K}}_y^{-1}}({\mathbf{y}}_s - {\mathbf{H}}{\tilde{\mathbf{b}}})$ where $N=n-m$ for ${\tilde{\mathcal{L}}}_1$ , and $N=n$ for ${\tilde{\mathcal{L}}}_2$ . The former case is the center estimate of the variance, while the latter the MLE. When there are multiple outputs, the output variables are usually assumed to be independent, and ${\sigma_f^2}$ can be computed separately for each output.

Plug the value back to the likelihood, and remove the constant terms [Welch1992], one obtains reduced log likelihood, $\displaystyle \begin{aligned} -{\mathcal{F}}_1 &= \log|{\tilde{\mathbf{K}}}_y| + \log|{\mathbf{H}}^T{\tilde{\mathbf{K}}_y^{-1}}{\mathbf{H}}| + (n-m)\log{\sigma_f^2}\\ -{\mathcal{F}}_2 &= \log|{\tilde{\mathbf{K}}}_y| + n\log{\sigma_f^2}\end{aligned}$ where the only remaining unknown hyperparameters are the length scales ${\mathbf{l}}$ . Note that ${\mathcal{F}}_2$ should be used if a general mean function, without priors on its coefficients, is employed.

Finding the hyperparameters

There appears to be two strategies for determining the hyperparameters, the two-step and the one-step methods.

The two-step method utilizes the reduced log likelihood. This is the method used in the DACE toolbox and its Python translation in sklearn. The hyperparameters are divided into two groups: (1) length scales ${\mathbf{l}}$ , (2) variances ${\sigma_f^2}$ and ${\sigma_n^2}$ . The primary unknown is ${\mathbf{l}}$ , ${\sigma_f^2}$ is a function of ${\mathbf{l}}$ , and ${\sigma_n^2}$ is externally set, mainly as a device for numerical stability (namely the “nugget”). The objective function is ${\mathcal{F}}_2$ . In the multi-output case, ${\mathcal{F}}$ for each output is computed using the same set of ${\mathbf{l}}$ , and summed up as the final objective function.

The one-step method tries to determine all hyperparameters, even including $\bar{\mathbf{b}}$ , using gradient-based method. The difficulties are the efficiency of gradient-based method for the many variables and the implementation of the gradients. The first item is not a major issue with the developments in modern computational science. The second item could be indeed troublesome. But that is where the elegance of GPflow is revealed. It is based on TensorFlow, and TF handles the gradients efficiently with the computation graph model. In GPflow, one can focus on the objective functions. In its GPR implementation, ${\tilde{\mathcal{L}}}_2$ is employed. A limitation is that, in the multi-output case, all the outputs have to use the same ${\sigma_f^2}$ . A remedy is to add a wrapper to its GPR class, and compute the MLE’s of ${\sigma_f^2}$ for each output.

In the newer version of GPR in sklearn, the gradient of ${\mathcal{F}}_2$ w.r.t. ${\mathbf{l}}$ is introduced, and gradient-based optimization methods like L-BFGS-B are employed. However, in the new implementation, the mean function is hardcoded to be zero, meaning that ${\mathcal{F}}_1={\mathcal{F}}_2$ . Furthermore, the variance ${\sigma_f^2}$ is fixed to be one, probably for easier computation of the gradients. This version essentially rules out the second group of the hyperparameters and determines the first group using gradient-based method. Therefore, the approach in this version is closer to the one-step method.

The GPy implementation will not be discussed here, as I am not familiar with it.

kriging surrogate model machine learning