Gaussian Process Regressor - Addendum

Jun 8, 2018

digest

1. Updating scheme
2. Gradients

A collection of extra topics as a sequel to the GPR series.

Updating scheme

Consider a scenario where a GP model is trained using a large sample data set ( $N$ points). Recall from the first and the second articles, the training process determines the hyperparameters and generates a number of coefficient matrices for the efficient computation of the predictive mean and variance. Now suppose a few new samples ( $M$ points, $M\ll N$ ) are to be appended to the data set. The question is, does one simply retrain the GP model?

Not necessarily. The new samples can reduce the prediction error and the overall uncertainty but may have limited effect on the hyperparameters. Therefore, with a few new samples, one can update the GP model by keeping the old hyperparameters and recomputing the coefficient matrices, including ${\mathbf{L}}$ , ${\mathbf{F}}$ , ${\mathbf{Q}}$ and ${\mathbf{R}}$ .

Strictly speaking, the addition of new samples will result in the recomputation of everything, even though the hyperparameters remain the same: The std and mean of the samples are modified, so is the correlation matrix ${\mathbf{K}}$ , and so are the follow-up matrix decompositions. However, now that $M\ll N$ , one can assume that the std and mean of the new sample data set are the same as those of the old one. As a result, the portion of ${\mathbf{K}}$ associated with the old sample data set remains the same, and the new coefficient matrices can be computed via partial matrix decompositions, which could save considerable amount of time - dropping from $O(N^3)$ to $O(N^2)$ .

The new covariance matrix is $\displaystyle {\mathbf{K}}= \begin{bmatrix} {\mathbf{K}}_{ss} & {\mathbf{K}}_{sn} \\ {\mathbf{K}}_{ns} & {\mathbf{K}}_{nn} \end{bmatrix}$ where ${\mathbf{K}}_{ss}$ is the matrix associated with the old data set, whose Cholesky factor ${\mathbf{L}}_s$ is known. Matrices ${\mathbf{K}}_{ns}={\mathbf{K}}_{sn}^T$ and ${\mathbf{K}}_{nn}$ are due to the new samples. The Cholesky factor ${\mathbf{L}}$ of ${\mathbf{K}}$ is obtained via the following procedure of partial Cholesky decomposition,

Cholesky decomposition: ${\mathbf{K}}_{nn}={\mathbf{L}}_n{\mathbf{L}}_n^T$ .
Linear solve: ${\mathbf{L}}_{sn} = {\mathbf{L}}_s^{-1}{\mathbf{K}}_{sn}$
Assemble the Cholesky factor, $\displaystyle {\mathbf{L}}= \begin{bmatrix} {\mathbf{L}}_s & {\mathbf{O}}\\ {\mathbf{L}}_{sn}^T & {\mathbf{L}}_n \end{bmatrix}$

Next, the new ${\mathbf{F}}$ matrix is $\displaystyle {\mathbf{F}}= {\mathbf{L}}^{-1}{\mathbf{H}}= \begin{bmatrix} {\mathbf{L}}_s & {\mathbf{O}}\\ {\mathbf{L}}_{sn}^T & {\mathbf{L}}_n \end{bmatrix}^{-1} \begin{bmatrix} {\mathbf{H}}_s \\ {\mathbf{H}}_n \end{bmatrix} = \begin{bmatrix} {\mathbf{F}}_s \\ {\mathbf{L}}_n^{-1} ({\mathbf{H}}_n - {\mathbf{L}}_{sn}^T{\mathbf{F}}_s) \end{bmatrix}$ where block matrix inversion is used and ${\mathbf{F}}_s={\mathbf{L}}_s^{-1}{\mathbf{H}}_s$ is known from the old model.

Finally, since ${\mathbf{F}}_s={\mathbf{Q}}_s{\mathbf{R}}_s$ is known, one can utilize the QR update algorithm to obtain the QR decomposition of ${\mathbf{F}}$ . Such algorithm has been implemented in standard packages for scientific computation, such as scipy.linalg.qr_insert.

With the new coefficient matices ${\mathbf{L}}$ , ${\mathbf{F}}$ , ${\mathbf{Q}}$ and ${\mathbf{R}}$ , the GP model is updated with the new sample data set.

Gradients

In some scenarios where a GP model is involved, the gradients of the mean and the variance w.r.t. the input are required. One example is the surrogate-based optimization, where the GP model is used as the surrogate and the gradient-based algorithm is chosen to find the optima. It is necessary to compute the gradient analytically (or using automatic differentiation), instead of using finite difference. The latter could destabilize the iterations of gradient descent near some “singularity” points at which the GP model is ill-defined.

Note that, in the trivial treatment of the multiple output case, the process variances ${\sigma_f^2}$ of the output are determined independently. Therefore, it is sufficient to consider the single output case. From the first article of the GPR series, the mean and variance at a single point are, respectively, $\displaystyle \begin{aligned} m({\mathbf{x}}) &= {\bar\mathbf{g}}^T{\mathbf{k}_u}({\mathbf{x}}) + {\bar\mathbf{b}}^T{\mathbf{h}_u}({\mathbf{x}}) \\ {\sigma^2}({\mathbf{x}}) &= {\sigma_f^2}- [{\mathbf{L}}^{-1}{\mathbf{k}_u}({\mathbf{x}})]^2 + [{\mathbf{R}}^{-T} [{\mathbf{h}_u}({\mathbf{x}}) - {\mathbf{F}}^T[{\mathbf{L}}^{-1}{\mathbf{k}_u}({\mathbf{x}})]]]^2 \\ &\equiv {\sigma_f^2}- {\mathbf{e}}_1^T{\mathbf{e}}_1 + {\mathbf{e}}_2^T{\mathbf{e}}_2 \end{aligned}$ where ${\mathbf{k}_u}={\mathbf{K}}_{su}$ and ${\mathbf{h}_u}={\mathbf{H}}_u^T$ and $\displaystyle \begin{aligned} {\bar\mathbf{b}}&= {\mathbf{R}}^{-1} [{\mathbf{Q}}^T({\mathbf{L}}^{-1}{\mathbf{y}}_s)] \\ {\bar\mathbf{g}}&= {\mathbf{L}}^{-T} [{\mathbf{L}}^{-1} ({\mathbf{y}}_s-{\mathbf{H}}{\bar\mathbf{b}})] \end{aligned}$

The gradient of the mean is computed in a straight-forward manner, $\displaystyle {\frac{\partial m}{\partial {\mathbf{x}}}} = {\bar\mathbf{g}}^T{\frac{\partial {\mathbf{k}_u}}{\partial {\mathbf{x}}}} + {\bar\mathbf{b}}^T{\frac{\partial {\mathbf{h}_u}}{\partial {\mathbf{x}}}}$ where ${\frac{\partial {\mathbf{k}_u}}{\partial {\mathbf{x}}}}$ and ${\frac{\partial {\mathbf{h}_u}}{\partial {\mathbf{x}}}}$ are obtained from the regression and mean functions, respectively.

The computation of the gradient of the variance is a little bit more involved, $\displaystyle \begin{aligned} {\frac{\partial {\sigma^2}}{\partial {\mathbf{x}}}} &= - 2{\mathbf{e}}_1^T{\mathbf{L}}^{-1}{\frac{\partial {\mathbf{k}_u}}{\partial {\mathbf{x}}}} + 2{\mathbf{e}}_2^T{\mathbf{R}}^{-T}\left({\frac{\partial {\mathbf{h}_u}}{\partial {\mathbf{x}}}} - {\mathbf{F}}^T{\mathbf{L}}^{-1}{\frac{\partial {\mathbf{k}_u}}{\partial {\mathbf{x}}}}\right) \\ &= - 2[{\mathbf{L}}^{-T}({\mathbf{F}}{\mathbf{r}}+ {\mathbf{e}}_1)]^T {\frac{\partial {\mathbf{k}_u}}{\partial {\mathbf{x}}}} + 2{\mathbf{r}}^T{\frac{\partial {\mathbf{h}_u}}{\partial {\mathbf{x}}}} \end{aligned}$ where ${\mathbf{r}}={\mathbf{R}}^{-1}{\mathbf{e}}_2$ . The computation is organized so as to minimize the number of linear solves of the triangular matrices.

kriging surrogate model machine learning