Gaussian Process Regressor - Part III

  1. 1. Limitations of GPR
  2. 2. Towards Large Dataset
  3. 3. A Unifying View
    1. 3.1. Outline of framework
    2. 3.2. Approximations
  4. 4. More on GPRFITC

Finally, let’s talk about some variants of GPR.

Limitations of GPR

In previous articles (here and here), the classical formulation of Gaussian process regression (GPR) is presented. Relevant software implementation is also discussed and compared. GPR is non-parametric, and thus very flexible - it can handle any data set. Moreover, it provides the error estimate of the prediction. However, a non-parametric model like GPR has its inherent disadvantages. The prediction depends on the data set. That is, the model does not actually extract any information from the data, and it is essentially brute-force curve-fitting. Furthermore, the model is dense. The prediction requires accessing all the training data, resulting in expensive dense matrix operations. In fact, given NN training data points, the trainging stage requires O(N3)O(N^3) cost due to dense matrix inversion, and the prediction stage requires O(N2)O(N^2) cost due to dense matrix multiplication.

Towards Large Dataset

In an effort to cure the weaknesses of GPR, a lot of modern variants have been developed. The basic idea is to reduce the effective NN in the model. The classification of variants have been well discussed in Chap. 8 of GPML and Quin2005.

The variants can be roughly classified into two types. In the first type, the covariance matrices are replaced by their low-rank approximations, so that the cost of matrix operations depends on the rank instead of NN. The major criticism is that the GP becomes degenerate due to the approximation. A degenerate GP would predict low or zero variance in regions far away from the training data set, which is not desirable. In the second type, a partial set of data points is selected to represent the whole data set, thus directly reducing NN. The challenge is, how to select the point? Greedy method might work but is probably suboptimal. There is also the variational approach that sparsifies the PGR, but I am not familiar with this aspect yet.

One interesting approach is the relevance vector machine (RVM). It applies the idea of automatic relevance determination (ARD) and finds the training data point that is of the most “relevance” to the model. Subsequently, a kernel is constructed from these “relevance vectors” to form the final prediction model. Due to the construction of the kernel, RVM is a degenerate GP, which is criticized in Rasm2005. A fun fact of RVM is that its 2001 version algorithm is patented, and partially discouraged its integration into some mainstream packages. Note that its faster, 2003 version of algorithm is not patented, though. One good Python implementation of RVM can be found here.

Another approach of the second type is the GPR with pseudo-inputs Snel2008, or termed fully independent training conditional (FITC). This approach determines a set of points from the training data, not necessarily the points in the data set itself, and generates the GPR over these “pseudo inputs”. The GP is still non-degenerate, but NN can be significantly reduced, sometimes by two orders of magnitudes. Yet the prediction is not deteriorated by much.

A Unifying View

In Quin2005, a unified framework for sparse GPR (and GPR itself) is proposed, from which the variants discussed above can be derived.

Outline of framework

In the framework, one has to differentiate between two types of variables. One is the observed noisy output y{\mathbf{y}}, the other the underlying noise-free latent output f{\mathbf{f}}, y=f+ϵ\displaystyle {\mathbf{y}}= {\mathbf{f}}+ \epsilon where ϵN(0,σn2)\epsilon\sim{\mathcal{N}}(0,{\sigma_n^2}) is the noise, assumed to be additive and independent. The conditional satisfies yfN(vf,σn2){\mathbf{y}}|{\mathbf{f}}\sim {\mathcal{N}}(vf,{\sigma_n^2}). The training data is considered noisy, i.e. ys{\mathbf{y}}_s is provided for training. However, the prediction is expected to be noise-free, i.e. fu{\mathbf{f}}_u is desired. Using a Bayesian approach, the prediction can be formulated as follows, p(fuys)=p(fs,fuys)dfs=p(ysfs,fu)p(fs,fu)p(ys)dfs=1p(ys)p(ysfs)p(fs,fu)dfs\displaystyle \begin{aligned} p({\mathbf{f}}_u|{\mathbf{y}}_s) &= \int p({\mathbf{f}}_s,{\mathbf{f}}_u|{\mathbf{y}}_s) d{\mathbf{f}}_s \\ &= \int \frac{p({\mathbf{y}}_s|{\mathbf{f}}_s,{\mathbf{f}}_u)p({\mathbf{f}}_s,{\mathbf{f}}_u)}{p({\mathbf{y}}_s)} d{\mathbf{f}}_s \\ &= \frac{1}{p({\mathbf{y}}_s)} \int p({\mathbf{y}}_s|{\mathbf{f}}_s)p({\mathbf{f}}_s,{\mathbf{f}}_u) d{\mathbf{f}}_s \\ \end{aligned} where fs{\mathbf{f}}_s is going to be integrated out, or “marginalized”, as it is of no interest to the prediction. The preceeding term p(ys)p({\mathbf{y}}_s) is essentially a normalizing factor and will not be computed explicitly. The remaining term p(fs,fu)p({\mathbf{f}}_s,{\mathbf{f}}_u) in the integration would satisfy the joint Gaussian distribution as in previous derivations. This is where the large dense covariance matrix is introduced. This is where the simplification comes in.

In the framework, a set of inducing points fi{\mathbf{f}}_i is assumed to be sampled from the GP like fs{\mathbf{f}}_s and fu{\mathbf{f}}_u. Therefore, the probability should satisfy, p(fs,fu)=p(fs,fu,fi)dfi=p(fs,fufi)p(fi)dfi\displaystyle p({\mathbf{f}}_s,{\mathbf{f}}_u) = \int p({\mathbf{f}}_s,{\mathbf{f}}_u,{\mathbf{f}}_i) d{\mathbf{f}}_i = \int p({\mathbf{f}}_s,{\mathbf{f}}_u|{\mathbf{f}}_i)p({\mathbf{f}}_i) d{\mathbf{f}}_i where fiN(mi,Kii){\mathbf{f}}_i\sim{\mathcal{N}}({\mathbf{m}}_i,{\mathbf{K}}_{ii}). The formulation up to here is still exact. The integration for p(fuys)p({\mathbf{f}}_u|{\mathbf{y}}_s) would result in the regular GPR model.

Now here comes the key assumption in the framework, fs{\mathbf{f}}_s and fu{\mathbf{f}}_u are conditionally independent given fi{\mathbf{f}}_i, p(fs,fu)q(fs,fu)=q(fsfi)q(fufi)p(fi)dfi\displaystyle p({\mathbf{f}}_s,{\mathbf{f}}_u) \approx q({\mathbf{f}}_s,{\mathbf{f}}_u) = \int q({\mathbf{f}}_s|{\mathbf{f}}_i) q({\mathbf{f}}_u|{\mathbf{f}}_i) p({\mathbf{f}}_i) d{\mathbf{f}}_i This means the dependency of fu{\mathbf{f}}_u on fs{\mathbf{f}}_s is indirectly induced by fi{\mathbf{f}}_i. As a reference, the exact forms of the conditionals and the covariance matrix are, {fsfiN(msKsiKii1(fimi),KssQss)fufiN(muKuiKii1(fimi),KuuQuu)K=[KssKsuKusKuu]\displaystyle \left\{ \begin{aligned} {\mathbf{f}}_s|{\mathbf{f}}_i &\sim {\mathcal{N}}({\mathbf{m}}_s-{\mathbf{K}}_{si}{\mathbf{K}}_{ii}^{-1}({\mathbf{f}}_i-{\mathbf{m}}_i), {\mathbf{K}}_{ss}-{\mathbf{Q}}_{ss}) \\ {\mathbf{f}}_u|{\mathbf{f}}_i &\sim {\mathcal{N}}({\mathbf{m}}_u-{\mathbf{K}}_{ui}{\mathbf{K}}_{ii}^{-1}({\mathbf{f}}_i-{\mathbf{m}}_i), {\mathbf{K}}_{uu}-{\mathbf{Q}}_{uu}) \end{aligned} \right. \quad{\mathbf{K}}= \begin{bmatrix} {\mathbf{K}}_{ss} & {\mathbf{K}}_{su} \\ {\mathbf{K}}_{us} & {\mathbf{K}}_{uu} \end{bmatrix} where Qab=KaiKii1Kib{\mathbf{Q}}_{ab}={\mathbf{K}}_{ai}{\mathbf{K}}_{ii}^{-1}{\mathbf{K}}_{ib}.

Approximations

Under the above framework, all the approximations are actually simplifying or approximating the term KQ{\mathbf{K}}_{**}-{\mathbf{Q}}_{**}. For example, the low-rank approximation approach is effectively using, {fsfiN(msKsiKii1(fimi),O)fufiN(muKuiKii1(fimi),O)K=[QssQsuQusQuu]\displaystyle \left\{ \begin{aligned} {\mathbf{f}}_s|{\mathbf{f}}_i &\sim {\mathcal{N}}({\mathbf{m}}_s-{\mathbf{K}}_{si}{\mathbf{K}}_{ii}^{-1}({\mathbf{f}}_i-{\mathbf{m}}_i), {\mathbf{O}}) \\ {\mathbf{f}}_u|{\mathbf{f}}_i &\sim {\mathcal{N}}({\mathbf{m}}_u-{\mathbf{K}}_{ui}{\mathbf{K}}_{ii}^{-1}({\mathbf{f}}_i-{\mathbf{m}}_i), {\mathbf{O}}) \end{aligned} \right. \quad{\mathbf{K}}= \begin{bmatrix} {\mathbf{Q}}_{ss} & {\mathbf{Q}}_{su} \\ {\mathbf{Q}}_{us} & {\mathbf{Q}}_{uu} \end{bmatrix} The covariance matrices are neglected and the “distributions” become deterministic, and hence the term Deterministic Inducing Conditional (DIC) in the framework. The zero covariance matrix is exactly the cause of the zero error estimation in this type of approach.

The approximation of interest in this article, the GPRFITC model, uses the following conditionals, {fsfiN(msKsiKii1(fimi),Qss+diag(KssQss))fufiExactK=[Qss+diag(KssQss)QsuQusKuu]\displaystyle \left\{ \begin{aligned} {\mathbf{f}}_s|{\mathbf{f}}_i &\sim {\mathcal{N}}({\mathbf{m}}_s-{\mathbf{K}}_{si}{\mathbf{K}}_{ii}^{-1}({\mathbf{f}}_i-{\mathbf{m}}_i), {\mathbf{Q}}_{ss} + diag({\mathbf{K}}_{ss}-{\mathbf{Q}}_{ss})) \\ {\mathbf{f}}_u|{\mathbf{f}}_i &\sim Exact \end{aligned} \right. \quad{\mathbf{K}}= \begin{bmatrix} {\mathbf{Q}}_{ss} + diag({\mathbf{K}}_{ss}-{\mathbf{Q}}_{ss}) & {\mathbf{Q}}_{su} \\ {\mathbf{Q}}_{us} & {\mathbf{K}}_{uu} \end{bmatrix} Note that q(fufi)q({\mathbf{f}}_u|{\mathbf{f}}_i) is exact only when one output is requested. Otherwise, the covariance matrix will be treated like q(fsfi)q({\mathbf{f}}_s|{\mathbf{f}}_i).

More on GPRFITC

In GPRFITC, the predictive mean becomes, mp=muQus(Qss+V)1(ysms)=muKuiM1KisV1(ysms)\displaystyle \begin{aligned} {\mathbf{m}}_p &= {\mathbf{m}}_u - {\mathbf{Q}}_{us}({\mathbf{Q}}_{ss}+{\mathbf{V}})^{-1}({\mathbf{y}}_s-{\mathbf{m}}_s) \\ &= {\mathbf{m}}_u - {\mathbf{K}}_{ui}{\mathbf{M}}^{-1}{\mathbf{K}}_{is}{\mathbf{V}}^{-1}({\mathbf{y}}_s-{\mathbf{m}}_s) \end{aligned} where V=diag(KssQss)+σn2I{\mathbf{V}}=diag({\mathbf{K}}_{ss}-{\mathbf{Q}}_{ss})+{\sigma_n^2}{\mathbf{I}}, and M=Kii+KisV1Ksi{\mathbf{M}}={\mathbf{K}}_{ii}+{\mathbf{K}}_{is}{\mathbf{V}}^{-1}{\mathbf{K}}_{si}. The covariance is, Kp=KuuQus(Qss+V)1Qsu=KuuQuu+KuiM1Kiu\displaystyle \begin{aligned} {\mathbf{K}}_p &= {\mathbf{K}}_{uu} - {\mathbf{Q}}_{us}({\mathbf{Q}}_{ss}+{\mathbf{V}})^{-1}{\mathbf{Q}}_{su} \\ &= {\mathbf{K}}_{uu} - {\mathbf{Q}}_{uu} + {\mathbf{K}}_{ui}{\mathbf{M}}^{-1}{\mathbf{K}}_{iu} \end{aligned} Note that for the computational cost, NN is reduced to the number of inducing points.

There are three matrix inversions involved in the prediction. V{\mathbf{V}} is diagonal, and its inversion is trivial. Inversion inside Qss{\mathbf{Q}}_{ss} is done using Cholesky decomposition (again), Qss=KsiKii1Kis=Ksi(LiLiT)1KisK¯isTK¯is\displaystyle \begin{aligned} {\mathbf{Q}}_{ss} &= {\mathbf{K}}_{si}{\mathbf{K}}_{ii}^{-1}{\mathbf{K}}_{is} = {\mathbf{K}}_{si}({\mathbf{L}}_i{\mathbf{L}}_i^T)^{-1}{\mathbf{K}}_{is} \\ &\equiv \bar{\mathbf{K}}_{is}^T\bar{\mathbf{K}}_{is} \end{aligned} where K¯is=Li1Kis\bar{\mathbf{K}}_{is}={\mathbf{L}}_i^{-1}{\mathbf{K}}_{is}. Subsequently, inversion of M{\mathbf{M}} utilizes the decomposition of Kii{\mathbf{K}}_{ii}, M1=(Kii+KisV1Ksi)1=LiT(I+K¯isV1K¯isT)1Li1LiT(LmLmT)1Li1LTL1\displaystyle \begin{aligned} {\mathbf{M}}^{-1} &= ({\mathbf{K}}_{ii}+{\mathbf{K}}_{is}{\mathbf{V}}^{-1}{\mathbf{K}}_{si})^{-1} = {\mathbf{L}}_i^{-T} ({\mathbf{I}}+\bar{\mathbf{K}}_{is}{\mathbf{V}}^{-1}\bar{\mathbf{K}}_{is}^T)^{-1} {\mathbf{L}}_i^{-1} \\ &\equiv {\mathbf{L}}_i^{-T} ({\mathbf{L}}_m{\mathbf{L}}_m^T)^{-1} {\mathbf{L}}_i^{-1} \\ &\equiv {\mathbf{L}}^{-T}{\mathbf{L}}^{-1} \end{aligned} where Cholesky decomposition is applied in the middle. The last two expressions are then used in the computation of mp{\mathbf{m}}_p and Kp{\mathbf{K}}_p.

The hyperparameters of GPRFITC can be trained by maximizing the marginal likelihood. 2L=(ysms)T(Qss+V)1(ysms)+logQss+V+nlog2π\displaystyle -2{\mathcal{L}}= ({\mathbf{y}}_s-{\mathbf{m}}_s)^T ({\mathbf{Q}}_{ss} + {\mathbf{V}})^{-1} ({\mathbf{y}}_s-{\mathbf{m}}_s) + \log|{\mathbf{Q}}_{ss} + {\mathbf{V}}| + n\log 2\pi where the matrix inversion is handled as before. For the determinant, logQss+V=logVI+K¯isV1K¯isT=logV+2logLm\displaystyle \begin{aligned} \log|{\mathbf{Q}}_{ss} + {\mathbf{V}}| &= \log|{\mathbf{V}}||{\mathbf{I}}+\bar{\mathbf{K}}_{is}{\mathbf{V}}^{-1}\bar{\mathbf{K}}_{is}^T| \\ &= \log|{\mathbf{V}}| + 2\log|{\mathbf{L}}_m| \end{aligned} The hyperparameters in L{\mathcal{L}} include those in regular GPR model: the length scales, process and noise variances, as well as the extra parameters: the location of the inducing points, the number of which is proportional to the input dimension and number of points. Due the large number of hyperparameters, the only viable training approach is the gradient-based method. The gradients of L{\mathcal{L}} can be found in Appendix C of Snel2008. Nevertheless, as mentioned before, the gradients are automatically handled in the TensorFlow-based GPflow code.


This is the end of this series of articles on GPR, which covers its derivation, computational strategy, determination of hyperparameters, and extension to other variants.