In the last article, the Gaussian process regression (GPR) model is discussed. Given the training data, it looks like all the coefficients, the predictive mean and the variances, can be determined from those formulations. Well, that is not the whole story. There are some extra coefficients in the covariance function of the Gaussian process that need to be determined, namely the hyperparameters.
The determination of the hyperparameters is a little more involved than the formulations in the last article, as this problem is non-convex.
Covariance function
A covariance function in the Gaussian process takes the form of, k(xi,xj)=σf2R(xi,xj) where σf2 is the process variance, and R is typically stationary, i.e. R(x,x′)=R(x−x′), and monotonically decreasing. It is also assumed R(0)=1, meaning that a point correlates with itself the most. When the observation is noisy, the covariance function is modified as, k(xi,xj)=σf2R(xi,xj)+σn2δij≡σf2[R(xi,xj)+σ~n2δij] A typical choice of R is the squared exponential function, R(xi,xj)=exp[−(xi−xj)TM(xi−xj)] where in the isotropic case, M=2l21I meaning that the length scales in all dimensions are the same; while in the anisotropic case, M=21diag(l)−2 is a diagonal matrix with positive diagonal entries, representing the length scales in different dimensions. In both cases, the length scales have to be figured out from the training data. The latter case is sometimes also called automatic relevance determination (ARD), meaning that the training process automatically determines if the particular dimension has a significant influence on the output. See the DACE manual for more discussion.
The length scales l and the variances σf2 and σn2 are the hyperparameters to be determined in the training process. In the general sense, the coefficients for the basis functions b¯ are hyperparameters too.
Non-dimensionalization
Before proceeding to the training process, it is beneficial to non-dimensionalize some quantities in the GPR model. First, the covariance matrices Ky and Ksu can be “non-dimensionalized” by σf2, KyKsu≡σf2K~y≡σf2(K~ss+σ~n2I)≡σf2K~su
Subsequently, examine the effect of σf2 on the predictive mean, mp∗=KsuTg¯+HuTb¯=K~suTg~+HuTb~ where, b¯g¯=(HTKy−1H)−1HTKy−1ys=(HTK~y−1H)−1HTK~y−1ys≡b~=Ky−1(ys−Hb¯)=σf−2K~y−1(ys−Hb~) And the covariance, Kp∗=Kuu−KusKy−1Ksu+DT(HTKy−1H)−1D=σf2[K~uu−K~usK~y−1K~su+D~T(HTK~y−1H)−1D~] where, D=HuT−HTKy−1Ksu=HuT−HTK~y−1K~su≡D~ In sum, σf2 has no direct effect on the mean, but the covariance is proportional to σf2, once the hyperparameters are given. The question is, how to obtain the hyperparameters?
Learning the hyperparameters
Log-likelihood
The hyperparameters are determined using the maximum likelihood estimation (MLE). With the joint Gaussian distribution, the log marginal likelihood of the training data is, L=logp(ys∣Xs,b,B)=−21(ys−Hb)T(Ky+HTBH)−1(ys−Hb)−21log∣Ky+HTBH∣−2nlog2π where n is number of training data points. Utilizing the determinant counterpart of matrix inversion lemma, ∣A+UCVT∣=∣A∣∣C∣∣C−1+VTA−1U∣ and let b=0 and B−1→O as was done last time, −2L=ysT[Ky−1−Ky−1H(HTKy−1H)−1HTKy−1]ys+log∣Ky∣+log∣HTKy−1H∣+(n−m)log2π where m is number of basis functions, and was introduced due to the singularity caused by ∣B∣. An interesting simplification can be done to the first term in the above expression, I1I1I1I1=ysT[Ky−1−Ky−1H(HTKy−1H)−1HTKy−1]ys=ysTKy−1(ys−Hb¯)≡I2=(ys−Hb¯)TKy−1ys≡I3=ysTKy−1ys−b¯THTKy−1Hb¯≡I4=I2+I3−I4=(ys−Hb¯)TKy−1(ys−Hb¯) Indicating that the term I1 behaves as if it is centered at Hb¯, with a covariance matrix Ky. Furthermore, factor L with σf2, −2L~1=σf−2(ys−Hb~)TK~y−1(ys−Hb~)+log∣K~y∣+log∣HTK~y−1H∣+(n−m)logσf2 where the constant term is neglected.
If the prior on the coefficients of the basis functions is ignored, the log-likelihood would be simplified to, −2L~2=σf−2(ys−Hb~)TK~y−1(ys−Hb~)+log∣K~y∣+nlogσf2
Process variances
A natural step next would be finding σf2 by setting ∂(−2L~)/∂σf2=0 and solving for σf2, σf2=N1(ys−Hb~)TK~y−1(ys−Hb~) where N=n−m for L~1, and N=n for L~2. The former case is the center estimate of the variance, while the latter the MLE. When there are multiple outputs, the output variables are usually assumed to be independent, and σf2 can be computed separately for each output.
Plug the value back to the likelihood, and remove the constant terms [Welch1992], one obtains reduced log likelihood, −F1−F2=log∣K~y∣+log∣HTK~y−1H∣+(n−m)logσf2=log∣K~y∣+nlogσf2 where the only remaining unknown hyperparameters are the length scales l. Note that F2 should be used if a general mean function, without priors on its coefficients, is employed.
Finding the hyperparameters
There appears to be two strategies for determining the hyperparameters, the two-step and the one-step methods.
The two-step method utilizes the reduced log likelihood. This is the method used in the DACE toolbox and its Python translation in sklearn. The hyperparameters are divided into two groups: (1) length scales l, (2) variances σf2 and σn2. The primary unknown is l, σf2 is a function of l, and σn2 is externally set, mainly as a device for numerical stability (namely the “nugget”). The objective function is F2. In the multi-output case, F for each output is computed using the same set of l, and summed up as the final objective function.
The one-step method tries to determine all hyperparameters, even including b¯, using gradient-based method. The difficulties are the efficiency of gradient-based method for the many variables and the implementation of the gradients. The first item is not a major issue with the developments in modern computational science. The second item could be indeed troublesome. But that is where the elegance of GPflow is revealed. It is based on TensorFlow, and TF handles the gradients efficiently with the computation graph model. In GPflow, one can focus on the objective functions. In its GPR implementation, L~2 is employed. A limitation is that, in the multi-output case, all the outputs have to use the same σf2. A remedy is to add a wrapper to its GPR class, and compute the MLE’s of σf2 for each output.
In the newer version of GPR in sklearn, the gradient of F2 w.r.t. l is introduced, and gradient-based optimization methods like L-BFGS-B are employed. However, in the new implementation, the mean function is hardcoded to be zero, meaning that F1=F2. Furthermore, the variance σf2 is fixed to be one, probably for easier computation of the gradients. This version essentially rules out the second group of the hyperparameters and determines the first group using gradient-based method. Therefore, the approach in this version is closer to the one-step method.
The GPy implementation will not be discussed here, as I am not familiar with it.