We say a function subjects to a GP, $$ f(\vx) \sim \GP(m(\vx),k(\vx,\vx')) $$ where $m(\vx)$ is the mean function and $k(\vx,\vx')$ is the covariance function, when $$ \begin{aligned} m(\vx) &= \mathrm{E}[f(\vx)] \\ k(\vx,\vx') &= \mathrm{E}[(f(\vx)-m(\vx))(f(\vx')-m(\vx'))] \end{aligned} $$ The probabilistic distribution of a set of points $\cD=\{ \vX,\vy\}$ sampled from GP is Gaussian, $$ \vy\sim\cN(\vm,\vK) $$ where $[\vm]_i=m(\vX_i)$ and $[\vK]_{ij}=k(\vX_i,\vX_j)$.
A covariance function of a Gaussian process is defined as, $$ k(\vx_i,\vx_j) = \sfs R(\vx_i,\vx_j) $$ where
When the observation is noisy, the covariance function is modified as, $$ k(\vx_i,\vx_j) = \sfs R(\vx_i,\vx_j) + \sns \delta_{ij} \equiv \sfs [R(\vx_i,\vx_j) + \tsn \delta_{ij}] $$
Some possible covariance functions ($r=||\vx-\vx'||$)
# Example covariance functions
N = 1001
r = 0.1
X = np.linspace(0,1,N)
figure = plt.figure(figsize=(10,6))
plt.plot(X, SEKernel(X,r), 'b-', label='Squared Exponential')
plt.plot(X, M3Kernel(X,r), 'r-', label='Matern 3/2')
plt.plot(X, PPKernel(X,r), 'k-', label='Piecewise Polynomial')
plt.xlabel('r')
plt.ylabel("Cov")
_=plt.legend(loc=0)
# Example realizations of GPs with different covariance functions
figure = plt.figure(figsize=(10,6))
plt.plot(X, S_SE, 'b-', label='Squared Exponential')
plt.plot(X, S_MT, 'r-', label='Matern 3/2')
plt.plot(X, S_PP, 'k-', label='Piecewise Polynomial')
plt.xlabel("X")
plt.ylabel("SAMPLE")
_=plt.legend(loc=0)
A typical choice of $R$ is the squared exponential function, $$ R(\vx_i,\vx_j) = \exp[-(\vx_i-\vx_j)^T\vM(\vx_i-\vx_j)] $$ where in the isotropic case, $\vM=\frac{1}{2l^2}\vI$ meaning that the length scales in all dimensions are the same; while in the anisotropic case, $\vM=\half diag(\vl)^{-2}$ is a diagonal matrix with positive diagonal entries, representing the length scales in different dimensions.
Hyperparameters so far:
Return to the dataset $\cD=\{ \vX,\vy\}$ sampled from GP: $$ \vy\sim\cN(\vm,\vK) $$ where $[\vm]_i=m(\vX_i)$ and $[\vK]_{ij}=k(\vX_i,\vX_j)$. Note that scalar output is assumed.
Now split the points into two sets: the training set $\vX_s$ with known output $\vy_s$, and the target set $\vX_u$ with unknown output $\vy_s$. The joint distribution is still Gaussian, with the following mean, $$ \vm^T = [\vm(\vX_s)^T, \vm(\vX_u)^T] \equiv [\vm_s^T, \vm_u^T] $$ and the block covariance matrix, $$ \vK = \begin{bmatrix} \vK_{ss} & \vK_{su} \\ \vK_{us} & \vK_{uu} \end{bmatrix} $$ where the $i,j$th element of $\vK$ is associated with the $i$th and $j$th rows of $\vX$, $$ [\vK_{ab}]_{ij} = k([\vX_a]_i, [\vX_b]_j) $$
When $\vy_s$ is noisy, a white noise term with variance $\sigma_n^2$ is added to the Gaussian distribution, adding a diagonal matrix to $\vK_{ss}$ $$ \vK_y = \vK_{ss} + \sigma_n^2 \vI_{ss} $$
The distribution of $\vy_u$ given $\vy_s$ is found by conditional probability, $$ \vy_u|\vy_s \sim \cN(\vm_p, \vK_p) $$ where the predictive mean is, $$ \vm_p = \vm_u + \vK_{us}\vK_y^{-1}(\vy_s-\vm_s) $$ and the (predictive) covariance is, $$ \vK_p = \vK_{uu} - \vK_{us}\vK_y^{-1}\vK_{su} $$
What are the differences between the predictive mean and the kernel ridge formulation?
# Posterior GP's
data = np.array([
[0.2, 0.4, 0.8], # Input
[0.8,-0.2, 0.2] # Output
])
mp, std, fp = genGP(X, data, SEKernel, r, nsmp=5)
plotGP(X, data, mp, std, fp)
The form of mean function is usually unknown beforehand, so it is more common to use a set of basis functions, $$ m(\vx) = \vh(\vx)^T\cb $$ The coefficients $\cb$ have to be fitted from the sample data. Under a Bayesian framework, $\cb$ is assumed to subject to, $$ \cb \sim \cN(\vb, \vB) $$ Taking $\cb$ into account, the GP is modified as, $$ f(\vx) \sim \GP(\vh(\vx)^T\vb,k(\vx,\vx')+\vh(\vx)^T\vB\vh(\vx)) $$ Extra terms due to the basis functions shall be added to the covariance matrix, $$ \vK_{ab} = \vK(\vX_a, \vX_b) + \vH_a \vB \vH_b^T,\quad [\vH_*]_i=\vh([\vX_*]_i)^T $$
The extra terms make the predictive mean and covariance from the previous section extremely cumbersome to compute. Simplifications are needed and enabled by invoking the matrix inversion lemma, $$ (\vA+\vU\vC\vV^T)^{-1} = \vA^{-1} - \vA^{-1}\vU (\vC^{-1}+\vV^T\vA^{-1}\vU)^{-1} \vV^T\vA^{-1} $$
Finally, consider the case that the coefficients are distributed uniformly, instead of Gaussian. That means $\vb\rightarrow\mathrm{0}$ and $\vB^{-1}\rightarrow \vO$.
With that, we arrive at the commonly used form of GPR, $$ \begin{aligned} \vm_p^* &= \vK_{us}\vKyi\vy_s + \vD^T (\vH^T\vKyi\vH)^{-1} \vH\vKyi\vy_s \\ &= \vK_{us}\bar\vg + \vH_u\bar\vb \\ \vK_p^* &= \vK_p + \vD^T (\vH^T\vKyi\vH)^{-1} \vD \end{aligned} $$ where $\bar\vb=(\vH^T\vKyi\vH)^{-1}\vH^T\vKyi\vy_s$ and $\bar\vg=\vKyi(\vy_s-\vH\bar\vb)$. The term $\bar\vb$ is essentially the coefficients of the mean function fitted from the sample data. Also, note that some people argue that the last term in $\vK_p^*$ can neglected for simplicity [Sasena2002].
The computation of $\vm_p^*$ and $\vK_p^*$ can be tricky due to the ill-conditioned matrix inversion. The strategy is to combine cholesky decomposition and QR decomposition to stabilize the inversions, i.e. $\vK_y^{-1}$ and $(\vH^T\vKyi\vH)^{-1}$.
After the stabilization, the quantities $\bar\vb$ and $\bar\vg$ in $\vm_p^*$ are computed as follows, $$ \begin{aligned} \bar\vb &= (\vH^T\vKyi\vH)^{-1}\vH^T\vKyi\vy_s = \vR^{-1} [\vQ^T(\vL^{-1}\vy_s)] \\ \bar\vg &= \vKyi(\vy_s-\vH\bar\vb) = \vL^{-T} [\vL^{-1} (\vy_s-\vH\bar\vb)] \end{aligned} $$
The covariance matrix is computed as follows, $$ \vK_p^* = \vK_{uu} - (\vL^{-1}\vK_{su})^2 + (\vR^{-T} [\vH_u^T - \vF^T(\vL^{-1}\vK_{su})])^2 $$ where $(\Box)^2=\Box^T\Box$.
Yes...?
No, how about hyperparameters?
For easier treatment, we non-dimensionalize some quantities in the GPR model.
The covariance matrices $\vK_y$ and $\vK_{su}$ can be "non-dimensionalized" by $\sfs$, $$ \begin{aligned} \vK_y &\equiv \sfs \tvK_y \equiv \sfs (\tvK_{ss} + \tsn\vI) \\ \vK_{su} &\equiv \sfs \tvK_{su} \end{aligned} $$
Subsequently, for the predictive mean, $$ \vm_p^* = \vK_{su}^T\bar\vg + \vH_u^T\bar\vb = \tvK_{su}^T\tvg + \vH_u^T\tvb $$
The hyperparameters are determined using the maximum likelihood estimation (MLE). With the joint Gaussian distribution, the log marginal likelihood of the training data is, $$ \begin{aligned} \cL &= \log p(\vy_s|\vX_s,\vb,\vB) \\ &= - \half(\vy_s-\vH\vb)^T (\vK_y+\vH^T\vB\vH)^{-1} (\vy_s-\vH\vb) \\ &- \half \log |\vK_y+\vH^T\vB\vH| - \frac{n}{2}\log 2\pi \end{aligned} $$ where $n$ is number of training data points.
Factor $\cL$ with $\sfs$, $$ -2\tcL_1 = \sigma_f^{-2} (\vy_s - \vH\tvb)^T \tKyi (\vy_s - \vH\tvb) + \log|\tvK_y| + \log|\vH^T\tKyi\vH| + (n-m)\log\sfs $$ where the constant term is neglected.
If the prior on the coefficients of the basis functions is ignored, the log-likelihood simplifies to, $$ -2\tcL_2 = \sigma_f^{-2} (\vy_s - \vH\tvb)^T \tKyi (\vy_s - \vH\tvb) + \log|\tvK_y| + n\log\sfs $$
A natural step next would be finding $\sfs$ by setting $\partial(-2\tcL)/\partial\sfs=0$ and solving for $\sfs$, $$ \sfs = \frac{1}{N}(\vy_s - \vH\tvb)^T \tKyi (\vy_s - \vH\tvb) $$ where $N=n-m$ for $\tcL_1$, and $N=n$ for $\tcL_2$. The former case is the center estimate of the variance, while the latter the MLE. When there are multiple outputs, the output variables are usually assumed to be independent, and $\sfs$ can be computed separately for each output.
Plug the value back to the likelihood, and remove the constant terms [Welch1992], one obtains reduced log likelihood, $$ \begin{aligned} -\cF_1 &= \log|\tvK_y| + \log|\vH^T\tKyi\vH| + (n-m)\log\sfs \\ -\cF_2 &= \log|\tvK_y| + n\log\sfs \end{aligned} $$ where the only remaining unknown hyperparameters are the length scales $\vl$. Note that $\cF_2$ should be used if a general mean function, without priors on its coefficients, is employed.
The minimization of $-\cF_2$ is equivalent to the minimization of $$ -\cF^* = \sfs|\tvK_y|^{1/n} $$ where $|\tvK_y|=|\tilde{\vL}|^2$ using Cholesky decomposition $\tvK_y=\tilde{\vL}\tilde{\vL}^T$.
In the homework you will implement the two-step method