Inference for Gaussian Process Parameters

Read 15.1 to 15.2.5 (inclusive) of “Machine Learning, A Probabalistic Perspective” by Kevin Murphy.
You can find pdf on the internet.

The goal is to infer the parameters of a simple Gaussian Process model.

Our model is \[ y_i = f(x_i) + \epsilon_i \]

To keep things simple let’s assume \(x_i \in R\).

where \(\epsilon_i \sim N(0,\sigma^2_y)\) and \[ (f(x_1),f(x_2),\ldots,f(x_N))' \equiv f(x) \sim N(0,K), \;\; K_{ij} = \kappa(x_i,x_j). \] Then, \[ Y \,|\, x \sim N(0,K_y), \;\; K_y = K + \sigma^2_y \, I. \] Now let \(\kappa\) depend on some parameters so that we have \(\kappa(x_i,x_j;\gamma)\) and hence

\[ K_y = K(\gamma) + \sigma^2_y \, I. \]

The example in Murphy is \[ \kappa(x_i,x_j;\sigma_f,l) = \sigma^2_f \, \exp(-\frac{1}{2l^2} \, (x_i-x_j)^2)). \]

so \(\gamma = (\sigma_f,l)\).

Letting \(\theta = (\gamma,\sigma_y)\) We have \[ K_y(\theta) = K(\gamma) + \sigma^2_y \, I. \]

Then \(Y \, |\, x \sim N(0,K_y(\theta))\) and \[ \log(p(y \, | \, x,\theta)) =- \frac{1}{2} y' K_y(\theta)^{-1} y - \frac{1}{2} \log(|K_y(\theta)|) - \frac{N}{2} \log(2 \pi). \]

which is Equation 15.2.2 of Murphy.

Now the idea of the project is to explore inference for \(\theta\) by computing the likelihood.

For example you could use Algorithm 15.1 in Murphy to compute \(\log(p(y \, | \, x,\theta))\).

A simple thing to try would be to fix all of the elements of \(\theta\) but one: \(\theta = (\theta_i,\theta_{-i})\). Fix \(\theta_{-i}\) and then evaluate \[ \log(p(\theta_i \,|\, \theta_{-i},y)) \propto \log(p(y \,|\, x, \theta)) + \log(p(\theta)) \]

on a grid of values for \(\theta_i\). Here \(p(\theta)\) is the prior. The use of a prior is optional.
You could then use co-ordinate descent or a griddy Gibbs sampler.

Recall that co-ordinate descent means you just optimize over one element of \(\theta\) at a time given all the other elements.
Then each iteration of co-ordinate descent cycles through each element, optimizing each element give the others.

Or you could use gradient descent.
Equations 15.23 and 15.24 give the gradient of \(\log(p(y \, | \, x,\theta))\) with respect to \(\theta\).

I would plot the likelihood of two elements of \(\theta\) given the third.
For example of thorough exploration of the likelihood in terms of \(\gamma\) for fixed \(\sigma_y\) should be very interesting.

Note, Algorithm 15.1 just says that to compute \(\log(p(y \, | \, x,\theta))\) let \(L\) be the Cholesky root of \(K_y(\theta)\).

Then use the Cholesky to compute \(|K_y(\theta)|\) and \(K_y(\theta)^{-1} y \equiv \alpha\).

See page 228 of Murphy for discussion of the “backslash operator” which he uses in 15.1.
But \(L\y\) just means solve the equations \(L x = y\) so that \(y = L^{-1} x\), but we can do this efficiently for triangular \(L\).

That is, If \(K_y = LL'\), then

\[ \alpha = K_y^{-1} y = (LL')^{-1} y =( L^{-1})' L^{-1} y. \] and \[ \log(|K_y|^{1/2})= \sum \log(L_{ii}). \] The line (6) of algorithm 15.1 has \[ \log(p(y \, | \, x,\theta)) = -\frac{1}{2} \, y' \alpha \; - \sum \log(L_{ii}) - \frac{N}{2} \, \log(2 \pi) \]

Use real and simulated data to see how your procedure works.