This note is a brief addendum to Section 3.3 of Bishop on Bayesian Linear Regression. Some of the derivations in that section assume, for simplicity, that the prior mean on the weights is zero. Here we’ll relax this assumption and see what happens to the equivalent kernel.
Background
The setting in that section is that, given input $\xx$, we predict the output to be a linear combination of features computed from these inputs, so $$ y(\xx, \ww) = \phi(\xx)^T \ww.$$ Our uncertainty about the prediction comes from our uncertainty in the weights, which given a dataset of inputs $\xx_1, \dots \xx_N$ and corresponding outputs $t_1, \dots t_N,$ we found to be $$p(\ww|\text{training data}) = \mathcal{N}(\ww|\mm_N, \SS_N),$$ where the mean is $$\mm_N = \SS_N(\SS_0^{-1} \mm_0 + \beta \Phi^T \tt),$$ and the precision is $$\SS_N^{-1} = \SS_0^{-1} + \beta \Phi^T \Phi.$$ Here $\Phi^T = [\phi(\xx_1) \dots \phi(\xx_N)]$, $\tt$ is the vector of outputs, and $\mm_0$ and $\SS_0$ are the prior mean and covariance on the weights.
Equivalent kernel
And interesting result of that section was that we can think of prediction as a weighting of the training data where we weight each datapoint by its similarity to the one we’re predicting, and measuring similarity using an equivalent kernel. This was derived, for simplicity, for the case where the prior mean was 0 and the covariance isotropic $\alpha \II$. In that case, the posterior mean on the weights simplifies to $$\mm_N = \beta \SS_N \Phi^T \tt.$$ The mean prediction at a given point $\xx$ is $$ y(\xx, \mm_N) = \phi(\xx)^T \mm_N = \phi(\xx)^T \SS_N \beta \Phi^T \tt.$$ We can expand this out into a weighted combination of the training targets as $$y(\xx,\mm_N) = \sum_n \beta \phi(\xx)^T \SS_N \phi(\xx_n)t_n = \sum_n k(\xx, \xx_n) t_n,$$ where we’ve defined the equivalent kernel as $$ k(\xx, \xx’) \triangleq \beta \phi(\xx)^T \SS_N \phi(\xx’).$$
What happens if we allow the mean and covariance of the prior on the weights to be arbitrary?
The general case
To see what happens in the general case, we can consider the covariance of the predictions, \begin{align}\cov(y(\xx), y(\xx’)) &= \cov(\phi(\xx)^T \ww, \ww^T \phi(\xx’))\\ &= \phi(\xx)^T \cov(\ww) \phi(\xx’)\\ &= \phi(\xx)^T \SS_N \phi(\xx’)\\ &= \beta^{-1} k(\xx, \xx’).\end{align}
Notice that the prior mean doesn’t play any role here, and the prior covariance, $\SS_0$, is included inside of $\SS_N$. This suggests that the equivalent kernel is independent of the prior-mean, and depends on the prior covariance through how it shows up in $\SS_N$.
Returning to the mean of the prediction at a given point $\xx$, the expression is indeed more complex in the general case $$ y(\xx, \mm_N) = \phi(\xx)^T \SS_N (\SS_0^{-1} \mm_0 + \beta \Phi^T \tt).$$ To express this in terms of the equivalent kernel, we find any solution $\bb$ to $$ \SS_0^{-1} \mm_0 = \beta \Phi^T \bb.$$ In the cases of interest here $\Phi^T$ will have many more columns (data points) than rows (features), so there will be an infinity of solutions, of which we can pick any. We can then write the mean prediction as
\begin{align} y(\xx, \mm_N) &= \phi(\xx)^T \SS_N (\beta \Phi^T \bb + \beta \Phi^T \tt)\\
&= \beta \phi(\xx)^T \SS_N \Phi^T(\bb + \tt)\\
&=\sum_n \beta \phi(\xx)^T \SS_N \phi(\xx_n)(b_n + t_n)\\
&=\sum_n k(\xx, \xx_n) (b_n + t_n).\end{align}
So the solution is an interpolation of the training targets that are offset by an amount determined by the prior mean and covariance through $\bb$.
Summary
We reviewed the equivalent kernel derivation and showed using that when we allow the prior on the weights to be arbitrary, the equivalent kernel only changes through the effect of the prior covariance on $\SS_N$, and the interpolation procedure when making predictions changes by adding a constant offset to the training targets.
$$\blacksquare$$
Leave a Reply