In section 8.1.4 of “Pattern Recognition and Machine Learning,” Bishop shows how a Bayesian network of scalar random variables where each is Gaussian distributed given its parents, and linear in the parents, produces a multivariate Gaussian joint distribution. I discuss that section in the video below.
In this post we flesh out some of the calculations in that section that show how to compute the mean and covariance of the multivariate Gaussian.
Code that demonstrates these computations is here.
Setting up the Network
We’re given a direct graph on $N$ nodes, with $\pa_i$ indicating the parents of node $i$. Each node determines a conditional probability according to \begin{align} x_i|\pa_i = b_i + v_i \veps_i + \sum_{j \in \pa_i} w_{ij} x_j, \label{eq:cond} \end{align}where in Eq. \ref{eq:cond} $$\veps_i \sim \mathcal{N}(0, 1).$$
It’s important to emphasize that $v_i$ is the conditional variance of $x_i$ given its parents, not its marginal variance.
We next want to compute the marginal statistics of each node. Because our distribution is a multivariate Gaussian, all we’ll need to specify it completely are the means and variances of each node, and the pairwise covariances.
These quantities would be easy to compute if we had the marginal probability distributions. For example, the mean of a node would of course be $$ \EE x_i = \int p(x_i) x_i dx_i.$$
However, we’re not given marginal distributions, but conditional distributions. So we have to express our expectations in terms of these known quantities.
The marginal distribution on each node decomposes as $$p(x_i) = \int_{\pa_i} p(x_i, \pa_i) d\pa_i = \int_{\pa_i} p(x_i | \pa_i) d\pa_i.$$
Applying this to the mean, we have \begin{align*} \EE x_i &= \int x_i p(x_i) dx_i\\ &= \int_{\pa_i} \left[ \int_{x_i} x_i p(x_i | \pa_i) dx_i \right] p(\pa_i) d\pa_i. \end{align*}
The term in square brackets is the conditional mean of $x_i$ given its parents. We’re given the conditional distribution of $x_i$, so we can read off the mean, $$\dots = \int_{\pa_i} [b_i + \sum_{j \in \pa_i} w_{ij} \overline{x}_j]\; p(\pa_i) d\pa_i.$$ But the terms inside the integral are constant, so we get \begin{align}\EE x_i = b_i + \sum_{j \in \pa_i} w_{ij} \overline{x}_j.\label{eq:mean}\end{align}
This is the perhaps intuitive statement that the mean of a variable is composed of its own unique contribution, plus a weighted contribution of its parent means.
Computing the covariances
To compute the covariances we need to calculate $$ \cov(x_i, x_j) = \EE_{x_i, x_j} (x_i – \overline{x_i}) (x_j – \overline{x_j}).$$ We’ll be computing this in terms of the parents of $j$, so it’s useful to write out the integral, \begin{align*}\cov(x_i, x_j) &= \int p(x_i, x_j) (x_i – \ol{x_i})(x_j – \ol{x_j}) dx_i dx_j \\ &=\int p(x_i, x_j, \pa_j) (x_i – \ol{x_i})(x_j – \ol{x_j}) dx_i dx_j d\pa_j \\ &= \int p(x_j | x_i, \pa_j) p(x_i, \pa_j) (x_i – \ol{x_i})(x_j – \ol{x_j}) dx_i dx_j d\pa_j \end{align*}
$j > i$
We will first consider the case when $i \neq j$. For reasons that will become clear below, we will be computing the covariances from lower number nodes to higher, so we will take $j > i$, and assume that we already have all the covariances of terms with indices less than $j$.
From Eq. \ref{eq:mean} we have the mean of node $j$ as $$ \overline{x}_j = \sum_{k \in \pa_j} w_{jk} \overline{x_k} + b_j.$$ Therefore, the deviation of $x_j$ relative to its mean is $$ x_j – \overline{x}_j = \sum_{k \in \pa_j} w_{jk} (x_k – \overline{x_k}) + \sqrt{v_j} \veps_j.$$
The covariance is then \begin{align*} \cov(x_i, x_j) &= \int p(x_j | x_i, \pa_j) p(x_i, \pa_j) (x_i – \ol{x_i})\left( \sum_{k \in \pa_j} w_{jk} (x_k – \overline{x_k}) + \sqrt{v_j} \veps_j\right) dx_i dx_j d\pa_j. \end{align*} We can split this into two parts \begin{align} \dots &= \int p(x_i, \pa_j) \left(\sum_{k \in \pa_j} w_{jk}(x_i – \ol{x_i}) (x_k – \overline{x_k}) \right) dx_i d\pa_j \label{eq:cov_pa} \\ &+ \int p(x_i, \pa_j)p(x_j|x_i, \pa_j) (x_i – \ol{x_i}) \sqrt{v_j} \veps_j \; dx_i dx_j d\pa_j\label{eq:cov_noise}\end{align}
The first part measures how much $i$ covaries with the parents of $j$. We will see that it contributes a weighted sum of those covariances. First, \begin{align*} &\int p(x_i, \pa_j) \left(\sum_{k \in \pa_j} w_{jk}(x_i – \ol{x_i}) (x_k – \overline{x_k}) \right) dx_i d\pa_j \\&= \sum_{k \in \pa_j} w_{jk} \int p(x_i, \pa_j) (x_i – \ol{x_i}) (x_k – \ol{x_k}) dx_i d\pa_j. \end{align*}
We can separate $k$ from $\pa_j$ and decompose the latter as $$\pa_j = k \cup \pa_{j/k}.$$ In these terms, \begin{align*} \dots = \sum_{k \in \pa_j} w_{jk} &\int p(x_i, x_k, \pa_{j/k}) (x_i – \ol{x_i}) (x_k – \ol{x_k}) dx_i dx_k d\pa_{j/k}.\end{align*} Since the terms in brackets only involve $i$ and $k$, we can integrate out $\pa_{j/k}$ and get \begin{align}\dots= \sum_{k \in \pa_j} w_{jk} &\int p(x_i, x_k) (x_i – \ol{x_i}) (x_k – \ol{x_k}) dx_i dx_k \nonumber \\ = \sum_{k \in \pa_j} w_{jk} &\;\cov(x_i, x_k).\label{eq:cov}\end{align}
Next, we consider the contribution of Eq. \ref{eq:cov_noise}. What’s crucial here is that node $j$ is downstream of node $i$. This makes the injected noise at node $j$ independent of the activity of node $i$. If instead node $i$ was downstream of $j$, then the noise injected into $j$ would contribute to its activity, and trickle down into $i$, making the latter’s value dependent on the noise.
Since we’re evaluating our covariances so that $j$ is downstream of $i$, we can use the fact that the the injected noise at $j$ will be independent of the activity of $i$, and any of its other parents.The expression in Eq. \ref{eq:cov_noise} then simplifies to \begin{align*} &\int p(x_i, \pa_j)p(x_j|x_i, \pa_j) (x_i – \ol{x_i}) \sqrt{v_j} \veps_j \; dx_i dx_j d\pa_j \\= &\int p(x_i, \pa_j)p(x_j) (x_i – \ol{x_i}) \sqrt{v_j} \veps_j \; dx_i dx_j d\pa_j\\ = &\int p(x_i)p(x_j) (x_i – \ol{x_i}) \sqrt{v_j} \veps_j \; dx_i dx_j\\ = &\;0.\end{align*}
Combining this with Eq. \ref{eq:cov} gives us the covariance of $i$ with $j>i$: \begin{align} \EE (x_i – \ol{x_i}) (x_j – \ol{x_j}) = \sum_{k \in \pa_j} w_{jk} \cov(x_i, x_k).\label{eq:cov_ij}\end{align}
Because we’re computing covariances in order from lower number nodes to higher, we will have all the covariance we need in this expression.
$i = j$
Next, we’ll consider the case when $i = j$ i.e. we’ll compute the variance of node $j$. Note again that this is not $v_j$, the variance of the injected noise, as that was the conditional variance when the activity of all the parent nodes is provided. The marginal variance of this node will be at least as large as this quantity, because it will include the contributions from all the parents.
To compute the variance, we’ll again expand in terms of parents: \begin{align*} \cov(x_j, x_j) &= \int p(x_j) (x_j – \ol{x_j})^2 dx_j\\ &= \int p(x_j|\pa_j) p(\pa_j) (x_j – \ol{x_j})^2 dx_j d\pa_j \\ &= \int p(x_j|\pa_j) p(\pa_j) \left(\sum_{k \in \pa_j} w_{jk}(x_k – \ol{x_k}) + \sqrt{v_j} \veps_j\right)^2 \; dx_j d\pa_j \\ &=\int p(x_j|\pa_j) p(\pa_j) \left(\sum_{k_1, k_2 \in \pa_j} w_{jk_1}w_{jk_2}(x_{k_1} – \ol{x_{k_1}})(x_{k_2} – \ol{x_{k_2}}) + \sum_{k \in \pa_j} (x_k – \ol{x_k})\sqrt{v_j} \veps_j + v_j \veps_j^2 \right) \; dx_j d\pa_j \end{align*}
The first term in the integral contributes weighted sums of covariances among the parents, $$ \sum_{k_1, k_2 \in \pa_j} w_{jk_1} w_{jk_2} \cov(x_{k_1}, x_{k_2}).$$
The second term integrates to zero because the noise at child nodes is independent of the activity at parent nodes.
The last term integrates to $v_j$, since $\veps_j$ is zero-mean so $\veps_j^2$ measures its variance.
Combining these, we get \begin{align} \cov(x_j, x_j) = \sum_{k_1, k_2 \in \pa_j} w_{jk_1} w_{jk_2} \cov(x_{k_1}, x_{k_2}) + v_j.\label{eq:cov_j} \end{align}
$$ \blacksquare$$
Leave a Reply