These are my notes based on Section 8.1.3 “Discrete variables” in PRML. Watch the related video here.
We use graphical models to represent the joint probability distribution of a collection of random variables, in a way that captures their conditional independencies. To demonstrate that, in this section we’re going to consider discrete random variables. We’ll first consider the joint distributions themselves, then show how they can be expressed as Bayesian networks in a way that reveals their dependencies. These dependencies will require more and more parameters to specify as the number of variables increases, and this will be reflected in the graphical model. The graphical representation will then suggest ways of reducing the number of parameters, at the cost of expressivity.
The state representation
We’ll use a one-hot representation and represent each variable as a $K$-element vector $$\xx = [x_1, x_2, \dots, x_K],$$ with binary elements, $$x_i \in \{0,1\} \quad \forall i, $$ where exactly one is 1, $$\sum_{i=1}^K x_i = 1.$$
To provide a probabilistic description of such variables, we need to specify the probabilities $\mu_1, \mu_2, \dots, \mu_K$ of each of the $K$-states. That only requires $K-1$ parameters, since given the probabilities of any $K-1$ states, the probability of the last state is 1 minus the summed probabilities of the other states. So, packing the state probabilites into a vector as $$\bmu = [\mu_1, \dots, \mu_K],$$ we have \begin{align} p(\xx = [1,0,\dots,0,0]|\bmu) &= \mu_1 \\p(\xx=[0,1,\dots,0,0]|\bmu) &= \mu_2\\ \vdots\\ p(\xx=[0,0,\dots,1,0]|\bmu) &= \mu_{K-1} \\ p(\xx=[0,0,\dots,0,1]|\bmu) &= \mu_K = 1 – \sum_{k=1}^{K-1} \mu_k \\
p(\xx=\text{anything else}|\bmu) &= 0 \\ \end{align}
If we assume that we’ll only be dealing with valid one-hot states, we can write these probabilities as $$ p(\xx|\bmu) = \begin{cases} \mu_1 & \text{if } x_1 = 1\\ \mu_2 & \text{if } x_2 = 1 \\ \vdots \\ \mu_K & \text{if } x_K = 1 \end{cases}.$$ Because the $x_i$ are binary, we can write this expression compactly as $$ p(\xx|\bmu) = \mu_1^{x_1} \mu_2^{x_2} \dots \mu_K^{x_K} = \prod_{k=1}^K \mu_i^{x_i}.$$
Joint distributions
The description above provides a complete probabilistic description of a single $K$-state random variable. If we have two such variables, a complete description requires specifying the probabilities of all $K^2$ combinations of states. This is the joint distribution of the two variables.
It’s going to get messy writing out combinations of one-hot vectors, so instead, when convenient we’ll refer to such binary vectors by the index of their non-zero element. So for example $$ \xx = 2 \quad\text{means}\quad \xx = [0,1,0,\dots,0].$$
With this convention in hand, we can proceed as before, and write the probabilities of the joint distribution as $$ p(\xx_1, \xx_2 | \bmu) = \begin{cases} \mu_{11} & \text{if } \xx_1 = 1, \xx_2 = 1 \\ \mu_{12} & \text{if } \xx_1 = 2, \xx_2 = 1 \\ \vdots \\ \mu_{KK}& \text{if } \xx_1 = K, \xx_1 = K \end{cases}.$$
As before, we can write this compactly as a product using the one-hot nature of the the vectors. Take, for example, the contribution of the first term. By our convention, $\xx_1 = 1$ means that $x_{11} = 1$, and similarly for $\xx_2 = 1$. We want the corresponding component of our product to be $\mu_{11}$ if both of these conditions are met, and 1 otherwise. We can achieve this using the term $\mu_{11}^{x_{11} x_{21}}$. This will be $\mu_{11}$ if both $x_{11}$ and $x_{21}$ are 1, and 1 otherwise.
Multiplying all these terms, we get $$p(\xx_1, \xx_2 | \bmu) = \mu_{11}^{x_{11} x_{21}} \mu_{12}^{x_{12} x_{21}} \dots \mu_{KK}^{x_{1K} x_{2K}} = \prod_{i=1}^K \prod_{j=1}^K \mu_{ij}^{x_{1i} x_{2j}}.$$
We will now need $K^2 – 1$ parameter values to specify this system: one for every $\mu_{ij}$, except the last one since they must all sum to 1.
The pattern is now clear: if we had three variables, the joint probability would be $$ p(\xx_1, \xx_2, \xx_3 | \bmu) = \prod_{i=1}^K \prod_{j=1}^K \prod_{k=1}^K \mu_{ijk}^{x_{1i} x_{2j} x_{2k}},$$ and we would need $ K^3 – 1$ parameter values to specify this system. In general, if we have $M$ such variables, we need $K^M – 1$ parameter values. In other words, the number of parameters we need grows exponentially with the number of variables.
Representation as a Bayesian network
We can express the distribution $p(\xx_1, \dots, \xx_M)$ as a directed graph called a Bayesian network. In such a graph, the nodes correspond to the variables. The edges correspond to dependencies among the variables.
To determine these dependencies, we use Bayes’ rule to decompose the joint distribution into the product
$$ p(\xx_1,\dots,\xx_M) = p(\xx_1)p(\xx_2|\xx_1) p(\xx_3|\xx_1, \xx_2) \dots p(\xx_K|\xx_1,\dots,\xx_{K-1}).$$ Since we haven’t made any assumptions about the form of the joint distribution, this expression holds generically, for any joint distribution.
This decomposition expresses the dependencies among our random variables. The first variable $\xx_1$, has no dependencies. The second $\xx_2$, depends on the first, as specified by $p(\xx_2|\xx_1)$. The third depends on the first two, and so on.
We express these dependencies graphically by draw arrows from a random variable to all the other random variables that depend on it. For example, we can express the decomposition $$p(\xx_1,\xx_2,\xx_3,\xx_4) = p(\xx_1) p(\xx_2|\xx_1) p(\xx_3|\xx_1,\xx_2)p(\xx_4|\xx_1,\xx_2,\xx_3)$$ as

The number of arrows landing on a given random variable determine the number of other variables it depends on, and therefore, the number of parameters required to specify that dependency.
In our example above, $\xx_1$ has not arrows landing on it, so we only need $K-1$ parameters to specify it: the probabilities of each of its $K$ states.
The next variable $\xx_2$ has one arrow landing on it, from $\xx_1$. This means that to completely specify $\xx_2$, we need to specify the probabilties of each of its $K$ states, for each state of $\xx_1$. This requires $K – 1$ variables for each of the $K$ states of $\xx_1$, for a total of $K (K-1)$ values to specify $p(\xx_2|\xx_1)$.
Continuing in this way, we see that we need $K^2 (K-1)$ values to specify $p(\xx_3|\xx_1, \xx_2)$ and so on.
For $M$ variables, the total number of variables we need is $$\sum_{m=1}^M (K-1)K^{m-1} = (K-1) {K^M – 1 \over K – 1} = K^M – 1$$ parameters. As expected, this is the same value we computed before our decomposition of the joint probability.
Reducing the parameter counts
We saw above that the number of parameters needed to specify our joint distribution depends on the dependencies among the variables. In the generic setting, we need $O(K^M)$ values to determine the joint distribution, that is, exponential in the number of variables.
Many problems have 100s or 1000s of variables, and such exponential growth in the number of parameters will quickly exhaust our computing resources. The source of the exponential increase is that we are allowing the dependencies among variables to be arbitrarily complex. We will now discuss three ways of reducing the parameter counts by simplifying these dependencies:
- Localizing the dependencies.
- Parameter sharing.
- Parameterizing the dependencies.
1. Localization of dependencies
In our general formulation above, each variable depended on all previous variables. The final variable, $\xx_M$, depended on every other variable in the system. This was reflected in $\xx_M$ receiving arrows from all other variables. This dependency alone contributed $(K-1)K^{M-1} = O(K^M)$ values to our parameter count.
One way to reduce our parameter count is to make such dependencies more “local”. In the most extreme setting, we can remove all dependencies, making each node independent of the others:

This graphical model corresponds to the factorized probability distribution $$p(\xx_1, \xx_2, \xx_3, \xx_4) = p(\xx_1)p(\xx_2)p(\xx_3)p(\xx_4).$$ The dependencies of each variable are now maximally local, since no variable depends on any other. The number of parameters we need has correspondingly dropped to $M (K-1)$, and is linear in the number of variables $M$. This has come at the cost of expressivity: we’ve given up the possibility of capturing any dependence among our random variables.
A less drastic way to reduce the parameter count is by cutting all edges in the graph except between immediate neighbours:

In this case, we need $(K-1)$ values to determine $p(\xx_1)$, and $(K-1)K$ for each $p(\xx_k|\xx_{k-1})$. This gives a total of $(K-1) + (K-1)K (M-1) = O(K^2 M)$. This is quadratic in the number of states, but still linear in the number of variables.
Parameter sharing
A related way to reduce our parameter count is to constrain certain parameters to have the same value. An important application of this are convolutional neural networks.
To see how to represent this, we’ll make the dependences on the parameters explicit:

Here the elements of the vector $\bmu_1$ determine the probabilities of each of the $K$ states of $\xx_1$. On the other hand, $\bmu_2$ to $\bmu_M$ are $K \times K$ matrices containing the conditional probabilities $p(\xx_m | \xx_{m-1})$, for each of the $K^2$ combinations of $\xx_m$ and $\xx_{m-1}$.
In the most general setting, these matrices can all be different, requiring us to provide $K -1 +(M-1)K(K-1)$ parameters, as we saw above.
However, in many settings it makes sense to require that the conditional probabilities be the same for each variable. Fore example, if we imagine that the variables correspond to the states of a physical system at $M$ successive points in time, then it might make sense to assume that the dynamics, which determine the transitions between states, remain stationary in time. We represent that graphically by replacing the separate parameters for each variable, $\bmu_2, \dots, \bmu_M,$ with a single parameter $\bmu$:

In this case, our parameter count has dropped further, to $K-1$ (for $\xx_1$) plus $(K-1) K$, a single matrix, stored in $\bmu$, that determines the transitions from each state to the next.
Parameterizing dependencies
Another way to reduce the parameter counts is to parameterize dependencies. Let’s go back to our original, general dependency structure, where the final variable $\xx_M$ depended on all the previous ones. To fully specify that dependence, we needed to specify what the probability of a given state of $\xx_M$ was for each of the $K^{M-1}$ possible states of the remaining variables. This gave us maximum flexibility, at the cost of having to provide a large number of parameter values.
We can reduce the number of required parameters by restricting the dependency to live within a parameterized family. For example, consider the case where all the variables are binary. Then $K = 2$, and we can represent each variable $\xx_i$ as the binary scalar $x_i$. Instead of letting the dependence of $x_M$ on the previous variables be arbitrary, requiring $2^{M-1}$ values, we can compute its activation probability using a linear combination of the other variables passed through a logistic sigmoid, $$ p(x_M=1|x_1, \dots, \xx_{M-1}) = \sigma(w_0 + \sum_{i=1}^{M-1} w_i x_i).$$ This only requires $M$ values, one for each $w_i$, dramatically reducing the number of parameters required, at the cost of limiting the complexity of the relationships we can capture.
Note also that because $x_M$ still depends on $x_1, \dots, x_{M-1}$, the graphical model of its dependencies are the same as before:

So the graph can only tell us who depends on who. The nature of the dependencies and the number of parameters required is usually not visible from the graph, although as we have seen, more arrows into a variable typically mean more parameters.
Summary
In this post we’ve seen how using Bayes rule to decompose a joint distribution gives rise to a representation of the joint distribution as a directed graph called a Bayesian network, whose nodes represent variables and edges represent dependencies. We’ve also observed the parameter explosion that occur as the number of variables increase, and how limiting the complexity of dependencies, which we can represent by removing edges from our graph, can tame this increase.
$$\blacksquare$$
Leave a Reply