Graph Spectra and Clustering

(These notes are principally based on conversations with ChatGPT and the notes of Dan Spielman, for example these.)

I’m currently interested in detecting the clustering of odour responses. The motivation is that this would reveal points in time where specific computations become possible. Here, computation is defined as a linear discrimination, separating one set of odours from another. We don’t know which such discrimination, if any, the system is performing. So, rather than checking every possible alternative, we can just check whether odour responses cluster, make discrimination possible.

The approach to this problem that we will consider here is based on graph theory. One reason for this is that it will allow us to be flexible in our notion of clustering, rather than being limited to e.g. convex shapes, as in the k-means approach to clustering.

The vertices of the graph we will use correspond to the population responses to each repetition of each odour. That is, $M$ odours and $R$ repititions yield $M R$ vertices.

We connect each vertex to every other with an edge, and weight that edge by some notion of the similarity of responses. For example, we can take a probabilistic view and consider each population response $\xx_i$ as determining a mean against which all other responses, for example $\xx_j$, are compared, using some fixed scale $\sigma$. In that case, we can set the weights connecting edges $i$ and $j$ as $$ W_{ij} = p(\xx_j|\xx_i) \propto e^{-{1\over 2 \sigma^2} \|\xx_j – \xx_i\|_2^2}.$$

However we choose to compute these affinities, the resulting values can be tabulated in a matrix, whose eigenvalues, i.e. spectrum, tell us how clustered the graph is.

The intuition behind how this works is based on random walks. We can start on a node and move to other nodes according the edge-weights, treated as probabilities. We can run this forward for some fixed number of steps, and then ask what the probability of returning to our starting point is.

At one extreme, we can consider a graph that’s fully connected with equal weights. In that case, our particle would move freely among the nodes. We would expect the probability of return to be low as time went on, approaching some asymptotic value. This graph would represent the “one giant cluster”, or “no clusters”, situation.

At the other extreme, we can consider a graph consisting of isolated vertices, with all edge weights set to 0. In that case our starting particle would be stuck at its starting location, and the probability of return would be 1. This graph would represent the “all singleton clusters” situation.

Intermediate situations would correspond to the “true” clusterings we’re after, in which, hopefully, the responses would cluster according to odour, or groups of odours.

How do we actually compute the probabilities we need?

First, we can convert our weights to transition probabilities. We can consider $W_{ij}$ to be the un-normalized probability of transitioning from node $j$ to node $i$ in one step. To normalize it, we first compute the sum of all such transition to node $i$. This is known as the node degree, $$ D_i = \sum_{j} W_{ij}.$$

The (normalized) probabilities are then, $$ P_{ij} = W_{ij}/D_i,$$ or in matrix form, $$ \PP = \WW \DD^{-1}.$$

If we start at some initial state $\xx_0$, our final state $n$ steps later will be distributed according to $$ \xx_n = \PP^n \xx_0.$$

The elements of $\PP^n$ tell us the probability of ending up at some terminal state given some initial state. The diagonal elements are the return probabilities, the probability of starting and ending at the same place. Given our discussion above, we can sum these diagonal elements to get a notion of clustering. So we can define $$ C = \sum_i (\PP^n)_{ii} = \tr{(\PP^n)} = \sum_i \lambda_i^n,$$ where $\lambda_i$ are the eigenvalues of $\PP$.

So summing the spectrum tells us something about how clustered the data is.

We can extend our random-walk to continous time and arrive at a description of diffusion on the graph. The connection to clustering will be the same. “Heat” will diffuse more easily within clusters than between clusters, and we can sum how much heat is left at each starting node at a given point in time.

We can think of $\PP$ containing the transition probabilities per unit time. For smaller intervals of time, we can imagine that there’s some probability that the particle stays put, which increases as the interval shrinks. So, $$ \xx(t + \Delta t) = (1 – \Delta t) \xx(t) + \Delta t \PP \xx(t).$$

In this formulation, $\xx(t + \Delta t) = \xx(t)$ when $\Delta t = 0$, and $\PP \xx(t)$ when $\Delta t = 1$, capturing our notion that $\PP$ should represent the transitions at unit time-steps.

If we rearrange the above, we get $${ \xx(t + \Delta t) – \xx(t) \over \Delta t} = \Delta t (\PP – \II)\xx(t).$$ Taking the limit as $\Delta t\to 0$ gives $$ {d\xx \over dt} = -(\II – \PP)\xx(t), \implies \xx(t) = e^{-t(\II – \PP)} \xx(0).$$

So, assuming all the eigenvalues of $\II – \PP \ge 0$, we have a bunch of modes (given by the eigenvectors) decaying at rates determined by the eigenvalues of $\II – \PP$. The slow decaying modes, corresponding to small eigenvalues, correspond to our clusters. But what do we know about the eigenvalues?

We know that the eigenvalues are real, because $\PP$ is similar to a symmetric matrix: $$ \DD^{-\half} \PP \DD^{\half} = \DD^{-\half} \WW \DD^{-1} \DD^{\half} = \DD^{-\half} \WW \DD^{-\half}.$$

We know that the eigenvalues are non-negative, because $\II – \PP$ is diagonally dominant, since the diagonal elements, $1 – P_{ii}$, are greater than or equal, in fact just equal, to the summed magnitudes of the off-diagonal elements, $\sum_{j\neq i} |P_{ij}| = \sum_{j \neq i} P_{ij} \triangleq R_i.$ Gresgorin’s circle theorem says that the eigenvalues lie within discs of radius $R_i$ of the corresponding diagonal elements, and by diagonal dominance, all of these discs remain non-negative.

We also know that we’ll have at least one 0 eigenvalue, corresponding to the constant (right) eigenvector, since $(\II – \PP)\bone = \bone – \bone = \bzero.$ This reflects the normalization of the probabilities in $\PP$.

Following this intuition, we see that each connected component of the graph contributes a 0 eigenvalue, since the transition probabilities will be normalized to one within each component.

Thus we can see that in the extreme case of disconnected components (where connection weights between components are 0), the number of 0 eigenvalues indicates the number of clusters. When the clusters aren’t completely disconnected, but only loosely connected, then only the first eigenvalue will be exactly 0. But the number of ‘small’ eigenvalues gives an approximate number of clusters.

Just like in the discrete case, we can count these small eigenvalues by computing the trace of $e^{-t(\II – \PP)},$ $$C(t) = \tr\left(e^{-t(\II – \PP)}\right) = \sum_k e^{-t \lambda_k}.$$ Depending on what time point we evaluate this at, the small eigenvalues will contribute $O(1)$ terms, while the larger ones get squashed towards 0.

How to pick the $t$ to evaluate the trace at? We can think of it as a low-pass filter. For small values of $t$, most eigenvalues will contribute to our metric. As we increase $t$, the larger eigenvalues, will get squashed to zero, leaving the few large ones. This will continue until, as $t \to \infty$, heat has spread fully into the graph, and our signature only picks up the contribution of the constant mode, with eigenvalue 0.

We can avoid selecting a $t$, at the cost of an extra parameter, by integrating it out. For example, we can compute $$ C = \int_0^\infty \sum_{k=2}e^{-t \lambda_k} \; dt = \sum_{k \ge 2} { 1 \over \lambda_k}.$$ Notice that we’ve skipped the first eigenvalue (at 0), to keep the measure from blowing up.

Small eigenvalues in the denominator can make this blow up, so we can consider a regularized version of this by performing a weighted integration, $$ C(\alpha) = \int_0^\infty e^{-\alpha t} \sum \dots = \sum_{k \ge 2} {1 \over \lambda_k + \alpha}.$$ Here $\alpha$ provides a scale to separate the small and large eigenvalues. We can then estimate the number of clusters as $$ \kappa(\alpha) = \alpha C(\alpha) + 1= \sum_{k\ge 1} {1 \over 1 + {\lambda_k \over \alpha}},$$ where, due to the regularization, we can now sum over all eigenvalues.

$\blacksquare$

Comments

Leave a Reply Cancel reply