Estimating the intrinsic dimensionality of data with the participation ratio

Many datasets are samples of the values of a given set of $N$ features. We can visualise these data as points in an $N$-dimensional space, with each point corresponding to one of the samples. Visualization encourages geometric characterization. A basic geometric property is dimensionality: what is the dimension of the space in which the data live? Since each data point consists of $N$ features, one answer is that the data is $N$-dimensional. This is known as the ambient, or extrinsic, dimensionality of a dataset.

Often, however, our $N$-dimensional data don’t fill $N$-dimensional space uniformly in all directions. Instead, they occupy a lower-dimensional subset of this space. As a simple example, consider a dataset consisting of points sampled from a line in the plane. Each point is specified with an $x$ and a $y$ coordinate. Therefore the extrinsic dimensionality of this dataset is 2. However, the points don’t occupy all directions in the two dimensions equally. They are restricted to a one-dimensional subset, the line. The dimension of the space the data occupy is its intrinsic dimensionality. For our toy dataset, this dimensionality is 1.

The intrinsic dimensionality of our toy dataset can be determined by inspection. How do we estimate it for real datasets? There are several notions of intrinsic dimensionality. The approach we’re interested in is based on quantifying the dimensionality of the linear subspace (point, line, plane, …) of the ambient space that the data occupy. We identify this subspace by using principal components analysis (PCA) to compute an orthonormal set of $N$ directions (the principal components, or PCs) in the $N$ dimensional feature space, ordered by the amount of data variance $\lambda_1 \dots \lambda_N$ they capture. If the data fill space uniformly in all directions, these sorted variances should slowly decrease. This is because the variance along each direction should be approximately the same, but there will be fluctuations due to sampling.

If, however, the data live on a low-dimensional subset of the space, then the first few variances, corresponding to the directions spanning this low-dimensional subset, should be large. The remaining variances, corresponding to the directions that span the space the data doesn’t occupy, should be small. Applied to our toy example of data sampled from the line, the first PC would correspond to the line itself and would capture all of the variance of the data. The second PC would be orthogonal to the first, and would not capture any variance. Therefore, in this example, one of the two PCs captures all of the variance, in agreement with the intrinsic dimensionality, 1, of the data.

This intuitive approach suggests that the intrinsic dimensionality of a dataset is related to the number of principal component variances that are somehow ‘large.’ The participation ratio (PR) quantifies this intuition. It is defined as $$\text{PR}(\lambda_1,\dots,\lambda_N) = {(\lambda_1 + \dots + \lambda_N)^2 \over \lambda_1^2 + \dots + \lambda^2_N}.$$ Below we will try to understand this expression by examining some of its properties. We will then take a geometric perspective to show how one could derive this expression.

Basic properties of the participation ratio

To better understand the participation ratio we begin with some basic observations. First, the PR is non-negative, since variances are non-negative. When all the variances are zero, PR becomes 0/0 and is therefore undefined. However, if we ignore such degenerate datasets, then $\text{PR} > 0$. This agrees with our intuitive notion of dimensionality that sets containing more than a single point should have positive dimension.

A second property of the participation ratio that also agrees with our intuitive notions of dimensionality is its invariance to the overall scale of the data. This is because when we scale the data by some amount, the variances all scale by the square of this amount, which can be factored out of both the numerator and denominator and cancelled out. One way such scaling would occur would be if our data were all measured in the same units, say, meters, and we changed units of measurement, to, say millimetres. Our values would increase 1000-fold, but we would not expect the dimensionality of the data to change, and indeed, when measured by the participation ratio, it does not.

A third property of the participation ratio that does contradict ideas about dimensionality we inherited from Euclidean space, is that it is real-valued since the variances it depends on are real-valued. That is, unlike the dimensionality of Euclidean space, which only takes on integer values, it is possible for the dimensionality of a set as measured by the participation ratio to be e.g. 1.2. Note that this is completely unrelated to another real-valued notion of dimensionality, the fractal dimension.

To further understand the participation ratio, we unpack the numerator in its definition and find that $$\begin{align}\text{PR} &= {\lambda_1^2 + \dots + \lambda_N^2 + 2 \sum_{i=1}^{N-1} \sum_{j>i} \lambda_i \lambda_j \over \lambda_1^2 + \dots + \lambda^2_N} = 1 + 2{ \sum_{i=1}^{N-1} \sum_{j>i} \lambda_i \lambda_j \over \lambda_1^2 + \dots + \lambda^2_N}.\end{align}$$ Since the variances are non-negative, we observe that $$ \text{PR} \ge 1.$$ The lower bound is achieved when all cross terms $\lambda_i \lambda_j$ are zero. This occurs when exactly one of the variances is non-zero. In other words, PR = 1 if and only if the data lie along a one-dimensional subspace i.e. a line.

At the other extreme, we can consider the case when the data are distributed uniformly in all directions. In that case, all the variances will have the same value, which we call $\lambda$. We would expect the dimensionality to be that of the ambient space, $N$. Indeed $$ \begin{align}\text{PR} &= 1 + 2{ \sum_{i=1}^{N-1} \sum_{j>i} \lambda^2 \over N \lambda^2}= 1 + 2{N(N-1) \lambda^2 \over 2 N \lambda^2} = 1 + (N-1) = N. \end{align}$$

Similar reasoning covers cases of intermediate dimensionality. For example, if the data are distributed uniformly in all directions of an $M < N$ dimensional subspace, then $M$ of the $N$ variances will have the same value, $\lambda$, and the remainder will be zero. The variances that are zero drop out of the expression for PR, so we’re in the same situation as above when the data filled the full $N$-dimensional space uniformly in all directions, except now restricted to $M$-dimensions. Therefore, $\text{PR}=M$ in such a case.

Deriving the participation ratio

By examining the formula for the participation ratio we have seen that it has many of the properties that we would expect for a measure of the intrinsic dimensionality of a dataset: it is lower bounded by 1 when the data live on a line, upper bounded by the ambient dimensionality when the data fill space uniformly in all directions, and is invariant to the overall scaling of the dataset. But where does the formula come from? How could we have derived it ourselves?

One route to deriving the participation ratio begins by considering the principal component variances as an $N$-dimensional vector $$ \boldsymbol{\lambda} \triangleq [\lambda_1, \lambda_2, \dots, \lambda_N].$$ We could define a dimensionality measure $D_0$ by counting the number of non-zero variances. This corresponds to the $\ell_0$ norm of the vector of variances $$D_0(\bla) \triangleq \sum_{i=1}^N (\lambda_i > 0) = \|\bla\|_0.$$

$D_0$ already has a few of the properties we want of a dimensionality: (1) it is lower bounded by 1, except when the data consist of a single point, which it handles more gracefully than the PR by assigning such degenerate datasets a dimensionality of 0; (2) it is upper bounded by the ambient dimensionality; (3) it is invariant to the overall scaling of the data; (4) it is also integer-valued, in agreement with our Euclidean intuition.

Unfortunately, $D_0$ is overly sensitive: any variance whatsoever along a previously unoccupied direction will add a whole integer value to the dimensionality. Returning to our toy example of points along a line, if any of these points deviated, even by an Angstrom from the line, the dimensionality of the dataset as measured by $D_0$ would immediately increase to 2, which seems unreasonable. It is more reasonable to expect the contribution of each direction to the dimension to be proportional to its variance. We then arrive at our second attempt at dimensionality, $$ D_1(\bla) \triangleq \sum_{i=1}^N \lambda_i = \sum_{i=1}^N |\lambda_i| = \|\bla\|_1.$$

$D_1$ is thus just the $\ell_1$ norm of the vector of variances. Unfortunately, $D_1$ no longer has the nice range and scaling properties of $D_0$. For example, it is no longer invariant to the overall scaling of the data, nor does it have an upper bound. We can therefore increase the $D_1$ dimensionality of our toy dataset on the line simply by scaling all the values, which contradicts our intuition of how dimensionality should behave.

We need to normalize our dimensionality measure somehow. A key property that we require is invariance to the overall data scaling. Viewed geometrically, this means dimensionality should not depend on the magnitude of $\bla$, but only its direction. A natural idea is then to normalize $D_1$ by the length of the variance vector. This yields our third attempt at measuring intrinsic dimensionality, $$ D_2(\bla) \triangleq {D_1(\bla) \over \sqrt{\lambda_1^2 + \dots + \lambda_N^2}} = { \|\bla\|_1 \over \|\bla\|_2}.$$

$D_2$ now has our desired scale-invariance property, since if we scale our data by a factor $k$, variances scale by $k^2$, and $$D_2(k^2 \bla) = { \|k^2 \bla\|_1 \over \|k^2 \bla\|_2} = { k^2 \| \bla\|_1 \over k^2 \|\bla\|_2} = { \| \bla\|_1 \over \|\bla\|_2} = D_2 (\bla).$$ Another way to see this is to observe that $D_2$ is the $\ell_1$ norm of the unit vector $\hat \bla$ in the direction of $\bla$, $$D_2(\bla) = { \|\bla\|_1 \over \|\bla\|_2} = \left\| {\bla \over \|\bla\|_2} \right\|_1 = { \| \hat \bla \|_1} = D_2 (\hat \bla).$$ Since scaling the variances doesn’t change their unit vector, it also doesn’t change their $D_2$ dimensionality.

To understand the range of values $D_2$ can take we stay with this view of $D_2$ as the $\ell_1$ norm of the unit vector $\hat \bla$ and examine the $\ell_1$ norms of different unit vectors. We will do this first in two dimensions. The set of all unit vectors in two dimensions is of course the unit circle. Because variances are non-negative, we can focus on the subset of the unit circle in the first quadrant. To determine the $\ell_1$ norms of the points in this subset, we overlay the contours of this norm on our quarter of the unit circle. The contours in the first quadrant consist of lines at a constant angle of 135 degrees to the x-axis, starting at a point of value zero at the origin and increasing outwards.

The first of these increasing contours of the $\ell_1$ norm that contacts our quarter-circle has value 1 and intersects our quarter-circle simultaneously at the two points $(0,1)$ and $(1,0)$ where the quarter-circle contacts the coordinate axes. This tells us that $$D_2(\bla) \ge 1,$$ with equality achieved at points along a single coordinate axis. The points on single coordinate axes correspond to datasets where only one of the variances is non-zero. These are datasets that lie along a line, and $D_2$ assigns them all a dimensionality of 1, in agreement with our intuition.

As we continue to move along increasing contours of the $\ell_1$ norm, their intersection with our quarter-circle traces out the points with the correspondingly increasing $D_2$. Eventually, the contours contact our quarter-circle at just one point, before breaking contact altogether. This final point is $(1/\sqrt{2}, 1/\sqrt{2})$, with $\ell_1$ norm $2/\sqrt{2} = \sqrt{2}$. Generalizing to $N$ dimensions, we get $$D_2(\bla) \le \sqrt{N},$$ with equality achieved at $(1/\sqrt{N}, \dots, 1/\sqrt{N})$ and all of its scalings. In other words, the datasets with the highest $D_2$ have equal variance along all $N$ directions. This again agrees with our intuition about intrinsic dimensionality.

In $D_2$ we thus have a measure of dimensionality that is (1) not over-sensitive like $D_0$, (2) scale-invariant, (3) assigns its lowest value,1, to datasets whose variances lie entirely along a single dimension, and (4) assigns its highest value to datasets whose variance is spread uniformly along all directions. The one, small problem we have is that the maximum dimensionality it assigns is $\sqrt{N}$ instead of $N$. Therefore, we square $D_2$ and notice that we have derived the participation ratio,
$$ D_2(\bla)^2 = \left({\|\bla\|_1 \over \|\bla\|_2}\right)^2 = {\|\bla\|^2_1 \over \|\bla\|^2_2} = {(\la_1 + \dots + \la_N)^2 \over \la_1^2 + \dots + \la_N^2} = \text{PR}(\bla).$$

Pitfalls of the participation ratio

The participation gives reasonable answers when the data is roughly convex. The dimensionality it assigns to datasets with other shapes can be counterintuitive. For example, the intrinsic dimensionality of the sine wave $(t, \sin(t))$ can reasonably be thought to be 1, since the data points are determined by a single parameter $t$. However, the participation ratio of a set of points sampled from this sine wave will almost certainly be greater than one, since the data will have variance along the horizontal and vertical directions. Similarly, the circle $(\cos(t), \sin(t))$ has intrinsic dimension one because its points are determined by a single parameter. Yet data sampled randomly from the circle will have PR of approximately two since any pair of orthogonal directions will have similar variance. An X-shaped dataset will also have PR of approximately two because of the similar variance along the diagonals, despite looking very different to the circle.

Summary

The participation ratio is a simple real-valued measure of the intrinsic dimensionality of a dataset. It is defined from the principal component variances as the ratio of their sum squared to the sum of their squares. Equivalently, it is the squared $\ell_1$ norm of the length-normalized vector of principal component variances, and can therefore be seen as measuring the sparsity of this vector. It has the nice properties of being bounded below by 1 for datasets along a line, and above by the ambient dimensionality for datasets that fill space uniformly in all directions, while also being invariant to the overall scale of the data. It works best when the data are convex, and care must be taken when computing it for non-convex data as it can give unexpected results.


Posted

in

by

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *