Category: Blog
-
Notes on Kernel PCA
Following Bishop, I show how to express the eigenvectors of the feature projections in terms of the eigenvectors of the kernel matrix, and how to compute the kernel of centered features from the uncentered one.
-
Dimension reduction of vector fields
I discuss two notions of dimension-reduction of vector fields from the “low-rank hypothesis” paper, and which might be the ‘correct’ one.
-
The low-rank hypothesis of complex systems
In this post I will summarize the paper “The low-rank hypothesis of complex systems” by Thibeault et al.
-
An iterative reweighted least squares miracle
I show what’s really happening in the iterative reweighted least squares updates for logistic regression.
-
Computing with Line Attractors
These notes are based on Seung’s “How the Brain Keeps the Eyes Still”, where he discusses how a line attractor network may implement a memory of the desired fixation angle that ultimately drives the muscles in the eye. A Jupyter notebook with related code showing how to implement a line attractor is provided here. Background…
-
A Free Connectivity Non-Solution
In this post I explore one possible unconstrained connectivity solution that turns out to not work. As before, the loss function we’re optimizing is$$ L(\ZZ) = {1 \over 2} \|\XX^T \ZZ^T \ZZ \XX – \CC\|_F^2 + {\lambda \over 2 }\|\ZZ – \II\|_F^2.$$ The gradient above is $$ \nabla_\ZZ L = \ZZ (2 \XX \bE \XX^T)…
-
Interaction of Feature Size and Regularization
In this post we’re going to explore how estimated feature size is affected by regularization. The intuition is that the shrinkage applied by regularization will mean low amplitude features get (even more) swamped by additive noise. This is because the estimated values will be more affected by the additive noise, and will not generalize. So…
-
The Logic of Free Connectivity
When we fit connectivity to convert the input representations to the output representations, and without constraining the connectivity except for regularizing its implied feedforward circuit to be approximately the identity, we find, first, that we can fit the representation data reasonably well. Below I show the observed representation of some held-out data on the left,…
-
Quantization
We’re trying to understand the solutions when minimizing $$L(\zz) = {1 \over 2 M^2} \|\XX^T \ZZ^2 \XX – \SS\|_F^2 + {\lambda \over 2 N}\|\zz – \bone\|_2^2.$$ We can write this as $$L(\zz) = {1 \over 2 M^2} \|\RR \zz^2 – \ss \|_2^2 + {\lambda \over 2 N}\|\zz – \bone\|_2^2,$$ where the $i$’th column of $\RR$…
-
The joint distribution of two iid random variables is spherically symmetric iff the marginal distribution is Gaussian.
Proof: If the marginal distribution is Gaussian, the joint distribution is clearly spherical. Below we show the converse: if the joint distribution is spherical, the marginal distribution is Gaussian. Let the marginal distribution be $g(x)$, so $p(x,y) = g(x)g(y)$. Spherical means the gradient of the joint distribution is proportional to $(x,y)$. That is $$\nabla p(x,y)…
-
Decorrelation through gain control
Decorrelation is typically thought to require lateral interactions. But how much can we gain just by gain control? The setting as usual is $N$-dimensional glomerular inputs $\xx$, driving projection neuron activity according to $\dot \yy \propto – \sigma^2 \yy + \xx – \WW \yy$, which at steady state gives an input output transformation $$(\II \sigma^2…
-
EM for Factor Analysis
In this note I work out the EM updates for factor analysis, following the presentation in PRML 12.2.4. In factor analysis our model of the observations in terms of latents is $$ p(\xx_n|\zz_n, \WW, \bmu, \bPsi) = \mathcal{N}(\xx_n;\WW \zz_n + \bmu, \bPsi).$$ Here $\bPsi$ is a diagonal matrix used to capture the variances of the…
-
Automatic Relevance Determination for Probabilistic PCA
In this note I flesh out the computations for Section 12.2.3 of Bishop’s Pattern Recognition and Machine Learning, where he uses automatic relevance to determine the dimensionality of the principal subspace in probabilistic PCA. The principal subspace describing the data is spanned by the columns $\ww_1, \dots, \ww_M$ of $\WW$. The proper Bayesian way to…
-
A simple property of sparse vectors
This came up in Chapter 7 of Wainwright’s “High-dimensional Statistics”. In that Chapter we’re interested in determining how close solutions $\hat \theta$ to different flavours of the Lasso problem come to the true, $S$-sparse vector $\theta^*$. A useful notion is the set of $S$-dominant vectors (my terminology): $$ C(S) = \{x: \|x_{S^c}\|_1 \le \|x_S\|_1\},$$ i.e.…
-
Understanding Expectation Maximization as Coordinate Ascent
These notes are based on what I learned from my first postdoc advisor, who learned it (I believe) from (Neal and Hinton 1998). See also section 4 of (Roweis and Ghahramani 1999) for a short derivation, and the broader discussion in Chapter 9 of Bishop, in particular Section 9.4 Introduction When performing maximum likelihood estimation…
-
Maximum likelihood PCA
These are my derivations of the maximum likelihood estimates of the parameters of probabilistic PCA as described in section 12.2.1 of Bishop, and with some hints from (Tipping and Bishop 1999). Once we have determined the maximum likelihood estimate of $\mu$ and plugged it in, we have (Bishop 12.44)$$ L = \ln p(X|W, \mu, \sigma^2)…
-
Reaction rate inference
Consider the toy example set of reactions\begin{align*}S_0 &\xrightarrow{k_1} S_1 + S_2\\S_2 &\xrightarrow{k_2}S_3 + S_4\\S_1 + S_3 &\xrightarrow{k_3} S_5\end{align*}We have (noisy) data on the concentrations of the species as a function of time. We want to infer the rates $k_1$ to $k_3$. Let’s write the derivatives:\begin{align*}\dot S_0 &=- k_1 S_0\\\dot S_1 &= k_1 S_0 -k_3 S_1…
-
Natural parameterization of the Gaussian distribution
The Gaussian distribution in the usual parameters The Gaussian distribution in one dimension is often parameterized using the mean $\mu$ and the variance $\sigma^2$, in terms of which $$ p(x|\mu, \sigma^2) = {1 \over \sqrt{2\pi \sigma^2}} \exp\left(-{(x – \mu)^2 \over 2 \sigma^2} \right).$$ The Gaussian distribution is in the exponential family. For distributions in this…
-
The inference model when missing observations
The inference model isn’t giving good performance. But is this because we’re missing data? In the inference model, the recorded output activity is related to the input according to $$ (\sigma^2 \II + \AA \AA^T) \bLa = \YY,$$where we’ve absorbed $\gamma$ into $\AA$. We can model this as $N$ observations of $\yy$ given $\bla$, where$$…
-
A noob’s-eye view of reinforcement learning
I recently completed the Coursera Reinforcement Learning Specialization. These are my notes, still under construction, on some of what I learned. The course was based on Sutton and Barto’s freely available reinforcement learning book, so images will be from there unless otherwise stated. All errors are mine, so please let me know about any in…