Category: Blog

  • A simple property of sparse vectors

    This came up in Chapter 7 of Wainwright’s “High-dimensional Statistics”. In that Chapter we’re interested in determining how close solutions $\hat \theta$ to different flavours of the Lasso problem come to the true, $S$-sparse vector $\theta^*$. A useful notion is the set of $S$-dominant vectors (my terminology): $$ C(S) = \{x: \|x_{S^c}\|_1 \le \|x_S\|_1\},$$ i.e.…

  • Understanding Expectation Maximization as Coordinate Ascent

    These notes are based on what I learned from my first postdoc advisor, who learned it (I believe) from (Neal and Hinton 1998). See also section 4 of (Roweis and Ghahramani 1999) for a short derivation, and the broader discussion in Chapter 9 of Bishop, in particular Section 9.4 Introduction When performing maximum likelihood estimation…

  • Maximum likelihood PCA

    These are my derivations of the maximum likelihood estimates of the parameters of probabilistic PCA as described in section 12.2.1 of Bishop, and with some hints from (Tipping and Bishop 1999). Once we have determined the maximum likelihood estimate of $\mu$ and plugged it in, we have (Bishop 12.44)$$ L = \ln p(X|W, \mu, \sigma^2)…

  • Reaction rate inference

    Consider the toy example set of reactions\begin{align*}S_0 &\xrightarrow{k_1} S_1 + S_2\\S_2 &\xrightarrow{k_2}S_3 + S_4\\S_1 + S_3 &\xrightarrow{k_3} S_5\end{align*}We have (noisy) data on the concentrations of the species as a function of time. We want to infer the rates $k_1$ to $k_3$. Let’s write the derivatives:\begin{align*}\dot S_0 &=- k_1 S_0\\\dot S_1 &= k_1 S_0 -k_3 S_1…

  • Natural parameterization of the Gaussian distribution

    The Gaussian distribution in the usual parameters The Gaussian distribution in one dimension is often parameterized using the mean $\mu$ and the variance $\sigma^2$, in terms of which $$ p(x|\mu, \sigma^2) = {1 \over \sqrt{2\pi \sigma^2}} \exp\left(-{(x – \mu)^2 \over 2 \sigma^2} \right).$$ The Gaussian distribution is in the exponential family. For distributions in this…

  • The inference model when missing observations

    The inference model isn’t giving good performance. But is this because we’re missing data? In the inference model, the recorded output activity is related to the input according to $$ (\sigma^2 \II + \AA \AA^T) \bLa = \YY,$$where we’ve absorbed $\gamma$ into $\AA$. We can model this as $N$ observations of $\yy$ given $\bla$, where$$…

  • A noob’s-eye view of reinforcement learning

    I recently completed the Coursera Reinforcement Learning Specialization. These are my notes, still under construction, on some of what I learned. The course was based on Sutton and Barto’s freely available reinforcement learning book, so images will be from there unless otherwise stated. All errors are mine, so please let me know about any in…

  • Notes on the evidence approximation

    These notes closely follow section 3.5 of Bishop on the Evidence Approximation, much of which is based on this paper on Bayesian interpolation by David MacKay, and to which I refer to below. Motivation We have some dataset of inputs $\XX = \{\xx_1, \dots, \xx_N\}$ and corresponding outputs $\tt = \{t_1, \dots, t_N\}$, and we’re…

  • The equivalent kernel for non-zero prior mean

    This note is a brief addendum to Section 3.3 of Bishop on Bayesian Linear Regression. Some of the derivations in that section assume, for simplicity, that the prior mean on the weights is zero. Here we’ll relax this assumption and see what happens to the equivalent kernel. Background The setting in that section is that,…

  • RL produces more brain-like representations for motor learning than supervised learning

    These are my rapidly scribbled notes on Codol et al.’s “Brain-like neural dynamics for behavioral control develop through reinforcement learning” (and likely contain errors). What learning algorithm does the baby’s brain use to learn motor tasks? We have at least two candidates: supervised learning (SL), which measures and minimizes discrepancies between desired and actual states…

  • Notes on the Geometry of Least Squares

    In this post I expand on the details of section 3.1.2 in Pattern Recognition and Machine Learning. We found that maximum likelihood estimation requires minimizing $$E(\mathbf w) = {1 \over 2} \sum_{n=1}^N (t_n – \ww^T \bphi(\xx_n))^2.$$ Here the vector $\bphi(\xx_n)$ contains each of our features evaluated on the single input datapoint $\xx_n$, $$\bphi(\xx_n) = [\phi_0(\xx_n),…

  • Inference by decorrelation

    We frequently observe decorrelation in projection neuron responses. This has often been linked to either redundancy reduction, or pattern separation. Can we make an explicit link to inference? A simple case to consider is $\ell_2$ regularized MAP inference, where $$ \log p(x|y) = L(x,y) = {1 \over 2\sigma^2} \|y – A x\|_2^2 + {\gamma \over…

  • Background on Koopman Theory

    These notes fill in some of the details of Section 2.1 of Kamb et al.’s “Time-Delay Observables for Koopman: Theory and Applications”. They were made by relying heavily on finite-dimensional intuition (operators as infinite-dimensional matrices), and by talking with ChatGPT, so likely contain errors. We are interested in understanding the time evolution of a dynamical…

  • Notes on the Recognition-Parameterized Model

    Recently, William Walker and colleagues proposed the Recognition Parameterized Model (RPM) to perform unsupervised learning of the causes behind observations, but without the need to reconstruct those observations. This post summarizes my (incomplete) understanding of the model. One popular approach to unsupervised learning is autoencoding, where we learn a low-dimensional representation of our data that…

  • Between geometry and topology

    At one of the journal clubs I recently attended, we discussed “The Topology and Geometry of Neural Representations”. The motivation for the paper is that procedures like RSA, which capture the overlap of population representations of different stimuli, can be overly sensitive to some geometrical features of the representation the brain might not care about.…

  • Notes on Atick and Redlich 1993

    In their 1993 paper Atick and Redlich consider the problem of learning receptive fields that optimize information transmission. They consider a linear transformation of a vector of retinal inputs $s$ to ganglion cell outputs of the same dimension $$y = Ks.$$ They aim to find a biologically plausible learning rule that will use the input…

  • Changing regularization, II

    Today I went back to trying to understand the solution when using the original regularization. While doing so it occurred to me that if I use a slightly different regularization, I can get a closed-form solution for the feedforward connectivity $Z$, and without most (though not all) of the problems I was having in my…

  • Notes on Multiresolution Matrix Factorization

    These are my notes from early January on Kondor et al.’s Multiresolution Matrix Factorization from 2014. This was a conference paper and the exposition was a bit terse in places, so below I try to fill in some of the details I thought were either missing or confusing. Motivating MMF We will be interested in…

  • Changing regularization

    This morning it occurred to me that the problems we’re having with our equation \begin{align}S^2 Z^2 S^2 – S C S = \lambda (Z^{-1} – I)\label{main}\tag{1}\end{align} are due to the regularizer we use, $\|Z – I\|_F^2$. This regularizer makes the default behavior of the feedforward connections passing the input directly to the output. But it’s…

  • Wrangling quartics, V

    Yesterday I went to discuss the problem with one of my colleagues. He had the interesting idea of modelling $S$, and especially $S^2$, as low rank, in particular as $S = s_1 e_1 e_1^T$. That is, shifting the focus on $S$ from $Z$. I tried this out today, and although it didn’t quite pan out,…