Tag: notes

  • Converting joint distributions to Bayesian networks

    In these notes we discuss how to convert a joint distribution into a graph called a Bayesian network, and how the structure of the graph suggests ways to reduce the parameters required to specify the joint.

  • Jensen-Shannon Divergence

    I discuss how the Jensen-Shannon divergence is a smoothed symmetrization of the KL divergence comparing two distributions, and connect it to the performance of their optimal binary discriminator.

  • Notes on Kernel PCA

    Following Bishop, I show how to express the eigenvectors of the feature projections in terms of the eigenvectors of the kernel matrix, and how to compute the kernel of centered features from the uncentered one.

  • EM for Factor Analysis

    In this note I work out the EM updates for factor analysis, following the presentation in PRML 12.2.4.

  • Understanding Expectation Maximization as Coordinate Ascent

    In these notes we follow Neal and Hinton 1998 and show how to view EM as coordinate ascent on the negative variational free energy.

  • Maximum likelihood PCA

    These are my derivations of the maximum likelihood estimates of the parameters of probabilistic PCA as described in section 12.2.1 of Bishop, and with some hints from (Tipping and Bishop 1999).

  • Natural parameterization of the Gaussian distribution

    The Gaussian distribution in the usual parameters The Gaussian distribution in one dimension is often parameterized using the mean $\mu$ and the variance $\sigma^2$, in terms of which $$ p(x|\mu, \sigma^2) = {1 \over \sqrt{2\pi \sigma^2}} \exp\left(-{(x – \mu)^2 \over 2 \sigma^2} \right).$$ The Gaussian distribution is in the exponential family. For distributions in this…

  • A noob’s-eye view of reinforcement learning

    I recently completed the Coursera Reinforcement Learning Specialization. These are my notes, still under construction, on some of what I learned. The course was based on Sutton and Barto’s freely available reinforcement learning book, so images will be from there unless otherwise stated. All errors are mine, so please let me know about any in…

  • Notes on the evidence approximation

    These notes closely follow section 3.5 of Bishop on the Evidence Approximation, much of which is based on this paper on Bayesian interpolation by David MacKay, and to which I refer to below. Motivation We have some dataset of inputs $\XX = \{\xx_1, \dots, \xx_N\}$ and corresponding outputs $\tt = \{t_1, \dots, t_N\}$, and we’re…

  • The equivalent kernel for non-zero prior mean

    This note is a brief addendum to Section 3.3 of Bishop on Bayesian Linear Regression. Some of the derivations in that section assume, for simplicity, that the prior mean on the weights is zero. Here we’ll relax this assumption and see what happens to the equivalent kernel. Background The setting in that section is that,…

  • Notes on the Geometry of Least Squares

    In this post I expand on the details of section 3.1.2 in Pattern Recognition and Machine Learning. We found that maximum likelihood estimation requires minimizing $$E(\mathbf w) = {1 \over 2} \sum_{n=1}^N (t_n – \ww^T \bphi(\xx_n))^2.$$ Here the vector $\bphi(\xx_n)$ contains each of our features evaluated on the single input datapoint $\xx_n$, $$\bphi(\xx_n) = [\phi_0(\xx_n),…

  • Notes on Multiresolution Matrix Factorization

    These are my notes from early January on Kondor et al.’s Multiresolution Matrix Factorization from 2014. This was a conference paper and the exposition was a bit terse in places, so below I try to fill in some of the details I thought were either missing or confusing. Motivating MMF We will be interested in…

  • How many neurons or trials to recover signal geometry?

    This my transcription of notes on a VVTNS talk by Itamar Landau about recovering the geometry of high-dimensional neural signals corrupted by noise. Caveat emptor: These notes are based on what I remember or hastily wrote down during the presentation, so they likely contain errors and omissions. Motivation The broad question is then: Under what…

  • Decomposing connectivity

    While working on optimizing connectivity for whitening (see below) I remembered that it can be useful to decompose connectivity matrices relating neurons into components relating pseudo-neurons. In this post, I’ll show how this can be done, and highlight its application to the whitening problem. I will assume that our $N \times N$ connectivity matrix $W$…