{"id":723,"date":"2024-04-01T21:31:42","date_gmt":"2024-04-01T20:31:42","guid":{"rendered":"https:\/\/sinatootoonian.com\/?p=723"},"modified":"2025-12-27T16:05:07","modified_gmt":"2025-12-27T16:05:07","slug":"notes-on-multiresolution-matrix-factorization","status":"publish","type":"post","link":"https:\/\/sinatootoonian.com\/index.php\/2024\/04\/01\/notes-on-multiresolution-matrix-factorization\/","title":{"rendered":"Notes on Multiresolution Matrix Factorization"},"content":{"rendered":"\n<p>These are my notes from early January on Kondor et al.&#8217;s <a href=\"https:\/\/proceedings.mlr.press\/v32\/kondor14.html\">Multiresolution Matrix Factorization<\/a> from 2014. This was a conference paper and the exposition was a bit terse in places, so below I try to fill in some of the details I thought were either missing or confusing. <\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Motivating MMF<\/h3>\n\n\n\n<p>We will be interested in factorizing $n \\times n$ symmetric matrices $A$. MMF can be motivated through PCA, which gives the eigendecomposition $A = U^T D U,$ where $U$ is orthonormal and $D$ is diagonal. We can think of this as a compression if $U$ is known in advance. We can then encode $A$ (and other such matrices) by their projections onto $U$. For $A$, this is simply $$ U A U^T  = D.$$ But we can imagine dealing with families of matrices and storing only their projections onto a fixed $U$. If $U$ is adapted to the family then these projections will be sparse, just like the diagonal matrix $D$ was (at most $n$ of the $n^2$ elements non-zero), and we will have achieved compression. <\/p>\n\n\n\n<p>The problem with PCA is that the basis elements $U$ are non-local. That is, a given basis element will have lots of non-zero entries, implying that the corresponding projection requires information from many of the $n$ locations. This can make the projections hard to interpret.<\/p>\n\n\n\n<p>MMF takes ideas from multiresolution analysis of functions on the real line and applies them to the discrete setting of $n$-dimensional vectors. It performs the decomposition of $A$ sequentially using a series of sparse orthonormal matrices. So our decomposition of $A$ will look like <br>$$ U_L U_{L-1} \\dots U_{1} A U_1^T \\dots U_{L-1}^T U_{L} = H,$$ where $H$ will also be diagonal except possibly on some central block. That is, we decompose the global rotation $U$ that a procedure like PCA might produce into a product of a sequence of local rotations $U_1, U_2, \\dots U_L$.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">The MMF Objective<\/h3>\n\n\n\n<p>To find the decomposition above we solve the optimization problem \\begin{align} \\min_{     \\substack{[n]\\supseteq S_1\\dots S_L \\\\ U_1,\\dots,U_L \\in O \\\\ H \\in H^n_{S_L}}} \\|U_L \\dots U_1 A U_1^T \\dots U_L^T &#8211; H  \\|_F^2. \\label{opt}\\tag{1}\\end{align}<\/p>\n\n\n\n<p>Here the $[n]=S_0 \\supseteq S_1 \\dots S_{L-1} \\supseteq S_L$ are nested subsets of the integers $1 \\dots n$ that indicate the coordinates that can still be involved in rotations. All remaining coordinates are held fixed. $H_{S_L}^n$ is the set of matrices that are non-zero on the $S_L \\times S_L$ block and also on the diagonal. <\/p>\n\n\n\n<p>The first step is to eliminate the last constraint in the minimization ($H \\in H_{S_L}^n$) by moving it to the norm itself. We do this by defining a &#8216;residual&#8217; norm that only counts elements outside of the block and diagonal: \\begin{align} \\|X\\|^2_\\text{resi} = \\sum_{\\substack{(i,j) \\not \\in S_L \\times S_L, \\\\ i \\neq j }} X_{ij}^2. \\label{resi} \\tag{2}\\end{align} So our problem becomes \\begin{align}   \\min_{\\substack{[n]\\supseteq S_1\\dots S_L \\\\ U_1,\\dots,U_L \\in O}} \\|U_L \\dots U_1 A U_1^T \\dots U_L^T \\|_\\text{resi}^2. \\label{resi2} \\tag{3}\\end{align} In other words, we&#8217;re looking for a sequence of rotations that diagonalize $A$, except possibly on the $S_L \\times S_L$ block. <\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Multiresolution analysis of $\\mathbb R^n$.<\/h3>\n\n\n\n<p>We now need to be careful about what the $U_i$ are. We first express $A$ as $$ A = \\sum \\la_i \\xx_i \\xx_i^T.$$ Each of the $\\xx_i \\in \\RR^n = V_0$. The idea is now to split the base space $V_0$ into a direct sum of a &#8216;smooth&#8217; subspace $V_1$ and a &#8216;detail&#8217; subspace $W_1$: $$ V_0 = V_1 \\oplus W_1.$$ <\/p>\n\n\n\n<p>Let a basis for $V_1$ be the rows of $\\Phi_1$ and a basis for $W_1$ the rows of $\\Psi_1$. We can express these basis elements in terms of the basis $\\Phi_0$ for $V_0$ using the matrix $U_1$: $$ \\begin{bmatrix} \\Phi_1 \\\\ \\Psi_1 \\end{bmatrix} = U_1 \\Phi_0.\\label{basis0}\\tag{4}$$ Since $\\dim(\\Phi_1) = \\dim(V_1)$ and $\\dim(\\Psi_1) = \\dim(W_1)$ and $\\dim(V_1) + \\dim(W_1) = \\dim(V_0)$, $U_1$ has dimensions $\\dim(V_0) \\times \\dim(V_0) = n \\times n$. We additionally require that $U_1$ is orthonormal, i.e. $U_1^T U_1 = I$. <\/p>\n\n\n\n<p> We can now express $A$ in the $\\Phi_1 \\cup \\Psi_1$ basis by first using $\\Eqn{basis0}$ to express $\\Phi_0$ in this basis: \\begin{align} U_1^T \\begin{bmatrix} \\Phi_1 \\\\ \\Psi_1 \\end{bmatrix} = \\Phi_0 \\implies \\Phi_0^T = \\begin{bmatrix} \\Phi_1^T | \\Psi_1^T \\end{bmatrix} U_1.\\end{align} Then $$ \\xx_i = \\Phi_0^T \\xx_i = \\begin{bmatrix} \\Phi_1^T | \\Psi_1^T \\end{bmatrix} U_1 \\xx_i.$$ The first $\\dim(V_1)$ rows of $U_1$, whose projections on $\\xx_i$ are projected out with $\\Phi_1^T,$ are then the scaling functions, and the remaining rows are the wavelets. <\/p>\n\n\n\n<p>\nSo the coefficients expressed in the new basis are $U_1 \\xx_i$. Therefore,\n$$ A_1 = \\sum \\la_i U_1 \\xx_i (U_1 \\xx_i)^T = U_1 \\left( \\sum \\la_i \\xx_i \\xx_i^T \\right) U_1^T = U_1 A U_1^T.$$\n<\/p>\n\n\n\n<p> To continue with the multiresolution analysis, we split $V_1 = V_2 \\oplus W_2$ and define $U_2$ as the matrix that expresses $\\Phi_2$ and $\\Psi_2$ in terms of $\\Phi_1$: $$ \\begin{bmatrix} \\Phi_2 \\\\ \\Psi_2 \\end{bmatrix} = U_2 \\Phi_1.$$ The dimensions of $U_2$ are $\\dim(V_1) \\times \\dim(V_1)$, and as before we also require that $U_2$ is orthogonal, i.e. $U_2^T U_2 = I$. To make it compatible with $A_1$, we extend it with the identity matrix, so that it leaves the $W_1$ subspace unchanged: $$ U_2 \\leftarrow U_2 \\oplus I_{\\dim(W_1)}.$$ Then, following the same reasoning as before, we can express $A$ in the $\\Phi_2 \\cup \\Psi_2 \\cup \\Psi_1$ basis as $$ A_2 = U_2 A_1 U_2^T = U_2 U_1 A U_1^T U_2^T.$$ The first $\\dim(V_2)$ rows of $U_2 U_1$ are the scaling functions, the next $\\dim(W_2)$ rows are the wavelets from $W_2$, and the remaining rows are the wavelets from $W_1.$ Continuing in this way for $L$ steps, we get the optimization problem $\\Eqn{opt}.$ <\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Recursive decomposition<\/h3>\n\n\n\n<p>To solve our optimization problem $\\Eqn{opt}$ we work with its residual-norm formulation $\\Eqn{resi2}$ and expand it recursively. First, let&#8217;s apply a recursive decomposition to the ordinary Frobenius norm. Since $S_0 = [n]$, we start with $$ \\|H\\|^2 = \\|H_{S_0, S_0}\\|^2$$. This is the sum of squares of all the elements of $H$. The next set of indices, $S_1$ are a proper subset of $S_0$, so we can decompose our sum as<br>$$ \\|H_{S_0, S_0}\\|^2 = \\|H_{S_1, S_1}\\|^2 + \\|H_{S_1, S_0\/S_1}\\|^2 + \\|H_{S_0\/S_1,S_1}\\|^2 + \\|H_{S_0\/S_1,S_0\/S_1}\\|^2.$$ Letting $J_i = S_{i-1}\/S_i$ indicate the indices $S_{i-1}$ contains that $S_i$ doesn&#8217;t, we can write the last expression as $$ \\|H_{S_0, S_0}\\|^2 = \\|H_{S_1, S_1}\\|^2 + \\|H_{S_1, J_1}\\|^2 + \\|H_{J_1,S_1}\\|^2 + \\|H_{J_1,J_1}\\|^2.$$ We can pack all the terms that index $J_1$ into  $\\mathcal E_1$ and get $$ \\|H_{S_0, S_0}\\|^2 = \\|H_{S_1, S_1}\\|^2 + \\mathcal{E}_1.$$<\/p>\n\n\n\n<p>We can similarily decompose $\\|H_{S_1,S_1}\\|^2$, to get \\begin{align}\\|H_{S_0, S_0}\\|^2 &amp;=  (\\|H_{S_2, S_2}\\|^2 + \\|H_{S_2, J_2}\\|^2 + \\|H_{J_2,S_2}\\|^2 + \\|H_{J_2,J_2}\\|^2)  + \\mathcal{E}_1\\\\   &amp;= \\|H_{S_2, S_2}\\|^2 + \\mathcal{E}_2 + \\mathcal{E}_1.\\end{align} Continuing in this way, we get $$\\|H\\|^2 = \\|H_{S_0, S_0}\\|^2 = \\|H_{S_L, S_L}\\|^2 + \\sum_{\\ell=1}^L \\mathcal{E}_\\ell.$$<\/p>\n\n\n\n<p>Therefore, if we redefine $\\mathcal{E}_i$ to not include the diagonal terms, $$\\mathcal{E}_\\ell = 2 \\|H_{J_\\ell, S_{\\ell}}\\|_F^2 + \\|H_{\\substack{J_\\ell, J_\\ell\\\\ \\text{off diag}}}\\|^2, \\label{Eell}  \\tag{5}$$ we arrive at $$\\|U_L \\dots U_1 A U_1^T \\dots U_L^T \\|_\\text{resi}^2 = \\|H\\|_\\text{resi}^2 = \\sum_{\\ell=1}^L \\mathcal{E}_\\ell.$$ <\/p>\n\n\n\n<p>We&#8217;d now like to express $\\Eqn{Eell}$ succinctly in terms of $A$ and the $U$&#8217;s. To do so, notice that $U_{\\ell+1}$ is the identity on its lower $S_0\/S_{\\ell}$ block. So, with some abuse of notation, \\begin{align}   A_{\\ell+1} = U_{\\ell+1} A_{\\ell} U_{\\ell+1}^T &amp;= \\begin{bmatrix} U_{\\ell+1} &amp; 0 \\\\ 0 &amp; I_{S_0\/S_{\\ell}} \\end{bmatrix} \\begin{bmatrix} A_{11} &amp; A_{12} \\\\ A_{21} &amp; A_{22} \\end{bmatrix} \\begin{bmatrix} U_{\\ell+1}^T &amp; 0 \\\\ 0 &amp; I_{S_0\/S_{\\ell}} \\end{bmatrix}\\\\   &amp;= \\begin{bmatrix} U_{\\ell+1} A_{11} &amp; U_{\\ell+1} A_{12} \\\\ A_{21} &amp; A_{22} \\end{bmatrix} \\begin{bmatrix} U_{\\ell+1}^T &amp; 0 \\\\ 0 &amp; I_{S_0\/S_{\\ell}} \\end{bmatrix}\\\\ &amp;= \\begin{bmatrix} U_{\\ell+1} A_{11} U_{\\ell+1}^T &amp; U_{\\ell+1} A_{12} \\\\ A_{21} U_{\\ell+1}^T &amp; A_{22} \\end{bmatrix}. \\end{align} Then $$ \\|[A_\\ell]_{J_\\ell, S_\\ell}\\|_F^2 =  \\|[A_\\ell]_{S_\\ell, J_\\ell}\\|_F^2 = \\|[A_{12}]_{:,J_\\ell}\\|_F^2,$$ where we&#8217;ve abused notation on the final set of indices and used the absolute indices $J_\\ell$ instead of those relative to $A_{12}$. Next, $$ \\|[A_{\\ell+1}]_{J_\\ell, S_\\ell}\\|_F^2 =  \\|[A_{\\ell+1}]_{S_\\ell, J_\\ell}\\|_F^2 = \\|[U_{\\ell+1} A_{12}]_{:,J_\\ell}\\|_F^2 = \\|[A_{12}]_{:,J_\\ell}\\|_F^2 = \\|[A_\\ell]_{S_\\ell, J_\\ell}\\|_F^2,$$ where the third equality is because Frobenius norm is invariant to rotation. But this will then hold for all $\\ell + k$, because $U_{\\ell+k}$ is a rotation on a subspace of that on which $U_{\\ell+1}$ acts. Therefore $$ \\|H_{J_\\ell, S_\\ell}\\|_F^2 = \\|[A_L]_{J_\\ell, S_\\ell}\\|_F^2 = \\|[A_\\ell]_{J_\\ell, S_\\ell}\\|_F^2.$$ <\/p>\n\n\n\n<p>The only term that remains in $\\Eqn{Eell}$ is $\\|H_{J_\\ell, J_\\ell}\\|_F^2$ (without the diagonal terms) &#8211; but these will be part of $A_{22}$, which is left unchanged in $A_{\\ell+1}$. Therefore, we can write $$ \\mathcal{E}_\\ell = 2 \\|[A_\\ell]_{J_\\ell, S_\\ell}\\|_F^2 + \\|[A_\\ell]_{\\substack{J_\\ell, J_\\ell \\\\ \\text{off diag}}}\\|_F^2.$$ <\/p>\n\n\n\n<p>\nNotice that $S_\\ell$ is potentially a large set, while the rotations might be operating on small subsets of this. The rotation $U_\\ell$ can operate on any of $S_{\\ell-1}$. Let&#8217;s suppose that: \n<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li> It operates on a subset $I$, so that $U_\\ell = I_{n-k} \\oplus_I O$, where $k=|I|$. Then $S_{\\ell-1} = I \\cup \\overline{S_{\\ell-1}}$; <\/li>\n\n\n\n<li>That $i_k \\in I$, so $S_{\\ell} = (I\/\\{i_k\\}) \\cup \\overline{S_{\\ell}}$, and $\\overline{S_{\\ell}} = \\overline{S_{\\ell-1}}$. <\/li>\n<\/ul>\n\n\n\n<p>In that case, $J_\\ell = S_{\\ell-1} \/ S_\\ell = \\{i_k\\}$, and we can write the above as \\begin{align*}   \\mathcal{E}_\\ell &amp;= 2 \\|[A_\\ell]_{i_k, S_\\ell}\\|_F^2\\\\   &amp;= 2 \\|[A_\\ell]_{i_k, I\/i_k}\\|_F^2 + 2 \\|[A_\\ell]_{i_k, \\overline{S_\\ell}}\\|_F^2\\\\   &amp;= 2\\sum_{i=1}^{k-1}[O A^{\\ell-1}_{I,I} O^T]^2_{k,i} + 2[OA^{\\ell-1}_{I,\\overline{S_\\ell}} (A^{\\ell-1}_{I,\\overline{S_\\ell}})^T O^T]_{k,k}  \\end{align*} We now have to minimize this over $O \\in SO(k)$. <\/p>\n\n\n\n<p><strong>Remark:<\/strong> It seems that the problem is a bit easier than this. We can write \\begin{align*}   \\mathcal{E}_\\ell &amp;= -2 [O A_{I,I}^{\\ell-1} O^T]^2_{k,k} + 2\\sum_{i=1}^{k}[O A^{\\ell-1}_{I,I} O^T]^2_{k,i} + 2[OA^{\\ell-1}_{I,\\overline{S_\\ell}} (A^{\\ell-1}_{I,\\overline{S_\\ell}})^T O^T]_{k,k}\\\\   &amp;= -2 [O A_{I,I}^{\\ell-1} O^T]^2_{k,k} + 2 [O A^{\\ell-1}_{I,I} A^{\\ell-1,T}_{I,I} O^T]_{k,k} + 2[OA^{\\ell-1}_{I,\\overline{S_\\ell}} (A^{\\ell-1}_{I,\\overline{S_\\ell}})^T O^T]_{k,k} \\end{align*} But this only depends on $k$&#8217;th row of $O$, $o_k$: \\begin{align}   \\mathcal{E}_\\ell &amp;= -2 [o_k A_{I,I}^{\\ell-1} o_k^T]^2 + 2 o_k A^{\\ell-1}_{I,I} A^{\\ell-1,T}_{I,I} o_k^T + 2 o_k A^{\\ell-1}_{I,\\overline{S_\\ell}} (A^{\\ell-1}_{I,\\overline{S_\\ell}})^T o_k^T\\\\   &amp;= -2 [o_k A_{I,I}^{\\ell-1} o_k^T]^2 + 2 o_k \\left[A^{\\ell-1}_{I,I} A^{\\ell-1,T}_{I,I} + A^{\\ell-1}_{I,\\overline{S_\\ell}} (A^{\\ell-1}_{I,\\overline{S_\\ell}})^T\\right] o_k^T\\\\   &amp;= -2 [o_k F o_k^T]^2 + 2 o_k G o_k^T. \\end{align} We now want to minimize this subject to $\\|o_k\\| = 1$. Renaming $o_k$ to $x^T$, we get $$ \\min_{x^T x =1} -(x^T F x)^2 + x^T Gx.$$ The Lagrangian is $$ L(x, \\la) = -(x^T F x)^2 + x^T G x + \\la (1 &#8211; x^T x) .$$ The gradient is $$ L_x \\propto -2 (x^T F x) F x +  G x &#8211; \\la x .$$ So at the optimum, \\begin{align*}   x &amp;= \\la^{-1} (G x &#8211; 2 (x^T F x) F x)\\\\   \\la &amp;= x^T G x &#8211; 2 (x^T F x)^2. \\end{align*} <\/p>\n\n\n\n<p>This is an eigenvalue equation for $(G &#8211; 2(x^T F x) F)$, but since the scaling of $F$ depends on $F$, we can&#8217;t solve it straightforwardly that way. Instead, we can try solving this system using it as an update equation. This is how I do it in the function <code><a href=\"https:\/\/github.com\/stootoon\/kondor-mmf\/blob\/823be7bbb4e6a39d3b11a88b337fb26e93cec4de\/mmf.py#L17\" data-type=\"link\" data-id=\"https:\/\/github.com\/stootoon\/kondor-mmf\/blob\/823be7bbb4e6a39d3b11a88b337fb26e93cec4de\/mmf.py#L17\">solve<\/a><\/code> in my implementation of MMF.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Code<\/h2>\n\n\n\n<p>I have written a Python implementation of the algorithm below in <a href=\"https:\/\/github.com\/stootoon\/kondor-mmf\">this repo<\/a>. The notebook contained there applies MMF to a matrix of odour-odour correlations, as shown below.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"998\" src=\"https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/04\/grafik-1-1024x998.png\" alt=\"\" class=\"wp-image-1703\" srcset=\"https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/04\/grafik-1-1024x998.png 1024w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/04\/grafik-1-300x292.png 300w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/04\/grafik-1-768x748.png 768w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/04\/grafik-1.png 1402w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\"><em>MMF on real data. <strong>Top left:<\/strong> Matrix of odour response correlations. <strong>Top right: <\/strong>MMF reconstruction. <strong>Bottom left:<\/strong> MMF compression. I ran MMF to $L=40$, so the last 8 x 8 block remains undiagonalized. <strong>Bottom right: <\/strong>The rotation matrix to reconstruct $A$ from the compression. The rows correspond to features extracted from $A$. I&#8217;ve reordered the rows by the order<\/em> i<em>n which indices were removed from the active set. <\/em><\/figcaption><\/figure>\n\n\n\n<p>Notice how, in the bottom right panel, early features are &#8216;high-frequency&#8217;, consisting of large rotations of small numbers of matrices. Later features to be smoother, and integrate over more indices. Perhaps this is how the &#8220;multiresolution&#8221; aspect manifests, just like how when analyzing functions on the real line, the subspaces later in the analysis capture smoother features.<\/p>\n\n\n\n<p>$$\\begin{flalign*} &amp;&amp; \\phantom{a} &amp; \\hfill \\square \\end{flalign*}$$<\/p>\n","protected":false},"excerpt":{"rendered":"<p>These are my notes from early January on Kondor et al.&#8217;s Multiresolution Matrix Factorization from 2014. This was a conference paper and the exposition was a bit terse in places, so below I try to fill in some of the details I thought were either missing or confusing. Motivating MMF We will be interested in [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1,30],"tags":[56,53,55,28,54],"class_list":["post-723","post","type-post","status-publish","format-standard","hentry","category-blog","category-notes-blog","tag-matrix-factorization","tag-mmf","tag-multiresolution","tag-notes","tag-paper"],"acf":[],"_links":{"self":[{"href":"https:\/\/sinatootoonian.com\/index.php\/wp-json\/wp\/v2\/posts\/723","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/sinatootoonian.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/sinatootoonian.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/sinatootoonian.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/sinatootoonian.com\/index.php\/wp-json\/wp\/v2\/comments?post=723"}],"version-history":[{"count":108,"href":"https:\/\/sinatootoonian.com\/index.php\/wp-json\/wp\/v2\/posts\/723\/revisions"}],"predecessor-version":[{"id":1705,"href":"https:\/\/sinatootoonian.com\/index.php\/wp-json\/wp\/v2\/posts\/723\/revisions\/1705"}],"wp:attachment":[{"href":"https:\/\/sinatootoonian.com\/index.php\/wp-json\/wp\/v2\/media?parent=723"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/sinatootoonian.com\/index.php\/wp-json\/wp\/v2\/categories?post=723"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/sinatootoonian.com\/index.php\/wp-json\/wp\/v2\/tags?post=723"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}