Transformers – sinatootoonian.com

We’ll follow the presenation in Chapter 12 of Bishop.

A transformer is so called because it transformers an input set of tokens into an output set of the same size. $$ \XX \to \wt \XX,$$ where $\XX$ and $\wt{\XX}$ have $N$ rows, one for each input and output $D$-dimensional token.

The tokens are transformed using queries to access values by key.

These are linear transformations of the input tokens.

The keys are $$ \KK = \XX \WW_K.$$

The queries are $$\QQ = \XX \WW_Q.$$

The values are $$\VV = \XX \WW_V.$$

The queries are matched against keys using a dot-product, then softmaxed. The result is used to weight the values:

\begin{align*} \wt{\XX} &= \text{Softmax}\left({\QQ \KK^T \over \sqrt{D}}\right) \VV\\ &= \text{Softmax}\left({\XX \WW_Q \WW_K^T \XX^T\over \sqrt{D}}\right)\XX \WW_V.\end{align*}

The output above corresponds to a single attention head. Multi-head attention concatenates several such outputs $\HH_1, \HH_2, \dots$ to produce its output according to $$ \YY = [\HH_1, \HH_2,\dots] \WW_o.$$