We’ll follow the presenation in Chapter 12 of Bishop.
A transformer is so called because it transformers an input set of tokens into an output set of the same size. $$ \XX \to \wt \XX,$$ where $\XX$ and $\wt{\XX}$ have $N$ rows, one for each input and output $D$-dimensional token.
The tokens are transformed using queries to access values by key.
These are linear transformations of the input tokens.
The keys are $$ \KK = \XX \WW_K.$$
The queries are $$\QQ = \XX \WW_Q.$$
The values are $$\VV = \XX \WW_V.$$
The queries are matched against keys using a dot-product, then softmaxed. The result is used to weight the values:
\begin{align*} \wt{\XX} &= \text{Softmax}\left({\QQ \KK^T \over \sqrt{D}}\right) \VV\\ &= \text{Softmax}\left({\XX \WW_Q \WW_K^T \XX^T\over \sqrt{D}}\right)\XX \WW_V.\end{align*}
The output above corresponds to a single attention head. Multi-head attention concatenates several such outputs $\HH_1, \HH_2, \dots$ to produce its output according to $$ \YY = [\HH_1, \HH_2,\dots] \WW_o.$$