Why mean, median, mode?

Recently while thinking about covariances I got to thinking about why we define an even simpler statistic, the mean, as we do. That’s what this post is about.

Suppose we have a dataset $X$ consisting of $N$ numbers $x_1 \dots x_N$. Their mean, $\overline x,$ is of course $$ \overline{x} = {1 \over N} \sum_{i=1}^N x_i.$$ This seems like an intuitive descriptor of our data, but how can we motivate it?

One way is to view our problem as summarizing our dataset $X$ with a single number. How should we choose this number, $\hat x$? We often have a notion of what a good summary would be. Therefore, one way is to quantify this with a loss function $L(X, \hat x)$ that measures how well $\hat x$ summarizes $X$. Our summary $\hat x$ of a given dataset would then be whatever value minimizes this loss.

It’s reasonable to assume that for any such loss function, every datapoint will contribute equally, and the overall loss will be the average of the contributions from each datapoint. So the loss functions we’ll consider will look like $$ L(X, \hat x) = {1 \over N} \sum_{i=1}^N \ell(x_i, \hat x), $$ where $\ell(x_i, \hat x)$ quantifies how well $\hat x$ describes a single datapoint $x_i$. But then, how do we choose the function $\ell$?

Since we want to measure the deviation of $\hat x$ from $x_i$, letting $\ell(x_i,\hat x) = \hat x – x_i$ seems a good start. This gives a positive loss whenever $\hat x$ is larger than $x$, which is helpful because decreasing the loss will shrink $\hat x$ towards $x_i$. However, it also assigns a negative loss to $\hat x$ whenever it’s smaller than $x$, and this isn’t helpful, because minimizing the loss will take $\hat x$ even farther below $x$, all the way to $-\infty$!

We want to treat positive and negative deviations from $x_i$ equally. Instead of using the raw deviations $\hat x – x_i$, we can try using the squared deviations $(\hat x – x_i)^2$. This assigns the same value to positive and negative deviations of the same magnitude. Using the squared deviations and averaging the contributions from all the datapoints we arrive at our overall loss $$L(X, \hat x) = {1 \over N} \sum_{i=1}^N (\hat x – x_i)^2.$$ To find our summary value we minimize the loss with respect to $\hat x$. The derivative of the loss with respect to $\hat x$ is $$ {d L \over d \hat x}= {2 \over N} \sum_i (\hat x – x_i),$$ and we find the summary by setting this derivative to zero.

Before we do so it’s worth considering the form of the derivative. We can see that each datapoint contributes a term proportional to its deviation from our summary $\hat x$. Our aim is to adjust $\hat x$ until all of these terms cancel out. This is just like a force balance in mechanics. We can think of each datapoint as pulling the summary towards it, and the summary ending up where all these forces cancel out.

In fact, we can make the analogy precise. Recalling that the force a spring with spring constant $k$ stretched a distance $x$ exerts is $kx$, we can think of each datapoint as being fixed in place on an axis and connected to our summary $\hat x$, which is free to move on the same axis, by a spring with constant $2/N$. Each datapoint exerts a force on our summary $\hat x$ proportional to how much its spring is stretched, and we’re looking for the location where all these forces cancel out. Note also that the loss function $L(X, \hat x)$ encodes the potential energy of this system, so we are, equivalently, looking for the location $\hat x$ that minimizes this energy.

Continuing with our differentiation, we have $${dL \over d\hat x} = {2 \over N} \left(\sum_i \hat x \right) – {2 \over N} \left(\sum_i x_i\right).$$ Our estimate $\hat x$ doesn’t vary with the datapoint index $i$, we can take it outside its sum,
$$ {d L \over d \hat x}= {2 \over N} \hat x \left( \sum_i 1 \right) – {2 \over N} \sum_i x_i = {2 \over N} N \hat x – 2{\sum_i x_i \over N} =2 \hat x – 2 \overline{x}.$$ Setting this to zero, we have
$$ \hat x = \overline x.$$ So the mean is the summary we’d get if we penalized by the square of the deviations. Or, returning to our physical analogy, forces from the different springs cancel out when we place our summary at the mean.

From mean to median

The squared loss penalizes positive and negative deviations equally, and as we saw above, is mathematically easy to work with while also having a direct physical analogy. However, it can be sensitive to outliers because deviations contribute according to their squared values, so large deviations are penalized disproportionately. Or, returning to our physical analogy of the spring system, distant datapoints pull much harder on the summary point than nearby ones.

To see this another way we rewrite the loss slightly as $$ L(X, \hat x) = {1 \over N} \sum_i (\hat x – x_i)^2 = {1 \over N} \sum_i \underbrace{|\hat x – x_i|}_{\text{deviation}}\underbrace{|\hat x – x_i|}_{\text{weight}}.$$ We now see that rather than weighting each deviation equally, the squared loss weights a deviation according to its magnitude. Therefore, larger deviations will dominate the overall loss, and the summary $\hat x$ will work harder to reduce those large deviations, rather than the smaller deviations from the other datapoints.

A natural solution is to weight each deviation equally, $$ L(X, \hat x) = {1 \over N} \sum_i \underbrace{|\hat x – x_i|}_{\text{deviation}}\cdot \underbrace{1}_{\text{weight}} = {1 \over N} \sum_i |\hat x – x_i|.$$ As before, we minimize this loss by setting the derivative to zero. Since the derivative of $|x|$ is $\text{sign}(x)$ (and ignoring the non-differentiability that occurs when $\hat x$ is exactly equal to a datapoint), we get $$ {d L \over d\hat x} = {1 \over N} \sum_i \text{sign}(\hat x – x_i).$$ Unless $\hat x$ matches one of the datapoints exactly (which will almost never occur), the $\text{sign}$ function will only yield the values -1 or 1. Therefore, every datapoint, regardless of its distance from the summary, pulls it with the same, unit, force. It’s then clear that the summary will rest wherever the number of datapoints pulling it in the two directions is the same. This occurs at any value greater than half of the datapoints, and less than the other half, in other words, at (a) median.

The mode

Our previous measures of the quality $\ell(x_i, \hat x)$ of a summary $\hat x$ in describing a datapoint $x_i$ have been lax in that being close to the datapoint produced a small loss, very close to the zero loss of matching it exactly. Sometimes we need a stricter measure of quality that penalizes any deviation from the target datapoint. We then might use the so-called ‘0-1’ loss,$$ \ell(x_i, \hat x) = \begin{cases} 0 & \text{if } \hat x = x_i, \\ 1 & \text{otherwise.} \end{cases}$$ The reason behind the naming should be obvious: the loss is 0 only if the summary matches the target exactly, otherwise, it’s 1.

As before, the loss we incur in summarizing the whole dataset $X$ with a single value $\hat x$ is the average of the contributions from each datapoint, so $$ L(X, \hat x) = {1 \over N} \sum_i \ell(x_i, \hat x).$$ What summary of the data do we get when we minimize this loss? Following the recipe of computing the derivative and setting it to zero doesn’t help much because the derivative of the 0-1 loss is zero everywhere, except at zero itself (where $x_i = \hat x$), where it’s undefined! This reflects the fact that the 0-1 loss views all deviations from its central value as equally bad.

Instead, we proceed in cases. If we set $\hat x$ to a value that doesn’t match any value in the dataset, then every single datapoint will register a mismatch, and our overall loss will be 1:
$$ L(X, \hat x) = {1 \over N} \sum_i 1 = {1 \over N} N = 1 \quad \text{when} \quad \hat x \not\in X.$$

On the other hand, if we set $\hat x$ to a value that does occur in the dataset, then all instances of that value will register the match and contribute zero to the loss. The remaining datapoints will still contribute a 1 each. We can split up our datapoints into those that match $X$ and those that don’t. We then have $$ L(X, \hat x) = {1 \over N} \sum_{i: x_i = \hat x} 0 + {1 \over N} \sum_{i: x_i \neq \hat x} 1 = {1 \over N}|\{i: x_i \neq \hat x\}|= 1 – {1 \over N}|\{i: x_i = \hat x\}| $$ where by $|\cdot|$ we mean the number of elements in the set. We see that the loss is reduced by $1/N$ for every datapoint that matches $\hat x$. Therefore, to minimize the loss we should set $\hat x$ to the value that occurs the most i.e. the mode.

Summary

To conclude, we’ve found that the summary value we use to describe a dataset depends on how we score deviations from the summary. If we use the squared deviation, we get the mean. If we use the absolute deviation, we get the median. And if we’re extra stringent and use a 0-1 loss, we get the mode.

I still wonder why we naturally use the mean to summarize data. I expect we would use the mode when dealing with categorical data, e.g. imagine summarizing a bag of 10 apples and one orange. One reason we use the mean when summarizing numerical data might be that it seems to weigh all datapoints equally, but as we saw above, it actually weights outliers most, which is often not what we want. To truly weigh all datapoints equally, we should use the median. Naively, the median seems harder to compute as it requires some sorting of the data, so might have $O(N \log N)$ complexity vs. the $O(N)$ complexity of the mean. However, quickselect can solve the problem in $O(N)$ time, on the same order as computing the mean. But to do so, quickselect stores pivots and swaps lists, which seems a lot more work, at least for a human brain, than simply adding the latest item to the running total and dividing by the count at the end. So perhaps it’s ultimately due to the limitations human mind using a fast procedure that works in most situations, and braving the occasional outlier.

Comments

2 responses to “Why mean, median, mode?”

BLS

24 January 2024

it strikes me that we use the mean because most data is normally distributed in nature?

1. Sina
  
  24 January 2024
  
  Yes, perhaps because many of the datasets we encounter naturally are approximately normally distributed and thereby don’t suffer from outliers, the mean is usually fine. But many datasets aren’t normally distributed, even though we may pretend they are – see e.g. Taleb’s various works on the problems this can cause. The median is the more robust measure, giving the same value as the mean when the data are normally distributed, while protecting against outliers when they’re not.

From mean to median

The mode

Summary

Comments

2 responses to “Why mean, median, mode?”

Leave a Reply Cancel reply