This is a tale of statistics and probability, gaming, the pursuit of truth, and lockdown boredom.
In late March 2020, as lockdowns were being imposed by governments across the world and travel was no longer an option for many, gamers in their millions, from school-age children to millennials, were still able to fly away to the parallel universe that is the latest instalment of Nintendo’s Animal Crossing (full disclosure: this article is unfortunately not a paid partnership between Golden Square Tutors and Nintendo!).
If you – like yours truly – have yourself never had a chance to play, the game is about turning a deserted island into a thriving village, inhabited by cute humanoid animals. However, how the player does this is left entirely up to them, and the lack of defined objectives and the focus on day-to-day activities and simple tasks (things like fishing, talking walks on the beach, picking up a flower, visiting a shop), many of which had to be given up during lockdown, must have played an important part of the intergenerational success of the game.
One way to find new recruits to populate the village on your own island is to visit entrepreneurial raccoon dog Tom Nook’s shop and purchase a return ticket to an uncharted, randomly-generated island (named Mystery Islands). Each time they land on a Mystery Island, players are able to encounter a non-playable character who, when asked, will agree to return with them to their village. Animal Crossing players started sharing their gaming experiences online, and soon enough they realised that the new villager they met on a Mystery Island could be any of the 391 available characters in the game. However, they started wondering whether each character was equally likely. This is where statistics and probability finally comes into play in our story.
Some of the most committed players were keeping a record of which characters they had encountered during their Mystery Island travels. At this point, if you are not an Animal Crossing player yourself, you might be pausing and wondering why they were doing something like that in the first place. My sister Chiara, who, unlike me, is an Animal Crossing veteran, and who first told me about all of this, put it like this: ‘If you are really into Animal Crossing, chances are that you’re a bit of a compulsive collector. A lot of the game dynamics involve collecting items, and keeping records of everything is a natural part of being a collector.’
In any case, as they started pooling together their tidily kept records, they realised that some characters seemed to make an appearance a lot more often than others. The picture that emerged seemed to them to point in a different direction than the original assumption that each character was equally likely. However, not everyone was convinced, and they needed more than a hunch to say that this was the case.
Animal Crossing villagers go binomial
The problem was something like this: I am given a coin and I want to know whether it’s fair (which would mean that the probabilities of my coin landing on its head or tail are both 50% – or as a mathematician would say, the probability of each of the two events equals 0.5); when I flip the coin once and I get tail, you will agree with me that that isn’t enough for me to say anything about whether the coin is fair; I then flip it 5 times and I get only tails, and I might start to get suspicious; once I have flipped my coin 30 times and got only tails, I will be very, very suspicious. However, it’s still not completely impossible for me to get tail 30 flips out of 30, even if the coin is fair: it’s just extremely unlikely. The sad truth of probability is that I can never be absolutely, completely certain that my coin is fair, doesn’t matter how many times I flip it: even a very unlikely result (like 30 tails out of 30) will still be, although unlikely, never completely impossible. What I can still say is that a result like getting tails 30 flips out of flip is extremely unlikely under the assumption that my coin is a fair one (how unlikely? The probability is $0.5^{30} \approx 0.0000000009$, or less than one in a billion).
In general, the probability of getting a certain number of tails (or heads) is given by the binomial distribution. The probability of getting exactly $k$ times tail if I flip the coin $n$ times is given by:
$$P(K = k) = \binom{n}{k}p^k(1-p)^{n-k}$$
Where $\binom{n}{k}$ is the binomial coefficient n choose k which can be calculated as $\binom{n}{k} = \frac{n!}{k!(n-k)!}$ (or from Pascal’s triangle), and $n!$ is the product of the numbers up to $n$: $n! = n \times (n-1) \times \dots \times 2 \times 1$ (for instance $\binom{7}{3} = \frac{7!}{3!4!} = 35$). In the coin example $p$ (the probability of success) and $1-p$ (the probability of failure) are both equal to $0.5$, but in the die example (see the questions above) we had $p = \frac{1}{6}$ and $1 – p = \frac{5}{6}$.
Before I flip my coin, I do not know what the number of tails $K$ is going to be, all I know are the probabilities of the different possible outcomes $k$: that’s why mathematicians would say that the number $K$ is a random variable.
You can see that if the random variable $K$ follows a binomial distribution we can write it as a sum of random variables $B_1, B_2, \dots, B_n$:
$$K = B_1 + B_2 + \dots + B_n$$
where each $B_m$ can either be $1$ (with probability $p$) or $0$ (with probability $1-p$), and represents the number of tails I get in flip number $m$ (in other words the $B$s follow the simplest possible binomial distribution, the one with $n = 1$, also known as the Bernoulli distribution). This fact might seem obvious now, but it will become important in a bit.
If I flip 100 times a fair coin, I expect to get around 50 times tails. The expectation (also known as the mean) $\mathrm{E}[K] = \sum_{i=0}^n k_i \times P(K = k_i)$ or a binomial distribution is the number of successes I expect to have, and for a binomial distribution it’s (as you would expect) $\mathrm{E}[K] = np$ (although this makes intuitive sense, it can also be shown properly without long calculations like this: the expectation of a sum of independent random variables is just the sum of their expectations, and each Bernoulli random variable has expectation $\mathrm{E}[B] = 0 \times P(B =0) + 1 \times P(B =1)$ $= 0 + 1 \times p = p$) .
The variance $\mathrm{Var}[K]$ (which can be calculated in terms of expectations as $\mathrm{Var}[K] = \mathrm{E}[K^2] – \mathrm{E}[K]^2$) and its square root the standard deviation $\sigma = \sqrt{\mathrm{Var}[K]}$ tell us how spread out around the mean we expect our results to be (a higher value of $\sigma$ means that the data is more spread out). For a binomial distribution the variance is given by $\mathrm{Var}[K] = np(1-p)$ (how do we know? For each Bernoulli random variable $\mathrm{E}[B]^2 = p^2$ and $\mathrm{E}[B^2] = 0^2 \times P(B =0) + 1^2 \times P(B =1) = p$ so $\mathrm{Var}[B] = \mathrm{E}[B^2] – \mathrm{E}[B]^2$ $= p – p^2$ $= p(1 – p)$).
Below you can find an interactive applet (my own work, written in Python and Javascript) of the binomial distribution for different values of the number of trials $n$ and the probability of success $p$ (to view the probability of a given number of success hover your cursor over the corresponding bar). You can also see that as the number of trials $n$ becomes larger, the probability distribution looks more and more like a certain famous curve you are very likely already familiar with.
In the questions below, you see that the probability of meeting a given character a certain number of times $k$ out of a total of $n$ journeys to Mystery Island also follows a binomial distribution (we’re making the important assumption that which character I meet on a given journey does not affect which character I’ll meet in the next one: we are assuming that the characters I meet on each journey are independent events).
Many of those Animal Crossing players who believed that the 'each character is equally likely' hypothesis was wrong subscribed to their own alternative. They thought that the data they had collected matched a probability distribution where each species (instead of each character) was equally likely. If only there was some precise way to check how well (or how badly) the data they had collected matched either of those two hypotheses, so that the matter could be universally agreed once and for all! (if you want to find out what that way is you're going to have to keep on reading).
The most famous probability distribution in the world
The curve that you might already be familiar with, is, of course, the probability density function of the normal distribution. The normal distribution is possibly the most important probability distribution in statistics and probability. Unlike discrete probability distributions like the binomial distribution, it is a continuous distribution: this means that the possible outcomes are taken to be any real number ($0.5$, $0.0157$, $\sqrt{2}$, $\pi$) instead of simply whole numbers ($0$, $1$, $2$, $3$, $4$). Whilst in a discrete probability distribution there's a function that gives us the probability of every outcome, for a continuous probability distribution the probability can be calculated using a function called the probability density function.
I can find the probability of my outcome $X$ to have a value between any two numbers $a$ and $b$ by looking at the area (the blue area in the visualisation below) between the probability density function (the black curve in the visualisation below) and the x-axis from $x = a$ to $x = b$. That is, we need to evaluate an integral of the probability density function:
$$P(a \leq X \leq b) = \int_{a}^{b} f(x) \, dx$$
The probability density function for the normal distribution is:
$$f(x) = \frac{1}{\sigma \sqrt{2 \pi}} e^{-(x - \mu)^2 / 2 \sigma^2}$$
Where $e$ is a very famous number, Euler's constant $e \approx 2.718$, $\mu = \mathrm{E}[X]$ is the mean and $\sigma= \sqrt{\mathrm{Var}[X]}$ is standard deviation of the normal distribution I’m looking for. For a normal distribution, I need to know the values of $\mu$ and $\sigma$ in order to be able to calculate probabilities: they are the parameters of my distribution (just like the number of trials $n$ and the probability of success $p$ were the parameters for the binomial distribution). The constant $\frac{1}{\sigma \sqrt{2 \pi}}$ we put in front of the probability density function so that the probability of X being any possible value is equal to 1, that is to make $\int_{- \infty}^{+\infty} f(x) \, dx = 1$ (we say that it’s a normalisation constant).
Luckily, you don't need to calculate this integral yourself. Even if you are very good at calculating integrals, this is not one that can be solved exactly. You'd normally need to use some approximation method, or, more likely, a scientific calculator or computer software. To save you the bother, here's a visualisation of normal probability density functions for different values of $\mu$ and $\sigma$, where you can see the approximate value of the probability for $X$ to be found between two values of your choice.
As you can see from the visualisation, although different values of $\mu$ and $\sigma$ correspond to different curves, these curves all look very much alike: in fact they’re all exactly the same curve, just translated (shifted to the right or left) as I increase or decrease the mean $\mu$, and rescaled (squashed down or squeezed up) as I increase or decrease the standard deviation $\sigma$. That’s why in order to calculate probabilities you might not want to bother with different curves at all, and just look at one of them: we can choose the one with $\mu = 0$ and $\sigma = 1$. This distribution is called the standard normal distribution.
If you give me any number $X$ that follows a normal distribution, the number $Z = \frac{X - \mu}{\sigma}$ will follow a standard normal distribution. If, for example, I wanted to find the probability that $X$ is larger than some number $x$, this will be the same as the probability that $Z$ is larger than the number $z = \frac{x - \mu}{\sigma}$ (we say that $z$ is the standard score of $x$: it tells me how far away $x$ is from the mean in units of standard deviations).
In the days of old people would have used a standard normal table (also called a Z-table) to find the value of $P(Z < z)$. Alternatively, you can use the applet below:
The central limit theorem: everything eventually turns out normal
One reason why the normal distribution is so important (and the reason why it comes into our story) is something called the central limit theorem. The central limit theorem (in one of its simplest versions) says that if I take the sum of a large number of independent random variables that all have the same distribution, the sum will approximatively follow a normal distribution (and as long as your distribution is not some weird distribution that doesn't even have a mean or variance).
Remember how we said earlier than a binomial random variable is a sum of $n$ Bernoulli random variables? The central limit theorem tells us that we can approximate it as a normal distribution (with the same mean $\mu = np$ and standard deviation $\sigma = \sqrt{np(1-p}$), and that this works best when $n$ is large.
And here's a visualisation of the normal approximation (blue area) to the binomial (red area): you can see that as the number of trials $n$ gets larger you get a better match between the binomial distribution and the corresponding normal distribution, and that the probabilities given by the normal distribution approximate well the ones given by the binomial.
Why might we want to use the normal approximation anyway? In the questions below you can see an example of a case when it's quicker to make calculations using the normal approximation than it would be calculating the binomial probabilities directly.
One way of showing why the central limit theorem is true is by looking at what happens to a function related to the probability density function called the characteristic function (if you know what a Fourier transform is: it's the Fourier transform of the probability density function) for a sum of independent random variables as the number of terms in the sum becomes large (spoiler alert: its gets closer and closer to the characteristic function of the corresponding normal distribution, meaning that the two distributions get more and more alike). If you're interested in cutting your teeth with slightly more involved maths, you can check out a more detailed outline of this story here.
Finally, the chi-square distribution
However, the reason why the normal approximation and the central limit theorem come into the Mystery Island mystery has to do with yet another distribution, the chi-square distribution. If I have a bunch of $k$ random variables $Z_1, Z_2, \dots, Z_k$ that each follow a standard normal distribution and are also independent of each other, then the sum of their squares $Y$ will follow a chi-square distribution with $k$ degrees of freedom:
$$Y = Z_1^2 + Z_2^2 + \dots + Z_k^2$$
The probability density function for a chi-square random variable with $k$ degrees of freedom can be written as:
$$f(x) = \frac {1}{2^{k/2}\Gamma (k/2)} x^{k/2-1}e^{-x/2}$$
where $\Gamma(z)$ in the normalisation constant is a special function called the gamma function which generalises the factorial to the case where $z$ is not an integer (if $n$ is integer the gamma function is just a good old factorial $\Gamma(n) = (n - 1)!$ $= (n-1) \times (n-2) \times \dots \times 2 \times 1$.
For instance, the probability density function of a chi-square distribution with two degrees of freedom ($k = 2$) is simply an exponential:
$$f(x) = \frac{1}{2} e^{x/2}$$
You can see what this probability density function for different numbers of degrees of freedom looks like here: