Chi-square in Mystery Island

This is a tale of statistics and probability, gaming, the pursuit of truth, and lockdown boredom.

In late March 2020, as lockdowns were being imposed by governments across the world and travel was no longer an option for many, gamers in their millions, from school-age children to millennials, were still able to fly away to the parallel universe that is the latest instalment of Nintendo’s Animal Crossing (full disclosure: this article is unfortunately not a paid partnership between Golden Square Tutors and Nintendo!).

If you – like yours truly – have yourself never had a chance to play, the game is about turning a deserted island into a thriving village, inhabited by cute humanoid animals. However, how the player does this is left entirely up to them, and the lack of defined objectives and the focus on day-to-day activities and simple tasks (things like fishing, talking walks on the beach, picking up a flower, visiting a shop), many of which had to be given up during lockdown, must have played an important part of the intergenerational success of the game.

One way to find new recruits to populate the village on your own island is to visit entrepreneurial raccoon dog Tom Nook’s shop and purchase a return ticket to an uncharted, randomly-generated island (named Mystery Islands). Each time they land on a Mystery Island, players are able to encounter a non-playable character who, when asked, will agree to return with them to their village. Animal Crossing players started sharing their gaming experiences online, and soon enough they realised that the new villager they met on a Mystery Island could be any of the 391 available characters in the game. However, they started wondering whether each character was equally likely. This is where statistics and probability finally comes into play in our story.

Some of the most committed players were keeping a record of which characters they had encountered during their Mystery Island travels. At this point, if you are not an Animal Crossing player yourself, you might be pausing and wondering why they were doing something like that in the first place. My sister Chiara, who, unlike me, is an Animal Crossing veteran, and who first told me about all of this, put it like this: ‘If you are really into Animal Crossing, chances are that you’re a bit of a compulsive collector. A lot of the game dynamics involve collecting items, and keeping records of everything is a natural part of being a collector.’

In any case, as they started pooling together their tidily kept records, they realised that some characters seemed to make an appearance a lot more often than others. The picture that emerged seemed to them to point in a different direction than the original assumption that each character was equally likely. However, not everyone was convinced, and they needed more than a hunch to say that this was the case.

Animal Crossing villagers go binomial

The problem was something like this: I am given a coin and I want to know whether it’s fair (which would mean that the probabilities of my coin landing on its head or tail are both 50% – or as a mathematician would say, the probability of each of the two events equals 0.5); when I flip the coin once and I get tail, you will agree with me that that isn’t enough for me to say anything about whether the coin is fair; I then flip it 5 times and I get only tails, and I might start to get suspicious; once I have flipped my coin 30 times and got only tails, I will be very, very suspicious. However, it’s still not completely impossible for me to get tail 30 flips out of 30, even if the coin is fair: it’s just extremely unlikely. The sad truth of probability is that I can never be absolutely, completely certain that my coin is fair, doesn’t matter how many times I flip it: even a very unlikely result (like 30 tails out of 30) will still be, although unlikely, never completely impossible. What I can still say is that a result like getting tails 30 flips out of flip is extremely unlikely under the assumption that my coin is a fair one (how unlikely? The probability is $0.5^{30} \approx 0.0000000009$, or less than one in a billion).

I flip a coin five times. What is the probability of getting tails every time?
I flip a coin three times. What is the probability of getting tails exactly once?
I flip a coin four times. What is the probability of getting tails exactly twice?
I throw a die five times. What is the probability of getting a six exactly twice?

In general, the probability of getting a certain number of tails (or heads) is given by the binomial distribution. The probability of getting exactly $k$ times tail if I flip the coin $n$ times is given by:

$$P(K = k) = \binom{n}{k}p^k(1-p)^{n-k}$$

Where $\binom{n}{k}$ is the binomial coefficient n choose k which can be calculated as $\binom{n}{k} = \frac{n!}{k!(n-k)!}$ (or from Pascal’s triangle), and $n!$ is the product of the numbers up to $n$: $n! = n \times (n-1) \times \dots \times 2 \times 1$ (for instance $\binom{7}{3} = \frac{7!}{3!4!} = 35$). In the coin example $p$ (the probability of success) and $1-p$ (the probability of failure) are both equal to $0.5$, but in the die example (see the questions above) we had $p = \frac{1}{6}$ and $1 – p = \frac{5}{6}$.

Before I flip my coin, I do not know what the number of tails $K$ is going to be, all I know are the probabilities of the different possible outcomes $k$: that’s why mathematicians would say that the number $K$ is a random variable.

You can see that if the random variable $K$ follows a binomial distribution we can write it as a sum of random variables $B_1, B_2, \dots, B_n$:

$$K = B_1 + B_2 + \dots + B_n$$

where each $B_m$ can either be $1$ (with probability $p$) or $0$ (with probability $1-p$), and represents the number of tails I get in flip number $m$ (in other words the $B$s follow the simplest possible binomial distribution, the one with $n = 1$, also known as the Bernoulli distribution). This fact might seem obvious now, but it will become important in a bit.

If I flip 100 times a fair coin, I expect to get around 50 times tails. The expectation (also known as the mean) $\mathrm{E}[K] = \sum_{i=0}^n k_i \times P(K = k_i)$ or a binomial distribution is the number of successes I expect to have, and for a binomial distribution it’s (as you would expect) $\mathrm{E}[K] = np$ (although this makes intuitive sense, it can also be shown properly without long calculations like this: the expectation of a sum of independent random variables is just the sum of their expectations, and each Bernoulli random variable has expectation $\mathrm{E}[B] = 0 \times P(B =0) + 1 \times P(B =1)$ $= 0 + 1 \times p = p$) .

The variance $\mathrm{Var}[K]$ (which can be calculated in terms of expectations as $\mathrm{Var}[K] = \mathrm{E}[K^2] – \mathrm{E}[K]^2$) and its square root the standard deviation $\sigma = \sqrt{\mathrm{Var}[K]}$ tell us how spread out around the mean we expect our results to be (a higher value of $\sigma$ means that the data is more spread out). For a binomial distribution the variance is given by $\mathrm{Var}[K] = np(1-p)$ (how do we know? For each Bernoulli random variable $\mathrm{E}[B]^2 = p^2$ and $\mathrm{E}[B^2] = 0^2 \times P(B =0) + 1^2 \times P(B =1) = p$ so $\mathrm{Var}[B] = \mathrm{E}[B^2] – \mathrm{E}[B]^2$ $= p – p^2$ $= p(1 – p)$).

Below you can find an interactive applet (my own work, written in Python and Javascript) of the binomial distribution for different values of the number of trials $n$ and the probability of success $p$ (to view the probability of a given number of success hover your cursor over the corresponding bar). You can also see that as the number of trials $n$ becomes larger, the probability distribution looks more and more like a certain famous curve you are very likely already familiar with.

Bokeh Plot

In the questions below, you see that the probability of meeting a given character a certain number of times $k$ out of a total of $n$ journeys to Mystery Island also follows a binomial distribution (we’re making the important assumption that which character I meet on a given journey does not affect which character I’ll meet in the next one: we are assuming that the characters I meet on each journey are independent events).

In Animal Crossing: New Horizons you can meet one of 391 characters on a journey to Mystery Island. What is the probability of finding the cat Raymond on your first journey if each character is equally likely? (this is called a uniform distribution)
Of those 391 characters, 23 are cats. What is the probability of meeting a cat on my first journey if each character is equally likely?
You make 100 journeys to Mystery Island. What is the probability that you’re going to meet the cat Raymond exactly once if each character is equally likely?
You make 100 journeys to Mystery Island. What is the probability that you’re going to meet the cat Raymond (at least once) if each character is equally likely?

Many of those Animal Crossing players who believed that the 'each character is equally likely' hypothesis was wrong subscribed to their own alternative. They thought that the data they had collected matched a probability distribution where each species (instead of each character) was equally likely. If only there was some precise way to check how well (or how badly) the data they had collected matched either of those two hypotheses, so that the matter could be universally agreed once and for all! (if you want to find out what that way is you're going to have to keep on reading).

In Animal Crossing: New Horizons there are in total 35 character species, and 23 characters are cats. Assuming each species (instead of each character) is equally likely, what is now the probability of meeting the cat Raymond your first journey to Mystery Island?
You make 100 journeys to Mystery Island. What is the probability that you're going to meet the cat Raymond (at least once) if each species is equally likely?

The most famous probability distribution in the world

The curve that you might already be familiar with, is, of course, the probability density function of the normal distribution. The normal distribution is possibly the most important probability distribution in statistics and probability. Unlike discrete probability distributions like the binomial distribution, it is a continuous distribution: this means that the possible outcomes are taken to be any real number ($0.5$, $0.0157$, $\sqrt{2}$, $\pi$) instead of simply whole numbers ($0$, $1$, $2$, $3$, $4$). Whilst in a discrete probability distribution there's a function that gives us the probability of every outcome, for a continuous probability distribution the probability can be calculated using a function called the probability density function.

I can find the probability of my outcome $X$ to have a value between any two numbers $a$ and $b$ by looking at the area (the blue area in the visualisation below) between the probability density function (the black curve in the visualisation below) and the x-axis from $x = a$ to $x = b$. That is, we need to evaluate an integral of the probability density function:

$$P(a \leq X \leq b) = \int_{a}^{b} f(x) \, dx$$

The probability density function for the normal distribution is:

$$f(x) = \frac{1}{\sigma \sqrt{2 \pi}} e^{-(x - \mu)^2 / 2 \sigma^2}$$

Where $e$ is a very famous number, Euler's constant $e \approx 2.718$, $\mu = \mathrm{E}[X]$ is the mean and $\sigma= \sqrt{\mathrm{Var}[X]}$ is standard deviation of the normal distribution I’m looking for. For a normal distribution, I need to know the values of $\mu$ and $\sigma$ in order to be able to calculate probabilities: they are the parameters of my distribution (just like the number of trials $n$ and the probability of success $p$ were the parameters for the binomial distribution). The constant $\frac{1}{\sigma \sqrt{2 \pi}}$ we put in front of the probability density function so that the probability of X being any possible value is equal to 1, that is to make $\int_{- \infty}^{+\infty} f(x) \, dx = 1$ (we say that it’s a normalisation constant).

Luckily, you don't need to calculate this integral yourself. Even if you are very good at calculating integrals, this is not one that can be solved exactly. You'd normally need to use some approximation method, or, more likely, a scientific calculator or computer software. To save you the bother, here's a visualisation of normal probability density functions for different values of $\mu$ and $\sigma$, where you can see the approximate value of the probability for $X$ to be found between two values of your choice.

Bokeh Plot

As you can see from the visualisation, although different values of $\mu$ and $\sigma$ correspond to different curves, these curves all look very much alike: in fact they’re all exactly the same curve, just translated (shifted to the right or left) as I increase or decrease the mean $\mu$, and rescaled (squashed down or squeezed up) as I increase or decrease the standard deviation $\sigma$. That’s why in order to calculate probabilities you might not want to bother with different curves at all, and just look at one of them: we can choose the one with $\mu = 0$ and $\sigma = 1$. This distribution is called the standard normal distribution.

If you give me any number $X$ that follows a normal distribution, the number $Z = \frac{X - \mu}{\sigma}$ will follow a standard normal distribution. If, for example, I wanted to find the probability that $X$ is larger than some number $x$, this will be the same as the probability that $Z$ is larger than the number $z = \frac{x - \mu}{\sigma}$ (we say that $z$ is the standard score of $x$: it tells me how far away $x$ is from the mean in units of standard deviations).

In the days of old people would have used a standard normal table (also called a Z-table) to find the value of $P(Z < z)$. Alternatively, you can use the applet below:

Bokeh Plot
The average height in the UK is around 5’ 7’’ (170 cm), with a standard deviation of around 3’’ (8 cm). What percentage of the UK population is 5’ 11’’ (180 cm) tall or more?
The average height for a male in the age group 18-25 in the UK is around 5’ 10’’ (178 cm), with a standard deviation of around 2’’ (5 cm). What percentage of males aged 18-25 are 5’ 11’’ (180 cm) tall or more?

The central limit theorem: everything eventually turns out normal

One reason why the normal distribution is so important (and the reason why it comes into our story) is something called the central limit theorem. The central limit theorem (in one of its simplest versions) says that if I take the sum of a large number of independent random variables that all have the same distribution, the sum will approximatively follow a normal distribution (and as long as your distribution is not some weird distribution that doesn't even have a mean or variance).

Remember how we said earlier than a binomial random variable is a sum of $n$ Bernoulli random variables? The central limit theorem tells us that we can approximate it as a normal distribution (with the same mean $\mu = np$ and standard deviation $\sigma = \sqrt{np(1-p}$), and that this works best when $n$ is large.

And here's a visualisation of the normal approximation (blue area) to the binomial (red area): you can see that as the number of trials $n$ gets larger you get a better match between the binomial distribution and the corresponding normal distribution, and that the probabilities given by the normal distribution approximate well the ones given by the binomial.

Bokeh Plot

Why might we want to use the normal approximation anyway? In the questions below you can see an example of a case when it's quicker to make calculations using the normal approximation than it would be calculating the binomial probabilities directly.

In Animal Crossing: New Horizons you start off with only two villagers living on your island: one of them is always an "uchi" personality type, and the other one a "jock" personality type. There are only 24 characters with "uchi" personality type, and one of them is a pink rhino called Renée. If you restart the game 30 times in the hope of getting Renée as one of your first villagers, what's the probability of that happening at least 3 times out of 30? (hint: for this question you can use the binomial distribution applet above)
What is the probability of getting Renée as one of your first villagers at least 7 times out of 100? (hint: for this question you can use the standard normal distribution applet)

One way of showing why the central limit theorem is true is by looking at what happens to a function related to the probability density function called the characteristic function (if you know what a Fourier transform is: it's the Fourier transform of the probability density function) for a sum of independent random variables as the number of terms in the sum becomes large (spoiler alert: its gets closer and closer to the characteristic function of the corresponding normal distribution, meaning that the two distributions get more and more alike). If you're interested in cutting your teeth with slightly more involved maths, you can check out a more detailed outline of this story here.

Finally, the chi-square distribution

However, the reason why the normal approximation and the central limit theorem come into the Mystery Island mystery has to do with yet another distribution, the chi-square distribution. If I have a bunch of $k$ random variables $Z_1, Z_2, \dots, Z_k$ that each follow a standard normal distribution and are also independent of each other, then the sum of their squares $Y$ will follow a chi-square distribution with $k$ degrees of freedom:

$$Y = Z_1^2 + Z_2^2 + \dots + Z_k^2$$

The probability density function for a chi-square random variable with $k$ degrees of freedom can be written as:

$$f(x) = \frac {1}{2^{k/2}\Gamma (k/2)} x^{k/2-1}e^{-x/2}$$

where $\Gamma(z)$ in the normalisation constant is a special function called the gamma function which generalises the factorial to the case where $z$ is not an integer (if $n$ is integer the gamma function is just a good old factorial $\Gamma(n) = (n - 1)!$ $= (n-1) \times (n-2) \times \dots \times 2 \times 1$.

For instance, the probability density function of a chi-square distribution with two degrees of freedom ($k = 2$) is simply an exponential:

$$f(x) = \frac{1}{2} e^{x/2}$$

You can see what this probability density function for different numbers of degrees of freedom looks like here:

Bokeh Plot

Since it's a sum of squares, $Y$ can never be a negative number: we see that the possible outcomes for a chi-square random variable are values greater than or equal to 0 ($Y \geq 0$).

In the case where $k = 1$, $Y$ is just the probability distribution for the square of a single standard normal random variable $Y = Z^2$. As you'd expect, the most likely outcomes for $k = 1$ are near $Y = 0$ (since the most likely outcomes of $Z$ are near $Z = 0$), but this changes as the number of degrees of freedom increases.

The mean of the chi-square distribution is just the number of degrees of freedom $\mathrm{E}[Y] = k$ (each standard normal by definition has mean $\mu = \mathrm{E}[Z] = 0$ variance $\sigma^2 = \mathrm{Var}[Z] = \mathrm{E}[Z^2] - \mathrm{E}[Z]^2 = 1$, putting the two together we get that $\mathrm{E}[Z^2] = 1$ which means that the mean of $Z^2$ is $1$; the mean of $Y$ is the sum of the means of each $Z^2$).

The variance of the chi-square distribution is twice the number of degrees of freedom $\mathrm{Var}[Y] = 2k$ (to prove that this is the case we need to know that $\mathrm{E}[Z^4] = 3$, which can be shown doing a Gaussian integral, however then it's easy to show that $\mathrm{Var}[Z^2] = \mathrm{E}[Z^4] - \mathrm{E}[Z^2]^2 = 3 - 1 = 2$, and the variance of $Y$ is the sum of the variances of each $Z^2$).

By now you know that as the number of degrees of freedom $k$ becomes large, the central limit theorem tells us that the chi-square distribution will itself also be approximated by a normal distribution. After all, it's a sum of $k$ independent random variables (each of which follows a chi-square distribution with one degree of freedom). Just for fun, you can see that this is actually what happens here:

Bokeh Plot

The chi-square test

The chi-square distribution gets its fame because it can be used to do exactly what the Animal Crossing players wanted to do: if you have some real-world data and a hypothesis of what probability distribution might have generated it, the chi-square test allows you (as long as you have collected enough data for it to work) to figure out whether your hypothesis is likely to be right.

Let's assume that we know that the underlying distribution is. Then we do not expect our data do match exactly the theoretical expected value (for instance if we throw a coin 100 times, we do not expect it to land on tail exactly 50 times), but we expect it to be close enough. The chi-square test gives us a way to put a number on how close our data is to the underlying distribution.

Here's how it works: we have a certain number of possible outcomes (a mathematician would call it them the sample space) and we want to find out if our hypothesis for their probability distribution is any good. For instance, there's only two possible outcomes (heads or tails) if I flip a coin once. In the Mystery Islands mystery I can take the possible outcomes to be either the characters that I meet (in which case we'll have 391 of them) or the species of the characters I meet (in which case we'll have 35 of them). For each outcome our hypothesis gives us a certain probability $p_i$. This probability times the number of trials $n$ (the total number of coin flips, or the total number of Mystery Island journeys) is the expected value $E_i = np_i$ for that outcome given by our hypothesis. The number of times the corresponding outcome actually happens is the observed value $O_i$.

The chi-square statistic $\chi^2$ is a number that measures how different the real-life observed values $O_i$ are from the expected values given by our hypothesis $E_i$. It's calculated by taking the difference between the observed values $O_i$ and the expected values $E_i$, squaring them and dividing the result by the expected values $E_i$, and then summing them all together:

$$\chi^2 = \sum_{i=1}^N \frac{(O_i - E_i)^2}{E_i}$$

Once we have calculated $\chi^2$ we just look for the corresponding probability for a random variable $Y$ with $N-1$ degrees of freedom to be greater or equal than $\chi^2$, $P(Y \geq \chi^2)$. We call this number the p-value. If the p-value is large it means that there's a good fit between the observed values and the hypothesis: data as far or more from the theoretical expected values is likely to have been generated by the probability distribution in our hypothesis. If the p-value is small, then data as far or more from the theoretical expected values is unlikely to have been generated by the probability distribution in our hypothesis. If the p-value is small enough, that gives us reason to believe that our original hypothesis was wrong, and we can rule it out (we say that we have rejected the null hypothesis). How small is small enough? In science, we often assume a significance level of 0.05 (or sometimes 0.01) so that any p-value smaller than would allow us to rule out our original hypothesis.

We have said that in a case where we only know the total number of trials $n$ the number of degrees of freedom is $N-1$. However if our data comes in a contingency table like this:

DrugPlaceboTotal
Recovered within two weeks418226644
Did not recover within two weeks82274356
Total5005001,000

Then the number of degrees of freedom will be $k = (c - 1) \times (r - 1)$ where $c$ is the number of columns and $r$ the number of rows in the observed data (in this example $k = (2-1) \times (2-1) = 1$.

This last formula was the subject of some kerfuffle in the early days of statistics, and in 1922 it led Karl Pearson (who himself had discovered, among other things, the chi-square distribution) to call another famed statistician, Ronald Fisher, 'a Don Quixote' (Don Quixote being the insane knight who fights windmills in Cervantes's eponymous masterpiece, not usually meant as a compliment) that 'must either destroy himself or the theory of probable errors'. Fisher's crime had been saying that the number of degrees of freedom should in this case be calculated as $k = (c - 1) \times (r - 1)$. Needless to say, Fisher was eventually shown to be right.

Where does the formula come from? Intuitively, in a $2 \times 2$ table I have 4 random variables $a, b, c, d$. But I already know how many patients are in the treatment group (the total number of patients who get the drug $N$) and how many patients are in the control group (the total number of patients who get the placebo $M$) and this takes away 2 degrees of freedom ($a + c = N$ and $b + d = M$). When my null hypothesis is that the drug is just as effective as a placebo, I also usually do not know the probability of a patient recovering within two weeks, and I need to estimate it from the available data: this takes away another degree of freedom ($p = \frac{a + b}{M + N}$), leaving $k = 1$. During my experiment I then just need to write down the number of patients who recovered within two weeks, and I can calculate the other three values: my four random variables are not independent, they're actually completely dependent, so that it's enough to know the value of one of them to know the values of all four. Similarly, if I decide I am going to make $n$ journeys to Mystery Island to collect data on the probability distribution of characters, I already know what $n$ is going to be before I collect any data. If I write down the number of characters I have met for each species except for one, I will always be able to calculate that number from the values for the other species and the total number of characters I've met $n$. That's why the number of degrees of freedom in that case is $k = N - 1$ where $N$ is the number of classes.

You are a scientist who has developed a new drug, and you want to check if your drug works better than a placebo (a sugar pill). What is the chi-square statistic for the data in the contingency table above under the null hypothesis that your drug works just as well as a placebo?
On the basis of the chi-square statistic you have calculated, are you able to reject the null hypothesis?
Before starting the experiment, you make the assumption that the probability of a patient recovering within two weeks if they have taken the drug is $p = 0.85$ and if they take the placebo it’s $q = 0.45$. How many degrees of freedom should you use for your chi-square test this time?

Chi-square testing in Mystery Island

And this is exactly what the Animal Crossing players did. They collected enough data (remember that the chi-square test relies on the central limit theorem to make the binomial approximation, and the central limit theorem relies on the number of trials being large), and once they had it, they were able to run a chi-square test on each-character-is-equaly likely hypothesis.

My sister Chiara, who is a moderator on the largest Animal Crossing forum in Italy (yes, I have that kind of connections), has collected a sample herself from a grand total of $n = 350$ (the sample size) Italian players’ Mystery Island trips.

Here's the data she collected grouped into species (click on the bar to see the data):

Bokeh Plot

Chiara actually collected data for each character, but the sample size was too small (only 350 journeys!) to run the chi-square test looking at the number of times each characters was observed (if $p$ is too small you need a larger $n$ to let the central limit theorem do its magic), a problem easily solved by looking at number of times a character of each species was met instead. Grouping data into larger groups when the sample size would otherwise be too small (or when you have continuous data) is a common trick in statistical tests commonly referred to as data binning.

The number of times characters of each species were met shown above can be turned into a relative frequency by dividing it by the total number of journeys. Relative frequencies are very much like probabilities: in fact, for a large sum we would expect them to be close to the actual probabilities of the underlying probability that generated them.

What should my expectation for the number of times I will meet characters of a given species be if I make 350 Mystery Island journeys and I assume that each character is equally likely?
What should my expectation be if instead I assume that each species is equally likely?

The number of times each species has been met out of those $350$ Mystery Islands journeys, together with the expected number of times they are met given by both hypotheses (and relative frequencies and probabilities) are all shown in the last graph of the article:

Bokeh Plot

You might agree that it's not immediately obvious which of the two hypotheses is more likely given our data just by looking at the graph above. However, just as those Animal Crossing players bent on settling the dispute on how likely it is to meet each characters on Mystery Island, we are now in the position to run the square which chi-square and find out the chi-square statistics that correspond to each of the two hypothesis. This is requires a few calculations (nothing fancy: just, multiplications, divisions, subtractions and additions, or if you're lazy like me a few lines of code in Python), but in the end we get that the chi-square statistic corresponding to the each-character-is-equally-likely hypothesis is $\chi_A \approx 107$, and that the chi-square statistic corresponding to the each-species-is-equally-likely is $\chi_B \approx 29$ .

What p-value does a chi-square statistic of 107 correspond to? Does it allow you to reject the null hypothesis? (hint: remember what the number of degrees of freedom is if there are 35 species, and use the applet for the chi-square distribution above)
What p-value does a chi-square statistic of 29 correspond to? Does this allow you to reject the null hypothesis?

The chi-square test told the Animal Crossing players that there was a clear winner: those who still believed that each character was equally finally had to give up and admit that they were wrong, as the test showed that that disagreed with the observed data.

Shortly afterwards, players were able to go through the API code Animal Crossing is programmed in (a practice commonly referred to in the community as data mining, although the pedant in me wants to point out that that's not actually what data mining means), and check directly that what the algorithm that decides what's the next character you're going to meet on Mystery Island actually does is select each species with the same probability. However, although the chi-square test is not able to prove something to be the case for certain, it had proved that the competing theory was very likely to be wrong.

In science, we often need to find out if a hypothesis we made is likely or unlikely to be true. Unfortunately there's no API code the world is programmed in that we can look at directly in order to find out how it works. The next best thing we can do is run experiments and check if our hypothesis agrees with them. That's why statistical tests like the chi-square test are so often used, for instance in fields like medicine and economics.

If you're feeling brave: proving Pearson's theorem

If you have been careful you might have noticed that I have not explained yet why the chi-square test works in the first place.

Let's recap what we know so far: first, that the observed values $O_1, O_2, \dots, O_N$ (each representing the number of characters of each species met over $n$ journeys to Mystery Island, where $N$ is the total number of possible species), before I observe them, are random variables that each follow a binomial distribution with probabilities of success $p_i$ and the same number of trials $n$.

The central limit theorem tells us that if $n$ is large each $O_i$ can also be approximated as a normally distributed random variable that has mean $\mu_i = np_i$ and standard deviation $\sigma_i = \sqrt{np_i(1-p_i)}$.

If I have some hypothesis for what I believe the value of $p_i$ to be, before I make any observation at all I expect $O_i$ to be around $E_i = \mathrm{E}[O_i] = \mu_i = np_i$.

Finally, we have said that the chi-square distribution is the distribution of a sum of $k$ independent standard normal random variables.

The chi-square test then works because we are saying that if our hypotheses for the values of the $p_i$ are the right ones, then the random variable $\chi^2$ (the chi-square statistic) defined by:

$$\chi^2 = \sum_{i = 1}^N \frac{(O_i - E_i)^2}{E_i}$$

will follow a chi-square distribution with $k = N - 1$ degrees of freedom. The fact that it does is known as Pearson's theorem (named after the Don Quijote guy).

At a first glance, the chi-square statistics looks like it's the right thing to follow a chi-square distribution. It is, after all, the sum of the squares of (approximatively) normal random variables.

However, thinking about it a bit more, there seems to be not just one problem with this, but actually three different problems:

  • In the first place, we are summing the squares of $N$ normal random variables $X_i = \frac{O_i - E_i}{\sqrt{E_i}}$, but the number of degrees of freedom of $\chi^2$ is not $N$, as we would expect, but $N - 1$.
  • Secondly, the random variables $O_i$, and hence the $X_i = \frac{O_i - E_i}{\sqrt{E_i}}$ are not independent random variables: knowing something about the value of one of them tells me something about the value of the others. After all, in the extreme example where I find that $O_i = n$ (all the characters I've met are cats), all the other $O_j = 0$ (the number of characters of any other species I've met is $0$). And in general, if I know that $O_i = x$ (I know that I've met $x$ cats), then surely $O_j \leq n - x$ (I already know that I cannot possibly have met more than $n - x$ rhinos).
  • And lastly, my normal variables $X_i$s are not even standardised: yes, their mean is equal to zero:
    $$\mathrm{E}[X_i] = \frac{(\mathrm{E}[O_i] - E_i)}{\sqrt{E_i}} = 0$$
    but their variance isn't equal to one:
    $$\mathrm{Var}[X_i] = \frac{\mathrm{Var}[O_i]}{E_i} = \frac{np_i(1-p_i)}{np_i} = 1-p_i$$
    (where I have used that $\mathrm{Var}[aX] = a^2 \mathrm{Var}[X]$ and $\mathrm{Var}[a] = 0$ if $X$ is random variable and $a$ a constant).

Is Pearson's theorem, and hence the chi-square test, a lie? A reptilian conspiracy to take over the world?

It turns out that these three problems in a sense cancel each other out, and that $\chi^2$ actually is a sum of independent standard normal random variables, although those variables are not the $X_i$s.

However understanding this will require a bit of geometry, as well as some further probability.

If we see the random variables $O_i$ as the components of an $n$-dimensional (in our example a $35$-dimensional) vector $\mathbf{O} = (O_1, O_2, \dots, O_N)$ we say that $\mathbf{O}$ is a random vector and that it follows a multinomial distribution. The $n$-dimensional version of the central limit theorem tells us that if $n$ is large this is approximated by a multivariate normal distribution. We can then look at the vector $\mathbf{X}$, whose components $X_i = \frac{O_i - E_i}{\sqrt{E_i}}$ are normal random variables. So far, nothing new is happening.

The length of the vector $\mathbf{X}$ is given by good old Pythagoras’ theorem (in $n$ dimensions) $\left| \mathbf{X} \right|= \sqrt{X_1^2 + X_2^2 + \dots + X_N^2}$. You can see that the square of the length of $\mathbf{X}$ is then just the chi-square statistic we care so much about $\chi^2 = \left| \mathbf{X} \right|^2 = X_1^2 + X_2^2 + \dots + X_N^2$.

The covariance $\mathrm{Cov}(X_i, X_j) = \mathrm{E}[X_iX_j] - \mathrm{E}[X_i]\mathrm{E}[X_j]$ is a measure of how much knowing the value of the random variable $X_i$ tells me about the value of another random variable $X_j$. For instance, if $X_i$ and $X_j$ are independent, then their covariance is zero $\mathrm{Cov}(X_i, X_j) = 0$. You've probably already realised from that definition that the variance of $X_i$ is just the covariance of $X_i$ with itself $\mathrm{Var}[X_i] = \mathrm{Cov}(X_i, X_i)$. We can calculate the covariances for our multinomial variables $O_i$: for $i = j$ the covariance is just the variance $\mathrm{Cov}(O_i, O_i) = \mathrm{Var}[O_i] = np_i(1 - p_i)$, but for $i \neq j$ it's $\mathrm{Cov}(O_i, O_j) = -np_ip_j$. For $\mathbf{X}$ this corresponds to $\mathrm{Cov}(X_i, X_i) = 1 - p_i$ and $\mathrm{Cov}(X_i, X_j) = -\sqrt{p_ip_j}$.

We can see the covariance as an $n \times n$ matrix $\Sigma$ with $\Sigma_{ij} = \mathrm{Cov}(X_i, X_j)$. It's a symmetric matrix (meaning it doesn't change when we flip its rows with its: in symbols $\Sigma = \Sigma^T$) because $\mathrm{Cov}(X_i, X_j) = \mathrm{Cov}(X_j, X_i)$.

A diagonal matrix is a matrix whose elements are all zeroes except for the ones on the diagonal, which are its eigenvalues. An important theorem from linear algebra with a spooky name, the spectral theorem, says that any symmetric matrix (like our covariance matrix $\Sigma$) can be turned by rotating it into a diagonal matrix with real (and not complex) eigenvalues (to rotate a matrix you need to conjugate it by a rotation matrix $A$ so that you get $A \Sigma A^T = D$, where $D$ is a diagonal matrix: in what follows we won't actually need to know exactly which rotation matrix $A$ will do the job, it's enough to know that because of the spectral theorem there is one that does).

Moreover, our covariance matrix $\Sigma$ can be written as $\Sigma = I - \mathbf{v}\mathbf{v}^T$, where $I$ is the identity matrix that leaves any vector as it is, and the vector $\mathbf{v}$ has coordinates $\mathbf{v} = (\sqrt{p_1}, \sqrt{p_2}, \dots, \sqrt{p_n})$. The fact that it can be written like that means that $\Sigma$ is a projection matrix on the $n-1$ dimensional subspace perpendicular to the vector $\mathbf{v}$: if I feed it the vector $\mathbf{v}$ it turns it into a zero ($\Sigma \mathbf{v} = 0$), but if I feed it any vector $\mathbf{u}$ that is perpendicular to $\mathbf{v}$ it leaves it untouched ($\Sigma \mathbf{u} = \mathbf{u}$). This in turn tells us that the eigenvalues of the matrix $\Sigma$ are $0$ (repeated once, corresponding to the vector $\mathbf{v}$) and $1$ (repeated $n-1$ times, corresponding to the vectors perpendicular to $\mathbf{v}$).

If we rotate the vector $\mathbf{X}$ we will get shiny new coordinates $Z_i$ (which we could calculate if we knew the rotation matrix $A$ as $Z_i = A_{ij}X_j$), however importantly the length of the vector $\mathbf{X}$ will stay the same, and thus also its square $\chi^2 = X_1^2 + X_2^2 + \dots + X_N^2$ $= Z_1^2 + Z_2^ 2+ \dots + Z_N^2$ (a rotation, unlike a stretch or a. shear, leaves the lengths of vectors untouched). In this case the new coordinates are new random variables. Because each of them is a sum of random variables of mean $0$, they also each have mean of $0$. But the covariance matrix of the $Z_i$ is $D$, which means that the $Z_i$s are also independently distributed random variables (because the off-diagonal elements of their covariance matrix $D$ are all $0$s) and all have variance equal to $1$, except for one of them $Z_N = \mathbf{v}^T \mathbf{X}$ $= \sqrt{p_1} X_1 + \sqrt{p_2} X_2 + \dots + \sqrt{p_n} X_N$ that as we know has variance equal to $0$. Now, a normally distributed random variable with variance $0$ is not really random at all: it’s just a constant. And because the mean of $Z_N$ is also $0$, we know that $Z_N$ must be the number zero $Z_N = \mathbf{v}^T \mathbf{X} = 0$ (this makes sense when you rewrite $Z_N$ back in terms of the binomial random variables $O_i$s: $Z_N = \frac{1}{\sqrt{n}}(\underbrace{O_1 + O_2 + \dots + O_N}_{= n} - n)$). The other $Z_i$s (that is $Z_1, Z_2, \dots, Z_{N-1}$) all have mean $0$ and variance $1$: they are independent standard normal random variables, and what's more there's exactly $k = N - 1$ of them.

We have shown that the square of the length of $\mathbf{X}$ is $\chi^2 = \left| \mathbf{X} \right|^2 = X_1^2 + X_2^2 + \dots + X_N^2 $ $= Z_1^2 + Z_2^2 + \dots + Z_{N-1}^2 + 0^2$ and hence follows a chi-square distribution with $k = N - 1$ degrees of freedom, which finally completes our proof of Pearson's theorem.

Leave a Reply

Your email address will not be published. Required fields are marked *