March 1, 2013
Half of this post will be the theoretical basis for what I will discuss, and the other half will be running some trials via Python and plotting some "examples." I am not a probability, nor am I a statistician — there, consequently, may be errors in this post. The main point I want to make is: the law of large numbers only works when you have a large number of trials. As obvious as this is, I feel that there are a number of places where this idea is misused — especially in using expected value or standard deviation in an essential way when the number of trials is small.
The Law Of Large Numbers.
Loosely, the law of large numbers states that probability only works as the number of times you do an experiment gets "large" (remember these quotes, they will be important!). Specifically, the (strong) law of large numbers states
Theorem (Strong Law of Large Numbers). Let be independent and identically distributed integrable random variables (if this doesn’t mean anything to you, think about die rolls) with , and define . Then .
This last line means: the probability that the average value of will approach the expected value of each as . This last part is important to this post, so we put it in italics.
This is, of course, widely applied. When you flip a fair coin, you expect the probability to get heads to be ; indeed, this is why coin flipping is so popular: each side has an equal chance of coming up. This idea, combined with human psychology, can get a bit sticky: if heads has come up 20 times in a row, would you expect it to come up again? Would you expect that it was "time for tails to come up" after that long string? This is illustrated in the Gambler’s Fallacy, an interesting topic which we will not get into in this post, but which is important in its own right. The problem is that, with a small number of flips, the probability will not equal and, indeed, can be quite far away from .
Here is an illustration. I’ve let Mathematica generate 0’s and 1’s at random with equal probability (meaning "tails" and "heads" respectively). If we allow for 5 flips in one "game", the probability of heads (total number of heads over total number of flips) is essentially random; I’ve played ten games and graphed their probabilities below:
As we see, we don’t really get anywhere near one half in some of these games. If we allow for 10 flips for each game, then we get the following:
This is nicer, because it shows that, in each game, we get, on average, pretty close to 0.5. Only a few games are "far away" from 0.5; indeed, we expect such deviation. Let’s look at 100 coin flips for each game:
This shows that most of our games will hover around having an average of 0.5 — what we would expect.
How Large is Large?
There is a subtle (well, maybe not so subtle — ) point in the law of large numbers: we need a large number of trials to get close to the true expected value. But…how large is large?
Here, we can use a number of inequalities to see what an "optimal" number of trials would be, but I think it would be nice to use some programs and so forth to show the reader what Large looks like in some cases.
Here is what we will do. I will construct an experiment to construct a few coin flips. Then we will measure how long it takes to flip and get within a certain range of the expected value. That is, we will flip a few times, count the number of heads, and see how long it takes us before we get within some of 0.5.
We need to be careful here. For example, if we flip heads the first time and tails the second time, this will give us an average of 0.5; but, of course, this is not what we want — we know that this will quickly deviate from 0.5 and then slowly come back to the expected value. For example, here are the first twenty flips for a particular game:
Note that this wildly fluctuates from nearly 100% down to around 40% and back up to around 57% before decreasing towards 45% at the end. Very few times do we get to our expected probability of 0.5. As we increase the number of rolls (using the same data as the previous plot),
We see that this goes down for a while, then begins to go back up again. Note that, though we’ve done this experiment 50 times, we still have a number more "tails" than "heads", though we’d expect them to be equal "after a while." Here is a chart for the first 1000 flips.
Notice that at the very beginning we have the erratic behavior we’ve seen before: this is, of course, because the law of large numbers hasn’t "kicked in" yet; the average value will fluctuate radically with each coin flip. As we go onwards we see that the behavior stops being so erratic (look how, for example, the first part looks "sharp"; this is because it goes up and down quickly) and begins to smooth out. Not only this, but it begins to approach its limiting probability of 0.5. Notice that after 200 flips we begin to get exceptionally close to the limiting probability: we stay around of 0.5 after 200 flips. This is what we’d expect.
A similar thing happens when we roll dice, though the experiment is slightly more interesting. For a 4-sided die (known henceforth as a "D4") the expected value is equal to . We do the exact same experiment as before, and record the values we obtain for the first 10 rolls:
We see that, despite eventually "coming down" at the end, the rolls spend a lot of time above the expected value. Similarly, other iterations of this experiment have given other wild behaviors: staying near 1, fluctuating between 2 and 4 wildly, staying at 3, and so forth. Let’s see what happens as we continue rolling 40 more times:
One notices that we are now "closer" to the expected value, and the rolls are "around" 2.5, within 0.1 on either side. Let’s add some more rolls until we get to 200:
Whoops. We’re still not quite at 2.5, though we are consistently with 0.1 of it. Maybe if we add some more rolls?
We’re getting closer. In the last graph, we were within 0.1 of the expected value. Now we are within 0.05 of the expected value. It looks like it took around 600 or so rolls to consistently start lying within this range and, in general, it looks like around 100 rolls are "good enough" to lie within 0.5 of the expected value (though, the figure above this one shows that it is still a bit out of range…). At last, look at this last figure:
We see that after a significant number of rolls (around 6500) we finally are consistently within 0.004 or so of our expected value. This is pretty close if you only need an error of 0.004, but 6500 is a pretty large number — if this was an experiment where we needed to hit a small particle with another small particle, for example, it may take significantly more than 6500 tries to get within a proper range since an error of 0.004 may be extremely large in this case.
Rolling two D4 do a similar thing, though it seems that they converge to their expected value (of 5) quicker. For example, here is the first 100 rolls:
Notice, though, before 20 rolls, the average is erratic and strange. Even after 20 rolls, the rolls are only within 0.1 of the expected value. For 1000 rolls,
we notice that we get somewhat closer to the expected value, getting within about 0.1 after around 600 rolls. If we take this to the extreme 10,000 rolls, we obtain a much nicer plot:
It takes us around 6000 rolls to get within 0.01 of the expected value, it seems. Of course, we expect a large number like this. We have "precise" ways of telling "how much error" is going to happen on average, as well. These estimates are based on something called z- or t-values and the standard deviation. In particular, a lot of emphasis is placed on standard deviation. How well does standard deviation measure the average deviation of our experiments?
The standard deviation is a measure which tells us, on average, how much our data deviates from the expected value. There are a number of ways to do standard deviation, but generally one computes two different kinds of standard deviations: the population and the sample standard deviation. Here’s how to compute these:
Population Standard Deviation: If we have data for a population, we will also have the expected value for the population. Because it is a population, this will be exact: it is not approximating anything. The standard deviation is given by , the square root of the variance.
Sample Standard Deviation: For this one we do not know what our expected value of the population is, and, hence, we have only an approximation of that average given by the average . We calculate the sample standard deviation in the following way:
Note that we’ve used an instead of because of Bessel’s Correction.
Suppose we have a 6-sided die (henceforth, D6). The expected value is easily calculated; the value can be calculated in a similar way. These two values give us that the population standard deviation is equal to .
In a real test, how long does it take until we get close to this standard deviation? We do some trial rolls and plot the sample standard deviation each time. For 20 rolls,
Notice that our sample standard deviation spends a lot of time meandering back and forth, and gets to only about 0.2 or so within the population standard deviation. Let’s do the same for 100 rolls,
We see that this eventually gets to within about 0.1 of the population standard deviation after 80 or so rolls. After 250 rolls, below we see that the sample standard deviation gets quite close to the population standard deviation (within about 0.01):
Let’s do another example which isn’t so dull. This time, we will roll two D6’s. One can calculate the expected value and so forth, and from this we find that the standard deviation . As usual, we expect the first 10 rolls to not give too nice of a graph:
What about after 100 rolls?
We see that the standard deviation is still quite far away from the population standard deviation even after 100 rolls, though it has stabilized within about 0.5 above it. What about 300 rolls?
We are much closer now: it is within 0.2 after some 250 or so rolls. We see that, eventually, the sample standard deviation will become closer and closer to the population standard deviation.
One common rule that is used frequently is the Empirical Rule saying that, in a normal distribution, 68% of data is one standard deviation from the mean, 95% is between two standard deviations, and 99.7% is between three standard deviations. These are common "rules of thumb" which are approximately true. When we want to know approximately where most of our data will lie, we build something called a confidence interval. The "rule of thumb" equation for a 95% confidence interval when we have points of data is: .
This "2" comes from the empirical rule (though it is not actually correct to use in the case where is small; one must use something called a t-value). The 95% confidence interval says that, if we were to do the experiment 100 times, 95 of those times we will have the expected value (the one we calculated above) to be in the confidence interval. The accuracy of the interval is dependent on s and being close to the true mean and true standard deviation; as we’ve seen, this is often not the case for small samples. Moveover, the "2" in the formula, as we’ve noted, is not always a good estimate when is small — what is usually done is, there is a correction term (called a t-value, which is a larger number than the corresponding z-value, which the "2" is approximating at 95% confidence) will give us a bigger interval so that we can be more confident that our true mean is within these ranges. Unfortunately, this larger interval is basically useless when we attempt to "guess" the true mean from the data — that is, as before, the data simply isn’t a good enough estimate.
The point is this: when you need to apply the law of large numbers make sure that you have a large enough number of trials. In particular, if you are using expected value and standard deviation make sure that the number of trials is large enough for these to be accurate representations of your samples.
It’s a bit of a strange feeling to go in reverse here — usually one uses a sample’s data and attempts to reconstruct the population, but here we are using the population’s data and noting that the sample may not always represent it accurately — but it is important to note. Especially because I’ve seen (e.g., in various game-making and game-analyzing forums) these concepts used time and time again without any reference to the law of large numbers and the necessity for a large number of trials.
In The Next Post…
In the next post, we will go over simple games (like die rolling and coin flipping) and discuss which of these games can benefit from using expected value and which games (which I call "Luck Based") are games which, regardless of strategy, are essentially based on the "luck" of the roller — that is, one where one particularly "good" roll will give a large advantage/disadvantage to a player from which it is difficult to recover from. For example, if we had a game where each player rolls a die in turns and sum up the numbers they have rolled, if they use a D6 and race to 7 then the first player rolling a 1 is a severe disadvantage, from which it is difficult to recover from. On the other hand, if they raced to something like 1000, each roll would have less of an effect and, in general, the rolls would eventually approximate the number of turns times the expected value — this type of game is "less" based on luck. We will discuss some of these things in the next post.
March 14, 2011
I do this post anticipating the question, "Yeah, but why are these formulas the same?" from some of my statistics students. This is all done in the discrete random variable sense.