Half of this post will be the theoretical basis for what I will discuss, and the other half will be running some trials via Python and plotting some "examples."  I am not a probability, nor am I a statistician — there, consequently, may be errors in this post.  The main point I want to make is: the law of large numbers only works when you have a large number of trials.  As obvious as this is, I feel that there are a number of places where this idea is misused — especially in using expected value or standard deviation in an essential way when the number of trials is small.


The Law Of Large Numbers.

Loosely, the law of large numbers states that probability only works as the number of times you do an experiment gets "large" (remember these quotes, they will be important!).  Specifically, the (strong) law of large numbers states


Theorem (Strong Law of Large Numbers).  Let \{X_{i}\}_{i\in {\mathbb N}} be independent and identically distributed integrable random variables (if this doesn’t mean anything to you, think about die rolls) with E(X_{i}) = \mu, and define \overline{X_{n}} = \dfrac{1}{n}(X_{1}+\cdots + X_{n}).  Then \mbox{P}(\lim_{n\to\infty}\overline{X_{n}} = \mu) = 1


This last line means: the probability that the average value of X_{1} + \dots + X_{n} will approach the expected value of each X_{i} as n\to \inftyThis last part is important to this post, so we put it in italics. 


This is, of course, widely applied.  When you flip a fair coin, you expect the probability to get heads to be \frac{1}{2}; indeed, this is why coin flipping is so popular: each side has an equal chance of coming up.  This idea, combined with human psychology, can get a bit sticky: if heads has come up 20 times in a row, would you expect it to come up again?  Would you expect that it was "time for tails to come up" after that long string?  This is illustrated in the Gambler’s Fallacy, an interesting topic which we will not get into in this post, but which is important in its own right.  The problem is that, with a small number of flips, the probability \mbox{P}(X = \mbox{Heads}) = \frac{\mbox{Number of Heads We Get}}{\mbox{Total Coin Flips}} will not equal \frac{1}{2} and, indeed, can be quite far away from \frac{1}{2}

Here is an illustration.  I’ve let Mathematica generate 0’s and 1’s at random with equal probability (meaning "tails" and "heads" respectively).  If we allow for 5 flips in one "game", the probability of heads (total number of heads over total number of flips) is essentially random; I’ve played ten games and graphed their probabilities below:




As we see, we don’t really get anywhere near one half in some of these games.  If we allow for 10 flips for each game, then we get the following:




This is nicer, because it shows that, in each game, we get, on average, pretty close to 0.5.  Only a few games are "far away" from 0.5; indeed, we expect such deviation.  Let’s look at 100 coin flips for each game:


This shows that most of our games will hover around having an average of 0.5 — what we would expect.


How Large is Large?

There is a subtle (well, maybe not so subtle — ) point in the law of large numbers: we need a large number of trials to get close to the true expected value.  But…how large is large?

Here, we can use a number of inequalities to see what an "optimal" number of trials would be, but I think it would be nice to use some programs and so forth to show the reader what Large looks like in some cases.


Here is what we will do.  I will construct an experiment to construct a few coin flips.  Then we will measure how long it takes to flip and get within a certain range of the expected value.  That is, we will flip a few times, count the number of heads, and see how long it takes us before we get within some \epsilon of 0.5. 

We need to be careful here.  For example, if we flip heads the first time and tails the second time, this will give us an average of 0.5; but, of course, this is not what we want — we know that this will quickly deviate from 0.5 and then slowly come back to the expected value.    For example, here are the first twenty flips for a particular game:



Note that this wildly fluctuates from nearly 100% down to around 40% and back up to around 57% before decreasing towards 45% at the end.  Very few times do we get to our expected probability of 0.5.  As we increase the number of rolls (using the same data as the previous plot),



We see that this goes down for a while, then begins to go back up again.  Note that, though we’ve done this experiment 50 times, we still have a number more "tails" than "heads", though we’d expect them to be equal "after a while."  Here is a chart for the first 1000 flips.



Notice that at the very beginning we have the erratic behavior we’ve seen before: this is, of course, because the law of large numbers hasn’t "kicked in" yet; the average value will fluctuate radically with each coin flip.  As we go onwards we see that the behavior stops being so erratic (look how, for example, the first part looks "sharp"; this is because it goes up and down quickly) and begins to smooth out.  Not only this, but it begins to approach its limiting probability of 0.5.  Notice that after 200 flips we begin to get exceptionally close to the limiting probability: we stay around \pm 0.02 of 0.5 after 200 flips.  This is what we’d expect.


A similar thing happens when we roll dice, though the experiment is slightly more interesting.  For a 4-sided die (known henceforth as a "D4") the expected value is equal to \frac{(1 + 2 + 3 + 4)}{4} = 2.5.  We do the exact same experiment as before, and record the values we obtain for the first 10 rolls:


We see that, despite eventually "coming down" at the end, the rolls spend a lot of time above the expected value.  Similarly, other iterations of this experiment have given other wild behaviors: staying near 1, fluctuating between 2 and 4 wildly, staying at 3, and so forth.  Let’s see what happens as we continue rolling 40 more times:



One notices that we are now "closer" to the expected value, and the rolls are "around" 2.5, within 0.1 on either side.  Let’s add some more rolls until we get to 200:


Whoops.  We’re still not quite at 2.5, though we are consistently with 0.1 of it.  Maybe if we add some more rolls? 




We’re getting closer.  In the last graph, we were within 0.1 of the expected value.  Now we are within 0.05 of the expected value.  It looks like it took around 600 or so rolls to consistently start lying within this range and, in general, it looks like around 100 rolls are "good enough" to lie within 0.5 of the expected value (though, the figure above this one shows that it is still a bit out of range…).  At last, look at this last figure:


We see that after a significant number of rolls (around 6500) we finally are consistently within 0.004 or so of our expected value.  This is pretty close if you only need an error of 0.004, but 6500 is a pretty large number — if this was an experiment where we needed to hit a small particle with another small particle, for example, it may take significantly more than 6500 tries to get within a proper range since an error of 0.004 may be extremely large in this case.

Rolling two D4 do a similar thing, though it seems that they converge to their expected value (of 5) quicker.  For example, here is the first 100 rolls:


Notice, though, before 20 rolls, the average is erratic and strange.  Even after 20 rolls, the rolls are only within 0.1 of the expected value.  For 1000 rolls,




we notice that we get somewhat closer to the expected value, getting within about 0.1 after around 600 rolls.  If we take this to the extreme 10,000 rolls, we obtain a much nicer plot:


It takes us around 6000 rolls to get within 0.01 of the expected value, it seems.  Of course, we expect a large number like this.  We have "precise" ways of telling "how much error" is going to happen on average, as well.  These estimates are based on something called z- or t-values and the standard deviation.  In particular, a lot of emphasis is placed on standard deviation.  How well does standard deviation measure the average deviation of our experiments?


Standard Deviation.

The standard deviation is a measure which tells us, on average, how much our data deviates from the expected value.  There are a number of ways to do standard deviation, but generally one computes two different kinds of standard deviations: the population and the sample standard deviation.  Here’s how to compute these:


Population Standard Deviation:  If we have data for a population, we will also have \mbox{E}(X) = \mu the expected value for the population.  Because it is a population, this \mbox{E}(X) will be exact: it is not approximating anything.  The standard deviation is given by \sqrt{\mbox{E}(X^{2}) - \mbox{E}(X)^{2}}, the square root of the variance. 


Sample Standard Deviation:  For this one we do not know what our expected value of the population is, and, hence, we have only an approximation of that average given by the average \bar{x}.  We calculate the sample standard deviation in the following way:

s = \sqrt{\frac{1}{n-1}\sum_{i=1}^{n}(x_{i} - \bar{x})^{2}}


Note that we’ve used an n - 1 instead of n because of Bessel’s Correction


Suppose we have a 6-sided die (henceforth, D6).  The expected value \mbox{E}(X) = 3.5 is easily calculated; the value \mbox{E}(X^{2}) = 15.1667 can be calculated in a similar way.  These two values give us that the population standard deviation is equal to \sigma = \sqrt{2.91667} = 1.70783

In a real test, how long does it take until we get close to this standard deviation?  We do some trial rolls and plot the sample standard deviation each time.  For 20 rolls,


Notice that our sample standard deviation spends a lot of time meandering back and forth, and gets to only about 0.2 or so within the population standard deviation.  Let’s do the same for 100 rolls,


We see that this eventually gets to within about 0.1 of the population standard deviation after 80 or so rolls.  After 250 rolls, below we see that the sample standard deviation gets quite close to the population standard deviation (within about 0.01):




Let’s do another example which isn’t so dull.  This time, we will roll two D6’s.  One can calculate the expected value and so forth, and from this we find that the standard deviation \sigma = 2.412.  As usual, we expect the first 10 rolls to not give too nice of a graph:



What about after 100 rolls?


We see that the standard deviation is still quite far away from the population standard deviation even after 100 rolls, though it has stabilized within about 0.5 above it.  What about 300 rolls?


We are much closer now: it is within 0.2 after some 250 or so rolls.  We see that, eventually, the sample standard deviation will become closer and closer to the population standard deviation.


One common rule that is used frequently is the Empirical Rule saying that, in a normal distribution, 68% of data is one standard deviation from the mean, 95% is between two standard deviations, and 99.7% is between three standard deviations.  These are common "rules of thumb" which are approximately true.  When we want to know approximately where most of our data will lie, we build something called a confidence interval.  The "rule of thumb" equation for a 95% confidence interval when we have n points of data is: \bar{x} \pm 2\dfrac{s}{\sqrt{n}}.


This "2" comes from the empirical rule (though it is not actually correct to use in the case where n is small; one must use something called a t-value).  The 95% confidence interval says that, if we were to do the experiment 100 times, 95 of those times we will have the expected value (the one we calculated above) to be in the confidence interval.  The accuracy of the interval is dependent on  s and \bar{x} being close to the true mean and true standard deviation; as we’ve seen, this is often not the case for small samples.  Moveover, the "2" in the formula, as we’ve noted, is not always a good estimate when n is small — what is usually done is, there is a correction term (called a t-value, which is a larger number than the corresponding z-value, which the "2" is approximating at 95% confidence) will give us a bigger interval so that we can be more confident that our true mean is within these ranges.  Unfortunately, this larger interval is basically useless when we attempt to "guess" the true mean from the data — that is, as before, the data simply isn’t a good enough estimate.


The Point?

The point is this: when you need to apply the law of large numbers make sure that you have a large enough number of trials.  In particular, if you are using expected value and standard deviation make sure that the number of trials is large enough for these to be accurate representations of your samples. 

It’s a bit of a strange feeling to go in reverse here — usually one uses a sample’s data and attempts to reconstruct the population, but here we are using the population’s data and noting that the sample may not always represent it accurately — but it is important to note.  Especially because I’ve seen (e.g., in various game-making and game-analyzing forums) these concepts used time and time again without any reference to the law of large numbers and the necessity for a large number of trials.


In The Next Post…

In the next post, we will go over simple games (like die rolling and coin flipping) and discuss which of these games can benefit from using expected value and which games (which I call "Luck Based") are games which, regardless of strategy, are essentially based on the "luck" of the roller — that is, one where one particularly "good" roll will give a large advantage/disadvantage to a player from which it is difficult to recover from. For example, if we had a game where each player rolls a die in turns and sum up the numbers they have rolled, if they use a D6 and race to 7 then the first player rolling a 1 is a severe disadvantage, from which it is difficult to recover from.  On the other hand, if they raced to something like 1000, each roll would have less of an effect and, in general, the rolls would eventually approximate the number of turns times the expected value — this type of game is "less" based on luck.  We will discuss some of these things in the next post.


While I’m finishing up the Baire Category post, let’s have a little fun.  Here’s a game I like to play when I’m in my office hours and have literally nothing else to do.  I call it, “Two Turtles Playing With A Hammer.” 

The game is simple.


  1. Pick a number (or, have a friend pick one for you!).  It’s more fun if you pick a number that’s a well-known irrational or transcendental number.  Let’s call this number \alpha
  2. Pick a natural number (for example, 4 or 12).  Usually, 2, 3, 4, or 6 give the nicest looking results.  Let’s call your number n.
  3. The point of the game is to approximate \alpha using a sequence that looks like this:


\displaystyle\frac{\pi^{n}}{b_{1}} + \frac{\pi^{n}}{b_{2}^{2}} + \frac{\pi^{n}}{b_{3}^{3}} + \dots


where all the b_{i}’s are natural numbers.  If this is not possible to do, you lose.  Else, you win!  [Of course, you can feel free to make your own game where this is an alternating sequence.]


An example.  Let’s just do one that I played around with a bit:


(\pi^4)/57 + (\pi^4)/67^2 + (\pi^4)/37^3 \approx \sqrt{3}.


Of course, the game is rather pointless, but you can say to your friends, “Look at this!  This crazy sum of stuff is pretty dang close to \frac{\sqrt{2}}{2}.”  They’ll be impressed.


[Note: It may be a good (undergraduate-level) exercise to think about the following: given that we only add terms, and given that the denominators must be elements of the naturals, then (given some n) what numbers can we make from these sequences?  What if we also set one of the b_{i}’s in the assumption (for example, making b_{3} = 5)?]


Edited: I fixed the note above.  Another good undergraduate-level question is: what if we allow only one term to be subtracted; can we make any real number?  If they’re particularly good at programming, you could ask them to write a program that approximates any number to some number of places with such a sequence (either from the game or using a single subtracted term).

When you’re on facebook, you’ll see a lot of people posting a lot of silly things.  It would be interesting to see what kinds of things are posted when, but this is a subject for some kind of cybersociologists.  Nonetheless, if you are awake at around 11pm and on facebook (and, I’ll admit it, I’m one of these people), you might see a string of comments that essentially say: "omg, it’s 11:11, make a wish."

If you’re superstitious or obsessive compulsive, this might be a lucky reminder to make a wish at this time, but besides its symmetry and visually appealing form, what’s so special about 1111? 

Read the rest of this entry »

Here is what I came up with for the last post on the Plus One game, and a new game on graphs.

Read the rest of this entry »

Algorithms: Plus One Game.

November 12, 2010

I woke up today with a game stuck in my head.


The Game: Player A picks a number from 1 to 10.  Player B guesses a number.  If Player B’s guess matches Player A’s number, then Player B wins.  If not, then Player A adds 1 to his number and the game continues.


This is not a difficult game to play.  The interesting thing here is that even though there is a winning strategy (Player B simply guesses “10” every time; eventually he will win.) the game could theoretically go on forever — say Player A picked “2” but Player B always guesses “1.”

The question is, how quick can we guarantee a win for Player B?


The first solution I thought of was the following: have Player B guess “5” for five turns.  If he doesn’t get it after five turns, then have him guess “15” for the next five turns.  After a bit of thinking, though, this is not actually any better than Player B just guessing “10” every time.


First Question: Is this the best we can do?

Now let’s make the game a bit more interesting, because (at this point) the game has a max value and a winning strategy for player B which is obvious.


The (Harder) Game: Player A guesses any natural number.  Player B guesses a natural number, and if it matches Player A’s, then player B wins.  If not, Player A adds 1 to his number and the game continues.


This game has a similar feel, but an infinite twist.  There is no longer a maximum value, so our original winning strategy does not work.  Nonetheless, is there a finite or countably infinite “optimal” strategy for player B?


If we were able to prove that the average number of turns for each finite game (the first game, and ones with a maximum value M) is \displaystyle \frac{M-1}{2}, then it would be: just take a limit.  I have a feeling that this first game can be proved to have such an optimal strategy using some kind of discrete math proof, but it’s been a while.  Anyone want to take a swing at this?


(Note: Brooke came up with a winning strategy for player B for the infinite case, and the solution is posted on the next post.  It turns out that even in the infinite case this is not a very interesting game and there is actually a finite winning strategy for player B.  It’s probably exactly what you think it is: go up in multiples of 2’s starting at the beginning.)