Half of this post will be the theoretical basis for what I will discuss, and the other half will be running some trials via Python and plotting some "examples."  I am not a probability, nor am I a statistician — there, consequently, may be errors in this post.  The main point I want to make is: the law of large numbers only works when you have a large number of trials.  As obvious as this is, I feel that there are a number of places where this idea is misused — especially in using expected value or standard deviation in an essential way when the number of trials is small.


The Law Of Large Numbers.

Loosely, the law of large numbers states that probability only works as the number of times you do an experiment gets "large" (remember these quotes, they will be important!).  Specifically, the (strong) law of large numbers states


Theorem (Strong Law of Large Numbers).  Let \{X_{i}\}_{i\in {\mathbb N}} be independent and identically distributed integrable random variables (if this doesn’t mean anything to you, think about die rolls) with E(X_{i}) = \mu, and define \overline{X_{n}} = \dfrac{1}{n}(X_{1}+\cdots + X_{n}).  Then \mbox{P}(\lim_{n\to\infty}\overline{X_{n}} = \mu) = 1


This last line means: the probability that the average value of X_{1} + \dots + X_{n} will approach the expected value of each X_{i} as n\to \inftyThis last part is important to this post, so we put it in italics. 


This is, of course, widely applied.  When you flip a fair coin, you expect the probability to get heads to be \frac{1}{2}; indeed, this is why coin flipping is so popular: each side has an equal chance of coming up.  This idea, combined with human psychology, can get a bit sticky: if heads has come up 20 times in a row, would you expect it to come up again?  Would you expect that it was "time for tails to come up" after that long string?  This is illustrated in the Gambler’s Fallacy, an interesting topic which we will not get into in this post, but which is important in its own right.  The problem is that, with a small number of flips, the probability \mbox{P}(X = \mbox{Heads}) = \frac{\mbox{Number of Heads We Get}}{\mbox{Total Coin Flips}} will not equal \frac{1}{2} and, indeed, can be quite far away from \frac{1}{2}

Here is an illustration.  I’ve let Mathematica generate 0’s and 1’s at random with equal probability (meaning "tails" and "heads" respectively).  If we allow for 5 flips in one "game", the probability of heads (total number of heads over total number of flips) is essentially random; I’ve played ten games and graphed their probabilities below:




As we see, we don’t really get anywhere near one half in some of these games.  If we allow for 10 flips for each game, then we get the following:




This is nicer, because it shows that, in each game, we get, on average, pretty close to 0.5.  Only a few games are "far away" from 0.5; indeed, we expect such deviation.  Let’s look at 100 coin flips for each game:


This shows that most of our games will hover around having an average of 0.5 — what we would expect.


How Large is Large?

There is a subtle (well, maybe not so subtle — ) point in the law of large numbers: we need a large number of trials to get close to the true expected value.  But…how large is large?

Here, we can use a number of inequalities to see what an "optimal" number of trials would be, but I think it would be nice to use some programs and so forth to show the reader what Large looks like in some cases.


Here is what we will do.  I will construct an experiment to construct a few coin flips.  Then we will measure how long it takes to flip and get within a certain range of the expected value.  That is, we will flip a few times, count the number of heads, and see how long it takes us before we get within some \epsilon of 0.5. 

We need to be careful here.  For example, if we flip heads the first time and tails the second time, this will give us an average of 0.5; but, of course, this is not what we want — we know that this will quickly deviate from 0.5 and then slowly come back to the expected value.    For example, here are the first twenty flips for a particular game:



Note that this wildly fluctuates from nearly 100% down to around 40% and back up to around 57% before decreasing towards 45% at the end.  Very few times do we get to our expected probability of 0.5.  As we increase the number of rolls (using the same data as the previous plot),



We see that this goes down for a while, then begins to go back up again.  Note that, though we’ve done this experiment 50 times, we still have a number more "tails" than "heads", though we’d expect them to be equal "after a while."  Here is a chart for the first 1000 flips.



Notice that at the very beginning we have the erratic behavior we’ve seen before: this is, of course, because the law of large numbers hasn’t "kicked in" yet; the average value will fluctuate radically with each coin flip.  As we go onwards we see that the behavior stops being so erratic (look how, for example, the first part looks "sharp"; this is because it goes up and down quickly) and begins to smooth out.  Not only this, but it begins to approach its limiting probability of 0.5.  Notice that after 200 flips we begin to get exceptionally close to the limiting probability: we stay around \pm 0.02 of 0.5 after 200 flips.  This is what we’d expect.


A similar thing happens when we roll dice, though the experiment is slightly more interesting.  For a 4-sided die (known henceforth as a "D4") the expected value is equal to \frac{(1 + 2 + 3 + 4)}{4} = 2.5.  We do the exact same experiment as before, and record the values we obtain for the first 10 rolls:


We see that, despite eventually "coming down" at the end, the rolls spend a lot of time above the expected value.  Similarly, other iterations of this experiment have given other wild behaviors: staying near 1, fluctuating between 2 and 4 wildly, staying at 3, and so forth.  Let’s see what happens as we continue rolling 40 more times:



One notices that we are now "closer" to the expected value, and the rolls are "around" 2.5, within 0.1 on either side.  Let’s add some more rolls until we get to 200:


Whoops.  We’re still not quite at 2.5, though we are consistently with 0.1 of it.  Maybe if we add some more rolls? 




We’re getting closer.  In the last graph, we were within 0.1 of the expected value.  Now we are within 0.05 of the expected value.  It looks like it took around 600 or so rolls to consistently start lying within this range and, in general, it looks like around 100 rolls are "good enough" to lie within 0.5 of the expected value (though, the figure above this one shows that it is still a bit out of range…).  At last, look at this last figure:


We see that after a significant number of rolls (around 6500) we finally are consistently within 0.004 or so of our expected value.  This is pretty close if you only need an error of 0.004, but 6500 is a pretty large number — if this was an experiment where we needed to hit a small particle with another small particle, for example, it may take significantly more than 6500 tries to get within a proper range since an error of 0.004 may be extremely large in this case.

Rolling two D4 do a similar thing, though it seems that they converge to their expected value (of 5) quicker.  For example, here is the first 100 rolls:


Notice, though, before 20 rolls, the average is erratic and strange.  Even after 20 rolls, the rolls are only within 0.1 of the expected value.  For 1000 rolls,




we notice that we get somewhat closer to the expected value, getting within about 0.1 after around 600 rolls.  If we take this to the extreme 10,000 rolls, we obtain a much nicer plot:


It takes us around 6000 rolls to get within 0.01 of the expected value, it seems.  Of course, we expect a large number like this.  We have "precise" ways of telling "how much error" is going to happen on average, as well.  These estimates are based on something called z- or t-values and the standard deviation.  In particular, a lot of emphasis is placed on standard deviation.  How well does standard deviation measure the average deviation of our experiments?


Standard Deviation.

The standard deviation is a measure which tells us, on average, how much our data deviates from the expected value.  There are a number of ways to do standard deviation, but generally one computes two different kinds of standard deviations: the population and the sample standard deviation.  Here’s how to compute these:


Population Standard Deviation:  If we have data for a population, we will also have \mbox{E}(X) = \mu the expected value for the population.  Because it is a population, this \mbox{E}(X) will be exact: it is not approximating anything.  The standard deviation is given by \sqrt{\mbox{E}(X^{2}) - \mbox{E}(X)^{2}}, the square root of the variance. 


Sample Standard Deviation:  For this one we do not know what our expected value of the population is, and, hence, we have only an approximation of that average given by the average \bar{x}.  We calculate the sample standard deviation in the following way:

s = \sqrt{\frac{1}{n-1}\sum_{i=1}^{n}(x_{i} - \bar{x})^{2}}


Note that we’ve used an n - 1 instead of n because of Bessel’s Correction


Suppose we have a 6-sided die (henceforth, D6).  The expected value \mbox{E}(X) = 3.5 is easily calculated; the value \mbox{E}(X^{2}) = 15.1667 can be calculated in a similar way.  These two values give us that the population standard deviation is equal to \sigma = \sqrt{2.91667} = 1.70783

In a real test, how long does it take until we get close to this standard deviation?  We do some trial rolls and plot the sample standard deviation each time.  For 20 rolls,


Notice that our sample standard deviation spends a lot of time meandering back and forth, and gets to only about 0.2 or so within the population standard deviation.  Let’s do the same for 100 rolls,


We see that this eventually gets to within about 0.1 of the population standard deviation after 80 or so rolls.  After 250 rolls, below we see that the sample standard deviation gets quite close to the population standard deviation (within about 0.01):




Let’s do another example which isn’t so dull.  This time, we will roll two D6’s.  One can calculate the expected value and so forth, and from this we find that the standard deviation \sigma = 2.412.  As usual, we expect the first 10 rolls to not give too nice of a graph:



What about after 100 rolls?


We see that the standard deviation is still quite far away from the population standard deviation even after 100 rolls, though it has stabilized within about 0.5 above it.  What about 300 rolls?


We are much closer now: it is within 0.2 after some 250 or so rolls.  We see that, eventually, the sample standard deviation will become closer and closer to the population standard deviation.


One common rule that is used frequently is the Empirical Rule saying that, in a normal distribution, 68% of data is one standard deviation from the mean, 95% is between two standard deviations, and 99.7% is between three standard deviations.  These are common "rules of thumb" which are approximately true.  When we want to know approximately where most of our data will lie, we build something called a confidence interval.  The "rule of thumb" equation for a 95% confidence interval when we have n points of data is: \bar{x} \pm 2\dfrac{s}{\sqrt{n}}.


This "2" comes from the empirical rule (though it is not actually correct to use in the case where n is small; one must use something called a t-value).  The 95% confidence interval says that, if we were to do the experiment 100 times, 95 of those times we will have the expected value (the one we calculated above) to be in the confidence interval.  The accuracy of the interval is dependent on  s and \bar{x} being close to the true mean and true standard deviation; as we’ve seen, this is often not the case for small samples.  Moveover, the "2" in the formula, as we’ve noted, is not always a good estimate when n is small — what is usually done is, there is a correction term (called a t-value, which is a larger number than the corresponding z-value, which the "2" is approximating at 95% confidence) will give us a bigger interval so that we can be more confident that our true mean is within these ranges.  Unfortunately, this larger interval is basically useless when we attempt to "guess" the true mean from the data — that is, as before, the data simply isn’t a good enough estimate.


The Point?

The point is this: when you need to apply the law of large numbers make sure that you have a large enough number of trials.  In particular, if you are using expected value and standard deviation make sure that the number of trials is large enough for these to be accurate representations of your samples. 

It’s a bit of a strange feeling to go in reverse here — usually one uses a sample’s data and attempts to reconstruct the population, but here we are using the population’s data and noting that the sample may not always represent it accurately — but it is important to note.  Especially because I’ve seen (e.g., in various game-making and game-analyzing forums) these concepts used time and time again without any reference to the law of large numbers and the necessity for a large number of trials.


In The Next Post…

In the next post, we will go over simple games (like die rolling and coin flipping) and discuss which of these games can benefit from using expected value and which games (which I call "Luck Based") are games which, regardless of strategy, are essentially based on the "luck" of the roller — that is, one where one particularly "good" roll will give a large advantage/disadvantage to a player from which it is difficult to recover from. For example, if we had a game where each player rolls a die in turns and sum up the numbers they have rolled, if they use a D6 and race to 7 then the first player rolling a 1 is a severe disadvantage, from which it is difficult to recover from.  On the other hand, if they raced to something like 1000, each roll would have less of an effect and, in general, the rolls would eventually approximate the number of turns times the expected value — this type of game is "less" based on luck.  We will discuss some of these things in the next post.


Recently, I’ve been learning to program in a new language and have been doing Project Euler problems for practice — of course, as the name suggests, most of these problems deal explicitly with problems which must be solved (efficiently) with mathematical techniques.  Two of the most common algorithms I’ve used are: Prime Testing, GCD Finder.  I’ll post about the former later, but the latter is an interesting problem in its own right:


Initial Problem.  Given two natural (positive whole) numbers called m, n, can we find some other natural number that divides both of them?


This problem is a first step.  It’s nice to be able to write numbers as a multiple of some other number; for example, if we have 16 and 18, we may write them as 8\times 2 and 9\times 2, thus giving us some insight as to the relationship between these two numbers. In that case it may be easy to see, but perhaps if you’re given the numbers 46629 and 47100, you may not realize right away that these numbers are 99\times 471 and 100\times 471 respectively.  This kind of factorization will reveal "hidden" relationships between numbers.  

So, given two numbers, how do we find if something divides both of them — in other words, how do we find the common divisors of two numbers?  If we think back to when we first began working with numbers (in elementary school, perhaps) the first thing to do would be to note that 1 divides every number.  But that doesn’t help us all that much, as it turns out, so we go to the next number: if both numbers are even, then they have 2 as a common factor.  Then we "factor" both numbers by writing them as 2\times\mbox{ something} and then attempt to keep dividing things out of the something.  We then move onto 3, skip 4 (since this would just be divisible by 2 twice), go onto 5, then 7, then…and continue for the primes.  This gives a prime factorization, but we have to note that if, say, 2 and 5 divide some number, then so does 10.  These latter divisors are the composite factors.

This seems excessive, but it is sometimes the only way one can do it. 

Anecdote!: On my algebra qualifying exam, there was a question regarding a group of order 289 which required us to see if 289 was prime or not; if not, we were to factor it.  We were not allowed calculators, so what could we do?  Try everything.  Note that we only need to try up to the square root of the number (which we could estimate in other ways), but it’s still a number of cases.  If you check, none of the following numbers divide into 289: 2, 3, 5, 7, 11, 13.  At this point, I was about to give up and call it a prime, but, for whatever reason, I decided to try 17.  Of course, as the clever reader will have pointed out, 289 = 17\times 17.  It is not prime.  There was, luckily, only one student who thought it was prime, but it points out how the algorithm above is not entirely trivial if one does not have access to a computer or calculator. 


Once we have a common divisor, or a set of common divisors, a natural thing to want to do is to find the biggest (we already have the smallest, 1) since in this way we can write our numbers with the largest common factor multiplied by some other number.  It will, in effect, make things prettier.


Real Problem.  Find the greatest divisor which is common to two natural numbers, m, n.


If you were just learning about this kind of thing, you may spout out the following solution: find all of the common divisors, then pick the greatest.  While this is not especially efficient, it is a solution.  Unfortunately, even for small numbers, this gets out of hand quickly.  For example, 60 and  420 have the following common divisors: 1, 2, 3, 4, 5, 6, 10, 12, 15, 20, 30, 60.  This takes a while to compute by hand. 

Even if we were to find prime factors, this would be 60 = 2^2 \times 3\times 5 and 420 = 2^2 \times 3 \times 5\times 7, which gives us that they share a number of prime factors.  A bit of thinking gives us that we take all of the prime factors they "share" and multiply them together to get the greatest common divisor.  This is another potential solution which is much faster than simply listing out all of the common divisors.  Unfortunately, this falls prey to the same kind of trap that other prime-related problems do: it is, at times, especially difficult to factor large composite numbers.  For example, the "reasonably small" number 49740376105597 has a prime factorization of 2741 \times 37813 \times 479909; this is not at all efficient to factor if one does not have a computer or a specialized calculator with a factoring algorithm on it.  As a mean joke, you may ask your friend to factor something like 1689259081189, which is actually the product of the 100,000th and 100,001st prime — that is, they would need to test 99,999 primes before getting to the one which divides the number.  If they divided by one prime per second (which is quite fast!) this would take them 1 day, 3 hours, and 46 minutes.  Not especially effective, but it will eventually get the job done.


Real Problem, With Efficiency: Find the greatest divisor which is common to two natural numbers, m,n, but do so in an efficient manner (we’ve all got deadlines!).


We need to sit down and think about this now.  We need an entirely new idea.  We note, at least, that for the two numbers m,n that one of them must be larger than the other (or else the problem is trivial).  One thing to try would be to see if the smaller one goes into the larger one (for example, above we had 60 going into 420, which gave us the easy solution that 60 must be the greatest common divisor).  If not, maybe we can see how much is left over.  That is, if m is the larger number,

m = a_{1}n + r_{1}

where here a_{1} is the number of times n goes into m without exceeding it, and r_{1} is the "remainder"; if it’s equal to 0, then n evenly divides into m, and otherwise it is less than n (or else we could divide an additional n into m). 

Using this, if r_{1}\neq 0, we may write m - a_{1}n = r_{1}; this means that, in particular, r_{1} divides m and a_{1}n, so it is a factor of m and of a_{1}n.  But it may not actually be a factor of n; so let’s see how many times it goes into n.  Using the same process…

n = a_{2}r_{1} + r_{2}

and by rearranging, we have that n - a_{2}r_{1} is divisible by r_{2}.  So, n is divisible by r_{2}, but we aren’t sure if r_{1} is divisible by r_{2}…if it were, we would be able to say that r_{2} was a common divisor of m and n (why?).  That’s something at least. 

The cool thing about our algorithm here is that, because a_{1}n + r_{1} = m we have that either r_{1} = 0 and we’re done with the algorithm, or r_{1} > 0 and we may form a new equation n = a_{2}r_{1} + r_{2}; this equation has, on the left-hand side, the number n which is less than the previous equation’s left-hand side, which was m.  Continuing this process, we will have r_{1}, r_{2}, \dots on the left-hand side, each of which is less than the one which came before it.  Because r_{i} \geq 0 for any of the remainders, eventually it will become 0 (why?) and this algorithm will terminate.  That is, we will have found some r_{i} which is a common divisor for both n, m; specifically, it will be the r_{i}\neq 0 such that r_{i+1} = 0 (or, it may simply be n if n divides m).

This algorithm, called the Euclidean Algorithm, actually does more "automatically": it not only finds a common divisor, but actually finds the greatest common divisor of m,n, which, from now on, we will denote \gcd(m,n).  The "proof" of this is simply noting that \gcd(m,n) = \gcd(n,r_{1}) = \gcd(r_{1},r_{2}) = \cdots = \gcd(r_{i-1},r_{i}) = r_{i} (we noted this above without making reference to the gcd, but the reader should attempt to go through all the same steps using the idea of the gcd). 

So.  If you have two natural numbers, n,m, you divide them, find the remainder, write the equation, then continue as above until you get a 0 remainder.  Then you pick the remainder directly before you got 0 as your gcd (or, you pick the smaller number if one number divides the other).  Pretty simple algorithm, but is it efficient?

Without going into formal "efficiency" definitions, "yes", it is quite efficient.  To prove it, let’s take an "average" example using the "large" numbers 1337944608 and 4216212.  We note that (by pen and paper, or by using a standard calculator) that

1337944608 = 317(4216212) + 1405404.

Next, we note that

4216212 = 3(1405404) + 0

which instantly gives us the solution \gcd(4216212, 1337944608) = 1405404.  That’s pretty awesome.  Note that this was an especially quick trial, but even the "worst" ones are relatively quick. 


Unexpected Corollary!:  For n,m natural numbers, if \gcd(n,m) = k then there exists integers a,b such that an + bm = k.


This is more useful than you might think at first glance, and we’ll get into why in a later post, but what’s nice about this corollary is that it comes "for free" from the Euclidean algorithm.  Note that, since k divides n, m, it suffices to prove this corollary for an + bm = 1 where n, m have \gcd(n,m) = 1.  The proof uses induction on the number of steps of the Euclidean algorithm for those numbers, but for those of you who are more experienced and know modular arithmetic, you may enjoy the following simple proof:


"Clever" Proof of the Corollary: Let m > n (for equality, the proof is easy).  We will only care about remainders in this proof, so we will look at some numbers modulo m.  Consider


r_{1} = n\mod m

r_{2} = 2n\mod m


r_{m-1} = (m-1)n\mod m


Note there are exactly m-1 remainders here and that the remainder 0 never occurs (since m,n are relatively prime).  Suppose that r_{i} \neq 1 for each of the i; that is, the remainder 1 does not ever show up in this list.  By the pigeon-hole principle (as there are m - 1 remainders but only m -2 possible values for the remainders) we must have that r_{i} = r_{j} for some i\neq j.  That is, we have

in \mod m = jn \mod m

which implies

(i-j)n\mod m = 0

but this is impossible, since it implies that either m = 0 or n is some integer multiple of m, but m > 0 and we have assumes m,n are relatively prime.  Hence, the remainder 1 must occur.  That is, r_{c} = 1 for some c and

cn \mod m = 1.

But what does this mean?  It means that there is some integer a such that am - cn = 1.  To make this prettier, let b = -c and we find that there exists a,b integers such that am + bn = 1, as required.  \Box


Pretty slick, no?

This post has a pretty weird title, but the problem is easy to state and uses a few interesting mathematical concepts.  It’s worth going through.  Let’s start with the basics.


Problem 1.  Let S_{\alpha}(n) = 1^{\alpha} + 2^{\alpha} + \dots + n^{\alpha}.  Show that S_{\alpha}(n) is a polynomial for each \alpha\in {\mathbb N} and that the degree of the polynomial is \alpha + 1.


Indeed, for example, we have  that 1 + 2 + \dots + n = \frac{n(n+1)}{2}, as we learned in Calculus, and this is a polynomial of degree 2.  Similarly, 1^{2} + 2^{2} + \dots + n^{2} = \frac{n(n+1)(2n+1)}{6}, which is a polynomial of degree 3.  In the same respect, 1^{3} + 2^{3} + \dots + n^{3} = \frac{(n(n+1))^{2}}{4}, which is a polynomial of degree 4.

The associated polynomials in this case are given by Faulhaber’s formula:


Theorem (Faulhaber).  For \alpha\in {\mathbb N} we have \displaystyle \sum_{k=1}^{n}k^{\alpha} = \frac{1}{1 + \alpha}\sum_{j=0}^{\alpha}(-1)^{j}\binom{\alpha + 1}{j}B_{j}n^{\alpha + 1 - j}.


This formula looks terrifying, but it is not hard to apply in practice.  You may be wondering, though, what the B_{j}‘s in this formula stand for.  These are the strange and wonderful Bernoulli numbers, of course!  I always enjoy seeing these creatures, because they unexpectedly pop up in the strangest problems.  There are a number of ways to define these numbers, one of which is to just write them out sequentially, starting with B_{0}:


1, -\frac{1}{2}, \frac{1}{6}, 0, -\frac{1}{30}, 0, \frac{1}{42}, 0, -\frac{1}{30}, 0, \dots


But in this case it is not so easy to guess the next value.  The clever reader will notice that all of the odd numbered Bernoulli numbers (except the first) are zero, but other than that there does not seem to be a clear pattern.  Fortunately, we can construct a function which generates the values as coefficients; we’ll call this function (surprise!) a generating function.


Definition.  We define the sequence \{B_{j}\}_{j=0}^{\infty} by

\displaystyle \frac{t}{e^{t} - 1} = \sum_{j=0}^{\infty}B_{j}\frac{t^j}{j!}.


Notice that this will, in fact, generate the B_{j} as coefficients times \frac{1}{j!}.  Neat.  In practice, you can use a program like Mathematica to compute B_{j} for pretty large values of j; but, of course, there are lists available.  We can now use Faulhaber’s formula above, which gives us (assuming we have proven that the formula holds!) that the sums of powers of natural numbers form polynomials of degree \alpha + 1.


But something else happens that’s pretty interesting.  Let’s look at some of the functions.


\begin{array}{|c|c|}\hline \alpha & \mbox{Polynomial}\\\hline &\\ 1 & \frac{1}{2}n^{2} + \frac{1}{2}n\\ &\\ 2 & \frac{1}{3}n^{3} + \frac{1}{2}n^{2} + \frac{1}{6}n\\ & \\ 3 & \frac{1}{4}n^{4} + \frac{1}{2}n^{3} + \frac{1}{4}n^{2} \\ & \\ 4 & \frac{1}{5}n^{5} + \frac{1}{2}n^{4} + \frac{1}{3}n^{3} - \frac{1}{30}n\\ & \\ \hline \end{array}


Look at the coefficients in each of these polynomials.  Anything strange about them?  Consider them for a bit.


Problem.  Look at the coefficients.  What do you find interesting about them?  Note that, in particular, for a fixed \alpha, the coefficients of the associated polynomial sum to 1.  Convince yourself that this is probably true (do some examples!) and then prove that it is true.  Do this before reading the statements below.


Anecdote.  I spent quite a while trying to write down the "general form" of a polynomial with elementary symmetric polynomials and roots to try to see if I could prove this fact using some complex analysis and a lot of terms canceling out.  This morning, I went into the office of the professor to ask him about what it means that these coefficients sum up to 1.  He then gave me a one-line (maybe a half-line) proof of why this is the case. 


Hint.  What value would we plug in to a polynomial to find the sum of the coefficients?  What does plugging in this value mean in terms of the sum?

While I’m finishing up the Baire Category post, let’s have a little fun.  Here’s a game I like to play when I’m in my office hours and have literally nothing else to do.  I call it, “Two Turtles Playing With A Hammer.” 

The game is simple.


  1. Pick a number (or, have a friend pick one for you!).  It’s more fun if you pick a number that’s a well-known irrational or transcendental number.  Let’s call this number \alpha
  2. Pick a natural number (for example, 4 or 12).  Usually, 2, 3, 4, or 6 give the nicest looking results.  Let’s call your number n.
  3. The point of the game is to approximate \alpha using a sequence that looks like this:


\displaystyle\frac{\pi^{n}}{b_{1}} + \frac{\pi^{n}}{b_{2}^{2}} + \frac{\pi^{n}}{b_{3}^{3}} + \dots


where all the b_{i}’s are natural numbers.  If this is not possible to do, you lose.  Else, you win!  [Of course, you can feel free to make your own game where this is an alternating sequence.]


An example.  Let’s just do one that I played around with a bit:


(\pi^4)/57 + (\pi^4)/67^2 + (\pi^4)/37^3 \approx \sqrt{3}.


Of course, the game is rather pointless, but you can say to your friends, “Look at this!  This crazy sum of stuff is pretty dang close to \frac{\sqrt{2}}{2}.”  They’ll be impressed.


[Note: It may be a good (undergraduate-level) exercise to think about the following: given that we only add terms, and given that the denominators must be elements of the naturals, then (given some n) what numbers can we make from these sequences?  What if we also set one of the b_{i}’s in the assumption (for example, making b_{3} = 5)?]


Edited: I fixed the note above.  Another good undergraduate-level question is: what if we allow only one term to be subtracted; can we make any real number?  If they’re particularly good at programming, you could ask them to write a program that approximates any number to some number of places with such a sequence (either from the game or using a single subtracted term).

I just saw this problem in a book of algebra problems, and I thought it was nice given one of the previous posts I had up here.


Problem.  Show that 11, 111, 1111, 11111, … are not squares.


You ought to think about this for a bit.  I started by supposing they were squares and attempting to work it out like that; unfortunately, there are some strange things that happen when we get bigger numbers.  But.  You should see a nice pattern with the last two digits.  Click below for the solution, but only after you’ve tried it!

Read the rest of this entry »

After proving this "Deep Result" [cf. Munkres] a professor will (hopefully) say something like: "Yes, the proof is important, but what does this theorem mean?  And what does it mean for spaces which are sufficiently nice, like metric spaces?"

Let’s state the result just so we’re all on the same page.


Theorem.   The topological space X is normal (that is, every pair of disjoint closed subsets A, B can separated by disjoint neighborhoods) if and only if any two disjoint closed sets can be separated by a function.


The proof of this is not obvious.  In fact, it is quite involved.  But if X happens to be a metric space, we can make the proof significantly easier by using a function that is similar to a distance function with a few minor modifications.  I call this function the standard Urysohn function for metric spaces for lack of a better (or shorter) name.  The function is as follows:

\displaystyle f(x) = \frac{d(x,A)}{d(x,A) + d(x,B)}

but before proving this theorem in the metric space setting, let’s look at this function.  If A,B are disjoint, then it is clear that the denominator is non-zero (why?) so this is defined everywhere.  If x\in A we have that this function evaluates to 0.  If x\in B then we have that this function evaluates to 1.  And the function (since the distance is always non-negative) achieves values in (0,1) for every point not in A or B (convince yourself that this function is continuous and only achieves the values 0 and 1 if the point is in either A or B respectively).  This function, therefore, separates A and B

The next thing you should think about is: what do the preimages of f look like?  Draw some pictures!  Here’s some pictures to start you off to get the idea.  The first is just in {\mathbb R} and the second is in the real plane.  Both have the standard topology.



This one is potentially not to scale, but you’ll see that we essentially have a linear relation (since the sum in the denominator will always be 1 in this case) except when we start looking at points that are less than all the points in A or greater than all the points in B.  What happens in those cases?



I’ve left this "face" as an exercise because when I did it I was kind of excited about the result.  What do the preimages look like?  Does this look like something you’ve seen before; maybe in physics?  Could you modify the Urysohn equation above to make it seem MORE like something in physics? 


So now that you’ve seen these pre-images, it should be relatively clear how to create neighborhoods around each of the open sets.  It still takes a bit of proof, but it’s nowhere near the difficulty of the standard Urysohn’s.


I encourage you, further, to draw some pictures.  My rule of thumb is: draw at least five different pictures; four easy ones and a hard one.  Find out where the preimages are in each of these.  Remember, too, that not every metric space looks the same; what would this look like, for example, in the taxi-cab space?  Or in some product space?  Or in the discrete space…?

Nested Kind of Topology.

October 16, 2011

In the last post, I posted a topology on \{1,2,3,\dots, k-1\} where the open sets were

\{\emptyset , \{1\}, \{1,2\}, \{1,2,3\},\dots, \{1,2,3,\dots, k-1\}\}

You can check that this is, in fact, a topology and it, in fact, has exactly k elements.  But this is also a really nice topology for intro-topology students.  Why?  Because it has some strange properties which are easy to get your hands on and prove!  Let’s show some of these.  For the sake of brevity, let’s call this topology P.


P is not Hausdorff.

This is easy to see.  Pick the elements 1 and 2.  The only set which contains 1 but not 2 is \{1\}, but there is no set which contains 2 but not 1. 


P is T_{0}.

Two points are topologically indistinguishable if they have exactly the same neighborhoods; a T_{0} space is one where each point is topologically distinguishable.  It is trivial to check that this holds.


P is connected.  (!)

This is somewhat surprising to me (especially because of how disjoint it "seems"!) but if we attempt to find a separation we find that it is impossible: every open set contains 1 except the empty set, so, in particular, no two open sets have trivial intersection unless one is empty.


A function f:{\mathbb R}\rightarrow P is continuous if and only if the sets of all x which map to n\in \{1,2,3,\dots, k-1\} are open (in the standard topology on {\mathbb R}

This is simply taking preimages.  I mention it because actually DRAWING the function on the "plane" (where the "x-axis" is {\mathbb R} and the "y-axis" is \{1,2,3,\dots, k-1\} we get something which does not at all look continuous (unless it is constant —).  This is perhaps a good exercise for intro topology students to work out so that they can see that not all continuous maps "look" continuous.  Of course, there are many other ways to see this using other spaces.


Limits?  What about the sequences in this set?  Let’s remind ourselves what convergence is:  \{x_{n}\}\rightarrow x if for every open set U containing x there is some N such that for all n\geq N we have x_{n} \in U.  Note then that:

\{1,1,1,1,1,\dots\} converges to 1.

\{2,2,2,2,2,\dots\} converges to 2.

and so on.  But then we have

\{1,2,1,2,1,\dots\} which converges to 2, but not to 1 (why?).

\{1,2,3,1,2,3,\dots\} which converges to 3, but not to 2 or 1 (why?).

and so forth.  In fact, if we have an infinite number of the numbers n_{1}, n_{2}, \dots, n_{r}\in \{1,2,\dots, k-1\} then this sequence will only converge to \max(n_{1}, n_{2}, \dots, n_{r}) which is kind of a neat characterization.  In fact, this takes care of most (all?) of the sequences. 


Limit Points?  A limit point p of a subset A is a point such that for every open U containing p we have A \cap (U-\{p\}) \neq \emptyset

Consider the subset A = \{1\} and let’s see if there are any limit points.  What about 2?  Well, we have that \{1,2\} - \{2\} = \{1\} so the intersection of this with A is nontrivial.  Similarly, the other open sets containing 2 have nontrivial intersection with A which includes more than the element 2.  What about 3?  Same deal.  And 4?  Yes.  So we can now ask: what’s the closure of \{1\}

Consider A = \{1,2\} and we will start looking at the element 3.  Is 3 a limit point?  For the same reasons above, it is.  So what is the closure of this set? 

Let’s be a little adventurous and ask what about A = \{1, 3\}.  For the same reasons as above, 4,5,6\dots, k-1 are all limit points, but what about 2?  In fact, it is.

Last, (and this will complete our "characterization" of the closures of our subsets) what about A = \{2,3\}.  Clearly we will have 4, 5, 6, \dots, k-1 as limit points, but the odd one out is 1.  And 1 is not a limit point (why not?).  That’s a bit sad.  So, taking the least element in the subset allows us to describe the closure: it’s the set of every element greater than or equal to it.


Closed sets.  It’s a bit interesting to think up what sets we got from taking the closures of each subset, and then to just start with our open sets and take compliments.  Do we get the same thing?  Neato.


Compactness?  A bit trivial in this case.  Why?  (Maybe extending to infinite things would make this cooler…)


Here’s something weird, too: what is a path in this space?  What is a loop?  How could we construct something like the fundamental group for this?  Homology groups?  To those with more experience, these questions might be easy; but it is not obvious to me (at least at this point!) how these constructions should go. 

The following question came up today after a brief and somewhat unsatisfying talk I gave on the intuition behind topologies:


Question: Given n\in {\mathbb N}, can we construct a topology \tau such that \tau has exactly n elements?


The solution wasn’t obvious to me until I looked at the pattern of numbers on the back of a thing of Dried Papaya Chunks.  For n = 1, 2 we can simply take \emptyset and \{1\} to be our sets, but for n = k we need to be a bit more clever.  But not much more.  Consider:


\{\emptyset ,\{1\},\{1,2\},\{1,2,3\},\dots \{1,2,3,\dots, k-1\}\}.

This is a topology on \{1,2,\dots, k-1\} with exactly k elements. 


In these examples, the elements which are nested inside each other and is somewhat reminiscent of the "particular point" topology — the topology in which a set is open if and only if it contains some predetermined point (or is empty).  In the finite case, how many elements are in the particular point topology?  Let’s do a few examples. 


If our universe is \{1\}, there are two open sets in the particular point topology (empty set and the whole set).

If our universe is \{1,2\}, then we have three open sets in the particular point topology.

If our universe is \{1,2, 3\}, then we have five open sets in the particular point topology.

If our universe is \{1,2,3,4\}, then we have nine open sets in the particular point topology.

If our universe is \{1,2,3,4,5\}, then we have seventeen open sets in the particular point topology.


If we exclude the empty set, it seems like we will have 2^{k-1} elements in our particular point topology.  Why is this? 

(Hint: Take the last example above with seventeen open sets and suppose the element 5 is your particular point.  Write down all the open sets containing 5.  Now take your pen and scribble out the 5’s in each of those open sets — do the resulting sets look familiar?  Hm.)


Neat.  If you take other finite topologies, are there any neat patterns that arise when you increase the number of digits?  Is it possible to associate some sort of polynomial or invariant with finite topologies?  With infinite topologies?  Would this even be useful?

This post was inspired by a question on the comment second of my last post about the prime-ness of 111…111.  The question is: what’s the sum of 1 + 11 + 111 + … + 111…111?  This seemed kind of silly to me (especially if the sum has less than 10 terms) but I quickly realized that I couldn’t figure out a "closed form" without appealing to weird "carry" tricks.  So I tried to figure something else out. 

Because my final solution to this question was somewhat unsatisfying, I’ve made this a special kind of post!  This post will have two main purposes: to try to answer this question and to show what kinds of things I do when I don’t know the answer to a problem.  So let’s begin.


First, I scribbled out lots of examples to see if there were patterns in the numbers.  Indeed, there are, but there was no nice way to describe them in general without creating a whole lot of variables and expanding the number of terms out into a base-ten sum.  This was frustrating.  Especially because I knew there might be an easy answer looming out there that I wasn’t clever enough to see. 

Second, I noted that 11 and 111 and 1111 are kind of weird numbers; written in base 10 they look pretty, but they don’t have much else in common.  As we saw in the last post, a few are prime but many are not, and many have very strange prime factorizations — these factorizations do group together, but that seemed to be too far away from this problem to use.  So, maybe another base would make them better formed?  I plugged away in wolfram alpha.  The first two bases I tried were base 2 and base 11.  Neither one worked nicely.  When I tried base 5, I found that the numbers followed an interesting pattern:

1, 11, 111, 1111, 11111, 111111, \dots

were mapped to

1, 21, 421, 13421, 323421, 12023421, \dots

Note that the last few digital always stay the same here.  Because the decimal form of these numbers is 1 + 10 + 100 + 10000 + … this is not all that surprising.  We’re essentially adding the base-5 versions of these together.  Note that 400_{5} means the number 400 in base 5; in decimal this is the number 100.  So 400_{5} = 100_{10}.  We drop the _{10} in our base 10 numbers and note the following:

1 = 1_{5}

10 = 20_{5}

100 = 400_{5}

1000 = 13000_{5}

10000 = 310000_{5}

100000 = 11200000_{5}

and while this seems to have a bit of a pattern, it grows too fast to be of use for adding.  On the other hand, I found that base 5 can be fairly neat when you input numbers that have "nice forms" in base 10.  Try it out!


Third, I was a bit upset, so I went back to thinking that I could write the number of terms as n = a_{0} + a_{1}10 + a_{2}10^{2} + \dots + a_{k}10^{k} and some some mods to figure out the digits.  The first digit will be easy: it’s just a_{0}.  But then what’s the carry?  After a bit, I worked out that it will just be \frac{n - a_{0}}{10}.  This did not seem to be going anywhere, since this number might be somewhat large too; we would need to add this number n - 1 times, and then there’s another carry that will look significantly more complex than the one I just wrote.  After trying this for five carries, I started to lose track of what I was doing; the formula was getting much more difficult than just doing "normal" addition.  This clearly wasn’t going to go anywhere.  At this point, a flash of insight hit me.  Of course.  This is just like adding 1 + 10 + 100 + 1000 + …, but only a little bit off.  But how much off? 


Fourth, I thought that, okay, 1 + 11 + 111 + 1111 +… is the same sum as 1 + 10 + 100 + 1000 + … with a difference of exactly 1 + 11 + 111 + … with one less term.  If there are n terms in the original, then it is equal to 1 + 10 + 100 + 1000 + \dots with n terms plus 1 + 11 + 111 + \dots with n -1 terms.  Note, then, that if we define the term ONE_{k} to be the sum of k terms that look like 1 + 11 + 111 + 1111\dots, then we have ONE_{k} = 111\dots 111 + ONE_{k-1}.  At this point, I began kicking myself, because this tells me nothing new about the problem; in fact, it is essentially a restatement of the problem!  But because of this, I thought about what happens when we do addition in 10’s.  I realized that it is exactly the same sort of thing we are doing when we add 11’s, but each successive digit will raise by 1.  So for example, while

1 + 10 + 100 + 1000 + 10000 = 11111

we have

1 + 11 + 111 + 1111 + 11111 = 12345


Each digit, starting at the left, is one plus the one before it.  Of course, this suggests a pattern.  But then what happens when you get far enough along? 


1 + 11 + 111 + 11111 + \dots + 111111111111 = 123456790122


which does not look as pretty.  But look; if we add in the following way, this becomes easy!  (I’ve attempted to use a chart to display place values.)


1 2 3 4 5 6 7 8 9      
                1 0    
                  1 1  
                    1 2
1 2 3 4 5 6 7 9 0 1 1 2


Which is exactly what we got above!  Note here that I’ve made a "10" that sits inside the tenth decimal place (and so is implicitly a carry of 1 in the ninth place) and a "11" that sits in the tenth place (and so is a carry of 1 in the tenth place and 1 in the eleventh place) and so on.  I’ve found that this method of computing is slightly faster than computing the original sum by hand and much less space consuming.  Unfortunately, the clever (and even not-so-clever) reader will notice that this is still the sum of n - 8 terms.  Though I find it faster because you are only adding at most three things in each column (carries included).  I tried to speed-test myself by adding together 20 terms this way and by doing 1 + 11 + 111 + … etc. straight out.  My times are as follows:


Original Way (Adding 1 + 11 + 111…): 1 minute. 

New Way (Described above): 40 seconds.


I also was much less confident doing this the original way (try it; you’ll have to keep track of the carry and the number of ones you need to add up and that gets confusing).  The most time consuming part about the "new way" was writing out the problem; I did not count this in the "original way" because it’s just as easy to visualize the problem as it is to write it out — it would take significantly longer.  If I were to only time the "calculation" part, the new way took me about 15 seconds.  Pretty good for a human.


But, this solution was the best I could do.  I do not have a closed form.  If anyone has a different or more interesting solution, let me know.  I didn’t want to resort to asking google, but maybe there’s a cute trick that makes this into a nice closed form.  I’m not sure.


Edit: Of course, there is an easier solution which was provided to me by Josh "The Widowmaker" Silverman over at his blog.  The key was to represent this as a geometric series (something that I seem to have completely missed out on).  It turns out that the closed form expression is

\frac{1}{81}(10^{n+1} - 10 - 9n)

for the sum of the first n numbers of this form.  To see the derivation, go to his blog and read it over.

To celebrate the 15k people who have, for whatever reason, happened upon this blog, I thought I’d post about something kind of fun.  I was going through some of my older books to try to see what I could give away, and I happened upon the sequence


1 + \cfrac{1}{1+\cfrac{1}{1+\cfrac{1}{1+\cfrac{1}{1 \cdots}}}}


which should look familiar to you.  Something kind of fun and mean to do is to tell someone who doesn’t know much math to evaluate this.  After they’ve had some time to get frustrated, you note that if we let


y = 1 + \cfrac{1}{1+\cfrac{1}{1+\cfrac{1}{1+\cfrac{1}{1 \cdots}}}}


then then the original fraction is just y = 1 + \dfrac{1}{y}.  Of course, this has an easy solution; y^2 - y - 1 = 0, or \dfrac{1\pm \sqrt{5}}{2} which is known as the golden ratio.  Similar looking "regular" continued fractions can be computed similarly.

For ease of notation, I will say [a_{0}; a_{1}, a_{2}, \dots ] when I mean


a_{0} + \cfrac{1}{a_{1}+\cfrac{1}{a_{2}+\cfrac{1}{a_{3}+\cfrac{1}{a_{4} \cdots}}}}


which is standard notation.  Note the semi-colon.  This means the golden ratio is [1;1,1,1,1,1,\dots]

It’s kind of neat to see what kinds of things you get with things of the form [z;z,z,z,z,z,z,\dots] where I have purposely used z here to hint that even complex values may be used.  Try out a few values.  Here’s a list of things to try:


  1. Try to get a rational number.
  2. Try to get a purely imaginary number.  Is it possible?
  3. What kinds of numbers will give you rational numbers?  (There is actually a theorem about this!)
  4. Can you ever obtain 0?  Why not?  Is it obvious from how we construct [z;z,z,z,z,z\dots]?  It it obvious from the solution to the quadratic?
  5. (Something I haven’t tried yet!) Do any complex numbers give you real numbers?  Do any complex numbers give you rational numbers?  (After reading below, is CF_{+} analytic?  What does it look like in the complex plane?)


The nice thing about this problem is that it is actually accessible for a student in high school who has seen the quadratic formula! 

A little edit below, WITH PICTURES!

Read the rest of this entry »