New Blog.

April 24, 2013

I love wordpress and how awesome their stuff is, but I needed a redesign.  The new blog [now not hosted on wordpress!] is at


Enjoy!  Leave comments! If I make mistakes [which will be often], let me know!




Half of this post will be the theoretical basis for what I will discuss, and the other half will be running some trials via Python and plotting some "examples."  I am not a probability, nor am I a statistician — there, consequently, may be errors in this post.  The main point I want to make is: the law of large numbers only works when you have a large number of trials.  As obvious as this is, I feel that there are a number of places where this idea is misused — especially in using expected value or standard deviation in an essential way when the number of trials is small.


The Law Of Large Numbers.

Loosely, the law of large numbers states that probability only works as the number of times you do an experiment gets "large" (remember these quotes, they will be important!).  Specifically, the (strong) law of large numbers states


Theorem (Strong Law of Large Numbers).  Let \{X_{i}\}_{i\in {\mathbb N}} be independent and identically distributed integrable random variables (if this doesn’t mean anything to you, think about die rolls) with E(X_{i}) = \mu, and define \overline{X_{n}} = \dfrac{1}{n}(X_{1}+\cdots + X_{n}).  Then \mbox{P}(\lim_{n\to\infty}\overline{X_{n}} = \mu) = 1


This last line means: the probability that the average value of X_{1} + \dots + X_{n} will approach the expected value of each X_{i} as n\to \inftyThis last part is important to this post, so we put it in italics. 


This is, of course, widely applied.  When you flip a fair coin, you expect the probability to get heads to be \frac{1}{2}; indeed, this is why coin flipping is so popular: each side has an equal chance of coming up.  This idea, combined with human psychology, can get a bit sticky: if heads has come up 20 times in a row, would you expect it to come up again?  Would you expect that it was "time for tails to come up" after that long string?  This is illustrated in the Gambler’s Fallacy, an interesting topic which we will not get into in this post, but which is important in its own right.  The problem is that, with a small number of flips, the probability \mbox{P}(X = \mbox{Heads}) = \frac{\mbox{Number of Heads We Get}}{\mbox{Total Coin Flips}} will not equal \frac{1}{2} and, indeed, can be quite far away from \frac{1}{2}

Here is an illustration.  I’ve let Mathematica generate 0’s and 1’s at random with equal probability (meaning "tails" and "heads" respectively).  If we allow for 5 flips in one "game", the probability of heads (total number of heads over total number of flips) is essentially random; I’ve played ten games and graphed their probabilities below:




As we see, we don’t really get anywhere near one half in some of these games.  If we allow for 10 flips for each game, then we get the following:




This is nicer, because it shows that, in each game, we get, on average, pretty close to 0.5.  Only a few games are "far away" from 0.5; indeed, we expect such deviation.  Let’s look at 100 coin flips for each game:


This shows that most of our games will hover around having an average of 0.5 — what we would expect.


How Large is Large?

There is a subtle (well, maybe not so subtle — ) point in the law of large numbers: we need a large number of trials to get close to the true expected value.  But…how large is large?

Here, we can use a number of inequalities to see what an "optimal" number of trials would be, but I think it would be nice to use some programs and so forth to show the reader what Large looks like in some cases.


Here is what we will do.  I will construct an experiment to construct a few coin flips.  Then we will measure how long it takes to flip and get within a certain range of the expected value.  That is, we will flip a few times, count the number of heads, and see how long it takes us before we get within some \epsilon of 0.5. 

We need to be careful here.  For example, if we flip heads the first time and tails the second time, this will give us an average of 0.5; but, of course, this is not what we want — we know that this will quickly deviate from 0.5 and then slowly come back to the expected value.    For example, here are the first twenty flips for a particular game:



Note that this wildly fluctuates from nearly 100% down to around 40% and back up to around 57% before decreasing towards 45% at the end.  Very few times do we get to our expected probability of 0.5.  As we increase the number of rolls (using the same data as the previous plot),



We see that this goes down for a while, then begins to go back up again.  Note that, though we’ve done this experiment 50 times, we still have a number more "tails" than "heads", though we’d expect them to be equal "after a while."  Here is a chart for the first 1000 flips.



Notice that at the very beginning we have the erratic behavior we’ve seen before: this is, of course, because the law of large numbers hasn’t "kicked in" yet; the average value will fluctuate radically with each coin flip.  As we go onwards we see that the behavior stops being so erratic (look how, for example, the first part looks "sharp"; this is because it goes up and down quickly) and begins to smooth out.  Not only this, but it begins to approach its limiting probability of 0.5.  Notice that after 200 flips we begin to get exceptionally close to the limiting probability: we stay around \pm 0.02 of 0.5 after 200 flips.  This is what we’d expect.


A similar thing happens when we roll dice, though the experiment is slightly more interesting.  For a 4-sided die (known henceforth as a "D4") the expected value is equal to \frac{(1 + 2 + 3 + 4)}{4} = 2.5.  We do the exact same experiment as before, and record the values we obtain for the first 10 rolls:


We see that, despite eventually "coming down" at the end, the rolls spend a lot of time above the expected value.  Similarly, other iterations of this experiment have given other wild behaviors: staying near 1, fluctuating between 2 and 4 wildly, staying at 3, and so forth.  Let’s see what happens as we continue rolling 40 more times:



One notices that we are now "closer" to the expected value, and the rolls are "around" 2.5, within 0.1 on either side.  Let’s add some more rolls until we get to 200:


Whoops.  We’re still not quite at 2.5, though we are consistently with 0.1 of it.  Maybe if we add some more rolls? 




We’re getting closer.  In the last graph, we were within 0.1 of the expected value.  Now we are within 0.05 of the expected value.  It looks like it took around 600 or so rolls to consistently start lying within this range and, in general, it looks like around 100 rolls are "good enough" to lie within 0.5 of the expected value (though, the figure above this one shows that it is still a bit out of range…).  At last, look at this last figure:


We see that after a significant number of rolls (around 6500) we finally are consistently within 0.004 or so of our expected value.  This is pretty close if you only need an error of 0.004, but 6500 is a pretty large number — if this was an experiment where we needed to hit a small particle with another small particle, for example, it may take significantly more than 6500 tries to get within a proper range since an error of 0.004 may be extremely large in this case.

Rolling two D4 do a similar thing, though it seems that they converge to their expected value (of 5) quicker.  For example, here is the first 100 rolls:


Notice, though, before 20 rolls, the average is erratic and strange.  Even after 20 rolls, the rolls are only within 0.1 of the expected value.  For 1000 rolls,




we notice that we get somewhat closer to the expected value, getting within about 0.1 after around 600 rolls.  If we take this to the extreme 10,000 rolls, we obtain a much nicer plot:


It takes us around 6000 rolls to get within 0.01 of the expected value, it seems.  Of course, we expect a large number like this.  We have "precise" ways of telling "how much error" is going to happen on average, as well.  These estimates are based on something called z- or t-values and the standard deviation.  In particular, a lot of emphasis is placed on standard deviation.  How well does standard deviation measure the average deviation of our experiments?


Standard Deviation.

The standard deviation is a measure which tells us, on average, how much our data deviates from the expected value.  There are a number of ways to do standard deviation, but generally one computes two different kinds of standard deviations: the population and the sample standard deviation.  Here’s how to compute these:


Population Standard Deviation:  If we have data for a population, we will also have \mbox{E}(X) = \mu the expected value for the population.  Because it is a population, this \mbox{E}(X) will be exact: it is not approximating anything.  The standard deviation is given by \sqrt{\mbox{E}(X^{2}) - \mbox{E}(X)^{2}}, the square root of the variance. 


Sample Standard Deviation:  For this one we do not know what our expected value of the population is, and, hence, we have only an approximation of that average given by the average \bar{x}.  We calculate the sample standard deviation in the following way:

s = \sqrt{\frac{1}{n-1}\sum_{i=1}^{n}(x_{i} - \bar{x})^{2}}


Note that we’ve used an n - 1 instead of n because of Bessel’s Correction


Suppose we have a 6-sided die (henceforth, D6).  The expected value \mbox{E}(X) = 3.5 is easily calculated; the value \mbox{E}(X^{2}) = 15.1667 can be calculated in a similar way.  These two values give us that the population standard deviation is equal to \sigma = \sqrt{2.91667} = 1.70783

In a real test, how long does it take until we get close to this standard deviation?  We do some trial rolls and plot the sample standard deviation each time.  For 20 rolls,


Notice that our sample standard deviation spends a lot of time meandering back and forth, and gets to only about 0.2 or so within the population standard deviation.  Let’s do the same for 100 rolls,


We see that this eventually gets to within about 0.1 of the population standard deviation after 80 or so rolls.  After 250 rolls, below we see that the sample standard deviation gets quite close to the population standard deviation (within about 0.01):




Let’s do another example which isn’t so dull.  This time, we will roll two D6’s.  One can calculate the expected value and so forth, and from this we find that the standard deviation \sigma = 2.412.  As usual, we expect the first 10 rolls to not give too nice of a graph:



What about after 100 rolls?


We see that the standard deviation is still quite far away from the population standard deviation even after 100 rolls, though it has stabilized within about 0.5 above it.  What about 300 rolls?


We are much closer now: it is within 0.2 after some 250 or so rolls.  We see that, eventually, the sample standard deviation will become closer and closer to the population standard deviation.


One common rule that is used frequently is the Empirical Rule saying that, in a normal distribution, 68% of data is one standard deviation from the mean, 95% is between two standard deviations, and 99.7% is between three standard deviations.  These are common "rules of thumb" which are approximately true.  When we want to know approximately where most of our data will lie, we build something called a confidence interval.  The "rule of thumb" equation for a 95% confidence interval when we have n points of data is: \bar{x} \pm 2\dfrac{s}{\sqrt{n}}.


This "2" comes from the empirical rule (though it is not actually correct to use in the case where n is small; one must use something called a t-value).  The 95% confidence interval says that, if we were to do the experiment 100 times, 95 of those times we will have the expected value (the one we calculated above) to be in the confidence interval.  The accuracy of the interval is dependent on  s and \bar{x} being close to the true mean and true standard deviation; as we’ve seen, this is often not the case for small samples.  Moveover, the "2" in the formula, as we’ve noted, is not always a good estimate when n is small — what is usually done is, there is a correction term (called a t-value, which is a larger number than the corresponding z-value, which the "2" is approximating at 95% confidence) will give us a bigger interval so that we can be more confident that our true mean is within these ranges.  Unfortunately, this larger interval is basically useless when we attempt to "guess" the true mean from the data — that is, as before, the data simply isn’t a good enough estimate.


The Point?

The point is this: when you need to apply the law of large numbers make sure that you have a large enough number of trials.  In particular, if you are using expected value and standard deviation make sure that the number of trials is large enough for these to be accurate representations of your samples. 

It’s a bit of a strange feeling to go in reverse here — usually one uses a sample’s data and attempts to reconstruct the population, but here we are using the population’s data and noting that the sample may not always represent it accurately — but it is important to note.  Especially because I’ve seen (e.g., in various game-making and game-analyzing forums) these concepts used time and time again without any reference to the law of large numbers and the necessity for a large number of trials.


In The Next Post…

In the next post, we will go over simple games (like die rolling and coin flipping) and discuss which of these games can benefit from using expected value and which games (which I call "Luck Based") are games which, regardless of strategy, are essentially based on the "luck" of the roller — that is, one where one particularly "good" roll will give a large advantage/disadvantage to a player from which it is difficult to recover from. For example, if we had a game where each player rolls a die in turns and sum up the numbers they have rolled, if they use a D6 and race to 7 then the first player rolling a 1 is a severe disadvantage, from which it is difficult to recover from.  On the other hand, if they raced to something like 1000, each roll would have less of an effect and, in general, the rolls would eventually approximate the number of turns times the expected value — this type of game is "less" based on luck.  We will discuss some of these things in the next post.

Recently, I’ve been learning to program in a new language and have been doing Project Euler problems for practice — of course, as the name suggests, most of these problems deal explicitly with problems which must be solved (efficiently) with mathematical techniques.  Two of the most common algorithms I’ve used are: Prime Testing, GCD Finder.  I’ll post about the former later, but the latter is an interesting problem in its own right:


Initial Problem.  Given two natural (positive whole) numbers called m, n, can we find some other natural number that divides both of them?


This problem is a first step.  It’s nice to be able to write numbers as a multiple of some other number; for example, if we have 16 and 18, we may write them as 8\times 2 and 9\times 2, thus giving us some insight as to the relationship between these two numbers. In that case it may be easy to see, but perhaps if you’re given the numbers 46629 and 47100, you may not realize right away that these numbers are 99\times 471 and 100\times 471 respectively.  This kind of factorization will reveal "hidden" relationships between numbers.  

So, given two numbers, how do we find if something divides both of them — in other words, how do we find the common divisors of two numbers?  If we think back to when we first began working with numbers (in elementary school, perhaps) the first thing to do would be to note that 1 divides every number.  But that doesn’t help us all that much, as it turns out, so we go to the next number: if both numbers are even, then they have 2 as a common factor.  Then we "factor" both numbers by writing them as 2\times\mbox{ something} and then attempt to keep dividing things out of the something.  We then move onto 3, skip 4 (since this would just be divisible by 2 twice), go onto 5, then 7, then…and continue for the primes.  This gives a prime factorization, but we have to note that if, say, 2 and 5 divide some number, then so does 10.  These latter divisors are the composite factors.

This seems excessive, but it is sometimes the only way one can do it. 

Anecdote!: On my algebra qualifying exam, there was a question regarding a group of order 289 which required us to see if 289 was prime or not; if not, we were to factor it.  We were not allowed calculators, so what could we do?  Try everything.  Note that we only need to try up to the square root of the number (which we could estimate in other ways), but it’s still a number of cases.  If you check, none of the following numbers divide into 289: 2, 3, 5, 7, 11, 13.  At this point, I was about to give up and call it a prime, but, for whatever reason, I decided to try 17.  Of course, as the clever reader will have pointed out, 289 = 17\times 17.  It is not prime.  There was, luckily, only one student who thought it was prime, but it points out how the algorithm above is not entirely trivial if one does not have access to a computer or calculator. 


Once we have a common divisor, or a set of common divisors, a natural thing to want to do is to find the biggest (we already have the smallest, 1) since in this way we can write our numbers with the largest common factor multiplied by some other number.  It will, in effect, make things prettier.


Real Problem.  Find the greatest divisor which is common to two natural numbers, m, n.


If you were just learning about this kind of thing, you may spout out the following solution: find all of the common divisors, then pick the greatest.  While this is not especially efficient, it is a solution.  Unfortunately, even for small numbers, this gets out of hand quickly.  For example, 60 and  420 have the following common divisors: 1, 2, 3, 4, 5, 6, 10, 12, 15, 20, 30, 60.  This takes a while to compute by hand. 

Even if we were to find prime factors, this would be 60 = 2^2 \times 3\times 5 and 420 = 2^2 \times 3 \times 5\times 7, which gives us that they share a number of prime factors.  A bit of thinking gives us that we take all of the prime factors they "share" and multiply them together to get the greatest common divisor.  This is another potential solution which is much faster than simply listing out all of the common divisors.  Unfortunately, this falls prey to the same kind of trap that other prime-related problems do: it is, at times, especially difficult to factor large composite numbers.  For example, the "reasonably small" number 49740376105597 has a prime factorization of 2741 \times 37813 \times 479909; this is not at all efficient to factor if one does not have a computer or a specialized calculator with a factoring algorithm on it.  As a mean joke, you may ask your friend to factor something like 1689259081189, which is actually the product of the 100,000th and 100,001st prime — that is, they would need to test 99,999 primes before getting to the one which divides the number.  If they divided by one prime per second (which is quite fast!) this would take them 1 day, 3 hours, and 46 minutes.  Not especially effective, but it will eventually get the job done.


Real Problem, With Efficiency: Find the greatest divisor which is common to two natural numbers, m,n, but do so in an efficient manner (we’ve all got deadlines!).


We need to sit down and think about this now.  We need an entirely new idea.  We note, at least, that for the two numbers m,n that one of them must be larger than the other (or else the problem is trivial).  One thing to try would be to see if the smaller one goes into the larger one (for example, above we had 60 going into 420, which gave us the easy solution that 60 must be the greatest common divisor).  If not, maybe we can see how much is left over.  That is, if m is the larger number,

m = a_{1}n + r_{1}

where here a_{1} is the number of times n goes into m without exceeding it, and r_{1} is the "remainder"; if it’s equal to 0, then n evenly divides into m, and otherwise it is less than n (or else we could divide an additional n into m). 

Using this, if r_{1}\neq 0, we may write m - a_{1}n = r_{1}; this means that, in particular, r_{1} divides m and a_{1}n, so it is a factor of m and of a_{1}n.  But it may not actually be a factor of n; so let’s see how many times it goes into n.  Using the same process…

n = a_{2}r_{1} + r_{2}

and by rearranging, we have that n - a_{2}r_{1} is divisible by r_{2}.  So, n is divisible by r_{2}, but we aren’t sure if r_{1} is divisible by r_{2}…if it were, we would be able to say that r_{2} was a common divisor of m and n (why?).  That’s something at least. 

The cool thing about our algorithm here is that, because a_{1}n + r_{1} = m we have that either r_{1} = 0 and we’re done with the algorithm, or r_{1} > 0 and we may form a new equation n = a_{2}r_{1} + r_{2}; this equation has, on the left-hand side, the number n which is less than the previous equation’s left-hand side, which was m.  Continuing this process, we will have r_{1}, r_{2}, \dots on the left-hand side, each of which is less than the one which came before it.  Because r_{i} \geq 0 for any of the remainders, eventually it will become 0 (why?) and this algorithm will terminate.  That is, we will have found some r_{i} which is a common divisor for both n, m; specifically, it will be the r_{i}\neq 0 such that r_{i+1} = 0 (or, it may simply be n if n divides m).

This algorithm, called the Euclidean Algorithm, actually does more "automatically": it not only finds a common divisor, but actually finds the greatest common divisor of m,n, which, from now on, we will denote \gcd(m,n).  The "proof" of this is simply noting that \gcd(m,n) = \gcd(n,r_{1}) = \gcd(r_{1},r_{2}) = \cdots = \gcd(r_{i-1},r_{i}) = r_{i} (we noted this above without making reference to the gcd, but the reader should attempt to go through all the same steps using the idea of the gcd). 

So.  If you have two natural numbers, n,m, you divide them, find the remainder, write the equation, then continue as above until you get a 0 remainder.  Then you pick the remainder directly before you got 0 as your gcd (or, you pick the smaller number if one number divides the other).  Pretty simple algorithm, but is it efficient?

Without going into formal "efficiency" definitions, "yes", it is quite efficient.  To prove it, let’s take an "average" example using the "large" numbers 1337944608 and 4216212.  We note that (by pen and paper, or by using a standard calculator) that

1337944608 = 317(4216212) + 1405404.

Next, we note that

4216212 = 3(1405404) + 0

which instantly gives us the solution \gcd(4216212, 1337944608) = 1405404.  That’s pretty awesome.  Note that this was an especially quick trial, but even the "worst" ones are relatively quick. 


Unexpected Corollary!:  For n,m natural numbers, if \gcd(n,m) = k then there exists integers a,b such that an + bm = k.


This is more useful than you might think at first glance, and we’ll get into why in a later post, but what’s nice about this corollary is that it comes "for free" from the Euclidean algorithm.  Note that, since k divides n, m, it suffices to prove this corollary for an + bm = 1 where n, m have \gcd(n,m) = 1.  The proof uses induction on the number of steps of the Euclidean algorithm for those numbers, but for those of you who are more experienced and know modular arithmetic, you may enjoy the following simple proof:


"Clever" Proof of the Corollary: Let m > n (for equality, the proof is easy).  We will only care about remainders in this proof, so we will look at some numbers modulo m.  Consider


r_{1} = n\mod m

r_{2} = 2n\mod m


r_{m-1} = (m-1)n\mod m


Note there are exactly m-1 remainders here and that the remainder 0 never occurs (since m,n are relatively prime).  Suppose that r_{i} \neq 1 for each of the i; that is, the remainder 1 does not ever show up in this list.  By the pigeon-hole principle (as there are m - 1 remainders but only m -2 possible values for the remainders) we must have that r_{i} = r_{j} for some i\neq j.  That is, we have

in \mod m = jn \mod m

which implies

(i-j)n\mod m = 0

but this is impossible, since it implies that either m = 0 or n is some integer multiple of m, but m > 0 and we have assumes m,n are relatively prime.  Hence, the remainder 1 must occur.  That is, r_{c} = 1 for some c and

cn \mod m = 1.

But what does this mean?  It means that there is some integer a such that am - cn = 1.  To make this prettier, let b = -c and we find that there exists a,b integers such that am + bn = 1, as required.  \Box


Pretty slick, no?

This post will require some very basic knowledge of category theory (like, what a category is, and how to make a poset into a category).  For everything below, I will be a bit informal, but I will essentially mean that A, B are objects in a category, and f:A\to B is some morphism between them which is also in the category.


The "natural" extension of the notion of a surjective map (in, say, the category of sets) is


Definition.  A map f:A\to B is an epimorphism if, for each object Z and map g,g':B\to Z we have that if g\circ f = g'\circ f then g = g'.


You should prove for yourself that this is, in fact, what a surjective map "does" in the category of sets.  Pretty neat.  Similarly, for injective maps (in, say, the category of sets) we have the more general notion:


Definition. A map f:A\to B is a monomorphism if, for each object Z and map g,g':Z\to A we have that if f\circ g = f\circ g' then g = g'.


Again, you should prove for yourself that this is the property that injective mappings have in the category of sets.  Double neat.  There is also a relatively nice way to define an isomorphism categorically — which is somewhat obvious if you’ve seen some algebraic topology before.


Definition. A map f:A\to B is an isomorphism if there is some mapping g:B\to A such that f\circ g = 1_{B} and g\circ f = 1_{A}, where 1_{A},1_{B} denote the identity morphism from the subscripted object to itself.


Now, naively, one might think, "Okay, if I have some certain kind of morphism in my category (set-maps, homomorphisms, homeomorphisms, poset relations, …) then if it is an epimorphism and a monomorphism, it should automatically be an isomorphism."  Unfortunately, this is not the case.  Here’s two simple examples.


Example (Mono, Epi, but not Iso).  The most simple category for which this works is the category 2, which I’ve drawn below:


There are two objects, a,b and three morphisms, the identites and the morphism f:a\to b.  First, prove to yourself that this is actually a category.  Second, we note that f:a\to b is an epimorphism: the only map from A\to A is the identity, and there is no mapping from b\to a, so the property trivially holds.  Third, we note that f:a\to b is a monomorphism for the exact same reason as before.  Last, we note that f is not an isomorphism: we would need some g: b\to a which satisfied the properties in the definition above…but, there is no map from b\to a.  Upsetting!  From this, we must conclude that f cannot be an isomorphism despite being a mono- and epimorphism.



Similar Example (Mono, Epi, but not Iso).  Take the category ({\mathbb N}, \leq), the natural numbers with morphisms as the relation \leq.  Which morphisms are the monomorphisms?  Which morphisms are the epimorphisms?  Prove that the only isomorphisms are the identity morphisms.  Conclude that there are a whole bunch of morphisms which are mono- and epimorphisms but which are not isomorphisms.

Let’s talk about Burnside’s theorem.  First, let me note that there are two results commonly called "Burnside’s Theorem."  The first one that I learned (which we won’t be discussing in this post) was:


Theorem (Burnside).  If G is a finite group of order p^{a}q^{b} where a,b are non-negative integers and where p,q are primes, then G is solvable.


The second one is also a group theoretical result, but a bit more combinatorial-feeling.  In some books (and, apparently, Wikipedia) this second result is called Burnside’s Lemma.  As noted in the Wikipedia article, this theorem was not even due to Burnside, who quoted the result from Frobenius, who probably got it from Cauchy. 

Let’s get some definitions down.  As usual, we’ll denote the order of the group by |G|, and our groups will all be finite in this post.  If we have a group G which acts on a set X, then given some fixed g\in G we define the set Fix_{X}(g) = \{x\in X\,|\, g\cdot x = x\}; this is, of course, the fixed points in X when acted on by the element g; for the remainder of the post, we will simply write Fix(g) with the set X implied.  Remember, when a group acts on X, the "product" will sit inside of X, and we write the action as g\cdot x for an element g\in G acting on an element x\in X.  The orbit of a point x\in X when acted on by G is given by \{g\cdot x\,|\, g\in G\}; we’ll denote this Orb(x), though this is not standard notation.  The orbit, essentially, is all of the possible values y\in X that you can get to by acting on x with elements in your group.

One thing to note, also, is that orbits are pairwise disjoint.  You should prove this to yourself if you haven’t already, but the idea is like this: if A,B are orbits of elements in X then suppose z\in A\cap B; then there is some x\in A, y\in B, g,h\in G such that g\cdot x = h\cdot y = z, but this implies (h^{-1}g)\cdot x = 1\cdot y = y which implies the orbits are identical (why?).  Hence, each element of X is in exactly one orbit


We need one more result before we can sink our teeth into Burnside.  Remember the fixed point set above?  This was all of the elements x such that g\cdot x = x for some g.  There’s a similar notion called a Stabilizer, denoted G_{x} = \{g\in G\,|\, g\cdot x = x\}; this is saying that we first fix x\in X, and then look at all the elements g\in G which stabilize it.  These definitions are pretty similar feeling (almost like family!) and, in fact, there is a nice relation between the two:


Notation.  Let X/G denote the set of orbits of X when acted on by G; when X is a group and G is a subgroup this is the same as a quotient.

Theorem (Orbit-Stabilizer Theorem).  There is a bijection between G/G_{x} and Orb(x)


That is, if we act on G by elements of g which fix x, then we will have the same number of elements as the orbit of x.  This might seem a little confusing at first, but if you work through it, it’s not so weird. 


Sketch of the Proof.  (Skip this if you’re not comfortable with all this notation above; just go down to the next theorem.)  Here, we want to show a bijection.  Notice that $G/G_{x}$ is the set of cosets hG_{x} for h\in G.  We claim that the mapping \phi which sends \phi: hG_{x} \mapsto h\cdot x is well-defined, injective and surjective (but not a homomorphism).  First, well defined: if hG_{x} = kG_{x} then $latex $hk^{-1}\in G_{x}$ which means that (hk^{-1})\cdot x = x.  This implies, after some manipulation, that h\cdot x = k\cdot x, which means these elements are identical in Orb(x).  Second, surjectivity is clear.  Last, if h\cdot x = g\cdot x in the orbit, then (g^{-1}h)\cdot x = x which implies g^{-1}h\in G_{x} which gives gG_{x} = hG_{x}; hence this map is injective.  This gives us that our map \phi is bijective.  \diamond


One immediate corollary is that |Orb(x)| = \dfrac{|G|}{|G_{x}|}; that is, the number of elements in the orbit of x is the same as the number of elements in G divided by the number of elements in g which fix x.  Think about this for a minute. 



Road to the Proof.

Okay.  Now, let’s think about something for a second.  What is the sum

\displaystyle \sum_{g\in G}|Fix(g)|

telling us?  This is the number of elements in x which are fixed by some g\in G; but there might be some overlap, since if g\cdot x = x and h\cdot x = x, then x will be counted twice: one as an element of Fix(g) and once as an element of Fix(h).  But how much overlap is there?  This is an innocent seeming question, and you might think something like, "Well, depends on how much stuff stabilizes each x.", and this is pretty close to the point. 

First, note that

\displaystyle \sum_{g\in G}|Fix(g)| = |\{(g,x)\in G\times X\,|\,g\cdot x = x\}|

which is just the long way to write out this sum; but, the nice part about that is, we can now think about this as all of the elements of G which are stabilized by some x\in X (why?).  Then,

\displaystyle \sum_{g\in G}|Fix(g)| = \sum_{x\in X}|G_{x}|.

If you don’t see this, you should prove to yourself why they’re the same sum (why is each element counted in the left-hand side also counted in the right-hand side?).  Now, by the Orbit-Stabilizer theorem above, this right-hand sum becomes pretty nice.  Specifically,

\displaystyle \sum_{x\in X}|G_{x}| = \sum_{x\in X}\frac{|G|}{|Orb(x)|} = |G|\sum_{x\in X}\frac{1}{|Orb(x)|}

where we noted in the last equality that |G| is a constant, so we may pull it out of the sum. 

Recalling that X/G denotes the number of orbits, we have that if we take a single orbit (call it A) we will be adding \frac{1}{|A|} up exactly |A| times (since the sum is taken over each x\in X so, in particular, over each $x\in A$); hence, we will add \frac{1}{|A|}\cdot |A| = 1 for each orbit we have in X/G.  That is;

\displaystyle = |G|\sum_{A\in X/G}1 = |G||X/G|.

Putting this all together, we have

\displaystyle \sum_{g\in G}|Fix(g)| = |G||X/G|.


We clean it up a bit, and state the following:


Theorem (Burnside’s).  For a finite group G and a set X, with notation as above, we have \displaystyle |X/G| = \frac{1}{|G|}\sum_{g\in G}|Fix(g)|


That is, the number of orbits is equal to the sum, over g\in G, of the elements of x fixed under g, averaged by the number of elements in G.  Kind of neat.


Next time, we’ll talk about applications!

Pick’s Theorem.

November 22, 2012


In order to teach kids about the concept of area, math teachers sometimes showed pictures which looked like



and we would be asked to "guess the area."  If this was being taught to younger students, the way to do it would be to draw little boxes inside and then try to see how many "half-boxes" were left, how many "quarter-boxes" were left, and so forth.  If this was a more sophisticated lesson, our teacher would instruct us to cut up the picture into rectangles and triangles — the one above, for example, has a right triangle on the left-hand side and a 4×2 rectangle on the right-hand side.  From this, we could easily use the formula for area for each, and we could deduce the total area from these. 

So far nothing is too difficult to solve.  But, if we adjust the picture slightly (still using only straight lines to construct our figure) we may get something like




Here, it is difficult to tell what to do.  We could try to find "easier" triangles to break this into, but none are immediately obvious.  It’s not easy to tell what the angles are (except by using some algebraic methods, finding slopes, etc.).  The best one could do would be to "guess" what the lengths of the sides were (or the lengths of the base and height of the triangle). 


Motivating Pick’s Theorem.

Before we begin, let me define some of the things we’ll be working with.   A polygon is a familiar concept, but we ought to say some things about it:

A lattice polygon is a figure on the two-dimension lattice (as above) such that all of its edges begin and end on a lattice point, no edges overlap except at a lattice point, and its boundary is a closed path (informally, this means that if we started at a lattice point and "traveled" along an edge and kept going in the same direction, we’d eventually come back to where we started).  Moreover, the polygon has to have a distinct interior and exterior (as in, it cannot look like a donut with a hole in the middle).  [Topologists, note: this means that the boundary of our figure is homeomorphic to a circle and the interior is simply connected.]


Easy Polygons.

The "easiest" polygons to work with in terms of area are rectangles: once we find the side lengths, we’re done.  Let’s look at the rectangle below, ignoring the stuff in the lower-left for now.




We can easily find the area here: it is a 7×3 rectangle, giving us an area of 21.  Notice something kind of neat, though; if we look at each interior point of the rectangle, we can associate a unit square with it (In the picture above, I’ve identified the point "a" with the first unit square "a", and the lattice point "b" with the square "b").  Using just the interior points, we can fill up most of this rectangle with unit squares, as below:




We see that we get "almost" all of the area accounted for if we only identify interior points like this; we miss the "¬" shape on the top.  But besides this weird shape, we learned that if we have a rectangle with n interior points, we will be able to fill up n units squared of area in our rectangle. 

If we do the same process (which I mark with a different color) for the boundary points on the top of the rectangle by identifying them with a unit square to their lower-left, we notice that we have to "skip" one on the far-left.




[For clarification, the second lattice point on the top of the rectangle corresponds to the first box, the third point corresponds with the second box, and so forth.  We are forced to skip one lattice point (the first one) on the top.]

Notice that if we have \ell interior points and k points on the top boundary of the rectangle, then there must be k-2 points in each row of interior points; hence, there must be \dfrac{\ell}{k-2} points in each column of interior points. 

Notice, last, that in our picture we need only fill in the remaining few parts on the right side.  We notice that the number of squares needed to fill in this side is exactly the same number of points in each column of interior points  (check this out on a few different rectangles if you don’t believe it). 


Whew.  Okay.  So.  We have \ell interior points, each corresponding to a unit square.  All but one of the top boundary points corresponds to a square; this is k - 1 unit squares.  Last, we have the number of points in each columns of the interior points corresponding to the remaining unit squares; this is \dfrac{\ell}{k-2} unit squares. 

At this point we want to write this nicely, so we’ll denote the TOTAL number of boundary points b, and note that b = 2k + 2\dfrac{\ell}{k-2}.

The total area we found was \ell + k-1 + \dfrac{\ell}{k-2}.  If we rearrange this a bit, we notice that 

\mbox{AREA} = \ell + \left(k + \dfrac{\ell}{k-2}\right) - 1 = \ell + \frac{1}{2}b - 1.

This is exactly the statement of Pick’s theorem, and it is true more generally, as we’ll see.



We’ll briefly cover one more example.  Triangles are a natural figure to think about, so let’s see if Pick’s formula works here.  First, right triangles are easy to work with, so let’s look at this one (ignore the blue part for now):




Notice that this triangle is exactly half of a rectangle (this is completed by the blue part), and it just so happens to have no lattice points on the diagonal.  This last part is important so look again: none of the dots touch the diagonal of the red triangle (here, we are excluding the vertices of the triangle, which we don’t count as being on the diagonal).  Some come close, but none lie on it.  Of course, this is not true in general, but for now let’s just look at this triangle. 

If we use the formula above for the rectangle, we get that the area is \ell + \dfrac{b}{2} - 1 for the rectangle (in this specific case, \ell = 30 and b = 26), and half of this will be \dfrac{\ell}{2} + \dfrac{b}{4} - \dfrac{1}{2}

On the other hand, if we look at the interior points of the triangle, if none of them lie on the diagonal (like above) then we have exactly half of what the rectangle had, so our triangle has \dfrac{\ell}{2} interior points; the number of boundary points will be half of the number of boundary points of the rectangle, plus 1.  This can be seen as follows: if we consider all the points on the bottom boundary and all the points on the left boundary except for the top-most point, then this is exactly half of the boundary points of the rectangle.  Hence, the number of boundary points we have for our triangle is \dfrac{b}{2} + 1

Plugging this information into Pick’s formula (which, at this point, we only know is valid for the rectangle!) we obtain: \frac{\ell}{2} + \dfrac{b}{4} - \frac{1}{2}.  This is exactly the area we calculated before, giving us a verification that Pick’s formula works for right triangles with no lattice points on the diagonal. 


How do we get around the condition that no lattice points should be on the diagonal?  There is a relatively easy way to break up right triangles into other right triangles, none of which will have points on their diagonals.  I’ll demonstrate with this picture below:


The idea is to just take the big triangle, draw some vertical and horizontal lines from the lattice points which lie on the diagonal, until you get smaller triangles (which will have no lattice points on the diagonal) and a bunch of rectangles.  In this case, I first got two triangles (a small one on top, and a small one on the bottom right) and one little 4×3 rectangle in the lower-left.  You then split the rectangle in half, which gives you some more triangles; if these triangles had lattice points on the diagonal, I would have repeated the process, getting even smaller triangles and rectangles.  Because everything is nice and finite here, we don’t get infinite numbers of rectangles and triangles and such: this process will eventually stop.  We apply the above to each and note that this verifies Pick’s formula for any right triangle.


But even right triangles are a bit restrictive.  It would be nice if it were true for any triangle.  Indeed, there is a similar process which decomposes triangles into right triangles.  In fact, there are a number of such processes: for example, prove to yourself that every triangle has at least one altitude which is entirely contained in the triangle, and note that this splits the triangle into two right triangles.  However you show it, this verifies Pick’s formula for any triangle.



The Theorem.

Theorem (Pick).  Given a lattice polygon with \ell interior points and b boundary points, the total area enclosed by the figure is given by \ell + \dfrac{b}{2} - 1


The proof of this is done by induction.  This is a more unusual type of induction since it requires us to induct on the number of triangles a figure is made up of.  The difficult part has already been completed: we have shown that the formula holds for any triangle.  We need a fact which I will not prove here:


Fact: Every polygon can be decomposed into a collection of triangles which only intersect at their boundaries.  Moreover, lattice polygons can be decomposed into a collection of lattice triangles (triangles whose vertices are lattice points) which only intersect at their boundaries.


This process is called polygon triangulation.  In layman’s terms, it means that you can take a polygon and cut it up into a bunch of triangles.  Try it yourself for a few polygons!


Given all of this, let’s jump into the proof.


Proof.  By induction.  We have already proved the formula holds for triangles, so suppose the formula holds for all polygons which are able to be decomposed into k or fewer triangles. Take such a polygon P which is able to be decomposed into exactly k triangles and "attach" a triangle T to the boundary of P such that the resulting figure is still a lattice polygon; call this new polygon Q.





For the triangle, denote its boundary points by b_{T} and its interior points by \ell_{T}; similarly, for P denote its boundary points by b_{P} and its interior points by \ell_{P}.  Denote the common points that T and P share by c.

For interior points, note that we have added the interior points of T and P together, but we also obtain those points which they share on their boundary, except for the two points on the vertex of the triangle which are shared; that is,  \ell_{Q} = \ell_{T} + \ell_{P} + (c-2)


For boundary points, we have to subtract c-2 points from P‘s boundary and  c - 2 points from T‘s boundary (for the same reason as in the previous paragraph).  If we add together the boundary points of T minus the common points and the boundary points of P minus the common points, we will be counting those two points on the vertex of the triangle which are shared two times (why?), so we need to subtract 2 so that we only count these points once.  Hence, b_{Q} = b_{P} + b_{T} - 2(c - 2) - 2 = b_{P} + b_{T} - 2c + 2


At this point, let A_{P} be the area of P and A_{T} be the area of T; we have that:


\displaystyle A_{P} + A_{T} = \left(\ell_{P} + \frac{b_{P}}{2} - 1\right) + \left(\ell_{T} + \frac{b_{T}}{2} - 1\right)

\displaystyle = (\ell_{P} + \ell_{T}) + \frac{b_{P} + b_{T}}{2} - 2


We note now that \ell_{P} + \ell_{T} = \ell_{Q} - (c - 2) and b_{P} + b_{T} = b_{Q} + 2c - 2 from above, which gives us

\displaystyle = \ell_{Q} - (c - 2) + \frac{b_{Q}}{2} + (c - 1) - 2 = \ell_{Q} + \frac{b_{Q}}{2} - 1.

This verifies Pick’s formula for our lattice polygon Q, and since any lattice polygon can be constructed this way (from finitely many triangles) this shows that Pick’s formula holds for any lattice polygon.  \Box

This post has a pretty weird title, but the problem is easy to state and uses a few interesting mathematical concepts.  It’s worth going through.  Let’s start with the basics.


Problem 1.  Let S_{\alpha}(n) = 1^{\alpha} + 2^{\alpha} + \dots + n^{\alpha}.  Show that S_{\alpha}(n) is a polynomial for each \alpha\in {\mathbb N} and that the degree of the polynomial is \alpha + 1.


Indeed, for example, we have  that 1 + 2 + \dots + n = \frac{n(n+1)}{2}, as we learned in Calculus, and this is a polynomial of degree 2.  Similarly, 1^{2} + 2^{2} + \dots + n^{2} = \frac{n(n+1)(2n+1)}{6}, which is a polynomial of degree 3.  In the same respect, 1^{3} + 2^{3} + \dots + n^{3} = \frac{(n(n+1))^{2}}{4}, which is a polynomial of degree 4.

The associated polynomials in this case are given by Faulhaber’s formula:


Theorem (Faulhaber).  For \alpha\in {\mathbb N} we have \displaystyle \sum_{k=1}^{n}k^{\alpha} = \frac{1}{1 + \alpha}\sum_{j=0}^{\alpha}(-1)^{j}\binom{\alpha + 1}{j}B_{j}n^{\alpha + 1 - j}.


This formula looks terrifying, but it is not hard to apply in practice.  You may be wondering, though, what the B_{j}‘s in this formula stand for.  These are the strange and wonderful Bernoulli numbers, of course!  I always enjoy seeing these creatures, because they unexpectedly pop up in the strangest problems.  There are a number of ways to define these numbers, one of which is to just write them out sequentially, starting with B_{0}:


1, -\frac{1}{2}, \frac{1}{6}, 0, -\frac{1}{30}, 0, \frac{1}{42}, 0, -\frac{1}{30}, 0, \dots


But in this case it is not so easy to guess the next value.  The clever reader will notice that all of the odd numbered Bernoulli numbers (except the first) are zero, but other than that there does not seem to be a clear pattern.  Fortunately, we can construct a function which generates the values as coefficients; we’ll call this function (surprise!) a generating function.


Definition.  We define the sequence \{B_{j}\}_{j=0}^{\infty} by

\displaystyle \frac{t}{e^{t} - 1} = \sum_{j=0}^{\infty}B_{j}\frac{t^j}{j!}.


Notice that this will, in fact, generate the B_{j} as coefficients times \frac{1}{j!}.  Neat.  In practice, you can use a program like Mathematica to compute B_{j} for pretty large values of j; but, of course, there are lists available.  We can now use Faulhaber’s formula above, which gives us (assuming we have proven that the formula holds!) that the sums of powers of natural numbers form polynomials of degree \alpha + 1.


But something else happens that’s pretty interesting.  Let’s look at some of the functions.


\begin{array}{|c|c|}\hline \alpha & \mbox{Polynomial}\\\hline &\\ 1 & \frac{1}{2}n^{2} + \frac{1}{2}n\\ &\\ 2 & \frac{1}{3}n^{3} + \frac{1}{2}n^{2} + \frac{1}{6}n\\ & \\ 3 & \frac{1}{4}n^{4} + \frac{1}{2}n^{3} + \frac{1}{4}n^{2} \\ & \\ 4 & \frac{1}{5}n^{5} + \frac{1}{2}n^{4} + \frac{1}{3}n^{3} - \frac{1}{30}n\\ & \\ \hline \end{array}


Look at the coefficients in each of these polynomials.  Anything strange about them?  Consider them for a bit.


Problem.  Look at the coefficients.  What do you find interesting about them?  Note that, in particular, for a fixed \alpha, the coefficients of the associated polynomial sum to 1.  Convince yourself that this is probably true (do some examples!) and then prove that it is true.  Do this before reading the statements below.


Anecdote.  I spent quite a while trying to write down the "general form" of a polynomial with elementary symmetric polynomials and roots to try to see if I could prove this fact using some complex analysis and a lot of terms canceling out.  This morning, I went into the office of the professor to ask him about what it means that these coefficients sum up to 1.  He then gave me a one-line (maybe a half-line) proof of why this is the case. 


Hint.  What value would we plug in to a polynomial to find the sum of the coefficients?  What does plugging in this value mean in terms of the sum?

An interesting question came up during my studying today which made me think about some of the different ways we can think about polynomials.  The solution to this problem was not immediately obvious to me, and, in fact, it wasn’t until I looked up a completely unrelated problem (in a numerical methods book!) that some solution became clear.

Question:  Suppose that \{f_{n}\} is a sequence of polynomials with f_{n}:{\mathbb R}\to {\mathbb R} each of degree m > 1 and suppose f_{n}\to f pointwise.  Show that f is also a polynomial of degree no more than m


Some interesting points come up here.  First is that we only have pointwise convergence — it wasn’t even immediately obvious to me how to prove the resulting limit was continuous, let alone a polynomial of some degree.  Second, we know very little about the polynomials except for what degree they are.  This should be an indication that we need to characterize them with respect to something degree-related.

Indeed, polynomials can be represented in a few nice ways.  Among these are:


  • In the form f(x) = a_{0} + a_{1}x + \cdots + a_{n}x^{n} where it is usually stated that a_{n}\neq 0.
  • In terms of their coefficients.  That is, if we have a list of polynomials of degree 3 to store on a computer, we could create an array where the first column is the constant, the second is the linear term, and so forth.  This is sort of like decimal expansion.
  • Where they send each point.  That is, if we know what f(x) is equal to for each x, we could recover it.
  • If a polynomial is of degree m then, somewhat surprisingly, we can improve upon the previous statement: if we know the value of f for m + 1 distinct points, then we can find f, and f is the unique such polynomial of that degree which has those values.  (Note that if we were to have m+1 points and a polynomial of degree k > m, then many polynomials of this degree could fit the points.  Consider, for example, m = 0 and k = 1.  Then we have one points and we want to fit line through it.  Clearly this can be done in infinitely many ways.)


  This last one is going to be useful for us.  So much so that it might be a good idea to prove it.


Lemma.  Let x_{1}, \dots, x_{m+1} be m + 1 distinct points in {\mathbb R}, and let y_{1}, \dots, y_{m+1} be distinct points in {\mathbb R}.  Then there is a unique polynomial of degree at most m such that f(x_{i}) = y_{i} for each i considered here.


Proof.  This is an exercise in linear algebra.  We need to solve the system of linear equations

\displaystyle \sum_{n = 1}^{m} a_{n}x_{i}^{n} = y_{i}

where i spans 1, \dots, m+1, for the constants a_{n}\in {\mathbb R}.  Notice that this is simply plugging x_{i} into a general polynomial of degree m.  Notice that the matrix that this forms will be a Vandermonde matrix.  Since each x_{i} is distinct, the determinant of this matrix is nonzero, which implies that there is a unique solution.  This gives us our coefficients, and note that this is a polynomial not necessarily of degree exactly m, since some coefficients may be 0, but it is at most m\diamond


[Note: For those of you who forgot your linear algebra, the end of this goes like this: if we let our coefficients be denoted by the column matrix b and our Vandermonde matrix is denoted by A, then we want to solve Ab = Y where Y is the column vector with entries y_{i}.  If A has non-zero determinant, then it is invertible, and so we have that b = A^{-1}y gives us our coefficients.]


Neato.  But now we need to specialize this somewhat for our proof. 


Corollary.  Let the notation and assumptions be as in the last lemma.  For I\in \{1, \dots, m+1\}, let g_{i} be the unique polynomial of degree at most m with g_{i}(x_{j}) = \delta_{i,j} (where \delta_{i,j} = 0 if i\neq j and \delta_{i,i} = 1).  Then every polynomial f of degree at most m is of the form \displaystyle f(x) = \sum_{i = 1}^{m+1}f(x_{i})g_{i}(x) for each x\in {\mathbb R}.


This might be a bit more cryptic, so let’s do an example.  Let’s let m = 1 so that we have two points.  Let’s say x_{1} = 0 and x_{2} = 1.  Then we have g_{1} is the unique polynomial of degree at most 1 such that g_{1}(0) = 1 and g_{2}(1) = 0.  Of course, this function will be g_{1}(x) = -x + 1.  Now g_{2}(0) = 0 and g_{2}(1) = 1; this gives us that g_{2}(x) = x.  The theorem now states that any polynomial of degree at most 1 can be written in the form

f(x) = f(0)g_{1}(x) + f(1)g_{2}(x)

= f(0)(-x+1)+ f(1)x = (f(1) - f(0))x + f(0).

For example, let f(x) = 2x + 1.  Then the lemma says f(x) = (3(1) - 1)x + 1 = 2x + 1, as we’d expect.  The power of this lemma will become clear when we use this in the solution.  The proof of this corollary is just a specialization of the previous lemma, so we exclude it.


Solution.  Recall, just for notation, that our sequence \{f_{k}\}\to f pointwise.  Let’s let x_{1}, \dots, x_{m+1} be our distinct points, as usual.  In addition, let’s let g_{1}, \dots, g_{m+1} be defined as in the corollary above. Represent each f_{k} as follows:

\displaystyle f_{k}(x) = \sum_{j = 1}^{m+1}f_{k}(x_{j})g_{j}(x)

for each x\in {\mathbb R}.  Here comes the magic: let k\to\infty and note that f_{n} \to f at every point, so, in particular, on each x_{i} and on x.  We obtain

\displaystyle f(x) = \sum_{j = 1}^{m+1}f(x_{j})g_{j}(x)

for each x\in {\mathbb R}.  But this is the sum of polynomials of degrees at most m, which gives us that f is itself a polynomial of degree at most m\Box


I’ll admit, I did a bit of digging around after finding the lemma above; in particular, this corollary representation of polynomials seems to be a nice way to represent a polynomial if we do not know the coefficients but do know the values at a certain number of points and have that its degree is bounded below that number of points. 


Exercise: Try to write out this representation for m = 2 and m = 3.  If you’re a programmer, why not make a program that allows you to input some points and some values and spits out the polynomial from the corollary above?

Question:  Consider the subspace E of L^{2}([-1,1]) consisting of even functions (that is, functions with f(x) = f(-x)).  Find the orthogonal complement of E.


One Solution.  It’s easy to prove E is a subspace.  Then, there is a representation of any function in this space by adding odd and even functions together; more precisely, given f\in L^{2}([-1,1]) we have that \frac{f(x) + f(-x)}{2} is even and \frac{f(x) - f(-x)}{2} is odd and f(x) = \frac{f(x) + f(-x)}{2} + \frac{f(x) - f(-x)}{2}.  For uniqueness, note that if f(x) = f(-x) = -f(x), then f(x) = -f(x) for each x, giving us that f(x) = 0.  Hence, the orthogonal complement of E is the set of odd functions.  \diamond


Here’s another solution that "gets your hands dirty" by manipulating the integral.


Another Solution.  We want to find all g\in L^{2}([-1,1]) such that \langle f, g \rangle_{2} = 0 for every even function f\in L^{2}([0,1]).  This is equivalent to wanting to find all such g with \displaystyle \int_{-1}^{1}f(x)g(x)\, dx = 0.  Assume g is in the orthogonal complement.  That is,

\displaystyle 0 = \int_{-1}^{1}f(x)g(x)\, dx = \int_{-1}^{0}f(x)g(x)\, dx + \int_{0}^{1}f(x)g(x)\, dx

\displaystyle = -\int_{1}^{0}f(-x)g(-x)\, dx + \int_{0}^{1}f(x)g(x)\, dx

The last equality here re-parameterizes the first integral by letting x\mapsto -x, but note that our new dx gives us the negative sign.

\displaystyle = \int_{0}^{1}f(-x)g(-x)\, dx + \int_{0}^{1}f(x)g(x)\, dx

\displaystyle = \int_{0}^{1}f(x)g(-x)\, dx + \int_{0}^{1}f(x)g(x)\, dx

\displaystyle = \int_{0}^{1}f(x)g(-x) + f(x)g(x)\, dx = \int_{0}^{1}f(x)(g(x) + g(-x))\, dx.

We may choose f(x) = g(x) + g(-x) since this is an even function, and we note that this gives us

\displaystyle 0 = \int_{0}^{1}(g(x) + g(-x))^{2}\, dx.

Since (g(x) + g(-x))^{2}\geq 0, it must be the case that (g(x) + g(-x))^{2} = 0.  [Note: The fact that this is only true "almost everywhere" is implicit in the definition of L^{2}.]  Hence, g(x) + g(-x) = 0, giving us that -g(x) = g(-x)

We now have one direction: that if g is in the orthogonal complement, then it will be odd.  Now we need to show that if g is any odd function, it is in the orthogonal complement.  To this end, suppose g is an odd function.  Then by the above, we have

\displaystyle \langle f, g \rangle_{2} = \int_{-1}^{1}f(x)g(x)\, dx = \int_{0}^{1}f(x)(g(x) + g(-x))\, dx = 0

where the last equality comes from the fact that g is odd.  \diamond

Seemingly unrelated is Gauss’ Mean Value Theorem, which is significantly cooler (in my opinion) than the standard mean value theorem of the reals.  We will define it formally below, but it says the follwing: if f is analytic (equivalent to complex differentiable) on some disk D and a is the center point of this disk, then the average of the values about the boundary of D is equal to f(a).  That is, to find the value of f(a), it suffices to integrate around a circle centered at a and divide by 2\pi (the amount of radians we pass through while integrating).  This is really neat to think about since this tells us not only that, given f there exists some point whose value is equal to the average of the sum of the values of f lying on a circle, but, moreover, that this point is actually the center of the circle.  This is intense stuff.


Theorem (Gauss’ Mean Value Theorem).  Let f(z) be analytic on some closed disk D which has center a and radius r.  Let C denote the boundary of the disk (that is, C is the circle bounding D).  Then we have that \displaystyle f(a) = \frac{1}{2\pi}\int_{0}^{2\pi}f(a+re^{i\theta})\,d\theta.


The proof of this theorem is pretty straight forward and uses the Cauchy integral formula and some easy substitution.


Proof.  Note that we have \displaystyle f(a) = \frac{1}{2\pi i}\oint_{C}\frac{f(z)}{z-a}\,dz.  The equation of a circle with radius r and center a is given by z = a + re^{i\theta} where \theta runs from 0 to 2\pi (if you don’t believe me, plot some points!).  Substituting this value into the integral and noting that dz = ire^{i\theta} we have that

\displaystyle f(a) = \frac{1}{2\pi}\int_{0}^{2\pi}\frac{f(a + re^{i\theta})ire^{i\theta}}{re^{i\theta}}\,d\theta = \frac{1}{2\pi}\int_{0}^{2\pi}f(a+re^{i\theta})\,d\theta

as required.  \diamond


Why bring up this neat little theorem?  Well, by itself it doesn’t seem to be all that useful — when would we be able to calculate and sum up a whole ton of values of an analytic function surrounding a point, but not be able to find the point itself?  But this little theorem packs some punch as a way of bounding certain values.  In particular, it gives a neat proof of the Maximum Modulus Theorem.  You might have guessed this from the title of this post.


First, let’s note something quickly.


Lemma.  Given the assumptions in Gauss’ MVT, we have \displaystyle |f(a)|\leq \frac{1}{2\pi}\int_{0}^{2\pi}|f(a+re^{i\theta})|\,d\theta


Be careful here in thinking that this should be an equality; we are now looking at the modulus of our value, and the modulus of each point on the circle.  But this lemma comes almost for free:


Proof.  We have |f(a)| =\left|\frac{1}{2\pi}\int_{0}^{2\pi}f(a+re^{i\theta})\,d\theta\right| by using Gauss’ MVT and simply taking the norm of both sides.  Note that

\displaystyle\left|\frac{1}{2\pi}\int_{0}^{2\pi}f(a+re^{i\theta})\,d\theta\right| = \frac{1}{2\pi}\left|\int_{0}^{2\pi}f(a+re^{i\theta})\,d\theta\right|

\displaystyle \leq \frac{1}{2\pi}\int_{0}^{2\pi}|f(a+re^{i\theta})|\,d\theta

whence the inequality above.  \diamond


This lemma tells us that the value of the center of any circle is bounded by the sum of the modulus of the values of the points of that circle.  We’ll see why this is the crucial bound we’ll need in the MMT’s proof below. 


Theorem (Maximum Modulus Theorem).  Given f analytic on some domain D, if f is non-constant on D then the maximum value of |f(z)| for z\in D will occur on the boundary of D.  (Alternatively, if |f(z)| is maximized by some value not on the boundary of D, then f is constant on D.)


Proof.  We’ll split this into two steps.  The first step is for the specific case that D is a closed disk and our maximum modulus occurs at the center of this disk.  The second step will be to get some arbitrary space D and construct some closed disks in the interior of D and "piece these together" to show that f is constant on all of D.

Step 1: Let’s suppose that our maximum modulus is at the center point of D, which we will call z_{0}; that is, we are supposing that |f(z)|\leq |f(z_{0})| for every z\in D.  Since z_{0} is an interior point, we have that there is some r-ball about z_{0} (that is, a ball of radius r) which is completely contained in D.  Let the C denote the circle of radius r centered at the point z_{0}.  By our second lemma above we have that

\displaystyle|f(z_{0})|\leq \frac{1}{2\pi}\int_{0}^{2\pi}|f(z_{0}+re^{i\theta})|\,d\theta

BUT, using that |f(z)|\leq |f(z_{0})| for every z\in D we have that

\displaystyle \frac{1}{2\pi}\int_{0}^{2\pi}|f(z_{0}+re^{i\theta})|\,d\theta \leq \frac{1}{2\pi}\int_{0}^{2\pi}|f(z_{0})|\,d\theta = f(z_{0}).

Stringing these inequalities together and suggestively re-writing |f(z_{0})| = \frac{1}{2\pi}\int_{0}^{2\pi}|f(z_{0})|\,dz, we have that

\displaystyle \frac{1}{2\pi}\int_{0}^{2\pi}|f(z_{0})|\,d\theta = \frac{1}{2\pi}\int_{0}^{2\pi}|f(z_{0}+re^{i\theta})|\,d\theta

and by subtracting,

\displaystyle 0 = \frac{1}{2\pi}\int_{0}^{2\pi}|f(z_{0})| -|f(z_{0}+re^{i\theta})| \,dz

but since the integrand is always positive or zero (why?) it must be the case that

\displaystyle |f(z_{0})| -|f(z_{0}+re^{i\theta})| = 0

or, in other words, |f(z_{0})| =|f(z_{0}+re^{i\theta})|.  Since r was arbitrary, we conclude that |f(z_{0})| = |f(z)| for every z\in D.


Step 2: Now suppose we have some arbitrary domain D and f is analytic on all of D.  I will hand-wave a bit here, but you can fill in the details.  Note that a domain (in this context) necessarily means open and path-connected (and, in fact, it usually denotes a simply connected open subset of {\mathbb C}).   Suppose that our maximum modulus occurs at some point on the interior of D which we will call z_{0}.  Now, given any other point w\in D we have some path from z_{0} to w which is completely contained in D.  In fact, we can make this path a finite polygonal path; that is, a path made out of a finite number of straight lines piecewise-connected together; we will denote this z_{0}L_{0}z_{1}L_{1}\cdots L_{n-1}w, where the L_{i} is the line with endpoints z_{i-1} and z_{i}.  I will let you work the details out here, but it can be done. 




Now, the polygonal line might be right next to a boundary, and we don’t want to accidentally hit it when we start making balls around points, so let \epsilon denote whichever is smaller: the distance from the polygonal line to the boundary, or 1.  So, if your polygonal line is right next to the boundary, we might need to make \epsilon pretty small; but if not, we can just let it be whatever we want, so we might as well make it 1.  Note that since D is open, no point on the polygonal path should be on the boundary.  Now, let’s break up our polygonal path into another polygonal path z_{0}L_{0}z_{1}L_{1}\cdots L_{n-1}w where each L_{i} has length less than \frac{\epsilon}{4}.  It is clear we can do this just by partitioning each straight line in our original path so that their lengths are appropriately small; note, we still only have a finite number of endpoints z_{i}.  That’s important.



(In the picture above, I’ve made the original endpoints blue and then partitioned our polygonal path with the new red endpoints to make each line segment less than \frac{\epsilon}{4}.)

Now everything is going to fall pretty quickly, so keep on your toes.  First, make a disk of radius \epsilon (as defined above) around each z_{i} and call it D_{i}.  Now note that, by our previous step, since our maximum modulus occurs at z_{0}, we have f(z) = f(z_{0}) for every point z\in D_{0}.  But z_{1} is in D_{0}


(This picture is not drawn to scale because I am not a good artist; this is illustrating z_{1} being inside the circle D_{0}.)

So now z_{1} is also of maximum modulus (since z_{0} was) and so f(z_{1}) = f(z) for every point in D_{1}.  Continue this and we will obtain f(z_{0}) = f(w).  Since w was an arbitrary point, it follows that f(z_{0}) = f(w)  for every w\in D.  Hence, if f attains a maximum modulus on the interior of some set D, then it is constant.  This implies directly that any non-constant analytic function achieves its maximum modulus on the boundary.  \Box