In practice, there are a variety of practical limitations of the 2SLS estimator. For one, it's biased in small samples. And since applied econometricians live in a world of small sample sizes this can be important in practice. Second, if your instruments aren't sufficiently correlated with your regressors -- a failure of instrument "relevance" -- the small-sample bias can be magnified dramatically. In the extreme case of zero instrument relevance the math completely falls apart and the 2SLS estimator isn't even identified.

Below are some notes summarizing the problem. The first 3 pages show the math for the simple case of one regressor one one instrument. The last 3 pages show the general case for

Small Sample Bias of the 2SLS IV Estimator]]>

Of his many mathematical contributions, Archimedes was most proud of this result, even going so far as to ask that the method he used to work out the formula -- a diagram circumscribing a cylinder inside a sphere -- be imprinted on his gravestone.

Archimedes' formula may have been a stroke of scientific genius in 250 B.C., but with the help of modern calculus the derivation is extremely simple. In this post I'll explain one way to derive the famous formula, and explain how it can be done in dimensions other than the usual three.

Consider the diagram below. It's a sphere with radius r. The goal is to find the volume, and here's how we do that.

Notice that one thing we can easily find is the area of a single horizontal slice of the ball. This is the shaded disk at the top of the diagram, which is drawn at height z. The disk has a radius of x, which we'll need to find the area of the disk. To find x, we can form a right triangle with sides z and x, and hypotenuse r. This is drawn in the figure. Then we can easily solve for x.

By the Pythagorean theorem, we know that , so solving for x we have . Then the area of the shaded disk is simply pi times the radius squared, or .

Now that we have the area of one horizontal disk, we want to find the area of

Which is the volume formula we were looking for.

This same logic can be used to derive formulas for the volume of a "ball" in 4, 5, and higher dimensions as well. Doing so, you can show that the volume of a unit ball in one dimension (a line) is just 2; the volume in two dimensions (a disk) is , and -- as we've just shown -- the volume in three dimensions (a sphere) is . Continuing on to four, five, and ultimately n dimensions, a surprising result appears.

It turns out the volume of a unit ball peaks at five dimensions, and then proceeds to shrink thereafter, ultimately approaching zero as the dimension n goes to infinity. You can read more about this beautiful mathematical result here.

First we'll prove a lemma that shows for any sequence we can always find a monotone subsequence -- that is, a subsequence that's always increasing or decreasing.

If has infinitely many peaks, then collect those peaks into a subsequence . This is a monotone decreasing subsequence, as required.

If has finitely many peaks, take

Continuing, we can create a subsequence that is monotone increasing. In either case -- if our sequence has infinite or finitely many peaks -- we can always find a monotone subsequence, as required.

Now that we've proved the above lemma, the proof of the Bolzano-Weierstrass theorem follows easily.

In this post I'll explain a powerful and surprising application of linear algebra to another field of mathematics -- calculus. I'll explain how the fundamental calculus operations of differentiation and integration can be understood instead as a

In linear algebra, the concept of a vector space is very general. Anything can be a vector space as long as it follows two rules.

The first rule is that if u and v are in the space, then u + v must also be in the space. Mathematicians call this "closed under addition." Second, if u is in the space and c is a constant, then cu must also be in the space. This is known as "closed under scalar multiplication." Any collection of objects that follows those two rules -- they can be vectors, functions, matrices and more -- qualifies as a vector space.

One of the more interesting vector spaces is

where a0...an are constants.

Is this really a vector space? To check, we can verify that it follows our two rules from above. First, if p(t) and q(t) are both polynomials, then p(t) + q(t) is also a polynomial. That shows it's closed under addition. Second, if p(t) is a polynomial, so is c times p(t), where c is a constant. That shows it's closed under scalar multiplication. So the set of polynomials of degree at most n is indeed a vector space.

Now let's think about calculus. One of the first methods we learn is taking derivatives of polynomials. It's easy. If our polynomial is ax^2 + 3x, then our first derivative is 2ax + 3. This is true for all polynomials. So the general first derivative of an nth degree polynomial is given by:

The question is: is this also a vector space? To answer that, we check to see that it follows our two rules above. First, if we add any two derivatives together, the result will still be the derivative of some polynomial. Second, if we multiply any derivative by a constant c, this will still be the derivative of some polynomial. So the set of first derivatives of polynomials is also a vector space.

Now that we know polynomials

If we call the set of polynomials , then the set of derivatives of this is , since taking the first derivative will reduce the degree of each polynomial term by 1. Thus, the operation "take the derivative" is just a function that maps . A similar argument shows that "taking the integral" is also a linear transformation in the opposite direction, from .

Once we realize differentiation and integration from calculus is really just a linear transformation, we can describe them using the tools of linear algebra.

Here's how we do that. To fully describe any linear transformation as a matrix multiplication in linear algebra, we follow three steps.

First, we find a basis for the subspace in the domain of the transformation. That is, if our transformation is from , we first write down a basis for .

Next, we feed each element of this basis through the linear transformation, and see what comes out the other side. That is, we apply the transformation to each element of the basis, which gives the "image" of each element under the transformation. Since every element of the domain is some combination of those basis elements, by running them through the transformation we can see the impact the transformation will have on

Finally, we collect each of those resulting images into the columns of a matrix. That is, each time we run an element of the basis through the linear transformation, the output will be a vector (the "image" of the basis element). We then place these vectors into a matrix D, one in each column from left to right. That matrix D will fully represent our linear transformation.

Here's an example of how to do this for , the set of all polynomials of at most degree 3. This is the set of all functions of the following form:

where at a0...a3 are constants. When we apply our transformation, "take the derivative of this polynomial," it will reduce the degree of each term in our polynomial by one. Thus, the transformation D will be a linear mapping from to , which we write as .

To find the matrix representation for our transformation, we follow our three steps above: find a basis for the domain, apply the transformation to each basis element, and compile the resulting images into columns of a matrix.

First we find a basis for . The simplest basis is the following: 1, t, t^2, and t^3. All third-degree polynomials will be some linear combination of these four elements. In vector notation, we say that a basis for is given by:

Now that we have a basis for our domain , the next step is to feed the elements of it into the linear transformation to see what it does to them. Our linear transformation is, "take the first derivative of the element." So to find the "image" of each element, we just take the first derivative.

The first element of the basis is 1. The derivative of this is just zero. That is, the transformation D maps the vector (1, 0, 0, 0) to (0, 0, 0). Our second element is t. The derivative of this is just one. So the transformation D maps our second basis vector (0, t, 0, 0) to (1, 0, 0). Similarly for our third and fourth basis vectors, the transformation maps (0, 0, t^2, 0) to (0, 2t, 0), and it maps (0, 0, 0, t^3) to (0, 0, 3t^2).

Applying our transformation to the four basis vectors, we get the following four images under D:

Now that we've applied our linear transformation to each of our four basis vectors, we next collect the resulting images into the columns of a matrix. This is the matrix we're looking for -- it fully describes the action of differentiation for any third-degree polynomial in one simple matrix.

Collecting our four image vectors into a matrix, we have:

This matrix gives the linear algebra view of differentiation from calculus. Using it, we can find the derivative of

For example, consider the polynomial . Note that the first derivative of this polynomial is ; we'll use this in a minute. In vector form, this polynomial can be written as:

To find its derivative, we simply multiply this vector by our D matrix from above:

which is exactly the first derivative of our polynomial function!

This is a powerful tool. By recognizing that differentiation is just a linear transformation -- as is integration, which follows a similar argument that I'll leave as an exercise -- we can see it's really just a rule that linearly maps functions in to functions in .

In fact, all m x n matrices can be understood in this way. That is, an m x n matrix is just a linear mapping that sends vectors in into . In the case of the example above, we have a 3 x 4 matrix that sends polynomials in (such as ax^3 + bx^2 + cx +d, which has four elements) into the space of first derivatives in (in this case, 3ax^2 + 2bx +c, which has three elements).

For more on linear transformations, here's a useful lecture from MIT's Gilbert Strang.]]>

Thankfully, there's an easier way to understand exact differential equations. Years ago, I tried to come up with the simplest possible way of explaining the method. Here's what I came up with.

The entire method of solving exact differential equations can be boiled down to the diagram below: "Exact ODEs in a Nutshell."

Recall that exact ODEs are ones that we can write as M(x.y) + N(x,y)*y' = 0, where M and N are continuous functions, and y' is dy/dx. Here is how to read the diagram.

Starting with an exact ODE, we're on the second line labeled "starting point." We have functions M and N, and our goal is to move upward toward the top line labeled "goal." That is, given an exact ODE, we want to find a solution F(x,y) = c whose first partial derivatives are Fx (which is just the function M) and Fy (which is the function N).

Before we do anything, we check that our equation is really exact. To do this, we move to the bottom line labeled "test for exactness." That is, we take the derivative of Fx = M with respect to y (giving us Fxy = My). And we take the derivative of Fy = N with respect to x (which gives us Fyx = Nx). Set these equal to each other. A basic theorem from calculus says that the mixed partial derivatives Fxy and Fyx will be the same for any function F(x,y). If they're equal, F(x,y) on the top line is guaranteed to exist.

Now we can solve for the function F(x.y). The diagram makes it easy to see how. We know M(x,y) is just the first partial derivative of F with respect to x. So we can move upward toward F(x,y) by integrating M with respect to x. Similarly, we know the function N(x,y) is just the first partial derivative of F(x,y) with respect to y, so we can find another candidate for F by integrating N with respect to y.

In the end, we'll have two candidates for F(x,y). Sometimes they're the same, in which case we're done. Sometimes they're different, as one will have a term the other won't have -- a term that got dropped to zero as we differentiated from F(x,y) to either Fx or Fy, since it's a function of only one of x or y. This is easy to solve: just combine all the terms from both candidates for F(x,y), omitting any duplicate terms. This will be our solution F(x,y) = c.

Try using this method on a few examples here. I think you'll find it's much simpler -- and easier to remember years later -- than the round-about method used in most textbooks.]]>

where x and y are the data sets we're trying to measure the correlation of.

This formula can be useful, but also has some major disadvantages. It's complex, hard to remember, and gives students almost no insight into what the correlation coefficient is really measuring. In this post I'll explain an alternative way of thinking about "r" as

The idea behind the correlation coefficient is that we want a standard measure of how "related" two data sets x and y are. Rather than thinking about data sets, imagine instead that we place our x and y data into two vectors u and v. These will be two n-dimensional arrows pointing through space. The question is: how "similar" are these arrows to each other? As we'll see below, the answer is given by the correlation coefficient between them.

The figure below illustrates the idea of measuring the "similarity" of two vectors v1 and v2. In the figure, the vectors are separated by an angle theta. A pretty good measure of how "similar" they are is the cosine of theta. Think about what cosine is doing. If both v1 and v2 point in roughly the same direction, the cosine of theta will be about 1. If they point in opposite directions, it will be -1. And if they are perpendicular or orthogonal, it will be 0. In this way, the cosine of theta fits our intuition pretty well about what it means for two vectors to be "correlated" with each other.

What is the cosine of theta in the figure? From the geometry of right triangles, recall that the cosine of an angle is the ratio of the length of the adjacent side to the length of the hypotenuse. In the figure, we form a right triangle by projecting the vector v1 down onto v2. This gives us a new vector p. The cosine of theta is then given by:

Now suppose we're interested in the correlation between two data sets x and y. Imagine we normalize x and y by subtracting from each data point the mean of the data set. Let's call these new normalized data sets u and v. So we have:

The question is, how "correlated" or "similar" are these vectors u and v to each other in space? That is,

But since and , this means the cosine of theta is just the correlation coefficient between the two vectors u and v, or:

From this perspective, the correlation coefficient has an elegant geometric interpretation. If two data sets are positively correlated, they should roughly "point in the same direction" when placed into n-dimensional vectors. If they're uncorrelated, they should point in directions that are orthogonal to each other. And if they're negatively correlated, they should point in roughly opposite directions.

The cosine of the angle between two vectors nicely fits that intuition about correlation. So it's no surprise the two ideas are ultimately the same thing -- a much simpler interpretation of "r" than the usual textbook formulas.]]>

The simplest way of modeling population is to assume "exponential" growth. That is, just assume population grows by some annual rate, forever. If we let "y" be a city's population and "k" be the annual growth rate, the exponential growth model is given by

This is a simple first-order differential equation. We can solve this for "y" by using a technique called "separation of variables". First, we separate variables like this:

Then we integrate both sides and solve for y, as follows:

Since C is just an arbitrary constant, we can let e^C just equal C, which gives us

where k is the annual growth rate, t is the number of years from today, and C is the population at time t=0. This is the famous "exponential growth" model.

While the exponential model is useful for short-term forecasts, it gives unrealistic estimates for long time periods. After just a few decades, population would rapidly grow toward infinity in this model. A more realistic model should capture the idea that population does not grow forever, but instead levels off around some long-term level. This leads us to our second model.

We can improve the above model by making a simple adjustment. Let "A" be the maximum long-term population a city can reasonably sustain. Then multiply the model above by a factor (1 - y/A), giving us

In this model, the population starts out growing exponentially. But as "y" approaches the maximum level "A", the term (1 - y/A) approaches zero, slowing down the growth rate. In the long run, growth will slow to a crawl as cities approach their maximum sustainable size -- a much more reasonable way to model population growth. This is known as the "logistic" model.

To solve for "y," we can again use separation of variables. However, we'll first need to use a trick from algebra known as the "partial fractions decomposition."

The partial fractions decomposition is a theorem about rational functions of the form P(x)/Q(x). Here is what it says. If P(x) and Q(x) are polynomials, and P(x)/Q(x) is "proper" -- that is, the order of P(x) is less than the order of Q(x) -- then we can "decompose" P(x)/Q(x) as follows:

where a1...an are the n roots of the polynomial Q(x), and C1...Cn are constants. Using this theorem, we can decompose hard-to-handle rational functions into much simpler pieces -- something we'll need to do to solve the logistic population model above.

Recall that the logistic population model is given by:

Separating variables, we have:

The term on the left-hand side is hard to integrate as written. Since it's a proper rational polynomial function, we can now use the partial fractions decomposition to simplify it. By the theorem above, we can rewrite it as:

To solve for C1 and C2, first multiply both sides by y(1 - y/A) to clear the denominators, like this:

This equation is true for all values of y. To solve for C1 and C2, simply plug in values for y that allow us to solve for them. To solve for C1, let y = 0. This "zeros out" C2 in the equation and lets us solve for C1, as follows:

To solve for C2 we repeat the process, plugging in a value for y that "zeros out" C1. To do this, Let y = A, and solve for C2 as follows:

Using these constants, now we can rewrite our original function using the partial fractions decomposition as follows:

This simpler function can then be plugged into our integration problem above, allowing us to integrate the logistic model and solve for y. Returning to our problem, we have:

Integrating both sides and solving for y, we have:

To solve for "C" in the equation, note that if we let t=0, C = y0/(1 - y0/A) where y0 is the beginning population. Plugging in and solving for y, we then have,

This is the famous "logistic model" of population growth. This basic model can then be used to develop pretty reasonable long-term forecasts for city populations.]]>

A common problem with time-series data is getting them into the right time interval. Some data are daily or weekly, while others are in monthly, quarterly or annual intervals. Since most regression models require consistent time intervals, an econometrician's first job is usually getting the data into the same frequency.

In this post I'll explain how to solve a common problem I've run into: how to divide quarterly data into monthly data. To do so, we'll use a method known as "cubic spline interpolation." In the example below we use Matlab and Excel. For Stata users, I've posted a Stata do file that illustrates how to work through the below example in Stata.

One of the most widely used data sources in economics is the National Income and Product Accounts (NIPAs) from the U.S. Bureau of Economic Analysis. They're the official source for U.S. GDP, personal income, trade flows and more. Unfortunately, most data are published only quarterly or annually. So if you're hoping to run a regression using monthly observations -- for example, this simple estimate of the price elasticity of demand for gasoline -- you'll need to split these quarterly data into monthly ones.

A common way to do this is by "cubic spline interpolation." Here's how it works. We start with n quarterly data points. That means we have n-1 spaces between them. Across each space, we draw a unique 3rd-degree (or "cubic") polynomial connecting the two points. This is called a "piecewise polynomial" function.

To make sure our connecting lines form a smooth line, we force all our first and second derivatives to be continuous; that is, at each connecting point we make them equal to the derivitive on either side. When all these requirements are met -- along with a couple end-point conditions you can read about here -- we have a (4n-4) x (4n-4) linear system that can be solved for the coefficients of all n-1 cubic polynomials.

Once we have these n-1 piecewise polynomials, we can plug in x values for whatever time intervals we want: monthly, weekly or even daily. The polynomials will give us a pretty good interpolation between our known quarterly data points.

While the above method seems simple, doing cubic splines by hand is not. A spline for just four data points requires setting up and solving a 12 x 12 linear system, then manually evaluating three different polynomials at the desired x values. That's a lot of work. To get a sense of how hard this is, here's my own Excel file showing what's involved in fitting a cubic spline to four data points by hand.

In practice, the best way to do a cubic spline is to use MATLAB. It takes about five minutes. Here's how to do it.

MATLAB has a built-in "spline()" function that does the dirty work of cubic spline interpolation for you. It requires three inputs: a list of x values from the quarterly data you want to split; a list of y values from the quarterly data; and a list of x values for the monthly time intervals you want. The spline() function formulates the n-1 cubic polynomials, evaluates them at your desired x values, and gives you a list of interpolated monthly y values.

Here's an Excel file showing how to use MATLAB to split quarterly data into monthly. In the file, the first two columns are quarterly values from BEA's Personal Income series. Our goal is to convert these into monthly values. The next three columns (highlighted in yellow) are the three inputs MATLAB needs: the original quarterly x values (x); the original quarterly y values (y); and the desired monthly x values (xx).

In the Excel file, note that the first quarter is listed as month 2, the second quarter as month 5, and so on. Why is this? BEA's quarterly data represent an average value over the three-month quarter. That means they should be treated as a mid-point of the quarter. For Q1 that's month 2, for Q2 that's month 5, and so on.

The next step is to open MATLAB and paste in these three columns of data. In MATLAB, type " x = [ ", cut and paste the column of x values in from Excel, type " ] " and hit return. This creates an n x 1 vector with the x values. Repeat this for the y, and xx values in the Excel file.

Once you have x, y, and xx defined in MATLAB, type "yy = spline(x,y,xx)" and hit return. This will create a new vector yy with the interpolated monthly y values we're looking for. Each entry in yy will correspond to one of the x values you specified in the xx vector.

Copy these yy values from MATLAB, paste them into Excel, and we're done. We now have an estimated monthly Personal Income series.

Here's an Excel file summarizing the above example for splitting quarterly Personal Income data into monthly using MATLAB. Also, here's a MATLAB file with the x, y, xx, and yy vectors from the above exercise.

In a previous post, I explained how simple linear models can help us understand where systems like this are headed in the long run. In this post, I'll explain how the same tools can help us see the "long run" values of the Fibonacci sequence. The result is an elegant model that illustrates the connection between the Fibonacci sequence and the so-called "golden ratio", an aesthetic principle appearing throughout art, architecture, music, book design and more.

Each term in the Fibonacci sequence equals the sum of the previous two. That is, if we let Fn denote the nth term in the sequence we can write . To make this into a linear system, we need at least one more equation. Let's use an easy one: . Putting these together, here's our system of linear equations for the Fibonacci sequence:

We can write this in matrix notation as the following:

Here's what this is saying. If we start with the vector on the right and multiply it by this 2 x 2 matrix, the result is the vector on the left. That is, multiplying our starting vector by the matrix above gives us the

As explained here, the general form for dynamic linear models like this is:

where u0 is the initial state, u1 is the final state, and A is the transformation matrix that moves us from one state to the next. As explained in the earlier post, we can use eigenvalues and eigenvectors to solve for the long run values of systems like this.

Here's how we do that. Let S be a 2 x 2 matrix where the columns are the two eigenvectors of our transformation matrix A. And let be a 2 x 2 matrix with the two eigenvalues of A along the diagonal and zeros elsewhere. By the definition of eigenvalues and eigenvectors, we have the following identity:

Solving for A, we have:

Plugging this into our above equation relating u0 and u1, we have the following system:

This equation relates the initial state vector u0 to the next state u1. But each time we multiply by A it moves us to the next state, and the next state, and so on. Imagine we multiply k times. We get the following general form:

This equation is what we're looking for. It allows us to quickly find the kth term in the Fibonacci sequence with a simple calculation. It relies only on the initial state vector u0 and the eigenvalues and eigenvectors of the transformation matrix A. That's our linear model.

Now that we have a model, let's find the S, and other parts that make it up.

The first step is to find the eigenvalues of the A matrix. This is given by:

We find the eigenvalues by solving Ax = lambda x for the vector lambda that makes that equation true (that's the definition of an eigenvalue). That's the same as solving Ax - lambda x = 0, or (A - lambda)x = 0. This implies the matrix (A - lambda) is singular or not invertable. So the determinant of (A - lambda) must be zero. That means we can find the eigenvalues by setting the determinant of (A - lambda) equal to zero and solving for lambda (this is called the "characteristic polynomial"). Here's how we do that:

Plugging this into the quadratic formula with a = 1, b = -1 and c = -1, you'll get the following two solutions, which are our two eigenvalues:

For a quick check that these are right, note that the trace of A is 1. The sum of the two eigenvalues is also 1. Since the eigenvalues must always sum to the trace, we've got the right values here.

The next step is to find the two eigenvectors that correspond to the eigenvalues. Let's start with . To do this, we write the following:

This implies that . Our "free" variable here is x2, so we'll let that equal 1. Then we can solve for x1. We get the following:

Using the old algebra trick for the difference of squares -- that is, -- we can simplify this by multiplying both numerator and denominator by as follows:

So our first eigenvector v1 is equal to:

Following the same process for give us the other eigenvector:

Note that the vectors v1 and v2 are orthogonal to each other. That is, the dot product between them is zero. This comes from a basic theorem of linear algebra: every symmetric matrix will have orthogonal eigenvectors. Since A is symmetric, our two eigenvectors are perpendicular to each other.

Now that we have the eigenvalues and eigenvectors, we can write the S and matrices as follows:

To complete our model, we also need to know the inverse of the S matrix. Thankfully, there's a simple formula for the inverse of a 2 x 2 matrix. If A is given by:

then the inverse of A is found by transposing the diagonal terms, putting a negative sign on the off-diagonal terms, and multiplying by 1/(determinant of A), or

Using that formula, we get:

As explained in our earlier post, our final step is to write our initial state vector u0 as a combination of the columns of S. That is, we can write:

where c is a 2 x 1 vector of scalars. Solving for c, we get:

For this example, let's let our initial state vector u0 be (1,1). These are the second and third terms of the Fibonacci sequence. Note that you can use any two subsequent terms for this step -- I'm just using (1,1) because I like the way the math works for it. So we have:

Since , that means

Putting everything together, we can write our final model for the Fibonacci sequence:

Multiplying this out, we get the following extensive form of the model:

This equation gives the k+3 and k+2 terms of the Fibonacci sequence as a function of just one variable: k. This allows us to easily find any term we'd like -- just plug in k. For example, imagine we want the 100th term. Simply let k = 98, and solve the above for F101 and F100. This is a huge number, by the way -- 218,922,995,834,555,000,000 -- something you can easily verify in this Excel spreadsheet. (Note that whether this is the 99th or 100th term depends on whether you label 0 or 1 to be the first term of the sequence; here I've made zero the first term, but many others use 1.)

So what happens to the above system as k grows very large? Do the terms in the Fibonacci sequence display some regular pattern as we move outward?

All changes from one term to the next are determined by k in the above model. Imagine k grows very large. In the second term in the equation, we can see that

That means that as k gets bigger, the second term in the equation goes to zero. That leaves only the first term. That means as

So as k grows large, the

This is a pretty amazing result. As we move far out in the Fibonacci sequence, the ratio of two subsequent terms approaches a constant. And that constant is equal to the first eigenvalue of our linear system above.

More importantly, this value is also equal to the famous "golden ratio", which appears in myriad forms throughout Western art, architecture, music and more. In the limit, the ratio of subsequent terms in the Fibonacci sequence equals to the golden ratio -- something that's not obvious at all, but which we can verify analytically using our model above.

If you'd like to see how this works in a spreadsheet, here's an Excel file where you can plug in values for k and find the k+2 and k+1 terms in the sequence.]]>

In this post I'll walk through a simple proof showing that the Poisson distribution is really just the binomial with n approaching infinity and p approaching zero.

The binomial distribution works when we have a fixed number of events n, each with a constant probability of success p. Imagine we don't know the number of trials that will happen. Instead, we only know the average number of successes per time period. So we know the rate of successes per day, but not the number of trials n or the probability of success p that led to that rate.

Define a number . Let this be the rate of successes per day. It's equal to np. That's the number of trials n -- however many there are -- times the chance of success p for each of those trials. Think of it like this: if the chance of success is p and we run n trials per day, we'll observe np successes per day on average. That's our observed success rate lambda.

Recall that the binomial distribution looks like this:

As mentioned above, let's define lambda as follows:

Solving for p, we get:

What we're going to do here is substitute this expression for p into the binomial distribution above, and take the limit as n goes to infinity, and try to come up with something useful. That is,

Pulling out the constants and and splitting the term on the right that's to the power of (n-k) into a term to the power of n and one to the power of -k, we get

Now let's take the limit of this right-hand side one term at a time. We'll do this in three steps. The first step is to find the limit of

In the numerator, we can expand n! into n terms of (n)(n-1)(n-2)...(1). And in the denominator, we can expand (n-k) into n-k terms of (n-k)(n-k-1)(n-k-2)...(1). That is,

Written this way, it's clear that many of terms on the top and bottom cancel out. The (n-k)(n-k-1)...(1) terms cancel from both the numerator and denominator, leaving the following:

Since we canceled out n-k terms, the numerator here is left with k terms, from n to n-k+1. So this has k terms in the numerator, and k terms in the denominator since n is to the power of k. Expanding out the numerator and denominator we can rewrite this as:

This has k terms. Clearly, every one of these k terms approaches 1 as n approaches infinity. So we know this portion of the problem just simplifies to one. So we're done with the first step.

The second step is to find the limit of the term in the middle of our equation, which is

Recall that the definition of e = 2.718... is given by the following:

Our goal here is to find a way to manipulate our expression to look more like the definition of e, which we know the limit of. Let's define a number x as . Now let's substitute this into our expression and take the limit as follows:

This terms just simplifies to e^(-lambda). So we're done with our second step. That leaves only one more term for us to find the limit of. Our third and final step is to find the limit of the last term on the right, which is

This is pretty simple. As n approaches infinity, this term becomes 1^(-k) which is equal to one. And that takes care of our last term.

Putting these three results together, we can rewrite our original limit as

This just simplifies to the following:

This is equal to the familiar probability density function for the Poisson distribution, which gives us the probability of k successes per period given our parameter lambda. So we've shown that the Poisson distribution is just a special case of the binomial, in which the number of n trials grows to infinity and the chance of success in any particular trial approaches zero. And that completes the proof.]]>