Tuesday, 12 December 2006

Benford's Law

Start with a group of numbers - say, the lengths of all the rivers in the UK, the numbers mentioned on the front page of a newspaper, or the populations of villages and towns in England. Then take the first digit of each of these numbers. You would probably assume that each of the digits 1 to 9 would appear with equal frequency in the resulting list. (Although that wasn't true of the three people in my family that I asked about this!)

What actually happens is that the digit 1 occurs much more frequently than the others; in fact about 30% of these initial digits will be 1s. 2 appears less often; and so on, down to 9 which accounts for only about 5% of the numbers.

Even more strangely, perhaps, is that these results are "scale invariant": it doesn't matter what units are used in the initial sample. For example, you would see the same results whether your river lengths were in miles or kilometres, or cubits.

In fact, the expected proportion of numbers starting with the digit n is ln(1 + 1/n) / ln(10). There's a pretty good description of how this formula is derived in Plus magazine

I tried it out on a couple of sets of data. Here are the results for the total amounts of all the transactions in my church's accounts in 2005:

A pretty close fit. Even better when the amounts are converted from pounds to euros:

Another example. Here's the results from the numbers of people in each UK area who declared that there religion was Jedi in the 2001 census:

This effect was apparently first noticed in 1881, but the law is named after Frank Benford who stated it in 1938.

Note that you have to pick your data set correctly. The law doesn't apply for truly random numbers, in which each digit has the same probability of occurring first. Nor does it apply when the data set is highly constrained. For example, if the height of hills is defined to be between 300 and 999 feet then certain initial digits are excluded by definition.