If you ever plan to cheat on your taxes, here’s something to consider (besides prison): Make sure that most of the numbers you fabricate start with the digit 1 (one). The second-most common leading digit should be 2, then 3, continuing on that pattern to leave 9 as the least common leading digit. This distribution is called *Benford’s Law*, and it’s a lot more straightforward than tax law… though *why* it exists is nearly as mysterious.

In a highly variable set of numbers such as those found in taxes, one would think that the leading digits would all be equally common. One would expect to find roughly the same amount of numbers starting with a 1 as, say, an 8. In a set of *totally* random numbers such as the lottery, that is exactly what one would discover; but when it comes to non-random real-life numbers, unless the data set is too constrained, a lot more numbers start with a one than any other digit. This can be useful in many ways.

The Internal Revenue Service runs our tax returns though software which makes sure the numbers follow Benford’s Law, and anytime one wanders too far from that number distribution, they know there’s a pretty good chance somebody’s pulling a fast one. It’ll raise flags to indicate that the return should be further scrutinized.

You can test Benford’s Law yourself by finding any non-random list of numbers that isn’t too specific a set. A good example would be the lengths of all of the major rivers in the U.S., or the sizes of all the files on your computer (but make sure you use *actual* file sizes, not Windows’ rounded-off values). An example of a set that is too constrained would be the ages of all of your friends. The threshold where a set becomes too constrained is a bit fuzzy, but until it is crossed, Benford’s Law predicts the frequency of leading numbers with very reasonable accuracy. The larger the data set, the more closely it should match.

As a test, I checked the sizes of all 3,124 items in a particular directory on my computer. Here is the distribution I found, along with the percentages of each that Benford’s Law predicts:

Leading Digit | Occurences | Frequency | Benford’s Law |
---|---|---|---|

1 | 854 | 27.3% | 30.1% |

2 | 619 | 19.8% | 17.6% |

3 | 417 | 13.3% | 12.5% |

4 | 324 | 10.4% | 9.7% |

5 | 261 | 8.4% | 7.9% |

6 | 195 | 6.2% | 6.7% |

7 | 158 | 5% | 5.8% |

8 | 154 | 4.9% | 5.1% |

9 | 142 | 4.5% | 4.6% |

Clearly these findings follow Benford’s Law very closely, as will any large set of real-life numbers. The phenomenon was first noticed in 1881 by mathematician and astronomer Simon Newcomb. While thumbing through some logarithm books to perform calculations, he noticed that the pages for numbers that began with the digit 1 were much more worn than the others. It seemed that people had been doing more calculations with numbers that started with a 1. He examined the number distribution, found it interestingly weighted towards smaller numbers as leading digits. He wrote about this pattern as a curiosity, but it was soon forgotten.

The phenomenon was re-discovered in 1938 by Frank Benford, a physicist at the General Electric company. He was fascinated by it, and tested many data sets for the patterns, including baseball statistics, areas of river catchments, and the addresses of the first 342 people listed in the book American Men of Science, and found most to follow the distribution closely. Because of the huge menagerie of data he tested, he is often credited for the law, as its name indicates.

Benford’s Law is proving quite useful for businesses and government agencies as a way of detecting fraud in taxes, accounting, expenses, and insurance. It was also used to help check for Y2K compliance back when that was a problem. No doubt there are many other uses that no one has thought of just yet.

So how is Benford’s Law helpful to you? For an individual, its uses are not many. If you ever see a question about the length of a river on a multiple-choice test, and you haven’t got a clue, you might lean toward the answer that has a 1 as the first digit. But I don’t recommend using it to cheat anyone, especially the IRS.

Unless I’m missing something, this makes perfect sense and is not that mysterious. When you’re talking about something like street addresses, they are linear and go up in value, so they may end at 5500 or 375, but 1 will always be the first number you come to, then 2, then 3, etc. —

Renny

You’re missing something. Benford’s Law says that in many sets of real-life numbers, those numbers that start with the digit 1 are most common, whether they’re one digit long or ten. “A number that starts with one” is not a reference to the counting order, but the digits themselves. For example, 1,945,344 starts with one, and 8,654 does not.

Okay, yeah, I rethought it and found my logic to be way off. I’ll have to think about this one some more. Once I understood the Monty Hall problem it was like a light bulb went off in my head. Hopefully I’ll understand this one.

Renny

In the big world of gathering numbers there could be a colluding factor with Benford’s Law attributed to a little-known trade secret among those who apply recognition software; specifically Intelligent Character Recognition (ICR) software used to interpret human handwriting.

ICR is best used to extract numerical data from scanned and imaged documents such as forms because numbers have a smaller set of handwritten patterns than letters, people naturally write numbers more clearly and they habitually space numbers thus avoiding collisions giving recognition algorithms a better chance of success. But that’s not good enough for a programmer if his or her job is measured by the total percentage of correct characters captured from a test deck before acceptance by a customer. No, they play Benford’s Law to skew success rates and enhance processing performance in their favor. As an example, the numbers “1” and “7” could be interpreted as near-similar and flag for verification. But if the system is weighted to the lower-valued digit when this happens, then the vote goes to “1” instead of “7” which follows the probability outcome of Benford’s Law. This conveniently happens with other numbers with equipotent patterns such as “0” and “6”; “3” and “8”; “4” and “9”; “5” and “6”; “7” and “9”. Data checking software identifies aberrations through statistical methods, like the number distribution of an IRS form, after extraction. By that time numbers may have already been vetted with Benford’s heavy dice.

I understand what Benfords law is saying, and Google turns up many confirming sites. Here’s what I find puzzling – suppose you take a table of data (such as the lengths of rivers in miles). Assume Benfords law applies and the leading digit is 1 30% of the time. Then convert the table to some other unit of measurement (km, feet …) then it would seem to me that the leading digit could not be 1 30% of the time. What do you think ??

Benford’s Law is scale invariant, meaning that the distribution of leading digits is independent of the measuring system used. Switching between miles, kilometers, fathoms, etc. will not alter the ratio. If it did, it wouldn’t be a true law of digit frequency.

I believe, however, that using extremes as measuring units can alter the ratio… for instance, if you measured the lengths of all rivers in the U.S. in, say,

lightyears, they would ALL have a zero for the leading digit.Well thats one way to put it.. nice, Alan.

Alan – Even in the cases of extremes, the numbers would follow Benford’s Law if you used scientific notation. Instead of 0.00000164 light years it would be 1.64*(10^-6).

Charlie – While changing the scale would make a lot of numbers that had begun with 1 start with a new digit, there would be just as many numbers that had started with a different digit that would then lead with a 1 after the scale change.

Multiple3 said: “Alan – Even in the cases of extremes, the numbers would follow Benford’s Law if you used scientific notation. Instead of 0.00000164 light years it would be 1.64*(10^-6).”

Good point.

0.00000164 lightyears = 9,640,739.7 miles… that’s one LONG river.

I believe it is because in any series of whole numbers, 1 comes first. Hence a series from 1 to 20 will contain 11 numbers that start with 1 (1 & 10-19) or 55%. 1 to 50 would be about 22%. Averaged out over series ending 1-100 I get about 25%. I can’t explain the difference right off hand, but my guesstimation is close.

Maybe with Binford’s law.

There are 12 numbers between 1 and 100, that begin with the number1. Slightly more than average, but only eleven when we are talking about 1 to 99. Eleven occurences,11%. Same occurence with 2,3,4, and 5. No digit is repersented more or less than 11%.

two pair said: No digit is repersented more or less than 11%.

The difference is alot of these things are happening incrementally. Just because ‘tens’ have come into play does not mean that all tens are as likely. There could just as easily be 1-2 or 1-10 or 1-20 distributions as there are 1-99 distributions.

Ok, so I understand the concept and everything, but I do have one thing that confuses me. You are all using rivers as examples of a set of data, will it seems to me the lengths of rivers would be in the same boat as a set of totally random numbers. It seems that the law only applies to “man-made” sets and not things of nature. I’m not saying that some “god” created them so they are random, but rather the fact that they occur randomly. Hmm chaos theory… I was just kindof blabbering there, but some else tell me what you think.

This makes a lot more sense if you don’t think of the numbers, in this case, as being symbolic. The ‘real world’ is the important part of the pattern because the numbers to the left are place holders of increasingly large value. So it makes sense that the units used to measure things would tend to bucket them with a number that has a lower number on the left side. I would guess that if you measured the size of rivers in inches, rather than miles, you would find the distribution got less extreme. I don’t have the data to do it here, but it would be an interesting experiment.

The reason it works for something natural like rivers (or mountain), but not for truely random numbers like all the digits from 1 to 100 is that Bedford’s Law only applies to data sets that have a normal “bell curve” distribution. In other words, if higher numbers are more rare than lower numbers, you can apply the law.

Think about rivers as an example. The odds of a river stopping for some reason (say running into a hill, a large body of water, an underground aquifer, etc.) at any given point are more or less equal. Therefore, we can think of a river as rolling a pair of dice until we get a snake eyes (two ones), with the length of time we roll being equivalent to the length of the river.

Everytime we roll, there is an equal opportunity to stop. We start with 1 roll, and need to roll 100% more times in order to get to the next number. If we get to 2 rolls, we only need 50% more rolls to get to the next number. By the time we get to 9 rolls, we only need to get 10% more rolls to get to the next number, and since there is a 90% chance that we aren’t in the last 10% of our rolls, there is a 90% chance that we get to the next number, which starts with a 1. Now that we are at 10, we need 100% more rolls again to get to the next starting digit.

Bedford’s law comes from the strange fact that, statistically speaking, it is very unlikely that any given position in space or time (or money) is special. That is, using time as an example, it is very unlikely tho be at the very end or the very beginnig of something. This is quite obvious when you realize that the odds of not being in the first 5% or the last 5% of something is 90%.

If we are in the last 50% of rolls, we will stop on a number with 1 as the first digit, and the odds of that happening are 50%. Once we get to 20, we woulds need to be in the last 33% of rolls to stop on a number starting on two, so the odds of stopping on a number starting with two are 33%. Once we get to 30, the odds of stopping with a 3 is 25%. And so on.

I found a great explanation I’ll copy here (link below):

Dow Illustrates Benford’s Law

To illustrate Benford’s Law, Dr. Mark J. Nigrini offered this example:

“If we think of the Dow Jones stock average as 1,000, our first digit would be 1.

“To get to a Dow Jones average with a first digit of 2, the average must increase to 2,000, and getting from 1,000 to 2,000 is a 100 percent increase.

“Let’s say that the Dow goes up at a rate of about 20 percent a year. That means that it would take five years to get from 1 to 2 as a first digit.

“But suppose we start with a first digit 5. It only requires a 20 percent increase to get from 5,000 to 6,000, and that is achieved in one year.

“When the Dow reaches 9,000, it takes only an 11 percent increase and just seven months to reach the 10,000 mark, which starts with the number 1. At that point you start over with the first digit a 1, once again. Once again, you must double the number — 10,000 — to 20,000 before reaching 2 as the first digit.

“As you can see, the number 1 predominates at every step of the progression, as it does in logarithmic sequences.”

——————————

-from http://www.rexswain.com/benford.html

This has been a very interesting discussion. I wonder what would happen if a different number base was used, say octal or hexidecimal. Obviously the distribution would have to shift since there would be less (or more) first digits to choose from. Suppose everything was converted into binary, would the odds converge on 50/50 or would 1 as a leading digit still be more common?

hehe. with binary everything other than ‘0’ leads with a 1.

The way I heard it explained was this. You start with a number starting with one say 100 to become a number starting with 2 it has to increase 100%, which makes it 200 but for 200 to increase to 300 it only has to increase by 50% to get to 300. For 300 to increase to 400 it increases 25% and so on until you get to 900. 900 then has to increase by only a ninth and then you’re back to 1000 which has to double to get to 2000 so probability dictates one will be the number that most numbers start with.

kRANs80 said: “This has been a very interesting discussion. I wonder what would happen if a different number base was used, say octal or hexidecimal. Obviously the distribution would have to shift since there would be less (or more) first digits to choose from. Suppose everything was converted into binary, would the odds converge on 50/50 or would 1 as a leading digit still be more common?”

I think this will work with any number base the first number will always be the most common number to start number with. I don’t think the binary question is valid binary is not a number system it is just a way of representing opened or closed valves and if you converted numbers into binary it would just be a representation of a number system.

What I’d like to know is; is this still true if you multiply the results? Say I take the lenghts of all the rivers and then times them by two I know that all the numbers that begin with one would change but I also know that all the numbers beggining with 5,6,7,8 and 9 would become numbers beggining with one again would this law still be maintained?

This makes perfect sense. It’s talking about non-controlled sets of numbers. Obviously 1-100 is a controlled set and there are 11% of each starting digit.

But the set could be 1-452 in which case you have 111 numbers starting with 1, 2, 3(24%), 63 starting with 4, and only 11 starting with 5-9.

If the set was 1-200 you’d obviously have a very high (over 50%) number of numbers starting with 1.

If you average out all possible sets (1-1 through 1-infinity) the number of numbers starting with 1 will average out to 30%…

so 1-1: 1 number starting with 1, 0 with anything else

1-2: 1 number starting with 1, 2

….

1-99: 11 each

…

1-199: 111 1’s, 11 of each other

…

1-999: 111 each

…

etc

add all those up and the odds will work out.

For example the sum of all numbers in all sets from 1-1 to 1-999 is:

498,501 numbers

1 = 95,793, 19.21%

2 = 85,692, 17.18%

3 = 75,591, 15.16%

4 = 65,490, 13.13%

5 = 55,389, 11.11%

6 = 45,288, 9.08%

7 = 35,187, 7.05%

8 = 25,086, 5.03%

9 = 14,985, 3.00%

you can see that even with a controlled set of sets (1-999) the numbers start to approach the predictions.

i dare not comment about the article whilst all these people are arguing its logic. damn interesting though

Binary is an interesting limit case. ALL binary numbers, except 0, start with 1. We are so used to thinking about binary in connection with computers, that we think binary comes in bytes. The binary number 1 is 1, not 00000001.

When you shift the numeric base, the percentages change, but 1 is still the most common. Of course, if you mix things up, say represent hex using binary (0000,0001,etc.), you get 50% 0’s and 50% 1’s.

Look, this Law is easy to explain. Have you ever seen a Log plot? Check out http://www.science-projects.com/SemiLogUse.htm

Hmm. . notice how the coverage between 1 and 2 is a lot bigger than between 8 and 9? In fact, it’s about 30% of the area from 1 to 10.

You see, this is just a matter of things GROWING, natural or artificial. When a river is forming, it might grow 10% longer in a century. So if it’s a 100 miles long, it could easily grow by 10miles. But a ONE mile long river is not likely to grow by ten miles. So we get a bunch of short rivers and a few long ones, which looks wierd on a plot, but nice and uniform on a LOG plot! The log plot has wider 1’s, so the (now uniform) data is more likely to be from 1 to 2 than from 8 to 9.

Am I right here??

chswartz said: “Look, this Law is easy to explain. Have you ever seen a Log plot? Check out http://www.science-projects.com/SemiLogUse.htm

Hmm. . notice how the coverage between 1 and 2 is a lot bigger than between 8 and 9? In fact, it’s about 30% of the area from 1 to 10.

You see, this is just a matter of things GROWING, natural or artificial. When a river is forming, it might grow 10% longer in a century. So if it’s a 100 miles long, it could easily grow by 10miles. But a ONE mile long river is not likely to grow by ten miles. So we get a bunch of short rivers and a few long ones, which looks wierd on a plot, but nice and uniform on a LOG plot! The log plot has wider 1’s, so the (now uniform) data is more likely to be from 1 to 2 than from 8 to 9.

Am I right here??”

No!

Feel free to comment me if I’m mistaken.

1) The law does not just apply rivers. The length of rivers was used as an example, I believe the author started with tax returns…

However, on the subject of rivers…

2) Although I agree that it takes longer to move from 1 to 2 exponentially (or on a log plot) than 8 to 9, you must remember that our system of measurement is essentially arbitrary. If a river is 1.1 miles long, it will take a long time to reach 2 miles exponentially. But 1.1 miles is also 1.76km. it will take a much shorter time to reach “2” by exponential growth. It doesnt mean that the river is any longer or growing any faster.

In answer to the other person’s comment above, I suspect that if measurements were converted from miles to km or vice versa, the law would still hold.

There are a number of articles on the use of Benford’s law , in addition to software and tutorials on the use of the law. It has extensive applicability in auditing and forensic accounting.

which is 1/10 or (.1) the approximate distance from the earth to the sun.

Actually, this Law does NOT work for bell curve distributions. In a perfect bell curve, the median and the mode are the same value. This means high numbers and low numbers are equally rare, with numbers somewhere in between being most common. For instance, look at average height (in feet) for 20-year-old males. It’s a bell curve distribution (or close to one) with 5 as the most common first digit. Sure, you could convert this to meters, and the most common digit would be 1. But that’s just a matter of shifting the curve. If you shifted it again into inches, the most common first digit is 6.

As long as the curve is bell-shaped, the most common first digit will be whatever the first digit of the median is.

But river length is not a bell curve distribution. Neither is amounts on a tax form. In both of those distributions, the median and the mode are not the same. If the distribution were unimodal, which it may not be, the mode would not be the same as the median. There would be a large number of rivers a little bit shorter than the median, and a few rivers extremely longer than the median. The distribution is much more likely to look like this: http://www.gatsby.ucl.ac.uk/~turner/Benford's%20law/Benford%20Tea%20Talk_files/Benfords%20law3.png

This is because short rivers are much more common than long ones. And, similarly, small dollar amounts are much more likely than large ones.

Take a look at that distribution and adjust the scale however you like. The most common data points will begin with 1 in every scale.

I think the problem with your logic here is that you data set is too constrained, which will, as Alan pointed out at the beginning of the article, cause it to fail Benford’s Law. It is not to constrained because it is too small a data set, but because the range is too small. That is, the range between smallest number and largest number is too small. According to Wikipedia, the shortest fully grown man is 2’11” while the tallest man in recorded history was 8’11”. Thus there is only an approximately 300% increase from shortest to tallest, regardless of what unit system you use. Compare this to the range in river lengths or yearly incomes, to go back to the two examples so heavily cited above. To run the full range of starting-digit-1 to starting-digit-9

requires a 900% increase in value, again regardless of what unit system is used. Bedford’s Law requires that the numbers involved traverse this range at least once , otherwise at least 1 number will always have a 0% occurrence in the opening position and the number set will thus immediately fail the test.

So, I think the bell curve distributions will still follow Bedford’s Law, but only when the distribution covers a sufficiently wide enough range of numbers, not just a sufficiently large enough set. That, it appears, is a key distinction to “too constrained” which it does not seem was made above.

Many universities accept term papers only electronically, and share them with other universities to pattern-match for cheaters.

Also, anyone forced to give a number between 1 and 10 often gives 7, with 3 being the 2nd most likely statistic.

In other news, I had 1,234 scoops of ice cream for dinner last night.

A1c: You had ice-cream for dinner?

You’re my hero

Has anyone heard of the law of least action? It’s a principle that’s seen when calculating various natural phenomenon such as the trajectory of light through a multi-substance plane. It’s also a common example problem in Calculus. If you have two planes, and you travel at 10 meters per second in plane 1, and 20 in plane 2, what is the trajectory that requires the least time? It’s a minimization problem, and is quite simple. You’ll notice the trend that as the velocity traveled in the various planes becomes more disparate; the distance traveled in the slower plane becomes smaller (closer to perpendicular with plane 2). And light refracts in a path that allows for the least amount of action. The equation for action is complicated. I don’t remember it, but it takes into account potential and kinetic energy. Anyway, this may be the reason for Benford’s law. If you imagine the possible path of light through a substance being quantified in terms of action, and the possible range is 1,000-10,000. Based on the principle of least action, one may say it is most probable that the light will travel with the action of 1,000-1,999, and least of 9,000-9,999. With a sliding scale (down in probability) through the numbers 1-9. This lines up perfectly with Benford’s law.

It is clever and DI but not so mysterious.

Imagine designing an electronic number display to display some “real life quantity”. If the display currently has 3 digits, you can display from 0 to 999, which is 1000 numbers.

If you now add the ability to display a preceding “1”, you have doubled the range of numbers (0-1999, which is 2000 numbers). If you enable that extra digit to optionally be a “2”, you have only increased your range of numbers by 50%. For 3 it is 33%, for 4 it is 25%, etc. All of this holds no matter how many digits the display had originally. In general, incrementing the upper bound of the first digit yields diminishing returns compared to the range that could already be expressed before the increment.

Since “real life numbers” follow exponential distributions (as opposed to “completely random numbers” which follow uniform distributions over an artificial interval), the percentage by which a digit extends the range is an indication of how likely the digit is to be needed.

Seemed nonsensical until I thought in terms of something no one seems to have mentioned: the range of the data set, the ballpark estimate of your highest and lowest values:

If your numbers seem to be randomly distributed between 10,000 and 100,000 we have approximately 10k numbers that start with each digit. As such we shouldn’t see bias one way or another.

If your numbers are randomly distributed between 10k and 200k, you have around 110k that start with 1 and 10 k that start with all others.

In practice we’re probably going to see a heavy concentration in some ranges while others become a lot rarer.

I don’t feel inclined to double check file sizes or anything. I will point out that if you have a something like a middle school student who’s required to religiously follow a five paragraph essay format, all those essays will have an approximately equal file size, and as such is likely to share a common first digit. The same is true of pictures loaded off a specific digital camera. Or any number of specialized cases.

We could easily end up with several files that are say, 500 megabytes or whatever.

Everything depends on what kind of trends are prevalent in the dataset.