Tuesday, July 22, 2008

One in a Million

I remember discussing multi-threaded programming with a senior engineer the better part of a decade ago. I was reacting rather strongly to his cavalier attitude towards multi-threaded code, deadlocks and race conditions.

"That is such a small hole, it's one in a million that it will happen."

At the time, we were a small company, and I was a young engineer. One in a million, that doesn't sound that big. Of course, one in a million happens more frequently than you would expect. It didn't take very long for our systems to be processing a million transactions a day, and then millions per hour. Every time the holes had to get smaller and smaller.

They still have a fault in there where the entire system crashes when the system goes completely idle for a couple of minutes. It doesn't happen all the time, it's a "one in about 20 million", but it only has to happen once. So, it crashed about once a week. That one took forever to track down. Still haven't been able to fix it entirely, but they know that it's there, and the hole is smaller, probably "1 in a couple billion" now, below the rate that it's restarted for maintenance. Fixed, for now.

It's interesting see that the DNS engineers are learning the same lesson. Compute power has now shifted in favour of the attackers. As we saw with Captcha, it doesn't matter that you only get through rarely - you only have to get through once in a while. Even just guessing a DNS TXID (2^16), you've got a 1 in a 65536 chance of guessing the right answer.

In pure mathematical terms, the likelihood of getting the TXID wrong 400,000 times in a row is 0.2%. In other words, pretty certain.

(1-(1/2^16)))^400000)

Let's look at the 50/50 point, where the likelihood of seeing X number of failures is 0.5.

That turns out to be somewhere around 45000 attempts, simple for loop range. Since I am spoofing packets, I don't care about a response, even better, filling up the pipe will expand the race situation, giving me a larger chance to get through.

Still, it sounds like the attacker has to also be requesting lookups to poison the server. Let's say they take 2ms each, and they're being nice by only doing them one at a time. Even using the large 400k count, it will take at most 15 minutes of concerted effort to poison the cache. The 50/50 point? One minute, 20 seconds.

Look out for those one in a million situations, they happen more often than you'd like.

No comments: