Based on title alone most of you will never read this and think “wow that guy is a moron”. Which is perfectly alright, you are allowed to do that.  But don’t fear, after you are done reading it you will think the same thing.

Most websites out there are using some simple hashing algorithm that is quick for their server to calculate, and is “small” and of non variable size in a database. In may cases this is some derivation of MD5.  Which that model has a completely different problem that I am not going to cover, Hash Cracking.

That’s right folk’s I am going to talk about the issues with Hash Collision!  And now you are closing your tabs, which I don’t blame you for, because lets face it the issue I am talking about is really rare.  Mostly you think I should shut up because MD5 has already been ‘proven’ broken (http://en.wikipedia.org/wiki/MD5#Collision_vulnerabilities). I am not denying that their research doesn’t work, or that they can’t create a colliding set of data, they do and they can. But that issue has nothing to do with passwords, particularly when it comes to MD5.

Keyspace, MD5 has a keyspace of 128 bits. The examples in the link above are 1024 bits long, hence why I say they don’t matter in the case of passwords.  The assumption that you make with a password hashing algorithm is that the algorithm is 1-1 and onto inside the size of the keyspace. Meaning that passwords that are 128 bits or smaller should never have a collision with another password that is 128 bits or smaller.  Which assuming ASCII allows for 18 characters in a password, UTF-8 allows for 16.  I am going to work with UTF-8 because lets face it, it’s gained some popularity in the past couple years online (http://googleblog.blogspot.com/2010/01/unicode-nearing-50-of-web.html).

Now if you are still reading you are thinking “Integgroll is an ignoramus passwords only have 95 possible characters that they can use!” And you would be right, there are only 95 of the possible 256 possible characters that appear in UTF-8 that are used for passwords. This means that an amazingly small portion of each character is used, about 37%, which leaves about 63% of the character to never change.  If you get where I am going with this, that means there is a HUGE area of the keyspace that is never used by passwords.

Alright now it is time for some imagery to help sooth some of the math that is coming up. Imagine that you are above the Milky Way, and you can somehow throw half football fields at that area. Each of those spots where a half field lands, that is one password.  If you were to do this for the 16 UTF-8 characters worth of keyspace that you can use in MD5 you would have 37% of the milky way covered in football fields.  Wrap a stadium around THAT Jerry Jones!

If you were to toss another half field at the Milky Way after it had all those fields on it already, there is a 37% chance that it will land in the same place as another field already on the Milky Way.  That is your chances of collision with a password that is 16 characters or less in the MD5 keyspace.  Most of that 37% is wrapped up in passwords that are of good length.  At 11 characters or less there is only about a 1% chance of a collision.  When you get down to the shorter lengths, like say 8 or under, you are looking at 0.144% chance.

Yes, the numbers here are tiny, and the chance that a password and an account will match up with a collision that is much shorter than its length is very very rare in the password space.  But this is a legitimate reason to limit password length, or at least it will be once you can brute force 11-16 character passwords relatively quickly.

My response to this is to not use a single hashing algorithm for authentication. For example, don’t use MD5 or MD5(SHA(BLOWFISH))) use MD5 and also use SHA store them in different fields in that user database, and only authenticate when the hashed password is matched for both algorithms.

 

TLDR: Use more than one form of hash algorithm with the results stored in different fields for authentication. Also Integgroll is a scrub.