Keep From Giving Away Passwords Like Gawker

Some time ago I made a comment on Lifehacker.com. In order to comment I had to create a login.  I used an email account that I mainly use for junk email and a simple password.  In the past week, some hackers got into Gawker’s servers and downloaded among other things the user database.

When you store a password in a database, you are supposed to hash it.  In theory this turns your password into a random string so no one who has access to the database, can actually see the password.  To verify your login, the server takes the password you give it and runs it through the same hash software.  If it matches what is stored in the database then you are given access.

The hash function is one way.  You can easily create the hashed value from the password, but it is difficult or impossible to create the password from the hashed value.  However, if you have a list of the hashes from a bunch of passwords, it is easy to search for a hash and then find its corresponding password.

To prove how great they were, the hackers did this to nearly 200,000 accounts and posted it online.  These accounts were listed with their username, password and email address.  Mine was one of them.  Fortunately, the password is something I haven’t used for several years and I haven’t used it for anything important for about a decade.  Still it is a bit unnerving to see my password out there for everyone to read.

I traced down a few places where the password was still being used–on a bookmark site and another comment account and changed those.  Fortunately, I have been using 1Password for a few years so it was easy to do a search and find the accounts that were still using this password.

More recently I have been using 1Password to create a random password for each site I visit.  This is ideal because a long random password is going to be much more difficult to search for unless they have an enumerated list of every possible combination of keyboard characters of any length. Further, even if someone was able to find the random password that corresponds to the hash, it would only give them access to the account on one website–which they probably had to have access to in order to get to the database in the first place.

Salting Passwords

One thing Gawker/Lifehacker should have done is “salted” the passwords.  This would simply be adding some characters to the beginning or end of each password before running it through the hash function.  When you go to login again, the server simply takes the password you give and adds the same characters to it before running the hash on it again.  This makes the hashes much more difficult to find because it helps make sure that they aren’t common words.  If the hackers don’t know what the salt is or how it is applied, this should make it pretty much impossible to figure out the passwords.

However, in this case, the hackers got access to the Gawker source code as well, so they would have known exactly how any salt was applied before making the hash. The hackers could go through and create hashes for a bunch of common words with the salt characters added to it, but that means they can’t rely on existing databases of hash to password mappings–they would have to calculate everything from scratch.  Still this is possible and not unreasonable–particularly because they would only need to run through a bunch of words once in order to see if any user had used that as a password.

So if you have 1,000,000 users, running such a process against 1,000,000 common passwords will probably give you access to a number of the accounts.  That would require 1,000,000 hash operations using the dictionary file. I did a test and found that 1,000,000 MD5 hashes takes about 25 minutes on my i7 MacBook Pro.  This is just using the command line and a shell script.  I’m guessing that it could be done much faster using a different method, running it on a bigger machine or dividing it up among different computers.

Things can be made much more secure by setting things up so the password to hash mappings can’t be used to search the whole dataset. You want the hackers to need to recalculate the hashes for the entire dictionary file for each user.  This can be done if each password in hashed using a different salt value for each user’s password.

Different Salt Values

This could be as simple as merging the username and passwords together and then creating the password hash from that. This means that in order to compromise the same number of accounts as could be done with the same salt value would now require 1,000,000 X 1,000,000 hash operations.  This would mean significantly more time and/or more machines to run it against all of the passwords. It would take my computer somewhere in the 50 year range to do this.  Still if you got a cluster of 100 computers that are all 10x faster than my laptop, it could be done in around 400 hours.  So it is definitely doable, but it is starting to get much more expensive. Amazon’s high end EC2 instances cost $0.50 to $2.10 per hour so we are looking at costs in the $20,000 or higher range.

Still this isn’t prohibitive–particularly if the hackers are willing to use a smaller dictionary of passwords.  With at 100,000 dictionary of potential passwords the cost on Amazon falls into the $2,000 or higher range.

Hashing the Username or Email Address

Another idea would be to hash the username or email address. Usernames are going to be a bit more difficult to attack using a dictionary because they tend to be more unique.  In fact, if you were to combine the username and the password into the same hash it would be significantly more difficult because you couldn’t try to break the username or password by itself.  So something like:

hash(username + password)

Of course this would only work if you didn’t need the username anywhere else.  For example, you couldn’t use it to show the name of the person who left a comment.  This might work out ok if your application uses a username to login and a “screen name” to show to other users.

The problem with this is that you couldn’t look up a user without having their correct username and password.  This could be problematic for doing any type of password recovery. You might be able to work around this by using a third field that can uniquely identify the user–like an email address.

If your system doesn’t need to send emails, hashing the email address might be a better solution because these tend to be much longer and are automatically unique.  You could still send emails for password recovery because when given an email address you could find the row with a matching hash value.

If the average email address is 20 characters and can contain A-Z, @ and a period, then you’d be looking at 20^28 different combinations for a dictionary attack.  Actually the numbers would be significantly reduced if you took into consideration the fact that most email addresses end in .com, there is only one @ and other common patterns.  This is partially offset by the fact that some email addresses are longer than 20 characters and the fact that many people have username portions of their email that are more complicated than their passwords.

Something like the hash shown below would be impossible to break on any large scale, but wouldn’t allow traditional password recovery using email.

hash(email address + password)

If someone forgot their password, it would be impossible to lookup their account via email using this approach.  Still, in some situations that might be acceptable–particularly if an alternative form of password recovery could be done either in person, SMS or some other method.

Can You Measure It?

For years there have been two elementary schools in Fort Scott. Eugene Ware Elementary School on one side of town and Winfield Scott Elementary School on the other. This year the school board voted to make a change and all of the same grades at the same school. This means regardless of where you live, you have to go to the same school for your grade. This was presented as a way to save money.

I can see some ways that it might save on the number of teachers. If your maximum class size is 20 and you have 30 students in a particular class at each school, then it will require four teachers with two at each school. Each would have an average of 15 students. If all of these students are at the same school, you could have classes with 20 students and only need 3 teachers. So the combination may be able to reduce the number of teachers required–but only at the expense of larger class sizes.

While I applaud the idea of trying something new, I was very disappointed that there was no talk about it being done as an experiment. These type of changes have a way of being much more complicated than they look on the surface and it is impossible to tell if it is really going to save money until you actually try it.

This isn’t just a problem with school systems. All kinds of organizations make big changes with no clear plan for how they are going to measure the results. Continual improvement is the process of changing and then measuring. You keep the changes that are beneficial and jettison the ones that are not. This only works if the measurement aspect is “built-in” to the change process. With good measurements and a true commitment to roll back anything that doesn’t work, you will make progress even if you just make random changes.

Obviously, well planned changes with some intelligence behind them will move you forward much faster than just making random decisions. However, it is a mistake to assume that intelligent decisions can substitute for identifying and rolling back the things that don’t work. The power is in getting rid of the things that don’t perform as expected.