We all know passwords suck
If you don’t, google “why passwords suck” and I’ll spare you a few pages.
Over the years, websites have required users to create increasingly complex passwords in an attempt to thwart attackers from brute forcing their way into victim accounts. Every website has its own ridiculous set of rules, resulting in passwords that are difficult to remember.
Many opt to create one “good” password and use it for all of their online accounts. What “good” looks like depends on your stance regarding entropy vs. complexity, but in any case the result is the same — people are using one password for all of their accounts.
Password theft aside, this might be okay if every web application could be trusted to transmit and store your password securely, but the fact of the matter is that they often can’t. And even if they’re keeping up with the secure storage standards of the moment, there’s never any telling how long before researchers find a fatal flaw within the hash functions themselves, or when hardware improvements will render feasible brute force times against a given algorithm.
Enter Credential Stuffing
Hackers eventually came to realize two things:
- Every web application stores passwords differently, and some of them do it in a way which makes it trivial to recover their plaintext.
- People tend to reuse the same password for all of their online accounts.
In short, if an attacker can steal your password from one website, that same password will likely work elsewhere. Worse yet, most websites accept an email address as a valid login name, thus simplifying the attack even further — one email, one password, all of your accounts.
The act of stealing credentials from one website and attempting to reuse them on other websites would eventually come to be known as credential stuffing. Attacks like these are hindered by protections such as two-factor or multi-factor authentication (2FA / MFA), geolocation tracking, etc., but even those have their shortcomings (assuming they’re enabled to begin with).
According to a report published by Akamai, they observed nearly 30 billion credential stuffing attempts in 2018 alone. It’s difficult to nail down credible sources of success rates, but if even 1% of those attacks are successful, we’re looking at a potential 300 million compromised accounts. Combine that with the fact that Akamai states they only observe between 15-30% of all web traffic and we’ve got a serious problem.
How the bad guys do it
The process for executing these attacks is simple:
- The attacker obtains (or creates their own) list of valid email/password combinations obtained during a data breach. Precompiled lists are often shared amongst underground communities.
- The attacker then utilizes a script or application which automates the process of attempting each credential pair against the login page of various popular websites (Gmail, Facebook, Instagram, etc). Sometimes they even automate the process of posting malicious content and/or changing the victim’s password to make it more difficult for the victim to recover their account.
These attacks are generally not personal; the attacker just wants valid accounts and they don’t care to whom they belong. Stolen accounts will often be sold on a darknet market to be used for nefarious purposes.
Protecting the customer
While Datto has offered 2FA as an opt-in feature for several years, the growing popularity of credential stuffing attacks made it clear that we have a responsibility to enforce secure policies on behalf of our customers and end users. Earlier this year we announced that we’d be transitioning into a mandatory 2FA policy for all Partner Portal accounts (our customer platform). This would ensure that even if a customer’s password was out in the wild, the attacker would still need to get past the second factor authentication before they could cause any harm.
Of course, defense in depth is the name of the game here. What if we could prevent credential stuffing attacks from being successful in the first place?
We brainstormed a few ideas:
Option #1: We could use the Have I Been Pwned (HIBP) API to search for our customer emails and/or passwords.
This was quickly ruled out for several reasons:
- A match on an email address doesn’t necessarily mean the customer’s Datto password is out in the wild.
- The HIBP API allows you to search for compromised passwords if they’re in plaintext, SHA1, or NTLM form, but Datto only stores them as bcrypt hashes.
- We’re uncomfortable with the idea of transmitting our customers’ sensitive credential information offsite.
Option #2: We could fire up an Amazon EC2 instance in an effort to brute force each customer’s password with a list of common and weak passwords.
This is somewhat ineffective against bcrypt, but more importantly it’s inconsequential towards our specific effort to prevent credential stuffing. Just because someone utilizes a weak password doesn’t necessarily mean they’ve been involved in a data breach.
Option #3: We could just do what the bad guys do by obtaining some data breach credentials and trying them against our own login page.
This would be very slow and would generate a lot of unnecessary traffic.
Taking a hybrid approach
After evaluating all of the tools at our disposal, we realized that we can leverage both the data breach credentials and the customer's bcrypt password hash to assist us:
- Create our own credentials database from popular data breaches. Use the same lists that the bad guys use so that we can prevent them from being used against us.
- Search the credentials database for customer email addresses. Searching for an email address may yield zero or more plaintext passwords associated with that address.
- If we find a plaintext password, hash and compare it to the customer’s Datto hash. If they match, we know the customer is at risk of being credential stuffed. Each bcrypt hash includes its own salt, simplifying this part a bit.
- Once vulnerable customers have been identified, notify them and force a password reset after a grace period. Ensure they don’t attempt to reuse the same password.
Obtaining and unpacking popular data breaches
The first action was to obtain copies of the data breaches known as Collections #1-5, amongst a few others. For obvious reasons, I’m not going to detail how to obtain these (and please don’t ask). The combined lists are said to total to nearly 28 billion stolen credentials (most of which were duplicates as I later discovered).
The lists were packed deep within a recursive maze of compression by various means (tar.gz
, zip
, rar
, etc.). I didn’t keep any notes during this phase, but here’s what I remember:
- Many of the compressed file names utilized spaces, quotes, or odd character encodings which made it difficult for tools like
tar
/unzip
/etc. to perform their duty. I had difficulty finding a one size fits all solution and had to rename/recode some of the more problematic files before I was able to extract them. - A working directory of about 2TB free space was required for frustration-free unpacking.
- The actual credentials were held within many different file types, including:
txt
,csv
,sql
,sqlite
,doc
,docx
,pdf
,xls
,xlsx
, etc. There was also a lot of other random junk files included (exe
,dll
,com
,jpg
, etc.). Fortunately, the majority of files weretxt
andcsv
.
Extracting the credentials
Since I only had a few days to finish this project, I didn’t have time to analyze and digest every individual list. I opted to take an 80/20 approach.
I figured since the majority of files were txt
and csv
, they would likely contain credentials in the pattern of: <email><delimiter><password or hash>
This would likely be incorrect for things like SQL dumps, but I randomly sampled enough files which adhered to this form that I felt comfortable moving forward with the assumption.
Using grep
and some loose regex, I recursively extracted anything matching the above pattern:
grep -arihE -o '[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,6}[,:;\s].*' ./ > list-1
-a
: Treat binary files as text (helped suppress error output)
-r
: Recursively read all files under each directory
-i
: Ignore case
-h
: Suppress the filenames on output
-E
: Extended regex
-o
: Only output matching segments, not the entire line
The regex pattern itself is a simplistic way to match on any string which follows the form:
<email><comma|colon|semicolon|space><any number of characters>
Doing it this way also yielded irregular and non-printable characters which I’d have to parse out later.
Normalizing the data
Next I used sed
to accomplish a few transformations:
sed -e "s/[\:\;[[:space:]]/\,/g" -e "s/\,\{2,\}/\,/g" -e "s/\"//g" -e "s/'//g" -e "s/[^[:print:]\n]//g" list-1 > list-2
- Convert all colons, semicolons, and spaces into commas for uniformity. This also preps the list for import into a SQLite database.
- Collapse multiple sequential commas into a single comma. Multiple sequential commas were a major artifact of Step 1. Note that this could also be accomplished with:
tr -s ','
. - Remove double and single quotation marks. Don’t ask why I separated these, I don’t know.
- Remove non-printable and newline characters. This is important because otherwise there will be thousands of duplicate entries that only differ based on invisible control characters.
At this point the list now contained rows in the form of:
<email><comma><a string of text>
Two problems remained, both of which were solved with awk
:
awk -F ',' '{ split($0, a, ","); print tolower(a[1])","a[2] }' list-2 > list-3
- Split the row into an array (using a comma as the delimiter), and convert the email address to lowercase. This helped collapse duplicate entries by preventing
fake@fake.com,password123
andFAKE@fake.com,password123
from being treated differently. - Only save the first and second array entries, discarding the rest. This prevented “too many columns” errors when importing the list into a SQLite database.
After this, any rows which did not follow the format of <email><comma><password>
became mangled and useless to me, but at least they wouldn’t cause any issues during the SQLite import. I also assumed that the emails and passwords themselves didn’t contain commas, but I’ve found it very rare for websites to allow such a thing.
Next, I sorted and uniquified the output list. I used the -T
flag to set a custom working directory, which I recommend anytime you think you’ll run out of disk space (the default is /tmp
on the OS drive). This operation can take days or weeks to complete, so it’s best to run it in screen
/tmux
, or otherwise parallelize it:
sort -u -T ./tmp/ list-3 > credentials.csv
Total size before normalization: 816GB
Total size after normalization: 103GB
Total credentials after normalization: 3.4 billion
Creating the SQLite database
I wanted to create a simple schema that would accomplish three things:
- Allow me to query an email address as fast as possible.
- Prevent blank/null emails or passwords from being inserted.
- Prevent duplicate email/password combinations from being inserted.
In the end, I landed on the following:
create table credentials (
email text not null,
password text not null,
constraint uc_credential unique(email, password) on conflict ignore
);
create index credentials_email_index on credentials (email);
Importing the credentials list was as simple as:
sqlite> .mode csv
sqlite> .import ./credentials.csv credentials
Although credentials.csv
was only 103GB, the SQLite database itself was 396GB. This was largely the result of indexing the email column.
Cracking the hashes
All that’s left to do is query the credentials database for each customer email and check if any of the resultant passwords match the customer’s when hashed. The bcrypt
library in Python makes this trivial by providing a checkpw
function, allowing you to test a plaintext password against a given bcrypt hash.
The entire cracking process was accomplished with a simple Python script:
#!/usr/bin/env python3
import bcrypt
import csv
import sqlite3
db = sqlite3.connect('credentials.sqlite3')
c = db.cursor()
# Assumes csv file formatted as <email>,<hash>
with open('customer-emails-and-hashes.csv', 'r') as csv_file:
csv_reader = csv.reader(csv_file, delimiter=',')
next(csv_reader) # Skip header row
for row in csv_reader:
customer_email = row[0].lower()
if customer_email:
potential_passwords = []
for plaintext_password in c.execute('SELECT password FROM credentials WHERE email = ?', (customer_email,)):
if plaintext_password:
potential_passwords.append(plaintext_password[0])
if potential_passwords:
customer_hash = row[1]
for plaintext_password in potential_passwords:
if bcrypt.checkpw(plaintext_password.encode('utf-8'), customer_hash.encode('utf-8')):
print(f'{customer_email} : {customer_hash} : {plaintext_password}')
break
db.close()
Final results, afterthoughts, and future improvements
We can obtain worst case statistics by leveraging @example.com
email addresses. These addresses have dozens of corresponding plaintext passwords which should significantly hinder our cracking speed.
Cracking these example bcrypt hashes on an Intel i5-4460 took about two seconds per hash, giving us a theoretical lower limit of around 43,000 hashes per day. In reality, most email addresses only have a few plaintext passwords associated with them (if any), meaning the process will be much faster against real data.
Running this script against tens of thousands of our customers only took about an hour. Our total success rate was about 0.56% — several hundred customers in our case. We alerted the vulnerable customers about our research efforts and forced a password reset after a grace period.
Although this was initially intended to be a quick and scrappy one-time ordeal, I later realized the long term potential of building a professional internal service out of this concept. A fully-featured application might include:
- An authenticated web API: Allow your company’s internal services to identify customers attempting to utilize a compromised password. If your database is kept up to date, this would help nip credential stuffing in the bud.
- Breach list imports, exports, and deletions: Provide data management functionality for a variety of database and file types. You could include metadata and a changelog for each import, allowing you to track the origin of credentials.
- Support for common hash types and salting methods: Allow users to specify the hash function (bcrypt, md5, etc.), as well as the position of the salt during the cracking process.
- Parallelized cracking and efficiency improvements: The cracking process doesn’t take too much time considering, but it could certainly be improved by utilizing GPUs, multi-threading, or hashcat integration.
Prevention
As of today, it’s arguable that the best prevention for credential stuffing is to utilize:
- A password manager, with a unique password for every site.
- 2FA/MFA, for every site as well as the password manager itself.
While this is a highly effective strategy, it does bring about the risk of having all your eggs in one basket. Taking the time to audit each password manager’s history of secure practices would be a worthwhile endeavor before making any commitments.
In addition I would recommend keeping tabs on your email addresses by subscribing to Have I Been Pwned’s “Notify me” feature (link at the top of the front page). They’ll email you whenever your address is involved with a data dump, and they’ll even include details about the kind of information that was exposed. It’s worth noting that many password managers include this feature and can query the HIBP API automatically on your behalf.