Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Why It Matters Whether Hashed Passwords Are Personal Information Under U.S. Law (jdsupra.com)
93 points by fortran77 on Feb 3, 2021 | hide | past | favorite | 56 comments


> As long as the salt value remains secret, this is a very effective method against rainbow attacks and other current methods of attack

But salts aren't used to be a secret: they are used to make existing rainbow tables useless, as the salt wasn't used to precompute the rainbow tables.

In the example they use "2@3" pre- and post-appended to the password. Ok, so any rainbow table that haven't used 2@3 is useless. But "password" will always be a lousy password and 6 characters of salt isn't that great.


If they use an unique salt for each password you are right, if they use the same salt for each password then the article is correct(ish).

I should say correct enough.


Even then it's better than nothing. If every company used different salt, rainbow tables would have to be completely regenerated for each company.

(This is still awful, because then two users with the same password would have the same digest. If you know that joe@example.com has password "badpass", and jane@example.com has the same hashed password, then hers is also "badpass". So use randomly per-user salts, gang! Even better, pick a tested implementation of PBKDF2 or Argon2 and use that implementation. Don't roll your own crypto!)

But still, it's better than nothing.


Oh, yeah, I agree, it is better than nothing.

Just pointing out that I have seen conditions where the article is not terribly wrong in its statement.

Things like bcrypt are nice because they take care of that for you.


Why not just use joe@example.com as part of the salt? So every user has a different salt.


Usernames and email addresses change.


What about taking results from one leak's reversed passwords and applying it to all future hash leaks?

It's the same logic as applying well known target data like birthdays, anniversaries, etc.


Salting makes that nearly impossible.


Salting makes each crack take either it's own rainbow table or other methods.

I was talking about _reversed_ passwords, guesses that fulfill the algorithm. Bonus points if they look like names, words, etc, but even random junk might just be a pwgen password.


Does anyone actually use rainbow tables to crack password hashes anymore with GPUs being so fast?


An O(1) "is this hash known?" lookup is going to be faster than any GPU code.


In a world with unlimited IO throughput and very few hashes to crack, yeah.

Discussion from 2017: https://hashcat.net/forum/thread-6219.html

From 2018: https://security.stackexchange.com/questions/186953/cracking...

GPUs are much faster now. I doubt you’ll find many people using rainbowcrack or similar tools in 2021.

OTOH https://www.rainbowcrackalack.com/


Huh, TIL. Thanks for the links.


>if they use the same salt for each password then the article is correct(ish).

Wouldn't it be a pepper then?

https://en.wikipedia.org/wiki/Pepper_(cryptography)

edit: nvm, it seems the main difference is a pepper being secret.


Yep - a pepper is an app-wide value that forces an attacker to get both a configuration value (from the source, or env vars) and also the hashes of the passwords (a db dump). Just another incremental thing for an attacker to overcome that is cheap & easy to implement.

Salts must be unique per password as far as I know? Otherwise they don't do their main goal which is to make every guess useful against only one hash. (ie: guess 'hunter2', check if any user had hunter2 as a password, but salted, you'd only be able to guess if cschneid had hunter2 since you'd be hashing hunter2-sssssaaaaallllltttt).


Yeah, I was trying to imply a correctish for a general audience.

If right or wrong is a spectrum.


It's maybe also worth pointing out that if the same salt is used for each password, then an attacker who can leak the hashes after creating their own account (with a known password) can try brute-forcing the salt.

If the attacker hasn't seen the code, they won't know whether the salt is prepended, or appended, or both to the password (or if there are two salts), but the amount of computation needed to try all possible salts is roughly equivalent to the amount needed to crack a single well-chosen (unsalted) password.


Please do not use the SHA family of hashes to store passwords. They are designed to be fast. This makes brute forcing them relatively easy. There are hashes designed for passwords that are not fast.


As others have said, use scrypt, but...

PBKDF2, obviously a less marketable name than scrypt, was a pretty standard tool for dealing with fast hash functions. What turned out to be a bigger problem than speed is memory use. Now that we have widespread GPUs, it's a lot easier to parallelize password cracking, and most cryptographic hash functions use negligible memory, so it parallelizes very well. scrypt also improves this.

I actually saw scrypt take down a website. It got attacked with brute-force password attempts (<100k, so it wasn't a serious attack), and the scrypt calls exhausted the CPU, essentially doing a DoS attack. We ended up wrapping the call in a semaphore to prevent this. You don't normally think about it, but login might be the most expensive operation your webserver does.


I completely agree. If you're directly using a SHA hash for password storage, even with salt, you're doing it so incompetently that you should probably be fined.

At the very least use a well-studied iterated salted hash algorithm for storing passwords (if you must authorize an incoming request later). That is a minimum bar. I generally recommend argon2. Reasonable alternatives are bcrypt, scrypt, and PBKDF2. PBKDF2 is weak against hardware, but it's still better than using a hash directly.

I teach a class on developing secure software, and students have to do a final project. I take off points if they directly use SHA for password storage; that's just not acceptable.


Yes, it is called stretching.

Cryptographic hash functions are typically designed to be computed quickly, so it is possible to try guessed passwords at high rates. You could try billions possible passwords each second. This is why we have password hash functions that perform key stretching, such as PBKDF2, scrypt, or Argon2. They increase the time (and sometimes even memory) required to perform brute-force attacks on stored password hash digests. Use a large (256-bit is fine) random salt value. All the salts are random values (note: they do not have to be a secret), so each user will use a different salt value, and now the attacker has to compute the stretching function once for each password combination, rather than once for each password, and this is a lot more work for the attacker.

If you only hash with a salt, it still leaves passwords exposed to brute-force attacks and dictionary attacks that they can easily run on GPUs. Key stretching is important! This is why we have dedicated password hashing functions that are slow enough to mitigate brute-force and dictionary attacks.

TL;DR: salting and stretching on passwords, along with password hash functions (slow password hashing functions: PBKDF2, bcrypt, scrypt, Argon2, Balloon and some recent modes of Unix crypt)! Do not use fast cryptographic hash functions on passwords, as it defeats the purpose of stretching. Salting is not enough.


Do not, ideally, roll your own out of anything.


Further, a hash is the wrong primitive to use to store passwords. Just as you wouldn't attempt to invent a cryptographic hash function yourself.

The correct primitive to use is a key derivation function. Examples are PBKDF2, bcrypt, scrypt and Argon2.


bcrypt is NOT a key derivation function. It IS a "password hashing function". PBKDF2, Argon2, and scrypt are both.


Can you saw which? Blowfish?


Use argon2 or scrypt for password hashing. PBKDF2 or bcrypt are also acceptable, but prefer the first two in new systems. Make sure to use appropriate complexity factors.

A good starting point when trying to decide what crypto algorithm to use is https://latacora.micro.blog/2018/04/03/cryptographic-right-a...


I absolutely love that right answers document. Thank you.


People used to say “Just use bcrypt.”, but this might have changed since. A cursory look seems to indicate that Argon2 is the current preferred choice.


bcrypt is fine. It has held up quite well. Really, when picking among the mainstream password hashes, just throw a dart if you need to.


pbkdf2: https://en.wikipedia.org/wiki/PBKDF2

Your standard crypto library probably implements it (Rfc2898DeriveBytes for .NET). Under the covers you should tell it to use SHA512. If you can choose something better go ahead, but this is often good enough


I've never really loved this advice.

If people use weak passwords, the hash algorithm doesn't matter, they're going to get cracked. If you use strong passwords, the hash algorithm doesn't matter, they're not going to get cracked. This sort of thing only matters for medium passwords. And can be avoided by requiring strong passwords.

It also does very little against precomputation attacks (e.g. rainbow tables), because it makes it take longer to compute the tables but it's still efficient to use them.

Meanwhile it adds a user-perceptible amount of latency to logging in, because the hash algorithm is so slow. That causes users to take steps to avoid it, e.g. by staying logged in when they don't need to or preventing their machine from locking, which reduces security.


Not sure what you're suggesting, but I'll try to give benefit of the doubt & guess you advocate for giving users randomly generated 64 character hex strings & password resets only give them a newly generated password

At that point it may be suitable to store the password as plaintext, though a layer of hashing could be suitable to slow down an attack from authenticating in case they get read access to the database


I mean, the "reality" is that if a user uses the same password across multiple sites then they are almost certainly screwed as the entire premise that we are going to guilt all the big websites into storing hashed passwords doesn't work for the long tail of websites that each had a high likelihood of being run by a scammer anyway. Users really just need to not use the same password across multiple websites, and that password needs to be high entropy. The recommendation of "hash your stored passwords" prevents you from being part of the problem, but it is a solution to the wrong problem and fails to protect either the user or the website from getting hacked: the recommendation--at least and certainly if assuming you insist on implementing login with the classic "user will later send a plain text copy of their secret to the server each time they log in, which the server will (somehow) verify"--absolutely should include "don't let the user generate their password", as anything else does a disservice to the user and doesn't actually help the website. If the user is sane, they won't care if you generate the password. If the user isn't sane, then either 1) they are using some kind of weird deterministic scheme to generate passwords for all websites (which is frankly not optimal); 2) they are intending to use their password on another, less secure website (and so should be prevented from using that same password on your website); or 3) their password is something stupid (like their mother's maiden name, concatenated with their date of birth... but they change the i's to exclamation points to get around the symbol requirement).


Never store the password in plaintext. Even with a fast hash algorithm, a strong password can't be cracked in a feasible amount of time.

It also doesn't have to be 64 characters. 16 characters in base64 is 80 bits of entropy. That's hundreds of years worth of hashes on a Bitcoin ASIC drawing hundreds of watts, for one password.


Nearly everything you've ever heard about "strong passwords" is irrelevant and obsolete when the cracking method is simply "hash every alphanumeric and punctuation string starting at MIN_LENGTH and incrementing". The only way to defeat this is by making the hash computation expensive.


Making the hash computation expensive does nothing when your password is "Password1" because it will be cracked in seconds even using an expensive hash algorithm. Making the hash computation inexpensive does very little (and has other costs) when your password is "Tgu8dVThkweVqa6P" because that has enough entropy to require in excess of $100,000 in electricity to brute force even with a fast hash algorithm, which is uneconomical in ordinary situations.


In many privacy regulations, any unique identifier is treated as Personally Identifying Information (PII). If the hashed password is used as a lookup key, that makes it a unique identifier, and it's PII. I do work in environments where agencies are not allowed to share user PII with each other, and therefore you can't create single (SSO) identities to login to different services. The principle is that individuals should be able to compartmentalize their business and service relationships and not have their service providers collude or conspire against them for leverage.


In typical username+password systems, the passwords are not unique, and thus the password hash would be an inappropriate lookup key.


It wouldn't be, unless you were sharing an unsalted hash database across organizations, where it was aggregated with other ones, then that hash is a unique identifier. Uniqueness becomes an issue in things like de-identification, which is a legal concept, and pseudonymization, which is basically tokenization, but not an information theoretic one like say, k-anonymity, or a cryptographic one like entropy.

If these distinctions are new to people on this thread, I recommend https://www.oreilly.com/library/view/building-an-anonymizati... for some background.


So, correct me if I am wrong, but you are saying that uniqueness when it comes to data regulation is different than uniqueness in information theory.

Even though I generate a random salt when hashing passwords I wouldn't use it to select a user out of a database because I don't check for collisions by ensuring a unique salt that has never been used before. The chance of a collision is tiny, but I still don't want to do it. But this would still qualify as unique under data regulations?


The fact of a collision does not generally have regulatory. meaning unless your hashing scheme was for de-identification and your re-identification risk assessment of the method showed it was trivial to re-identify someone because of said collisions. Also, logically, the lower the collision rate, the more unique the identifier. I recommend reading the book I linked above.

The policy view of what these things mean vs. what we talk about in terms of entropy, collisions, confusion, etc, are subtly different things. Of course it depends on the regs regime, but here's an example from what I do:

A business wants to share 10million of personal profiles keyed on a SIN with another business or agency for some research. Privacy law says they can't share the data unless the personally identifying information is removed. Business tells their DBA to "encrypt" the SIN numbers, who says "sure" and SHA256's them all and shares the data set, because to him, now the SINs are not shared(!).

A privacy policy analyst freaks out because hashing the SINs has done nothing to protect the identities of the people in the data profiles. The DBA can't figure out why because he used an NIST approved 256bit hash on them and he tells his boss "it's fine, they're encrypted."

To your case with the salt: if you salt the hashes, the agency receiving the file calls back and says they can't use the data because the hashes don't match the profiles in their database - because they have been using the hashes as unique identifiers and re-identifying people in the data set. If you salt them for your initial data sharing, and then re-hash them with a new salt for an update, that's better, but it is still a transferable identifier for that cut of the data.

Even if you tokenize each profile with a UUID and transfer that UUID between agencies, you are transferring a unique identifier about the person. The right way to do this is to have a tokenization service broker that takes records and synthesizes new record keys (MBUN) for each destination counterparty you are sharing the data set with. Hardly anyone does this, and they just take the privacy risk instead.

The policy vs. info theory difference is that policy can conflate hashing and encryption depending on the purpose of the use, where in tech and security, they are totally different things.


In the theoretical sense what you care about is whether you can safely assume those hashes will always be unique.

In the practical sense what matters is whether you can safely assume the hash for a particular user in a particular system will ever be unique.

E.g., if I had a snapshot of a database mapping identities to password hashes and a log of all the hashes that had been computed by the auth server, I could make very reasonable guesses about who logged in at what time.

With that said, I am not a lawyer and I have no idea what the legal significance of all that might be.


The passwords should be salted, and therefore unique.


Thanks for sharing! I'm curious about the exact dividing line between PII and non-PII. So if you know, could you clarify a detail?

Your phrasing was: "If the hashed password *is used as* a lookup key, that makes it a unique identifier, and it's PII."

If I take your wording literally, your saying that I could store everyone's SSN in a database, but as long as my system didn't attempt to use SSN as a unique identifier, the SSN doesn't count as PII.

But it seems unlikely that that's how you meant it.


"If A, then B" does not necessarily mean "If Not A, then Not B". Now if the first logical statement was "If, and only if A, then B", then the second logical statement would follow.


Wow. I had never considered that. I always wondered why some companies don't let me SSO into different parts of their apps.


It's an interesting issue now because norms have changed. That specific privacy view is a very 90's-00's worldview that assumed silos and didn't anticipate using your webmail login for literally everything. Examples include legal firewalls between divisions of banks where they can't use your transaction history to make car insurance decisions. Should an airline know your income before quoting you a fare? (which is why some people use Tor and proxies to search for flights)

Privacy is anti-discrimination, and the reason social media companies are so rich is because they sell micro-discrimination as a service. It's so valuable because when people see how it works they ask, "how is this even legal?"


Okay maybe not PII, but can they share PIII or PIV?


Not a huge fan of offline-attackable password hashes in datastores.

I want to put a vote in here for an anchoring. Lots of methods:

1) Code secret. This is ok but loses all strength if disclosed.

2) KMS wrapping of hashes. This is better; attacker needs to be in your environment, actively unwrapping all your stuff.

3) Controlled hashing step (best). Have one step of your password hashing incorporate a one-way operation using a "normally non-extractable" hardware secret. This could be in CloudHSM if you're in AWS.

This completely prevents offline attacks unless the attacker compromises your HSM.

4) It would be nice if cloud providers implemented "password hashing as a service". They could easily and cheaply prevent offline attacks.

5) Of course strong key auth instead of passwords for clients would be nicer.


Several technical mistakes in this piece (which I assume was not primarily for a technical audience, but mistakes they are nonetheless):

1. This article repeats the misunderstanding that Rainbow Tables are the same thing as a Dictionary:

> Given the processing power available, hackers have and continue to generate enormous tables that contain anything from every possible combination of values for shorter passwords to lists including variants of common and known passwords. These are called rainbow tables, and if the password used is included in one of these tables, then the cleartext password is known to the hacker.

The Rainbow Table is actually an interesting improvement on a pre-existing idea (by Hellman, yes as in Diffie-Hellman) for a Time-space trade off attack.

So let's describe the most basic idea: We have a list of potential passwords (a "Dictionary"). It doesn't matter whether it's an actual written list, or a procedure to pick possible passwords, but it must be strictly finite. Just trying them all is a Dictionary attack. For an online service this is one thing "rate limiting" is defending against.

Now, if you have the hashes, you could do your dictionary attack except instead of trying to log in you just calculate each hash and compare it. There are automated tools that can do this for you e.g. John The Ripper.

Next cleverest idea is: Let's calculate hashes from our dictionary, store and index them (or we could e.g. store them in order) with the original word so we can just look up a hash and get back the password if we know one.

Hellman's idea was: Don't store all these hashes. Storage is expensive. Imagine a function that converts a hash back into a password from the dictionary, call it F() and the hash function H() now instead of calculating and storing H(password) -> password we recurse through H(F(H(F(.... some number of times, effectively chaining together many solutions in each entry stored, then we do more work to calculate each intermediate when doing a lookup of a hash later. We're trading off less space for more time.

Hellman's idea is clever, but it's a bit wasteful, because sometimes there will be (at random) collisions in either H() or F() and this means you're wasting space/ losing recall in your system.

So, Rainbow Tables are Philippe Oechslin's improvement, Oechslin deliberately perturbs F() differently for each iteration so that if a collision does occur it will go away again unless it was in the same iteration of the chains.

This produces a "rainbow" picture in Oechslin's slides which named the resulting technique.

Rainbow Tables are useless against a properly salted hash, because all this up-front work must be done again for each different salt, or, the size of the tables are inflated by the size of the salt. So e.g. a 32-bit salt (many today are far larger) makes your Rainbow Table cost four billion times more to calculate and need four billion times as much space.

They got famous because some key Microsoft password technologies from the turn of the century do not use salt, and so you can attack those with Rainbow Tables. In particular the LM Hash can 100% reliably be reversed by a suitable Rainbow Table that would be affordable to store in that era.

2. It gets salt wrong:

> As long as the salt value remains secret, this is a very effective method against rainbow attacks and other current methods of attack.

Nope. The salt doesn't need to be secret it should be random for each record. Our goal isn't that an attacker doesn't know each salt, which in most of these cases will have been stored with the hashed passwords, but to prevent them amortizing work over a large number of hashes, see the discussion of Rainbow Tables earlier.

Overall I think for lawyers or managers looking for legal advice, which I assume is the target of this article, the focus ought to be on getting away from passwords. They were already a bad idea, a last resort, last century and we've doubled down rather than seeking every other alternative.

If your system is authenticated with WebAuthn you do not store any secrets you do not store anything that would "permit access" to the system, because it's all driven by Public Key Cryptography. You store things can be used to verify that the client is the same as before, but they can't be used to pretend to be that client, or even to unmask the client's identity. You could (but probably shouldn't) paint the contents of a WebAuthn authentication database on the side of your building and that'd be fine.


By having the salt secret they are using it as an encryption key that they hope will prevent any sort of brute force attack. Thus the hash is no longer covered by privacy legislation. A separate salt per record is still brute forceable, even if it takes a lot longer for a lot of records.


If you have a secret place that you can hide data from attackers in, just store the hashes there too.


A hash (especially a properly salted slowed hash) of a good password doesn't permit access to anything, because it's beyond cracking even with every dictionary and optimization you might have. A hash of a bad password does permit access. The law is going to have to wrap its head around that distinction somehow.

Unfortunately "good" passwords have other problems, like being hard to type and remember... so there are practically no good passwords out there...


Would that also mean that encrypted (not hashed) PII is also personal information? It should not be retrievable (in theory and practice) without the key.


I have it from different DPAs that encrypted data are not PPI under the GDRP. The case might be different for passwords but you have to think around the corner a bit: You compare two hashes when you compare a password - you don't compare the password itself (the actual string in the form) - so what Identifies you as the person to the computer is not your password but the hash. So the hash itself is PII. Why isn't this true for encryption? Well good encryption in principle should not look different from random data. Will the same file encrypted with the same key produce the same outcome? Well - no. At least it shouldn't. So its not possible to identify someone with it like with a hash value. I can only speculate about the rational but this is the only explanation that come to my mind.


> The question of whether a hashed password “permits access” to an online account is a complex question that has not been fully addressed from a legal standpoint.

Isn't this already answered by existing law? Wasn't Kevin Mitnick already charged and prosecuted with laws that would cover unauthorized access? Obviously if you have a hashed password and it's not your password and you're trying to access an account that isn't yours then I'm sure that qualifies as unauthorized access. What am I missing here?

It sounds like there is an argument being put forth to make hashed/salted passes PII for, I can only guess, the sole purpose of litigation against businesses for being hacked. Otherwise, does anyone think hackers care about PII?

I can understand prosecuting for security lapses that are related to a data breach. But within reason. You're getting close to charging the victim for the crime here.

I know it's fashionable to argue against storing any PII here on HN. But that's not reasonable for our society. Because, much like salting+hashing, when you're implementing security measure X, all the future hacker has to do is X+1 to breach the wall. This game will never end. If we go down this route, you know what's going to happen? Only Amazon and Google and Facebook can control your PII. With CCPA and GDPR you're already seeing how this plays out. The businesses with deep pockets win. They can afford to deal with the legal realities of doing business all day long. There is an entire cottage industry that popped up just to put those stupid cookie banners on websites now. Does anyone really want to live in a world where Google gets to dictate if your business can exist, and Amazon constantly stepping on your neck and demanding rent?

CCPA/GDPR are basically laws that punish the entire world for the sins of three or four monopolies.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: