How difficult is it to get into a position at a top company where you can do ML research, without having a PhD?
And I'm also wondering about going back to university: Is it difficult to get into a good ML PhD program if you've been in the industry for some years?
Very difficult, at east for research. Top companies with ML research labs (FB, Google, Uber) are basically externalized academic labs. The heads of these labs are simply bringing talent over from their former departments (NYU, Cambridge/UBC, CMU).
It is not impossible to work at these labs as an engineer without a PhD, but I don't have a deep understanding of these roles.
"All the mathematics you missed but need to know for graduate school"[1] helped me a lot (and, in fact, I had and did).
Once I had finished that, Cullen's "Matrices and linear transformations"[2] was really helpful too. But I wouldn't do Cullen if you're still, as I was, floundering with the concepts of why you're doing this in the first place. It's great once you have those concepts down.
Garrity's book looks exactly what I've been looking for for brushing up on basic math skills after a decade of fairly low-brow work after my MSc. Thanks for the reference!
Let's say you have a system with N possible states, evolving in discrete steps. At each step, the system has some probability of switching from any state to any other state. That gives you an NxN matrix of switching probabilities. For example, if the system always stays in the same state as it started, the switching probabilities are an identity matrix (1 on the diagonal and 0 everywhere else).
Now let's see what happens after two steps. If the system started out in state i, what's the probability that after two steps it will end up in state k? Well, it's the sum over all possible paths. In other words, the sum of probabilities of i->j->k for all possible j. In other words, the sum of p_{ij} times p_{jk} for j from 1 to N. But that's exactly the definition of multiplying a matrix by itself.
Now it should be easy to understand that whenever you have matrices that represent transformations of some object, composing transformations will correspond to multiplying matrices.
Check out this series of videos "The Essence of Linear Algebra"[1] for a really powerful visual and intuitive explanation. It starts with vectors and builds to matrix multiplication and further to several other topics.
Axler's book, "Linear Algebra Done Right", is widely considered to be one of the best texts on linear algebra that focuses less on the mechanics and more on the structure of linear operators on vector spaces.
Yes, and the abridged version (no proofs, examples, and exercises) is downloadable for free. He avoids determinants altogether (well, until the 10th chapter).
How is this a failure? That's correct. (If your password hashing library doesn't handle this automatically, of course. But it does the same internally.)
The big thing about this, is that it is perfectly "OK" to store both the algorithm, cost, and salt alongside the hash.
Most people seem to think, and myself included when I was new-to-it, that storing all those things together would compromise the security. The point of the hash is that it is impossible (almost) to get to the hash without the user's password, and there is no way to get to the password with the entire string you posted.
I'm naive about these things, but I was under the impression that salt just thwarted pre-computed hash tables? I guess should be "just" in quotes.
So somebody with resources and motive could still brute-force that string. It seems that storing the salt somewhere else would add a comparable amount of security as the salt itself. It seems prudent along the lines of "don't put all your eggs in one basket."
> but I was under the impression that salt just thwarted pre-computed hash tables?
Yes. Because if you had two users with the password 'dadada' they would hash to the same value
Now 1234:dadada hashes differently then 1326:dadada hence preventing the use of a prehashed table (you could go through all salts for common passwords, but it's usually a bit long as well)
Rather than expecting the password hash library to store something into your application DB, you should be managing the access to that DB yourself.
In our case, we use an immutable attribute of each user as their hash. This might be an internal identifier, or the timestamp on which their account was created, or something like that.
Rather than expecting the password hash library to store something into your application DB, you should be managing the access to that DB yourself.
You do manage it yourself. Password hashing library doesn't access your database, it produces a string that you store, which includes salt and password hash.
In our case, we use an immutable attribute of each user as their hash
What? You really need to talk to security-competent people.
> In our case, we use an immutable attribute of each user as their hash.
I assume you mean "as their salt". And even then, why the half-measure? Just laziness? Sure, a guessable/computable salt is better than no salt, but it's not nearly as good as a random salt.
why the half-measure? Just laziness? Sure, a guessable/computable salt is better than no salt, but it's not nearly as good as a random salt.
But isn't the salt essentially safe to make public anyway? That being the case, how does it matter what value you use, so long as it differs between users?
Ideal salt is a large (e.g. 16 bytes or more) random byte string generated for each password.
If there's a reason for it (in most cases, there is none), some trade offs are possible, e.g.:
Salt is a large random string unique per user, not per password.
Given two hashes of passwords for the same user it reveals whether passwords are the same.
Salt is a small random string or some predictable value.
Attackers can precompute guesses and then look them up.
If you use some immutable identifier per user as salt, both of these attacks are possible. Is there a reason for this? Since you already store password hash in your database, I'm 100% certain that it's not, you can generate large random salt per each password hash and store it.
As for "safe to make public": there are many things in crypto called "public" where "public" doesn't mean that the whole world is free to get it, but instead means an opposite of "private", or, as I like to call them, "non-secret". Yes, salt can be made public, but shouldn't (unless there's a reason for it — like in a kind of client-side crypto where server stores salt and sends it to clients) to avoid precomputation.
Salt is a large random string unique per user, not per password.
Of course it's per user.
But "large" makes some sense. My current implementation has maybe 20-22 bits of uniqueness in the salt, certainly less than 16 bytes.
I don't think 16 bytes is necessary even as insurance against the future. Rainbow tables are still expensive to build.
On the other hand, maybe to build just a small table addressing the stupidest passwords ("password","12345678",etc.) it's worth making it more difficult.
> I don't think 16 bytes is necessary even as insurance against the future.
The birthday problem comes into play here.
If you have 22 bits of entropy in your salt, after 2048 users (2^11) you will find two with the same salt, with 50% probability. If they also use the same password, this makes attacking your users much easier.
Don't make it easy for attackers. Use 16 bytes from a CSPRNG. Better yet: Use a password hashing library that takes care of this for you.
If you use a 128-bit (16-byte) salt, you have a 50% chance of a collision after 2^64 passwords.
It being unique goes most of the way, you're right (though hopefully it actually is unique!). I was being dramatic when I said "not nearly as good". But making the salt easily guessable does allow an attacker to precompute rainbow tables, etc. So if there was a breach and an attacker got a dump of your password hashes, it might mean the difference between you having time to invalidate those passwords or not.
I'm not too familiar with this library, but on inspection this approach seems to have a couple of drawbacks that libraries like bcrypt solve for you:
1) You need to store the salt alongside the password.
2) If you want to futureproof the stretching factor (e.g. change from 100000 to 1000000), you need to store that alongside the password hash as well.
3) If you want to futureproof the hashing algorithm, you need to store that alongside the password hash.
The value of the *crypt solutions is that they store the input parameters as part of the stored secret. So you can make adjustments later on without invalidating existing stored passwords, or having to resort to annoying "double-hashes" to migrate to a new approach.
I don't understand your comment about the ORM needing to handle passwords. It's a simple fetch of a field from the DB, which you then pass as an input to your password validator. How is that any harder than fetching a salt and a hash and passing those to your validator?
Yeah, I'm currently taking a computer security class for master students and we spend quite some time on passwords. Made me think that this should be in an obligatory course for bachelor students (or some stripped down version that teaches the basics).
It's mind-boggling how many well-known companies stored passwords badly.
As another comment mentioned I think the issue is far less that computer scientists don't know good PW practice and far greater that management doesn't care to give them time to work on it.
I think OP is rather talking about window.{alert,prompt,confirm}, at least "ie, you can't interact with the page without answering the dialog" hints towards that
That's tricky too. You can't just remove window.prompt, or users won't be able to use pages that rely on it for critical input. So what do you do? window.prompt is a synchronous API; you have to return something to the code that called it. You might say "well, just suspend that function and let the user interact with the rest of the page". But by doing that you've introduced coroutines to JavaScript (since "interacting with the rest of the page" means "running JS"), which is a huge change that comes with a mountain of tricky interactions. Even if it could be made to work (which most browser vendors think is impossible), it would still break pages that didn't expect random state to change across the window.prompt call.