Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> It’s a 64bit random number so it’ll never have unintentional collisions.

It'll have unintentional collisions if you ever generate more than 4 billion of these random numbers. That's not inconceivable.



Yes it is. Message schemas are made by humans. Most of these messages will be extended in a backwards compatible manner over the life of a project rather than replaced entirely so their IDs don’t change. That’s kinda the point of protobufs and its successors.

I’ve probably generated 100 IDs over my lifetime.


Which puts it on the same order of magnitude as the number of people on the planet. If every person alive generated a schema (or if 1/100th of all people generate 100 IDs each like you) then we'd have a small number of collisions. More likely you'd get large numbers of schema like that if there's a widespread application of a protocol compiler that generates new schema programmatically, e.g. to achieve domain separation, and then is applied at scale. I'm not saying that's likely, just that it is not, as is claimed, inconceivable.


It's only really a problem if you use the IDs in the same system. It's highly unlikely that you'd link 4B schemas into a single binary. And anyway, if you do have a conflict, you'll get a linker error.

Cap'n Proto type IDs are not really intended to be used in any sort of global database where you look up types by ID. Luckily no one really wants to do that anyway. In practice you always have a more restricted set of schemas you're interested in for your particular project.

(Plus if you actually created a global database, then you'd find out if there were any collisions...)


If you have 4 billion of them generated there’s another 1/4billionth chance you’ll generate a duplicate.

On top of that you would not only need to generate the same ID, you would need to USE it in the same system where that is could have some semantics to not cause an error.


>It'll have unintentional collisions if you ever generate more than 4 billion of these random numbers.

If it's 64 bit, doesn't that mean you'd need to generate ~10000000000000000000000000000000000000000000000000000000000000000 (2^64) of those numbers to have a collision, not 2^32?


If you generate randomly then, due to the birthday paradox, after generating sqrt(N) values you have a reasonable chance of collision.

The birthday paradox is named after the non-intuitive fact that with just 32 people in a room you have > 50% of 2 people having a birthday on the same day of the year.


I think it's 23 people in a room. The canonical example is people on a football (soccer) pitch. With 11 per side plus the referee there's a 50% chance that two will share the same birthday.


> 32 people

Slight correction: only 23 people, actually. So in every second football ("soccer") game, you have two people on the field with the same birthday.


Does birthday paradox apply here? It’s about any pair of people having the same birthday, whereas in this case you need someone else with a specific birthday.

For example, if you generate 2 numbers and they are the same, but are different to the capnproto number, that’s a collision but doesn’t actually matter.

EDIT: It does apply, I misunderstood what the number was being used for.



You’re right, I misunderstood what the magic number was being used for.


But if my application only uses 100 schemas, I only care about a collision if it's with one of those 100.


You have a collision if any two schemas share the id, not if a specific schema collides with any of the others. So it is exactly like the birthday paradox.


Yeah, but that collision probably doesn’t matter because there’s a bunch of other variables that need to come together for it to be an issue at all.


If the schema id is the message id, in principle it could be an issue as the protocol on the wite would be ambiguous. Then again, you should be able to detect any collisions when you register a schema with the schema repo and deal with it at that time.


I don’t understand your maths here: how is generating 4billion of them is any different from generating 3 billion except a slight raise in the probability measure?


When you reach the 4 billionth version of your protocol?


All versions of the same protocol have the same ID. That is the point of IDs -- to link together different versions of the protocol.


You're right! That makes collisions even less likely then.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: