I wonder if they have a major Postgres database that hit transaction ID wraparou...

mikeklaas · on Oct 31, 2021

I find it pretty odd to speculate that they are experiencing a very specific failure mode of a particular database. Do you even know whether they use Postgres?

anyfoo · on Oct 31, 2021

Maybe their load balancer got hit by a 2009 Toyota Camry with a sticky accelerator pedal?

wdr1 · on Nov 1, 2021

I know we're already in the weeds, but Malcolm Gladwell’s “Revisionist History” did an excellent episode on the stuck accelerator problem, basically showing it almost certainly didn't happen.

https://www.pushkin.fm/episode/blame-game/

Threeve303 · on Oct 31, 2021

It is a very unlikely Y2K bug

firecall · on Oct 31, 2021

It's not odd at all!

Speculation is a useful intellectual exercise, and the sign of a healthy, intelligent and curious mind!

Speculation is also fun!

Buttons840 · on Nov 1, 2021

No, it's just FUD directed at Postgres.

ravi-delia · on Nov 1, 2021

OP specifically pointed out that it's well documented. Every program makes tradeoffs, and I don't think anyone in this thread implied that because Postgres made this one it's unstable or bad.

Buttons840 · on Nov 1, 2021

The specific behavior of Postgres isn't the FUD, the FUD is the speculation that this is what brought down Roblox for multiple days. As others have mentioned, we don't even know if Roblox uses Postgres, and yet we're diving deep on how an edge case of Postgres brought down Roblox. It's the speculation that I think is FUD.

mbesto · on Nov 1, 2021

I did too but it's pretty easy to find out which database(s) a company uses:

https://corp.roblox.com/careers/listing/?gh_jid=3363389

frankjr · on Oct 31, 2021

Somebody mentioned a secret store issue that affected all of their services.

https://news.ycombinator.com/item?id=29044500

drewbailey · on Nov 1, 2021

Roblox has 43+ million daily active users. The issue you are wildly speculating about is enormously small compared to their size and scale. I guarantee you they’ve dealt with that potential issue (if they even are using Postgres) years ago.

mulmen · on Oct 31, 2021

Do you have any reason to believe this is the case?

hintymad · on Oct 31, 2021

I'd speculate that it's more likely a data corruption problem. A system was overwhelmed or misconfigured led to corruption of critical configuration data, which led to propagation of such corruption to a large number of dependencies. Roblox tried to restore its data from backup, a process that was not necessarily rehearsed regularly or rigorously therefore took longer than expected. All other services would have to restore their systems in a cascaded fashion while sorting out complex dependencies and constraints, which would take days.

crehn · on Oct 31, 2021

If it's a known issue, is there no way to increase the transaction ID size?

Quite surprising a seemingly battle-tested database can choke in such a manner.

sitharus · on Oct 31, 2021

It's an edge case, it only happens _if_ you have either very long running transactions holding xids, or you haven't vacuumed your database enough and it has to do a blocking vacuum to move the xid forward far enough to wrap around.

Most postgres databases wrap around with no issues.

The problem with increasing the size is a pair of xids are needed for every row, so you're doubling that if you go to 64 bits.

semi-extrinsic · on Oct 31, 2021

IDK about Postgres internals, but typically switching to int64 means recompiling all your binaries, plus your existing data format on disk needs to be converted.

anyfoo · on Oct 31, 2021

That's assuming a lot, including that the binaries aren't 64 bit already (a bit unlikely nowadays), and the database wouldn't just use a 32 bit datatype for this specific purpose in this specific configuration. (If this issue has anything to do with transaction IDs at all, as covered elsewhere.)

jatone · on Oct 31, 2021

they're not wrong. its not the software that's the issue, its the data stored on disk. the transaction ID is stored in the data for various reasons.

in theory nowadays it wouldn't be too hard to change if you use logical replication to upgrade the database but it'd be a huge undertaking for a lot of companies.

anyfoo · on Oct 31, 2021

Right, but the "you need to recompile binaries if you switch something to 64 bit" part was a bit too general.

alophawen · on Oct 31, 2021

You seem to confuse 64 bit integers with 64 bit programs

anyfoo · on Oct 31, 2021

I didn't.

jhgb · on Nov 1, 2021

But you wrote "that's assuming a lot, including that the binaries aren't 64 bit already" as a response to "typically switching to int64 means recompiling all your binaries, plus your existing data format on disk needs to be converted". Personally I definitely didn't read the latter as "recompile binaries from 32 bits to 64 bits". (I'm assuming that's what you meant by "not 64 bit already"?) More like "recompile 64 bit binaries to use different internal and external data structures".

anyfoo · on Nov 1, 2021

No, I can't be positive, but I'm pretty sure they meant recompiling binaries to be x86-64 instead of 32 bit x86, or armv8/arm64 instead of armv7... At least that's how I took it, because not many things are compile-time constants nowadays, especially for databases. It doesn't distribute well.

derefr · on Nov 1, 2021

The size of Postgres XIDs is in fact a compile-time constant.

If you’re wondering why it’s one: 1. it affects the size of a core data structure (that also gets serialized to disk and read back directly into said structure), and 2. basically nobody who has this problem (already a vanishingly small number of people) solves it by changing this constant, since you only have this problem at scale, and changing it would make your scale problems worse (and also make all your existing data unreadable by the new daemon.)

Also, FYI, Postgres uses compile-time constants for things that one might want to change much more frequently (though still in the “vanishingly unlikely” realm), e.g. WAL segment sizes. When you change internal constants like this, it’s expected that you’re changing them for every member of the cluster, or more likely building a new cluster from the ground up that is tuned in this strange way, and importing your data into it. “Doesn’t distribute well” never really enters into it.

anyfoo · on Nov 1, 2021

Well, I stand corrected. I thought OP did not know that (given they said they didn't know anything about postgres and were talking about "recompiling all binaries"), but reading their posting history they probably do.

With "doesn't distribute well" by the way I meant that it doesn't distribute well as program binaries, not across a cluster. It used to be extremely common to recompile e.g. your Linux kernel, nowadays almost nobody does that unless there are some very specific needs. Of course, building a specialized postgres cluster for exceptional scale would easily qualify.

jhgb · on Nov 1, 2021

Well, I'm pretty sure of the exact opposite. Maybe we'll get adjudicated by semi-extrinsic if we're lucky. :)

semi-extrinsic · on Nov 1, 2021

Yes, you are right. I was not talking about switching to x86-64, but the fact that most code on 64 bit platforms still uses 32 bit integers and that you'll have to recompile to get 64 bit ints.

This shows up sometimes also in numerical programming, e.g. when your meshing ends up producing more than ~4.2 billion grid points. But it is quite rare, precisely because UINT32_MAX is a fairly huge number.

pchanda · on Nov 1, 2021

> I'm pretty sure they meant recompiling binaries to be x86-64 instead of 32 bit x86

I don't see why you should think that given that the discussion has been pretty much about the uint32 nature of TransactionId.

anyfoo · on Nov 1, 2021

Because OP said they didn't know anything about postgres and were speaking in generic terms, and the way they said "switching to int64" and recompiling all binaries just stroke me as "if you want a 64 bit system, you need to recompile it" (which is true). I'm surprised it's a compile time constant, as I just learned.

Reading their post history now they probably do know what they're talking about, though.

jhgb · on Nov 1, 2021

There's a reason for even database servers with 32-bit transaction IDs to be 64-bit processes - performance and memory. AFAIK until Firebird 3.0, Firebird also had 64-bit server builds with 32-bit transaction IDs - it would have broken many external tools had the old on-disk format been changed.

tluyben2 · on Nov 1, 2021

Offtopic: why does something that everyone here recommends all the time need something like vacuum which is heavy and can fail? What is the reason people do not cry about that as much as the Python GIL? It felt like a hack 15+ years ago and it feels just weird now. I am curious why that was never changed; it is obviously hard but is it not regarded as a great pain and priority to resolve?

foerbert · on Nov 1, 2021

This is a hell of a comparison to try to make. How many people do you think regularly run into this kind of issue in Postgres? How many people do you think regularly run into the Python GIL?

Even if we assume your premise that vacuum is at least as bad as the GIL is correct, this would easily explain it.

Add in the sheer difference in expectations between a programming language and a database... and well, I don't think there's any mystery here at all. There's easily multiple great explanations, even if your premise is correct.

jhgb · on Nov 1, 2021

> why does something that everyone here recommends all the time need something like vacuum which is heavy and can fail?

Because it's better than the alternatives? What would you suggest?

xyst · on Oct 31, 2021

I thought it was a certificate issue? I am looking at robox.com and the issuing CA is GoDaddy...

mateo411 · on Oct 31, 2021

Certificate issues don't take that long to resolve.

dilyevsky · on Oct 31, 2021

They do if your entire PKI infra is down too

duskwuff · on Oct 31, 2021

A company's internal PKI infrastructure wouldn't be responsible for issuing a public-facing certificate. They literally can't sign those -- a real CA has to do it.

dilyevsky · on Oct 31, 2021

You are of course correct but usually public and private would reuse some core components of the infra (eg still need to store signed key pair somewhere safe). I’m speculating here but given how long it’s been down some very core and very difficult to recover service must have failed. Security infra tends to have those properties

charcircuit · on Oct 31, 2021

Downtime is expensive. You could just bypass your infra and manually get it working so that you can fix your infra while production is up instead of when it's down.

crehn · on Oct 31, 2021

That's in fact how most high-impact events should be handled: mitigate the issue with a potentially short-term solution, once things are back up find the root cause, fix the root cause, and perform a thorough analysis of events to ensure it won't happen again.

dilyevsky · on Oct 31, 2021

Depending on the level of automation that may not be possible. That’s like saying if factory line robot fails “you just bypass the line and manually weld those car bodies”

djbusby · on Oct 31, 2021

Wait. You can sign your own. They are just not trusted by the wider world. Your devices have an OS provided set of trusted root-CA.

jitl · on Oct 31, 2021

I’m just speculating. I didn’t do any in depth research - none of the articles or tweets by Roblox I saw offered anything more than “an internal issue”.

isuckatcoding · on Oct 31, 2021

Yeah but wouldn’t the impact be more widespread then?

mig39 · on Oct 31, 2021

robox or roblox?

indigodaddy · on Nov 1, 2021

I think he must have meant robox right? Cmon man get outta here.. :)

xyst · on Nov 1, 2021

Roblox.com

api · on Nov 1, 2021

It amazes me that anything in common use has these kinds of absurd issues. Postgres has a powerful query language and a lot of great features but the engine itself is clunky and hairy and feels like something from the early 90s. The whole idea of a big expensive vacuum process is crazy in 2021, as is the difficulty of clustering and failover in the damn thing.

CockroachDB is an example of what a modern database should be like.

mmiyer · on Nov 1, 2021

Funnily enough it looks like Roblox actually uses CockroachDB https://corp.roblox.com/careers/listing/?gh_jid=3363389

api · on Nov 1, 2021

Maybe so, but my comment was tangential about Postgres not Roblox. The parent was speculating. We don’t know what their problem is.

vbg · on Oct 31, 2021

The latest version of Postgres addresses this issue in various ways, although it’s not entirely solved, it should be significantly mitigated.

AznHisoka · on Oct 31, 2021

That’s not a bad theory. Even the homepage is down so that suggests their entire database was taken down.

bink · on Oct 31, 2021

I doubt their homepage is being served from the same DB as their game platform.

I think throwing together a static page better than "we're making the game more awesome" would be simple. It kinda makes me wonder if it's an internal auth/secret issue as has been speculated. That could theoretically make it harder to update the website, especially if it's deployed by CI/CD.

1cvmask · on Oct 31, 2021

Do any other databases have similar issues to Postgres? Or this is specific to Postgres?

ElbertF · on Oct 31, 2021

Years ago I ran into this issue with MySQL, storing four billion rows with a unique ID.

arp242 · on Nov 1, 2021

This is a different issue. You ran out of IDs for your row identifier. The PostgreSQL issue is that every transaction has an ID; normally this is a fairly invisible internal thing you don't need to worry about, but in some edge cases this transaction ID can causes problems. You could, in theory, run in to this issue with just one row.

I don't know how MySQL or MariaDB handles this; AFAIK it doesn't have this issue.

dreyfan · on Nov 1, 2021

Was it because your primary key was an unsigned 32-bit integer and that’s not an “issue”.

ElbertF · on Nov 1, 2021

True. Not the same thing but an equally disastrous outcome.