Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I wonder if they have a major Postgres database that hit transaction ID wraparound? Postgres uses int32 for transaction IDs, and IDs are only reclaimed by a vacuum maintenance process which can fall behind if the DB is under heavy write load. Other companies have been bitten by this before, eg Sentry in 2015 (https://blog.sentry.io/2015/07/23/transaction-id-wraparound-...). Depending on the size of the database, you could be down several days waiting for Postgres to clean things up.

Even though it’s a well documented issue with Postgres and you have an experienced team keeping an eye on it, a new write pattern could accelerate things into the danger zone quite quickly. At Notion we had a scary close call with this about a year ago that lead to us splitting a production DB over the weekend to avoid hard downtime.

Whatever the issue is, I’m wishing the engineers working on it all the best.



I find it pretty odd to speculate that they are experiencing a very specific failure mode of a particular database. Do you even know whether they use Postgres?


Maybe their load balancer got hit by a 2009 Toyota Camry with a sticky accelerator pedal?


I know we're already in the weeds, but Malcolm Gladwell’s “Revisionist History” did an excellent episode on the stuck accelerator problem, basically showing it almost certainly didn't happen.

https://www.pushkin.fm/episode/blame-game/


It is a very unlikely Y2K bug


It's not odd at all!

Speculation is a useful intellectual exercise, and the sign of a healthy, intelligent and curious mind!

Speculation is also fun!


No, it's just FUD directed at Postgres.


OP specifically pointed out that it's well documented. Every program makes tradeoffs, and I don't think anyone in this thread implied that because Postgres made this one it's unstable or bad.


The specific behavior of Postgres isn't the FUD, the FUD is the speculation that this is what brought down Roblox for multiple days. As others have mentioned, we don't even know if Roblox uses Postgres, and yet we're diving deep on how an edge case of Postgres brought down Roblox. It's the speculation that I think is FUD.


I did too but it's pretty easy to find out which database(s) a company uses:

https://corp.roblox.com/careers/listing/?gh_jid=3363389


Somebody mentioned a secret store issue that affected all of their services.

https://news.ycombinator.com/item?id=29044500


Roblox has 43+ million daily active users. The issue you are wildly speculating about is enormously small compared to their size and scale. I guarantee you they’ve dealt with that potential issue (if they even are using Postgres) years ago.


Do you have any reason to believe this is the case?


I'd speculate that it's more likely a data corruption problem. A system was overwhelmed or misconfigured led to corruption of critical configuration data, which led to propagation of such corruption to a large number of dependencies. Roblox tried to restore its data from backup, a process that was not necessarily rehearsed regularly or rigorously therefore took longer than expected. All other services would have to restore their systems in a cascaded fashion while sorting out complex dependencies and constraints, which would take days.


If it's a known issue, is there no way to increase the transaction ID size?

Quite surprising a seemingly battle-tested database can choke in such a manner.


It's an edge case, it only happens _if_ you have either very long running transactions holding xids, or you haven't vacuumed your database enough and it has to do a blocking vacuum to move the xid forward far enough to wrap around.

Most postgres databases wrap around with no issues.

The problem with increasing the size is a pair of xids are needed for every row, so you're doubling that if you go to 64 bits.


IDK about Postgres internals, but typically switching to int64 means recompiling all your binaries, plus your existing data format on disk needs to be converted.


That's assuming a lot, including that the binaries aren't 64 bit already (a bit unlikely nowadays), and the database wouldn't just use a 32 bit datatype for this specific purpose in this specific configuration. (If this issue has anything to do with transaction IDs at all, as covered elsewhere.)


they're not wrong. its not the software that's the issue, its the data stored on disk. the transaction ID is stored in the data for various reasons.

in theory nowadays it wouldn't be too hard to change if you use logical replication to upgrade the database but it'd be a huge undertaking for a lot of companies.


Right, but the "you need to recompile binaries if you switch something to 64 bit" part was a bit too general.


You seem to confuse 64 bit integers with 64 bit programs


I didn't.


But you wrote "that's assuming a lot, including that the binaries aren't 64 bit already" as a response to "typically switching to int64 means recompiling all your binaries, plus your existing data format on disk needs to be converted". Personally I definitely didn't read the latter as "recompile binaries from 32 bits to 64 bits". (I'm assuming that's what you meant by "not 64 bit already"?) More like "recompile 64 bit binaries to use different internal and external data structures".


No, I can't be positive, but I'm pretty sure they meant recompiling binaries to be x86-64 instead of 32 bit x86, or armv8/arm64 instead of armv7... At least that's how I took it, because not many things are compile-time constants nowadays, especially for databases. It doesn't distribute well.


The size of Postgres XIDs is in fact a compile-time constant.

If you’re wondering why it’s one: 1. it affects the size of a core data structure (that also gets serialized to disk and read back directly into said structure), and 2. basically nobody who has this problem (already a vanishingly small number of people) solves it by changing this constant, since you only have this problem at scale, and changing it would make your scale problems worse (and also make all your existing data unreadable by the new daemon.)

Also, FYI, Postgres uses compile-time constants for things that one might want to change much more frequently (though still in the “vanishingly unlikely” realm), e.g. WAL segment sizes. When you change internal constants like this, it’s expected that you’re changing them for every member of the cluster, or more likely building a new cluster from the ground up that is tuned in this strange way, and importing your data into it. “Doesn’t distribute well” never really enters into it.


Well, I stand corrected. I thought OP did not know that (given they said they didn't know anything about postgres and were talking about "recompiling all binaries"), but reading their posting history they probably do.

With "doesn't distribute well" by the way I meant that it doesn't distribute well as program binaries, not across a cluster. It used to be extremely common to recompile e.g. your Linux kernel, nowadays almost nobody does that unless there are some very specific needs. Of course, building a specialized postgres cluster for exceptional scale would easily qualify.


Well, I'm pretty sure of the exact opposite. Maybe we'll get adjudicated by semi-extrinsic if we're lucky. :)


Yes, you are right. I was not talking about switching to x86-64, but the fact that most code on 64 bit platforms still uses 32 bit integers and that you'll have to recompile to get 64 bit ints.

This shows up sometimes also in numerical programming, e.g. when your meshing ends up producing more than ~4.2 billion grid points. But it is quite rare, precisely because UINT32_MAX is a fairly huge number.


> I'm pretty sure they meant recompiling binaries to be x86-64 instead of 32 bit x86

I don't see why you should think that given that the discussion has been pretty much about the uint32 nature of TransactionId.


Because OP said they didn't know anything about postgres and were speaking in generic terms, and the way they said "switching to int64" and recompiling all binaries just stroke me as "if you want a 64 bit system, you need to recompile it" (which is true). I'm surprised it's a compile time constant, as I just learned.

Reading their post history now they probably do know what they're talking about, though.


There's a reason for even database servers with 32-bit transaction IDs to be 64-bit processes - performance and memory. AFAIK until Firebird 3.0, Firebird also had 64-bit server builds with 32-bit transaction IDs - it would have broken many external tools had the old on-disk format been changed.


Offtopic: why does something that everyone here recommends all the time need something like vacuum which is heavy and can fail? What is the reason people do not cry about that as much as the Python GIL? It felt like a hack 15+ years ago and it feels just weird now. I am curious why that was never changed; it is obviously hard but is it not regarded as a great pain and priority to resolve?


This is a hell of a comparison to try to make. How many people do you think regularly run into this kind of issue in Postgres? How many people do you think regularly run into the Python GIL?

Even if we assume your premise that vacuum is at least as bad as the GIL is correct, this would easily explain it.

Add in the sheer difference in expectations between a programming language and a database... and well, I don't think there's any mystery here at all. There's easily multiple great explanations, even if your premise is correct.


> why does something that everyone here recommends all the time need something like vacuum which is heavy and can fail?

Because it's better than the alternatives? What would you suggest?


I thought it was a certificate issue? I am looking at robox.com and the issuing CA is GoDaddy...


Certificate issues don't take that long to resolve.


They do if your entire PKI infra is down too


A company's internal PKI infrastructure wouldn't be responsible for issuing a public-facing certificate. They literally can't sign those -- a real CA has to do it.


You are of course correct but usually public and private would reuse some core components of the infra (eg still need to store signed key pair somewhere safe). I’m speculating here but given how long it’s been down some very core and very difficult to recover service must have failed. Security infra tends to have those properties


Downtime is expensive. You could just bypass your infra and manually get it working so that you can fix your infra while production is up instead of when it's down.


That's in fact how most high-impact events should be handled: mitigate the issue with a potentially short-term solution, once things are back up find the root cause, fix the root cause, and perform a thorough analysis of events to ensure it won't happen again.


Depending on the level of automation that may not be possible. That’s like saying if factory line robot fails “you just bypass the line and manually weld those car bodies”


Wait. You can sign your own. They are just not trusted by the wider world. Your devices have an OS provided set of trusted root-CA.


I’m just speculating. I didn’t do any in depth research - none of the articles or tweets by Roblox I saw offered anything more than “an internal issue”.


Yeah but wouldn’t the impact be more widespread then?


robox or roblox?


I think he must have meant robox right? Cmon man get outta here.. :)


Roblox.com


It amazes me that anything in common use has these kinds of absurd issues. Postgres has a powerful query language and a lot of great features but the engine itself is clunky and hairy and feels like something from the early 90s. The whole idea of a big expensive vacuum process is crazy in 2021, as is the difficulty of clustering and failover in the damn thing.

CockroachDB is an example of what a modern database should be like.


Funnily enough it looks like Roblox actually uses CockroachDB https://corp.roblox.com/careers/listing/?gh_jid=3363389


Maybe so, but my comment was tangential about Postgres not Roblox. The parent was speculating. We don’t know what their problem is.


The latest version of Postgres addresses this issue in various ways, although it’s not entirely solved, it should be significantly mitigated.


That’s not a bad theory. Even the homepage is down so that suggests their entire database was taken down.


I doubt their homepage is being served from the same DB as their game platform.

I think throwing together a static page better than "we're making the game more awesome" would be simple. It kinda makes me wonder if it's an internal auth/secret issue as has been speculated. That could theoretically make it harder to update the website, especially if it's deployed by CI/CD.


Do any other databases have similar issues to Postgres? Or this is specific to Postgres?


Years ago I ran into this issue with MySQL, storing four billion rows with a unique ID.


This is a different issue. You ran out of IDs for your row identifier. The PostgreSQL issue is that every transaction has an ID; normally this is a fairly invisible internal thing you don't need to worry about, but in some edge cases this transaction ID can causes problems. You could, in theory, run in to this issue with just one row.

I don't know how MySQL or MariaDB handles this; AFAIK it doesn't have this issue.


Was it because your primary key was an unsigned 32-bit integer and that’s not an “issue”.


True. Not the same thing but an equally disastrous outcome.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: