Twitter used to experience significant downtime compared to all other major plat...

wickedsight · on Nov 22, 2022

> Sure, you can run the platform with 1/10 headcount with significantly degraded user experiences (say ~98%). This is not a problem for startups but people usually have higher expectations for established companies.

This exactly. During the recent Whatsapp outage, many threads popped up on HN about how big of an issue this is in Europe, since Whatsapp is the main messaging platform in Europe. Thankfully, these outages are short and far between, so they never actually cause real issues. This is obviously costing Meta/Facebook a lot of money, but allows them to be an essential service. So essential in fact, that every major news outlet in my country sends a push message as soon as Whatsapp is down.

If Twitter wants to be a comparably important platform, they need that same stability. And Twitter, for me, is very much the best place to stay up-to-date on any current event (in near real-time). Reddit used to be pretty good with Live, but that's pretty much died (and was mostly a summary of tweets anyway). I really hope Twitter survives Elon, because I don't know of an alternative right now that has the same value in this use case.

bojan · on Nov 22, 2022

I don't remember WhatsApp being less stable before it was bought by Meta. And it was just as essential back then too.

djtango · on Nov 22, 2022

Yes it felt more reliable when it was on their own infra before they migrated to FB internal infra.

djtango · on Nov 22, 2022

Whatsapp is even more essential in most of APAC and South America

powerapple · on Nov 22, 2022

I think the opposite. Many softwares at its best when the team was small. Software companies have to hire many people because it needs to report growth to investors, headcount is one of the measurement of growth. It is not necessarily good for the product, actually many times, it hurts the product, but overall it is good for the company, the company will enter new areas, can explore new things.

What Twitter is doing is to scale down first, focus on the product, and once it gains traction, it definitely can scale up again. I don't think it will hurt the product very much.

summerlight · on Nov 22, 2022

> Software companies have to hire many people because it needs to report growth to investors, headcount is one of the measurement of growth.

I don't think you have a good understanding on how those companies are growing and scaling out. Don't take growth for the granted. "Right product" or "Right technology" won't give you that. It only comes from solving thousands of very specific, never-ending customer problems. If you do B2B, you need to spend most of your time on very specific requests from priority customers. And they are not one, but hundreds of them if you targets $xB business. It's just physically impossible to keep up with a small team even with a very aggressive prioritization.

Still not convinced? Google has a notoriously bad reputation for their customer supports and it's primarily because of their tendency of keeping "inessential headcounts" low as possible. And think about how many cloud customers they lost to AWS and Azure. TK came to GCP and his first work was adding an army of sales and account managers. This almost immediately yielded a rapid acceleration of the platform, although it's too late to catch up.

kibibyte · on Nov 22, 2022

Scaling back up is really hard though. We had a de facto freeze on hiring (not exactly hiring freeze; more of a headcount cap) just shy of a decade ago to focus on our product. During that time, some of our best recruiters left because they basically had nothing to do anymore.

The freeze worked: we got rid of some products that weren't getting traction and were able to improve the products that did have traction. But the cost of the freeze lingered for at least a year; it reset the hiring pipeline, we couldn't grow fast when we needed to because the limited number of recruiters we had were already overworked, and the limited number of engineers had to balance interviewing needs with their real work. This all happened when my employer was <10% of its current size and pre-IPO, and we didn't even take a headcount reduction.

Twitter is simply at a different scale. 7500 -> 2500 employees is a 66% reduction. Going 2500 -> 7500 is a 200% increase. Recruiting is likely totally gutted, and the current 2500 employees have to support systems previously maintained by a 7500-person company. If they decide they need to grow, it'll have to restart at a snail's pace, and they'll have to make sacrifices on feature development or stability along the way.

Edit: for what it's worth, the fastest way to regrow back 200% is to rehire the people laid off. But, given that I happened to interview earlier today an ex-Twitter candidate who didn't make it through the Elon snap, that route is rapidly closing up.

powerapple · on Nov 23, 2022

It is not only hard, it also may or may not work. It is the same process Twitter has already went through years ago. I have simplified the issue and talked only about the product. I don't disagree with you.

gjulianm · on Nov 22, 2022

> What Twitter is doing is to scale down first

This is not a trivial task. With such a heavy reduction and how entire teams have been completely decimated, there will be a lot of lost knowledge. I'm sure there will even be cases where the people who stay don't even know what knowledge was lost.

All that puts Twitter in a very risky position, specially in a product of such complexity that does a lot of things in-house. It shouldn't be underestimated.

octodog · on Nov 22, 2022

> Software companies have to hire many people because it needs to report growth to investors, headcount is one of the measurement of growth.

I mean this is just wrong. Companies are always under pressure to cut costs (employees) and it is always talked about when quarterly results are posted. Look at how the market reacted to Facebook's latest results and then again what happened when they laid off thousands of staff.

fastball · on Nov 22, 2022

The headcount at WhatsApp in 2013 was somewhere between 50-100, at which time they were servicing approx 400m MAU, which is more than users than Twitter has been able to boast for most of their existence.

Coincidentally, in 2013 SpaceX was just starting to provide commerical launch capacity, at which point I think they too had < 100 software engineers. A few short years later and they were re-using rockets, a feat many people had thought unlikely/impossible and requires some hardcore software eng.

Not surprised Elon Musk thinks he can run twitter with a skeleton crew.

throwaway290 · on Nov 22, 2022

1) And what was their uptime in 2013? How did uptime change as the service grew in popularity?

2) WhatsApp does not support the type of public broadcasts done at Twitter, and due to its e2ee doesn't require much human moderation.

snovv_crash · on Nov 22, 2022

1. WhatsApp was more reliable before migration to Meta's infrastructure.

2. WhatsApp didn't have e2ee back then. The broadcasting is important, yes, but it is very heavily biased towards reads over writes, so something like Cloudflare would solve 99% of the load.

throwaway290 · on Nov 22, 2022

My point 1 was specifically phrased to clarify how it could have been more reliable before migration to FB because it did not have to deal with the same load back then, you said nothing to show it was not a correlation.

Your point 2 sounds like there were additional factors that could have influence the reliability besides the load (they didn't simply migrate to FB infra but also switched to Signal).

> The broadcasting is important, yes, but it is very heavily biased towards reads over writes, so something like Cloudflare would solve 99% of the load.

There's push-notifying millions of devices within seconds after a celeb or a major news source tweets. There's tracking view and engagement stats on that in realtime. There's making sure a tweet is not available to any of those within seconds after it's been deleted or moderator. There're separate back-office apps for moderating that firehose of content. And that's just what I can see from the outside. An e2ee instant messenger with size-limited chat groups doesn't even come close.

Please don't say "just stick a CDN on top of it and you are 99% there", it's embarrassing (and not to twitter). This will maybe get you 80% there if your goal is "a microblogging platform" but not even 20% if your goal is being both the go-to news source and shitpost forum for people worldwide reliably working even in sensitive times and emergencies. Twitter used to be a microblogging platform back when it had much fewer employees and you'd see a fail whale regularly even as it had much fewer active users, in recent yeas it's a completely different beast and saying increased headcount is unrelated is amusing.

snovv_crash · on Nov 22, 2022

WhatsApp has scaled less than 10x since acquisition. They used to handle ~3M open TCP connections per server, and as a result could run their entire operation with under 300 servers.

The push notification argument is also overstated. Sharding and fan-out solves the burstiness. And people overall receive a similar number of messages (and thus push notifications) from WhatsApp as Twitter. Besides, these days the push notifications go through Google/Apple servers anyways to reduce the number of open connections needed on the phone side.

Then there are DMs. They are per person so CDNs don't help much (just static assets), but also they shard basically perfectly. So, shard them.

Which in the end leaves the user feeds. Designed correctly, sharding would work extremely well, and what doesn't work could be handled by caching closer to users for those 1k most popular accounts.

Honestly, with the correct architecture, languages and tooling, it could be handled by an experienced 50 person dev team plus another hundred in ops. Obviously Twitter doesn't have the perfect setup, so maybe an order of magnitude more? And if you throw a bunch of subpar engineers and tooling at the problem, nothing can dig you out of inefficiencies at this scale anyways.

And no, I'm not wildly optimistic here. StackOverflow still runs off of 9 on-prem servers [0]. I've seen message queues that can give 200M notifications per second on a single machine (written in C++, for HFT). This stuff is hard yes, but throwing more bodies at it doesn't help past the point your fundamentals are solved.

0. https://www.datacenterdynamics.com/en/news/stack-overflow-st...

throwaway290 · on Nov 23, 2022

> WhatsApp has scaled less than 10x since acquisition. They used to handle ~3M open TCP connections per server, and as a result could run their entire operation with under 300 servers.

They switched to a new protocol and grew from 200 million to I guess about a billion users since 2013. If you believe a team of 50 developers could deal with this and not cause extensive downtime and service disruption along the way I pray you never ever manage software engineers.

> Sharding and fan-out solves the burstiness.

Great, at least it's no longer "just add CDN to solve 99%" here;)

> And people overall receive a similar number of messages (and thus push notifications) from WhatsApp as Twitter.

Yeah again WhatsApp has many users but as an engineer you just don't ever have to worry about delivering a message instantly to more than 32 people (512 as of this year), and you never have to account for moderating any of that because it's e2ee and there are no adverts next to the messages. It's basically dumb pipes terminated by one native client. Twitter has to maintain a mix of automated and human review of all UGC and is accessible via extensive APIs and search engine indexed web app in addition to native client.

> Then there are DMs

Let's ignore Twitter's DMs, even without them it's far more complex and demanding than an IM app.

> StackOverflow still runs off of 9 on-prem servers [0].

Yeah, and SO maintenance page or read-only mode is up about once a month and lasts dozens of minutes. What are you even talking about now bringing up a niche programmer-oriented help forum for comparison here?

You may be stuck in the times where Twitter was a RoR-based microblogging platform. It's not been that for years.

snovv_crash · on Nov 24, 2022

I do manage software engineers, focusing on HPC (image processing primarily), and one thing I consistently see from people who work with 'classic' web tech is underestimating what modern hardware can do.

This isn't 2005 anymore, we have multiple parallel 40gb LAN, 64 cores per socket and 2MB of L2(!!!) cache per core, and a full terabyte of RAM (!!!) per server. If you program with anything that makes cache-aware data structures and can avoid pointer chasing, your throughput will be astounding and latency will be sub-millisecond. How else do you think WhatsApp managed 300M clients connected per server without having just the in-flight messages overflowing memory, on top of all the TCP connection state?

Things only get slow when scripting languages, serialisation, network calls and neural networks get involved. (AKA "I don't care if you want docker, a function call is 10000x faster than getting a response over gRPC and putting that in the hot loop will increase our hardware requirements by 20x.")

The more distributed your architecture the more network overhead you introduce and the more machines you need. Running the WhatsApp way with less, higher performance servers simply scales better. Just from the hardware improvements since 2013 there was no reason for WhatsApp to change their architecture as they grew.

And if you think rolling out a new protocol while maintaining backwards compatibility is hard and somehow adding more people will help, I have a team of engineers from Accenture to sell you. I did this straight out of university, to thousands of remote devices, over 2G networks, with many of the devices being offline for months in between connections. You just need a solid architecture, competent people and (I can't stress this enough) excellent testing, both automated and manual. And the team that did this was 6 engineers, and this wasn't their only responsibility.

fastball · on Nov 23, 2022

1) Their uptime was great. Regardless of what happened to their uptime after that (when they got to >1B MAU), they were already bigger than Twitter at that point, so...

2) Public broadcasts make a lot of things easier because that means more of your workload is relatively straightforward caching (as evidenced by this blog post).

throwaway290 · on Nov 23, 2022

> great.

So, you don't have the numbers.

> easier

Have you even thought about moderation and all the other concerns that go with this? How does instantly notifying millions of devices helped with caching, for example?