Hacker Newsnew | past | comments | ask | show | jobs | submit | podc's commentslogin

I'm watching fly.io with interest, I want to see how they handle the first major incidents - response time, lessons learnt, transparency before I trust them with a production site though. Most SRE skills related to your own operations are all learnt on the battlefield and not via some cliche must-read book from Google engineers afaic.

If its Linode style - delayed status page updates - sometimes as much as 15minutes, zero detail post-mortems - this problem has been fixed by our engineers thank you yada yada, and same issues repeat six months down the line then I will be understandably disappointed.


I've only been with them through one major incident so far, and I recall them handling it reasonably well.

You can see them responding to customers and providing updates in real time here: https://community.fly.io/t/there-seems-to-be-an-outage-with-...

And a detailed postmortem here: https://community.fly.io/t/major-outage-portmortem-2021-10-1...

They also update their status page pretty diligently whenever something goes wrong even for things that don't necessarily impact all customers (the only recent item on there that affected my app directly was the Oct 13 one from what I can remember): https://status.flyio.net/history


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: