DNS Outage at DigitalOcean

tonylemesmer · on March 24, 2016

People hating on DO "I'm losing thousands every hour". Well then should have had some failover in place if its that valuable.

[1]https://twitter.com/rodrigoespinosa/status/71303563702097100...

zackelan · on March 24, 2016

For an example of this taken to a ludicrous extreme, several years ago an AWS user complained that downtime of a few EC2 instances were putting lives at risk. They were hosting cardiac monitoring services on single EC2 instances with no multi-AZ or multi-region capability.

AWS forum post: https://forums.aws.amazon.com/thread.jspa?threadID=65649&tst...

Previous HN discussion: https://news.ycombinator.com/item?id=2477345

crisopolis · on March 24, 2016

I've been reading all the comments on Twitter also... like "err mai gawd I'm switching to AWS because of this" and your failure to not have a secondary DNS provider, but I highly doubt you'd switch.

Then another... "Today's @digitalocean DNS #outage is a reminder to not trust your entire business to one provider. Spread the love around!"

If your company is e-commerce and makes money by being 99.99% available. It's your own fault for no fail-over.

another... ".@digitalocean that's two hours without DNS now...my company's websites could be losing thousands of £ in e-commerce! Please, an update!"

codegeek · on March 24, 2016

Times like this makes you realize the difference betweeen good clients and bad clients. Yes, they have a right to be upset but claims like "could be losing thousands of dollars" is mostly exagerrated due to their frustration.

crisopolis · on March 24, 2016

heh yeah, I just laugh at all the tweets saying their losing $billions of dollars every minute their site/app is unavailable. All I can think is... if you're the next Amazon.com I'm pretty sure you'd have some type of disaster plan in place should something like this happen.

colinbartlett · on March 24, 2016

I can't disagree with what you're saying, but I think we are all guilty of this. We expect more out of big name services than might be reasonable. (100% uptime)

How many of us here have failover email services in case Gmail goes down? I think many companies would say they'd lose thousands in productivity if Google Apps suffers an outage yet I'd hazard that very few have failover plans.

tracker1 · on March 24, 2016

That's because people like to complain... the reality is stuff happens, systems go down, and life tends to go on.

Yeah, if your building full of employees can't work because the internet is down, and the secondary is also down, then that's kind of crappy, and you may be paying people to twiddle their thumbs... much short of that, it's kind of the cost of doing business...

There are redundancy options for a lot of things... If you're using only a single host provider for your infrastructure, and management scoffs at creating redundant, and under-utilized systems... it's not as "mission critical" as people think/say.

IgorPartola · on March 24, 2016

I use dns.he.net for my DNS hosting. It's free up to 50 zones and has been rock solid. The other day I started having some trouble with accessing some of my domain names. Turned out that all of their DNS was down and was returning NXDOMAIN for pretty much any request, including their own domains. Oops. So I emailed their support (which is usually very quick to respond and is better than I have seen with lots of paid products). Well, it then occurred to me that I will not get a response since the MX records for my domain were also hosted with them. Double oops.

On the plus side, in the past 4 years that I've used them this was the very first issue, and they fixed it within a couple of hours.

Anyone have any good recommendations on cheap or free backup DNS hosting?

nullrouted · on March 24, 2016

Use www.dnsmadeeasy.com and then dns.he.net as your secondary dns service or vice-versa. They will do transfers/updates from each other and work just fine.

cleaver · on March 24, 2016

I had one customer on DO DNS and it was a "good enough" solution. Unfortunately, this came right in the middle of a marketing push for last-minute registrations. An annoyance, but not a major financial impact. (Maybe it will give the impression of excess demand. :)

I understand that things break and I should be ready for it. What I found unacceptable were the status updates. Basically, "we're working on it". No clue as to what was going on. A DDoS? Not a DDoS? Routing issues? Corrupt zone files? No clue? Any of those would be helpful as I needed to figure out if I should wait it out, or switch to Route 53.

In the information vacuum, I switched to Route 53. It works.

stevekemp · on March 24, 2016

Route53 is a wonderful service, which has never had a global outage (so far).

I've wrapped it in git, to allow very quick updates to be made in a simple fashion:

https://dns-api.com/

scurvy · on March 24, 2016

How do you fail over your SOA on .com when the minimum TTL is 1 day?

sfilipov · on March 24, 2016

The answer probably is not very helpful but... change the TTL to a lower value? Is it essential to have such a big TTL of 1 day?

scurvy · on March 25, 2016

Sure, have fun convincing Verisign. They're the ones controlling the .com registry. Your TTL values are meaningless to them when it comes to SOA. Minimum 1 day TTL.

scurvy · on March 25, 2016

Sorry, I meant NS glue changes on .com, not SOA.

chronid · on March 24, 2016

DNS is hard. Very hard.

It may seems trivial when it works (hint: it's not), but some of the biggest fuck ups I've seen in my professional life were caused by strange DNS things happening or DNS servers going kaboom.

I feel the pain of the DO engineers trying to mitigate this issue. I really do.

mwfj · on March 24, 2016

BS. DNS is a trivial thing to scale, compared to most other web-scale efforts.

Things break when people don't use 20 year old best practices. There is no defense against inexperience and ignorance.

takeda · on March 24, 2016

I took the OPs comment as "it's hard to understand DNS and biggest fuck ups happen because people think they understand DNS when they actually don't".

The problem with DNS is that it can work even when it is configured incorrectly. This makes people who has no idea what they are doing that they actually understand it. The strange issues with DNS only happen with strange configurations. When you follow best practices everything is predictable.

mwfj · on March 24, 2016

All right. This I can agree with.

thyrsus · on March 24, 2016

Please help the ignorant and provide a link to a description of those best practices.

dividuum · on March 24, 2016

> I feel the pain of the DO engineers trying to mitigate this issue. I really do.

Me too. Just last week they had another problem with DNS on the client side of things: Resolving with the Google Public DNS, which most droplets use by default, didn't work reliably. I hope that they post a combined post mortem for both of those incidents.

Thaxll · on March 24, 2016

It's not hard, the problem is everything relies on DNS so when DNS goes down or has problems you have cascading failure.

bitJericho · on March 24, 2016

That's why you use multiple providers.

scurvy · on March 24, 2016

I'll bite. You can have multiple NS records but only a single SOA. The .com registry minimum TTL for SOA is a day.

How in the world would "multiple providers" help you in a 6 hour outage?

bitJericho · on March 25, 2016

You can have multiple NS records. You should have ns records that point to different companies DNS servers, and preferably different continents.

scurvy · on March 25, 2016

That's great for NS records. What about SOA?

bitJericho · on March 25, 2016

Soa isn't used for resolving names afaik.

scurvy · on March 25, 2016

You're right. Sorry I meant the NS glue records. Those have a 1 day minimum TTL. If you're not already running multiple providers, switching after one is down won't help much.

vidarh · on March 24, 2016

Unless you use a better resolver than the standard glibc resolver on Linux (e.g. dnsmasq, bind or similar running locally and pointing resolv.conf at it), you appear doomed to slow lookups etc. if your first resolv.conf entry fails, as most of the resolv.conf options that might have helped (if you'd set them) simply don't work or doesn't do anything particularly useful in the versions used in the Linux distro's I've tested it on.

pbarnes_1 · on March 24, 2016

unbound is great for this.

jbaptiste · on March 24, 2016

Even with multi providers, DNS issues are a cluster fuck.

takeda · on March 24, 2016

Only if you don't know what you're doing. The problem with DNS is that it might work even when it is misconfigured, and misconfiguration is the source of strange issues.

camikazeg · on March 24, 2016

I think that we all have areas where we don't know what we're doing. This is one of mine. With all the talk of how obvious/important/easy it is to have a failover in place in case this happens, I'm having trouble finding a good resource about setting up a redundant DNS. Running a droplet on Digital Ocean with Debian and Nginx.

zaroth · on March 25, 2016

Sounds like we're begging for someone to write a nice blog post for how they set up redundant DNS across multiple providers "the right way"... Sounds like it would hit the front page in short order if anyone is willing to share how they think this should be mitigated, and specifically how to expect common clients to behavior in that case when faced with the various types of outages that may occur!

bitJericho · on March 24, 2016

Yeah I mean it takes work to do it correctly. I wouldn't call it a cluster fuck.

dsr_ · on March 24, 2016

Suppose you have multiple providers, but one of them screws up and authoritatively denies the existence of all of your hosts?

bitJericho · on March 24, 2016

That's what you keep an extremely low ttl for.

Karunamon · on March 24, 2016

Which doesn't mean much when a nontrivial amount of ISPs out there don't respect the TTL settings.

Source: Days-long service degradation caused by customer ISP's caching bad DNS information well beyond the 10 minute TTL we had set.

tyingq · on March 24, 2016

One thing hosting providers could do better would be to split the risk a little by not handing the same dns server name to every client that chooses to have the hosting provider supply dns services.

The reason this might have some upside is that DDOS attacks against a specific DNS server are often intended to target one specific customer of a hosting provider. The attacker doesn't care about the side effects...just the original target.

Say, for example "controversialblog.com" is hosted on DO, and uses DO dns servers. The person attacking "controversialblog.com" looks up the NS records for the domain, and attacks that DNS server. The fact that it's one hostname that serves all of DO is of little interest to the attacker.

So, if DO would come up with say, 10 separate hostnames they could hand out, then this sort of thing would take down 10% of their customers instead of 100%.

traviswingo · on March 24, 2016

Yeah this is pretty unfortunate. We have some big investor meetings today and this unfortunately took our marketing site offline. Hopefully they resolve this soon - it's the first time we've ever experienced an issue with their service.

We really need fail-overs in place...small team problems.

brndn · on March 24, 2016

If you are showing a demo or something, you can still navigate to your DO IP address. Of course, I don't know if other things (images, etc.) on your website also rely on their DNS.

Maakuth · on March 24, 2016

For that kind of emergency, you can point the domain name to the IP address via /etc/hosts (there's a similar file in Windows as well).

tracker1 · on March 24, 2016

c:\windows\system32\drivers\etc\hosts

defenestration · on March 24, 2016

We feel the pain as well as our platform is unreachable. I'm now using an other DNS server and changed the nameserver in the domain-record. However the DNS propagation is taking some time. What are you doing at the moment as fail-over?

pbhjpbhj · on March 24, 2016

Sorry if I'm trying to "teach grandma to suck eggs" but can't you just enter the domain in your local hosts file. If it's a network that needs access then presumably you have some sort of proxy/cache that could be seeded with the necessary domain+IP pairing? I suppose these aren't possible if you're trying to demo on someone else's network or in a public space or such.

traviswingo · on March 24, 2016

We just switched it over to Route53 and set up some fail-overs there. Took us 5 mins and we're back online.

Looks like DO is still offline so it seems to have been a good call...

scurvy · on March 24, 2016

You're back online from your perspective. What about all the name servers that have your SOA cached still looking at DO? You're still down for them.

traviswingo · on March 24, 2016

Tough shit, lol. We now have reduced TTL times for future occurrences, but there's nothing we can do for those users who are still experiencing an outage.

scurvy · on March 24, 2016

Minimum TTL for SOA records on .com is a day. Lowering your A record TTL isn't buying you much in case of DNS hosting failure. Helpful if your site (not dns) needs to move providers though.

sashk · on March 24, 2016

My rule: provider should do single thing:

- Hosting provider - host sites

- vps/cloud provider - provide VMs

- domain registrar - domain related stuff, but not DNS

- dns provider - host dns

- second dns provider - host dns in case first dns provider fails

So many DNS outages recently and all my projects are up.

copperx · on March 24, 2016

Does Amazon's Route 53 count as a DNS provider, or do you treat it a hosting provider?

sashk · on March 24, 2016

For me - neither.

But if I'd be tied in into Amazon's cloud infrastructure, I would have to use many of their features going against my rules above.

ludbb · on March 24, 2016

How do you apply your rules considering what's available today? Which services are you using? It sounds like it would be a big headache to orchestrate the automation among all these different providers.

nlivingstone · on March 24, 2016

Have multiple VMs @ Digital Ocean (TOR1), we use Cloudflare for DNS... All site have remained available and successfully fulfilling requests.

cleaver · on March 24, 2016

Every site where I was using external DNS stayed up.

fredophile · on March 24, 2016

I don't have anything more important than a small personal website but now I'm curious. If you set up a system to handle your main DNS provider failing, how do you test it? Is there a good reference where I can find some best practices on this?

mrideout · on March 24, 2016

Here's my testing recommendation:

1. Pick some subset of your DNS records to monitor, or all of them if you want to be extra thorough. If you are picking a subset, then I'd pick whatever records are most critical to your business.

2. Setup monitoring that queries each of your authoritative name servers for each of the records that you identified in the previous step. The monitoring should notify you if any of the name servers are unresponsive, or return a different response than what's expected.

If you'd like to dig into the details of DNS, then O'Reilly's "DNS and BIND" is highly recommended, even if you're not using BIND.

There are a number of quality hosting providers out there. A rule of thumb that I use is this: If a DNS hosting provider doesn't eat their own dog food, don't trust them to handle your DNS. Digital Ocean doesn't use their own name servers for their main website's domain. Neither does Amazon.

Shameless plug: I created a DNS monitoring service that can be used used for monitoring each of your name servers: https://www.dnscheck.co/

tonyarkles · on March 24, 2016

The flipside to the dog food point: if DigitalOcean did use their own nameservers for their main site, then we wouldn't have been able to see the status page.

mrideout · on March 24, 2016

Good point!

Rezo · on March 24, 2016

Their status page at https://status.digitalocean.com is also now giving an intermittent "500 Internal Server Error" nginx error, probably from the load. That's why you should use a service like https://www.statuspage.io for your important stuff, even though creating a status page is a fun side-project for a dev team.

crisopolis · on March 24, 2016

So what you're saying is that instead of running their own Status Page on their own infrastructure that's reachable. They should outsource it to statuspage.io and pay another company to do it?

Rezo · on March 24, 2016

Yes, that's pretty standard. Availability monitoring and status reporting should be external and separate from your own infrastructure, otherwise neither may be available when you need it the most.

dsr_ · on March 24, 2016

And don't use statuspage.io if your host is AWS, because theirs is too.

Rezo · on March 24, 2016

They do have geo-region (not just AZ) redundancy and failover, which puts them quite a bit above most home-grown company status pages in my experience. But yes, if the problem was for example in R53 it would indeed be better to have an solution without that AWS dependency.

TheSwordsman · on March 24, 2016

Eh, as I remember it they have stuff in multiple AWS regions. Would take a global AWS failure to bring them down. This was my justification for going with them while working somewhere that used AWS.

That said, while it is extremely unlikely it shouldn't be discounted as impossible.

clentaminator · on March 24, 2016

But who monitors the status of statuspage.io?

colinbartlett · on March 24, 2016

I know you're joking, but I do: Check out https://StatusGator.com. StatusPage.io has a status page at metastatuspage.com which my company monitors.

ludbb · on March 24, 2016

And who monitors your company monitors? :)

jessaustin · on March 24, 2016

Are you also on AWS? Apparently you're not on DO...

karlgrz · on March 24, 2016

This is the first DNS outage I've experienced with them in 3+ years, then again I host everything in their NY regions.

crisopolis · on March 24, 2016

I've never experienced an outage of any kind with DO, so also first time. I also host all my droplets in the NYC regions.

josh_carterPDX · on March 24, 2016

Same. They've been pretty reliable. Hoping this doesn't last long or we'll be looking to move off.

crisopolis · on March 24, 2016

DigitalOcean uses CloudFlare for DNS - https://www.cloudflare.com/case-studies-digital-ocean/

jtokoph · on March 24, 2016

This statement can be misleading. If you read the article, they don't use CloudFlare's DNS servers per se. They use CloudFlare's DNS proxy which acts as a DNS firewall between the DigitalOcean DNS servers and the world.

josh_carterPDX · on March 24, 2016

I think the most annoying aspect of this outage are their updates. Three updates and they all say the same thing with no meaningful information as to what's causing this. Likely they may not have much information, but you'd think there would be something more than what they've been posting for the past hour. Good times!

coreyp_1 · on March 24, 2016

Does anyone know of a good strategy for DNS failover?

takeda · on March 24, 2016

That's one of the easiest thing to do. Just add multiple NS to the domain. As long as they are configured correctly and at least one is up (and for limited time even if none are up) the service is available.

With DNS you actually can achieve 100% uptime.

mattzito · on March 24, 2016

Well, there's a couple of strategies:

- IP-diverse nameservers

- TLD-diverse nameservers

- BGP anycast

IP-diverse nameservers requires that you expect that your DNS servers will go down rather than start returning bad results - I highly recommend having some sort of mechanism to hard-terminate access to those machines.

TLD-diverse nameservers is just an extra strategy for reducing the risk that an upstream TLD issue will blow up your spot.

And then BGP anycast is the expensive, complicated piece of this - it requires a high level of technical sophistication, lots of moving parts, and the QA/validation piece of it is tricky.

When I built an anycast DNS system, we ended up resorting to tricks like having the DNS servers publish routes to the router for redistribution, so that a down or unresponsive server automatically withdrew the routes. Then you do things like TXT records for your zone that respond with which POP you're hitting in some sort of hashed/obfuscated fashion.

It's hard and complicated, and unnecessary for most folks. Better to outsource to Route 53 or someone similar.

blumentopf · on March 24, 2016

- Implementation-diverse nameservers

Use multiple implementations, e.g. NSD/BIND for authoritative servers and Unbound/BIND for resolvers, to mitigate against implementation-specific bugs and vulnerabilities.

troydavis · on March 24, 2016

A fair number of providers support zone transfers (AXFR requests) from master to slave name servers. The slaves can be operated by a different entity.

Here's DNSimple's implementation: https://support.dnsimple.com/articles/secondary-dns/

I wrote about moving from 1 to 2+ authoritative DNS providers: http://blog.papertrailapp.com/dns-outage-on-monday-december-.... I think this is just as true today:

> For .. maintainers of mission-critical DNS zones, the solution is to not depend on any single DNS infrastructure for functioning authoritative DNS

c17r · on March 24, 2016

I don't know if DigitalOcean's DNS servers allow AXFR, if they do you can use a secondary DNS service to automatically replicate the DNS. You then list the secondary DNS as a NS for your domain.

If they don't allow AXFR -- and after this, they should! -- you can still have a secondary DNS provider but you'd have to duplicate any changes by hand. Not ideal but still doable.

dec0dedab0de · on March 24, 2016

DNS has failover built into the protocol, just have another server listed at your registrar.

jamescun · on March 24, 2016

I would be interested in the post-mortem from this. While DigitalOcean operate their own DNS, it is only made publicly available though CloudFlares DNS proxying service.

NewHatMatt · on March 24, 2016

From @DOStatus a minute ago:

"Our engineering team has identified the issue, and are working to resolve connectivity issues to our DNS servers.... http://do.co/status"

https://twitter.com/DOStatus/status/713043871559655424

nodesocket · on March 24, 2016

Recommend AWS Route53 very highly. Route53 also allows you to buy domain names and do lot's of fancy fail-over, geolocation, and CNAME alias at the apex magic.

samgranieri · on March 24, 2016

A few years ago Slicehost had a DNS outage and the webscrapers I had running were falling over because they couldnt resolve DNS. I had to SSH into 8 boxes and update resolv.conf to add google DNS and openDNS as a backup. (Yes, I should've had centralized config management with chef or puppet or ansible)

crisopolis · on March 24, 2016

That's crazy... I think by default DO droplets use Google DNS for resolving.

doublerebel · on March 24, 2016

No offense to anyone here, but what is DO's SLA? Last time I looked, they did not have one.

DO is cheap for a reason. And that's the same reason I don't host with them, I can get SLA-backed infrastructure for a reasonable price and would have no excuse to my customers or cofounders.

bpicolo · on March 25, 2016

Looks like they do have one: https://www.digitalocean.com/help/policy/

xir78 · on March 25, 2016

We have seamless DNS "failover" by running dnsmasq with the all-hosts option on all our servers. It causes dnsmasq to query all at once so if any go down its transparent to our apps. Works perfectly on our 1500 ec2 instances.

r1ch · on March 24, 2016

I thought their DNS was supposed to be rock solid since they use Cloudflare Virtual DNS. Oh well, lesson learned. Back to running my own DNS servers on each droplet, if the DNS is down the droplet is likely down regardless :).

showerst · on March 24, 2016

Feeling the pain here too. What DNS providers do others use and like? Route53?

rbritton · on March 24, 2016

The sites I have that are actually up right now are those routed through CloudFlare.

stevekemp · on March 24, 2016

Route53 is hard not to love; simple to develop against and very very reliable.

I wrap it in git to make updates more straightforward for people unfamiliar with AWS, but even using it directly is very simple from multiple languages. (https://dns-api.com/)

tyingq · on March 25, 2016

This has always been a good resource to see who the front runners are: http://www.solvedns.com/dns-comparison/

dboreham · on March 24, 2016

Bind, running on VMs. Not hard.

SteveNuts · on March 24, 2016

You run your own authoritative DNS servers?

z92 · on March 24, 2016

I ran my non-authoritative DNS server [bind] on a droplet for about a year. But the server crashed every few months. Why? Never figured out. A restart always fixed it.

Later shifted to DO's DNS servers.

Now that that one is down too, just shifted back to domain register's DNS.

Everything is working now.

joejoebob · on March 24, 2016

Where I work we use Rotue53. For my personal domains I just use my registrar, Namecheap.

dsp1234 · on March 24, 2016

For us, Route53 is painful. We host a few thousand zones, and due to rate limiting on APIs, doing something like "Show a list of domain names" or "Give all the domains matching some pattern" were particularly painful. Upwards of 30 seconds to do a simple list of domains meant we were forced to cache locally. A local cache, combined with the fact that zone names are not unique in their system (possible to create multiple abc.com entries, which differ only in an internal id and the list of NS entries) made it hard to ensure that our internal systems matched "reality". Then the administrative nightmare of 3-4 different NS entries for each zone means customized, rather than generic, instructions for validating NS settings at the individual registrars.

All in all, it was not a fun experience with such a large volume of zones, but we knew we were an edge case.

tomschlick · on March 24, 2016

You can contact amazon to get rate limits increased if you have the use case

rubyron · on March 24, 2016

dnsmadeeasy for around 8 years now

yakshaving_jgt · on March 24, 2016

This is the second time recently their AMS region has gone down, which is where I host my email. What a pain.

satyajeet23 · on March 24, 2016

That awkward moment when it shows the status page

grej · on March 24, 2016

This is causing huge huge pain for us, Digital Ocean.

colinbartlett · on March 24, 2016

If you want to get alerted when it comes back up, or you wish you had been alerted when it went down, check out my project: https://StatusGator.com.

StatusGator monitors status pages and sends notifications via email, Slack, and others. You can get alerted to status changes inside Slack and you can ask it the status of a service with a /statuscheck command.

camikazeg · on March 24, 2016

A bit of feedback: you should have a link back to your dashboard on every page. That seems like the most important page to me as a user, but if I am changing my notification or account settings, there is no way back to that page.

colinbartlett · on March 24, 2016

Great feedback, thank you! Added that.

pmalynin · on March 24, 2016

Yeah, tried to access our site and it was down. Really was expecting more out of Digital Ocean than to fuck up such an integral part of their infrastructure. In the future we'll be transitioning away from their DNS solution because this is unacceptable.

tehbeard · on March 24, 2016

I hope your clients/users are as understanding and civil as you are.

In the meantime, I'm going wait for post-mortem before deciding if I should continue using them for dns. Looking back over the status history, 1-2 incidents a year isn't that bad for my needs, but might be too much for you, which is fine (since I'm only hosting a couple of small side projects with them).

pmalynin · on March 24, 2016

The problem is, for an early stage startup incidents like this are deadly. Especially since we just applied to a bunch of accelerators.

tonyarkles · on March 24, 2016

I get that this is a pain in the ass. I've got a significant chunk of infrastructure on DO, I've got work to do today that depends on those machines. I learned about this simultaneously when a deployment failed and I got a text from an engineer at a company I consult with. Not a great way to start the day, for sure.

Know what I'm going to do? I'm going to have a cup of coffee and play with my dogs for a bit. It's inconvenient, it's going to delay things, and I'm a bit choked about it. But it's not worth getting angry over, because there's nothing I can do about it today.

crisopolis · on March 24, 2016

The resolution is, for any app/startup/business everything is a risk and if you didn't include the edge-case of "What happens if my primary DNS nameserver goes down for my domain?" into account. Is all you can do is blame DO?

If your app goes down do you have failover for that? Or do you blame your devops team?

grej · on March 24, 2016

Any small business has to manage which risks it accepts. OP got the Digital Ocean service to try and mitigate this risk to a degree. Beyond that, it becomes a question of accept risk in other aspects of product or a risk that your primary DNS server provider will fail?

The reality is, you have limited development and financial resources so you simple can’t do everything. Sure, in a vacuum, or in a larger enterprise, we’d love to manage every risk. But when we're just starting out that’s not realistic, and we do have a right to be upset at Digital Ocean's service going down while at the same time realizing that yes, ideally we would / should have had redundancy in place already.

Your question on blaming the devops team is exactly the mindset of someone who has a lot more resources than a brand new startup. In a brand new startup there IS NO devops team. If you're lucky, there is one person who does the devops work part time, balanced with a bunch of other development work he/she also does.

pmalynin · on March 24, 2016

We have auto failover for server, app, and database failures -- this can be easily managed. DNS Nameserver failover should have been built-in, after all there is a reason we specify 3 DNS nameservers into the domain configuration, since Digital Ocean took on this task (whereby we used Gandi's DNS before that) we expected it to perform as advertised -- so yes blame lies with DO.

tomschlick · on March 24, 2016

It's still entirely your fault. Something like Route53/Cloudflare is dirt cheap and crazy redundant. Don't risk your business on free/side services.

paradite · on March 24, 2016

You can go to your domain registrar and switch to another DNS provider (GoDaddy has their own DNS service).

ju-st · on March 24, 2016

Be happy that this happend early. Now you know that you should never ever have a single point of failure.

bitJericho · on March 24, 2016

Just add a second dns provider.

clentaminator · on March 24, 2016

If your site is so critical that it can't suffer any downtime then why is it not provisioned across multiple independent platforms?

crisopolis · on March 24, 2016

I also hope the users of your site understand that shit happens. Also as another user said... if DNS is so critical for you then why don't you have proper failover in place?