Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
DNS Outage at DigitalOcean (digitalocean.com)
120 points by finne on March 24, 2016 | hide | past | favorite | 121 comments


People hating on DO "I'm losing thousands every hour". Well then should have had some failover in place if its that valuable.

[1]https://twitter.com/rodrigoespinosa/status/71303563702097100...


For an example of this taken to a ludicrous extreme, several years ago an AWS user complained that downtime of a few EC2 instances were putting lives at risk. They were hosting cardiac monitoring services on single EC2 instances with no multi-AZ or multi-region capability.

AWS forum post: https://forums.aws.amazon.com/thread.jspa?threadID=65649&tst...

Previous HN discussion: https://news.ycombinator.com/item?id=2477345


I've been reading all the comments on Twitter also... like "err mai gawd I'm switching to AWS because of this" and your failure to not have a secondary DNS provider, but I highly doubt you'd switch.

Then another... "Today's @digitalocean DNS #outage is a reminder to not trust your entire business to one provider. Spread the love around!"

If your company is e-commerce and makes money by being 99.99% available. It's your own fault for no fail-over.

another... ".@digitalocean that's two hours without DNS now...my company's websites could be losing thousands of £ in e-commerce! Please, an update!"


Times like this makes you realize the difference betweeen good clients and bad clients. Yes, they have a right to be upset but claims like "could be losing thousands of dollars" is mostly exagerrated due to their frustration.


heh yeah, I just laugh at all the tweets saying their losing $billions of dollars every minute their site/app is unavailable. All I can think is... if you're the next Amazon.com I'm pretty sure you'd have some type of disaster plan in place should something like this happen.


I can't disagree with what you're saying, but I think we are all guilty of this. We expect more out of big name services than might be reasonable. (100% uptime)

How many of us here have failover email services in case Gmail goes down? I think many companies would say they'd lose thousands in productivity if Google Apps suffers an outage yet I'd hazard that very few have failover plans.


That's because people like to complain... the reality is stuff happens, systems go down, and life tends to go on.

Yeah, if your building full of employees can't work because the internet is down, and the secondary is also down, then that's kind of crappy, and you may be paying people to twiddle their thumbs... much short of that, it's kind of the cost of doing business...

There are redundancy options for a lot of things... If you're using only a single host provider for your infrastructure, and management scoffs at creating redundant, and under-utilized systems... it's not as "mission critical" as people think/say.


I use dns.he.net for my DNS hosting. It's free up to 50 zones and has been rock solid. The other day I started having some trouble with accessing some of my domain names. Turned out that all of their DNS was down and was returning NXDOMAIN for pretty much any request, including their own domains. Oops. So I emailed their support (which is usually very quick to respond and is better than I have seen with lots of paid products). Well, it then occurred to me that I will not get a response since the MX records for my domain were also hosted with them. Double oops.

On the plus side, in the past 4 years that I've used them this was the very first issue, and they fixed it within a couple of hours.

Anyone have any good recommendations on cheap or free backup DNS hosting?


Use www.dnsmadeeasy.com and then dns.he.net as your secondary dns service or vice-versa. They will do transfers/updates from each other and work just fine.


I had one customer on DO DNS and it was a "good enough" solution. Unfortunately, this came right in the middle of a marketing push for last-minute registrations. An annoyance, but not a major financial impact. (Maybe it will give the impression of excess demand. :)

I understand that things break and I should be ready for it. What I found unacceptable were the status updates. Basically, "we're working on it". No clue as to what was going on. A DDoS? Not a DDoS? Routing issues? Corrupt zone files? No clue? Any of those would be helpful as I needed to figure out if I should wait it out, or switch to Route 53.

In the information vacuum, I switched to Route 53. It works.


Route53 is a wonderful service, which has never had a global outage (so far).

I've wrapped it in git, to allow very quick updates to be made in a simple fashion:

https://dns-api.com/


How do you fail over your SOA on .com when the minimum TTL is 1 day?


The answer probably is not very helpful but... change the TTL to a lower value? Is it essential to have such a big TTL of 1 day?


Sure, have fun convincing Verisign. They're the ones controlling the .com registry. Your TTL values are meaningless to them when it comes to SOA. Minimum 1 day TTL.


Sorry, I meant NS glue changes on .com, not SOA.


DNS is hard. Very hard.

It may seems trivial when it works (hint: it's not), but some of the biggest fuck ups I've seen in my professional life were caused by strange DNS things happening or DNS servers going kaboom.

I feel the pain of the DO engineers trying to mitigate this issue. I really do.


BS. DNS is a trivial thing to scale, compared to most other web-scale efforts.

Things break when people don't use 20 year old best practices. There is no defense against inexperience and ignorance.


I took the OPs comment as "it's hard to understand DNS and biggest fuck ups happen because people think they understand DNS when they actually don't".

The problem with DNS is that it can work even when it is configured incorrectly. This makes people who has no idea what they are doing that they actually understand it. The strange issues with DNS only happen with strange configurations. When you follow best practices everything is predictable.


All right. This I can agree with.


Please help the ignorant and provide a link to a description of those best practices.


> I feel the pain of the DO engineers trying to mitigate this issue. I really do.

Me too. Just last week they had another problem with DNS on the client side of things: Resolving with the Google Public DNS, which most droplets use by default, didn't work reliably. I hope that they post a combined post mortem for both of those incidents.


It's not hard, the problem is everything relies on DNS so when DNS goes down or has problems you have cascading failure.


That's why you use multiple providers.


I'll bite. You can have multiple NS records but only a single SOA. The .com registry minimum TTL for SOA is a day.

How in the world would "multiple providers" help you in a 6 hour outage?


You can have multiple NS records. You should have ns records that point to different companies DNS servers, and preferably different continents.


That's great for NS records. What about SOA?


Soa isn't used for resolving names afaik.


You're right. Sorry I meant the NS glue records. Those have a 1 day minimum TTL. If you're not already running multiple providers, switching after one is down won't help much.


Unless you use a better resolver than the standard glibc resolver on Linux (e.g. dnsmasq, bind or similar running locally and pointing resolv.conf at it), you appear doomed to slow lookups etc. if your first resolv.conf entry fails, as most of the resolv.conf options that might have helped (if you'd set them) simply don't work or doesn't do anything particularly useful in the versions used in the Linux distro's I've tested it on.


unbound is great for this.


Even with multi providers, DNS issues are a cluster fuck.


Only if you don't know what you're doing. The problem with DNS is that it might work even when it is misconfigured, and misconfiguration is the source of strange issues.


I think that we all have areas where we don't know what we're doing. This is one of mine. With all the talk of how obvious/important/easy it is to have a failover in place in case this happens, I'm having trouble finding a good resource about setting up a redundant DNS. Running a droplet on Digital Ocean with Debian and Nginx.


Sounds like we're begging for someone to write a nice blog post for how they set up redundant DNS across multiple providers "the right way"... Sounds like it would hit the front page in short order if anyone is willing to share how they think this should be mitigated, and specifically how to expect common clients to behavior in that case when faced with the various types of outages that may occur!


Yeah I mean it takes work to do it correctly. I wouldn't call it a cluster fuck.


Suppose you have multiple providers, but one of them screws up and authoritatively denies the existence of all of your hosts?


That's what you keep an extremely low ttl for.


Which doesn't mean much when a nontrivial amount of ISPs out there don't respect the TTL settings.

Source: Days-long service degradation caused by customer ISP's caching bad DNS information well beyond the 10 minute TTL we had set.


One thing hosting providers could do better would be to split the risk a little by not handing the same dns server name to every client that chooses to have the hosting provider supply dns services.

The reason this might have some upside is that DDOS attacks against a specific DNS server are often intended to target one specific customer of a hosting provider. The attacker doesn't care about the side effects...just the original target.

Say, for example "controversialblog.com" is hosted on DO, and uses DO dns servers. The person attacking "controversialblog.com" looks up the NS records for the domain, and attacks that DNS server. The fact that it's one hostname that serves all of DO is of little interest to the attacker.

So, if DO would come up with say, 10 separate hostnames they could hand out, then this sort of thing would take down 10% of their customers instead of 100%.


Yeah this is pretty unfortunate. We have some big investor meetings today and this unfortunately took our marketing site offline. Hopefully they resolve this soon - it's the first time we've ever experienced an issue with their service.

We really need fail-overs in place...small team problems.


If you are showing a demo or something, you can still navigate to your DO IP address. Of course, I don't know if other things (images, etc.) on your website also rely on their DNS.


For that kind of emergency, you can point the domain name to the IP address via /etc/hosts (there's a similar file in Windows as well).


c:\windows\system32\drivers\etc\hosts


We feel the pain as well as our platform is unreachable. I'm now using an other DNS server and changed the nameserver in the domain-record. However the DNS propagation is taking some time. What are you doing at the moment as fail-over?


Sorry if I'm trying to "teach grandma to suck eggs" but can't you just enter the domain in your local hosts file. If it's a network that needs access then presumably you have some sort of proxy/cache that could be seeded with the necessary domain+IP pairing? I suppose these aren't possible if you're trying to demo on someone else's network or in a public space or such.


We just switched it over to Route53 and set up some fail-overs there. Took us 5 mins and we're back online.

Looks like DO is still offline so it seems to have been a good call...


You're back online from your perspective. What about all the name servers that have your SOA cached still looking at DO? You're still down for them.


Tough shit, lol. We now have reduced TTL times for future occurrences, but there's nothing we can do for those users who are still experiencing an outage.


Minimum TTL for SOA records on .com is a day. Lowering your A record TTL isn't buying you much in case of DNS hosting failure. Helpful if your site (not dns) needs to move providers though.


My rule: provider should do single thing:

- Hosting provider - host sites

- vps/cloud provider - provide VMs

- domain registrar - domain related stuff, but not DNS

- dns provider - host dns

- second dns provider - host dns in case first dns provider fails

So many DNS outages recently and all my projects are up.


Does Amazon's Route 53 count as a DNS provider, or do you treat it a hosting provider?


For me - neither.

But if I'd be tied in into Amazon's cloud infrastructure, I would have to use many of their features going against my rules above.


How do you apply your rules considering what's available today? Which services are you using? It sounds like it would be a big headache to orchestrate the automation among all these different providers.


Have multiple VMs @ Digital Ocean (TOR1), we use Cloudflare for DNS... All site have remained available and successfully fulfilling requests.


Every site where I was using external DNS stayed up.


I don't have anything more important than a small personal website but now I'm curious. If you set up a system to handle your main DNS provider failing, how do you test it? Is there a good reference where I can find some best practices on this?


Here's my testing recommendation:

1. Pick some subset of your DNS records to monitor, or all of them if you want to be extra thorough. If you are picking a subset, then I'd pick whatever records are most critical to your business.

2. Setup monitoring that queries each of your authoritative name servers for each of the records that you identified in the previous step. The monitoring should notify you if any of the name servers are unresponsive, or return a different response than what's expected.

If you'd like to dig into the details of DNS, then O'Reilly's "DNS and BIND" is highly recommended, even if you're not using BIND.

There are a number of quality hosting providers out there. A rule of thumb that I use is this: If a DNS hosting provider doesn't eat their own dog food, don't trust them to handle your DNS. Digital Ocean doesn't use their own name servers for their main website's domain. Neither does Amazon.

Shameless plug: I created a DNS monitoring service that can be used used for monitoring each of your name servers: https://www.dnscheck.co/


The flipside to the dog food point: if DigitalOcean did use their own nameservers for their main site, then we wouldn't have been able to see the status page.


Good point!


Their status page at https://status.digitalocean.com is also now giving an intermittent "500 Internal Server Error" nginx error, probably from the load. That's why you should use a service like https://www.statuspage.io for your important stuff, even though creating a status page is a fun side-project for a dev team.


So what you're saying is that instead of running their own Status Page on their own infrastructure that's reachable. They should outsource it to statuspage.io and pay another company to do it?


Yes, that's pretty standard. Availability monitoring and status reporting should be external and separate from your own infrastructure, otherwise neither may be available when you need it the most.


And don't use statuspage.io if your host is AWS, because theirs is too.


They do have geo-region (not just AZ) redundancy and failover, which puts them quite a bit above most home-grown company status pages in my experience. But yes, if the problem was for example in R53 it would indeed be better to have an solution without that AWS dependency.


Eh, as I remember it they have stuff in multiple AWS regions. Would take a global AWS failure to bring them down. This was my justification for going with them while working somewhere that used AWS.

That said, while it is extremely unlikely it shouldn't be discounted as impossible.


But who monitors the status of statuspage.io?


I know you're joking, but I do: Check out https://StatusGator.com. StatusPage.io has a status page at metastatuspage.com which my company monitors.


And who monitors your company monitors? :)


Are you also on AWS? Apparently you're not on DO...


This is the first DNS outage I've experienced with them in 3+ years, then again I host everything in their NY regions.


I've never experienced an outage of any kind with DO, so also first time. I also host all my droplets in the NYC regions.


Same. They've been pretty reliable. Hoping this doesn't last long or we'll be looking to move off.


DigitalOcean uses CloudFlare for DNS - https://www.cloudflare.com/case-studies-digital-ocean/


This statement can be misleading. If you read the article, they don't use CloudFlare's DNS servers per se. They use CloudFlare's DNS proxy which acts as a DNS firewall between the DigitalOcean DNS servers and the world.


I think the most annoying aspect of this outage are their updates. Three updates and they all say the same thing with no meaningful information as to what's causing this. Likely they may not have much information, but you'd think there would be something more than what they've been posting for the past hour. Good times!


Does anyone know of a good strategy for DNS failover?


That's one of the easiest thing to do. Just add multiple NS to the domain. As long as they are configured correctly and at least one is up (and for limited time even if none are up) the service is available.

With DNS you actually can achieve 100% uptime.


Well, there's a couple of strategies:

- IP-diverse nameservers

- TLD-diverse nameservers

- BGP anycast

IP-diverse nameservers requires that you expect that your DNS servers will go down rather than start returning bad results - I highly recommend having some sort of mechanism to hard-terminate access to those machines.

TLD-diverse nameservers is just an extra strategy for reducing the risk that an upstream TLD issue will blow up your spot.

And then BGP anycast is the expensive, complicated piece of this - it requires a high level of technical sophistication, lots of moving parts, and the QA/validation piece of it is tricky.

When I built an anycast DNS system, we ended up resorting to tricks like having the DNS servers publish routes to the router for redistribution, so that a down or unresponsive server automatically withdrew the routes. Then you do things like TXT records for your zone that respond with which POP you're hitting in some sort of hashed/obfuscated fashion.

It's hard and complicated, and unnecessary for most folks. Better to outsource to Route 53 or someone similar.


- Implementation-diverse nameservers

Use multiple implementations, e.g. NSD/BIND for authoritative servers and Unbound/BIND for resolvers, to mitigate against implementation-specific bugs and vulnerabilities.


A fair number of providers support zone transfers (AXFR requests) from master to slave name servers. The slaves can be operated by a different entity.

Here's DNSimple's implementation: https://support.dnsimple.com/articles/secondary-dns/

I wrote about moving from 1 to 2+ authoritative DNS providers: http://blog.papertrailapp.com/dns-outage-on-monday-december-.... I think this is just as true today:

> For .. maintainers of mission-critical DNS zones, the solution is to not depend on any single DNS infrastructure for functioning authoritative DNS


I don't know if DigitalOcean's DNS servers allow AXFR, if they do you can use a secondary DNS service to automatically replicate the DNS. You then list the secondary DNS as a NS for your domain.

If they don't allow AXFR -- and after this, they should! -- you can still have a secondary DNS provider but you'd have to duplicate any changes by hand. Not ideal but still doable.


DNS has failover built into the protocol, just have another server listed at your registrar.


I would be interested in the post-mortem from this. While DigitalOcean operate their own DNS, it is only made publicly available though CloudFlares DNS proxying service.


From @DOStatus a minute ago:

"Our engineering team has identified the issue, and are working to resolve connectivity issues to our DNS servers.... http://do.co/status"

https://twitter.com/DOStatus/status/713043871559655424


Recommend AWS Route53 very highly. Route53 also allows you to buy domain names and do lot's of fancy fail-over, geolocation, and CNAME alias at the apex magic.


A few years ago Slicehost had a DNS outage and the webscrapers I had running were falling over because they couldnt resolve DNS. I had to SSH into 8 boxes and update resolv.conf to add google DNS and openDNS as a backup. (Yes, I should've had centralized config management with chef or puppet or ansible)


That's crazy... I think by default DO droplets use Google DNS for resolving.


No offense to anyone here, but what is DO's SLA? Last time I looked, they did not have one.

DO is cheap for a reason. And that's the same reason I don't host with them, I can get SLA-backed infrastructure for a reasonable price and would have no excuse to my customers or cofounders.


Looks like they do have one: https://www.digitalocean.com/help/policy/


We have seamless DNS "failover" by running dnsmasq with the all-hosts option on all our servers. It causes dnsmasq to query all at once so if any go down its transparent to our apps. Works perfectly on our 1500 ec2 instances.


I thought their DNS was supposed to be rock solid since they use Cloudflare Virtual DNS. Oh well, lesson learned. Back to running my own DNS servers on each droplet, if the DNS is down the droplet is likely down regardless :).


Feeling the pain here too. What DNS providers do others use and like? Route53?


The sites I have that are actually up right now are those routed through CloudFlare.


Route53 is hard not to love; simple to develop against and very very reliable.

I wrap it in git to make updates more straightforward for people unfamiliar with AWS, but even using it directly is very simple from multiple languages. (https://dns-api.com/)


This has always been a good resource to see who the front runners are: http://www.solvedns.com/dns-comparison/


Bind, running on VMs. Not hard.


You run your own authoritative DNS servers?


I ran my non-authoritative DNS server [bind] on a droplet for about a year. But the server crashed every few months. Why? Never figured out. A restart always fixed it.

Later shifted to DO's DNS servers.

Now that that one is down too, just shifted back to domain register's DNS.

Everything is working now.


Where I work we use Rotue53. For my personal domains I just use my registrar, Namecheap.


For us, Route53 is painful. We host a few thousand zones, and due to rate limiting on APIs, doing something like "Show a list of domain names" or "Give all the domains matching some pattern" were particularly painful. Upwards of 30 seconds to do a simple list of domains meant we were forced to cache locally. A local cache, combined with the fact that zone names are not unique in their system (possible to create multiple abc.com entries, which differ only in an internal id and the list of NS entries) made it hard to ensure that our internal systems matched "reality". Then the administrative nightmare of 3-4 different NS entries for each zone means customized, rather than generic, instructions for validating NS settings at the individual registrars.

All in all, it was not a fun experience with such a large volume of zones, but we knew we were an edge case.


You can contact amazon to get rate limits increased if you have the use case


dnsmadeeasy for around 8 years now


This is the second time recently their AMS region has gone down, which is where I host my email. What a pain.


That awkward moment when it shows the status page


This is causing huge huge pain for us, Digital Ocean.


If you want to get alerted when it comes back up, or you wish you had been alerted when it went down, check out my project: https://StatusGator.com.

StatusGator monitors status pages and sends notifications via email, Slack, and others. You can get alerted to status changes inside Slack and you can ask it the status of a service with a /statuscheck command.


A bit of feedback: you should have a link back to your dashboard on every page. That seems like the most important page to me as a user, but if I am changing my notification or account settings, there is no way back to that page.


Great feedback, thank you! Added that.


Yeah, tried to access our site and it was down. Really was expecting more out of Digital Ocean than to fuck up such an integral part of their infrastructure. In the future we'll be transitioning away from their DNS solution because this is unacceptable.


I hope your clients/users are as understanding and civil as you are.

In the meantime, I'm going wait for post-mortem before deciding if I should continue using them for dns. Looking back over the status history, 1-2 incidents a year isn't that bad for my needs, but might be too much for you, which is fine (since I'm only hosting a couple of small side projects with them).


The problem is, for an early stage startup incidents like this are deadly. Especially since we just applied to a bunch of accelerators.


I get that this is a pain in the ass. I've got a significant chunk of infrastructure on DO, I've got work to do today that depends on those machines. I learned about this simultaneously when a deployment failed and I got a text from an engineer at a company I consult with. Not a great way to start the day, for sure.

Know what I'm going to do? I'm going to have a cup of coffee and play with my dogs for a bit. It's inconvenient, it's going to delay things, and I'm a bit choked about it. But it's not worth getting angry over, because there's nothing I can do about it today.


The resolution is, for any app/startup/business everything is a risk and if you didn't include the edge-case of "What happens if my primary DNS nameserver goes down for my domain?" into account. Is all you can do is blame DO?

If your app goes down do you have failover for that? Or do you blame your devops team?


Any small business has to manage which risks it accepts. OP got the Digital Ocean service to try and mitigate this risk to a degree. Beyond that, it becomes a question of accept risk in other aspects of product or a risk that your primary DNS server provider will fail?

The reality is, you have limited development and financial resources so you simple can’t do everything. Sure, in a vacuum, or in a larger enterprise, we’d love to manage every risk. But when we're just starting out that’s not realistic, and we do have a right to be upset at Digital Ocean's service going down while at the same time realizing that yes, ideally we would / should have had redundancy in place already.

Your question on blaming the devops team is exactly the mindset of someone who has a lot more resources than a brand new startup. In a brand new startup there IS NO devops team. If you're lucky, there is one person who does the devops work part time, balanced with a bunch of other development work he/she also does.


We have auto failover for server, app, and database failures -- this can be easily managed. DNS Nameserver failover should have been built-in, after all there is a reason we specify 3 DNS nameservers into the domain configuration, since Digital Ocean took on this task (whereby we used Gandi's DNS before that) we expected it to perform as advertised -- so yes blame lies with DO.


It's still entirely your fault. Something like Route53/Cloudflare is dirt cheap and crazy redundant. Don't risk your business on free/side services.


You can go to your domain registrar and switch to another DNS provider (GoDaddy has their own DNS service).


Be happy that this happend early. Now you know that you should never ever have a single point of failure.


Just add a second dns provider.


If your site is so critical that it can't suffer any downtime then why is it not provisioned across multiple independent platforms?


I also hope the users of your site understand that shit happens. Also as another user said... if DNS is so critical for you then why don't you have proper failover in place?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: