For an example of this taken to a ludicrous extreme, several years ago an AWS user complained that downtime of a few EC2 instances were putting lives at risk. They were hosting cardiac monitoring services on single EC2 instances with no multi-AZ or multi-region capability.
I've been reading all the comments on Twitter also... like "err mai gawd I'm switching to AWS because of this" and your failure to not have a secondary DNS provider, but I highly doubt you'd switch.
Then another...
"Today's @digitalocean DNS #outage is a reminder to not trust your entire business to one provider. Spread the love around!"
If your company is e-commerce and makes money by being 99.99% available. It's your own fault for no fail-over.
another... ".@digitalocean that's two hours without DNS now...my company's websites could be losing thousands of £ in e-commerce! Please, an update!"
Times like this makes you realize the difference betweeen good clients and bad clients. Yes, they have a right to be upset but claims like "could be losing thousands of dollars" is mostly exagerrated due to their frustration.
heh yeah, I just laugh at all the tweets saying their losing $billions of dollars every minute their site/app is unavailable. All I can think is... if you're the next Amazon.com I'm pretty sure you'd have some type of disaster plan in place should something like this happen.
I can't disagree with what you're saying, but I think we are all guilty of this. We expect more out of big name services than might be reasonable. (100% uptime)
How many of us here have failover email services in case Gmail goes down? I think many companies would say they'd lose thousands in productivity if Google Apps suffers an outage yet I'd hazard that very few have failover plans.
That's because people like to complain... the reality is stuff happens, systems go down, and life tends to go on.
Yeah, if your building full of employees can't work because the internet is down, and the secondary is also down, then that's kind of crappy, and you may be paying people to twiddle their thumbs... much short of that, it's kind of the cost of doing business...
There are redundancy options for a lot of things... If you're using only a single host provider for your infrastructure, and management scoffs at creating redundant, and under-utilized systems... it's not as "mission critical" as people think/say.
I use dns.he.net for my DNS hosting. It's free up to 50 zones and has been rock solid. The other day I started having some trouble with accessing some of my domain names. Turned out that all of their DNS was down and was returning NXDOMAIN for pretty much any request, including their own domains. Oops. So I emailed their support (which is usually very quick to respond and is better than I have seen with lots of paid products). Well, it then occurred to me that I will not get a response since the MX records for my domain were also hosted with them. Double oops.
On the plus side, in the past 4 years that I've used them this was the very first issue, and they fixed it within a couple of hours.
Anyone have any good recommendations on cheap or free backup DNS hosting?
Use www.dnsmadeeasy.com and then dns.he.net as your secondary dns service or vice-versa. They will do transfers/updates from each other and work just fine.
I had one customer on DO DNS and it was a "good enough" solution. Unfortunately, this came right in the middle of a marketing push for last-minute registrations. An annoyance, but not a major financial impact. (Maybe it will give the impression of excess demand. :)
I understand that things break and I should be ready for it. What I found unacceptable were the status updates. Basically, "we're working on it". No clue as to what was going on. A DDoS? Not a DDoS? Routing issues? Corrupt zone files? No clue? Any of those would be helpful as I needed to figure out if I should wait it out, or switch to Route 53.
In the information vacuum, I switched to Route 53. It works.
Sure, have fun convincing Verisign. They're the ones controlling the .com registry. Your TTL values are meaningless to them when it comes to SOA. Minimum 1 day TTL.
It may seems trivial when it works (hint: it's not), but some of the biggest fuck ups I've seen in my professional life were caused by strange DNS things happening or DNS servers going kaboom.
I feel the pain of the DO engineers trying to mitigate this issue. I really do.
I took the OPs comment as "it's hard to understand DNS and biggest fuck ups happen because people think they understand DNS when they actually don't".
The problem with DNS is that it can work even when it is configured incorrectly. This makes people who has no idea what they are doing that they actually understand it. The strange issues with DNS only happen with strange configurations. When you follow best practices everything is predictable.
> I feel the pain of the DO engineers trying to mitigate this issue. I really do.
Me too. Just last week they had another problem with DNS on the client side of things: Resolving with the Google Public DNS, which most droplets use by default, didn't work reliably. I hope that they post a combined post mortem for both of those incidents.
You're right. Sorry I meant the NS glue records. Those have a 1 day minimum TTL. If you're not already running multiple providers, switching after one is down won't help much.
Unless you use a better resolver than the standard glibc resolver on Linux (e.g. dnsmasq, bind or similar running locally and pointing resolv.conf at it), you appear doomed to slow lookups etc. if your first resolv.conf entry fails, as most of the resolv.conf options that might have helped (if you'd set them) simply don't work or doesn't do anything particularly useful in the versions used in the Linux distro's I've tested it on.
Only if you don't know what you're doing. The problem with DNS is that it might work even when it is misconfigured, and misconfiguration is the source of strange issues.
I think that we all have areas where we don't know what we're doing. This is one of mine. With all the talk of how obvious/important/easy it is to have a failover in place in case this happens, I'm having trouble finding a good resource about setting up a redundant DNS.
Running a droplet on Digital Ocean with Debian and Nginx.
Sounds like we're begging for someone to write a nice blog post for how they set up redundant DNS across multiple providers "the right way"... Sounds like it would hit the front page in short order if anyone is willing to share how they think this should be mitigated, and specifically how to expect common clients to behavior in that case when faced with the various types of outages that may occur!
One thing hosting providers could do better would be to split the risk a little by not handing the same dns server name to every client that chooses to have the hosting provider supply dns services.
The reason this might have some upside is that DDOS attacks against a specific DNS server are often intended to target one specific customer of a hosting provider. The attacker doesn't care about the side effects...just the original target.
Say, for example "controversialblog.com" is hosted on DO, and uses DO dns servers. The person attacking "controversialblog.com" looks up the NS records for the domain, and attacks that DNS server. The fact that it's one hostname that serves all of DO is of little interest to the attacker.
So, if DO would come up with say, 10 separate hostnames they could hand out, then this sort of thing would take down 10% of their customers instead of 100%.
Yeah this is pretty unfortunate. We have some big investor meetings today and this unfortunately took our marketing site offline. Hopefully they resolve this soon - it's the first time we've ever experienced an issue with their service.
We really need fail-overs in place...small team problems.
If you are showing a demo or something, you can still navigate to your DO IP address. Of course, I don't know if other things (images, etc.) on your website also rely on their DNS.
We feel the pain as well as our platform is unreachable. I'm now using an other DNS server and changed the nameserver in the domain-record. However the DNS propagation is taking some time. What are you doing at the moment as fail-over?
Sorry if I'm trying to "teach grandma to suck eggs" but can't you just enter the domain in your local hosts file. If it's a network that needs access then presumably you have some sort of proxy/cache that could be seeded with the necessary domain+IP pairing? I suppose these aren't possible if you're trying to demo on someone else's network or in a public space or such.
Tough shit, lol. We now have reduced TTL times for future occurrences, but there's nothing we can do for those users who are still experiencing an outage.
Minimum TTL for SOA records on .com is a day. Lowering your A record TTL isn't buying you much in case of DNS hosting failure. Helpful if your site (not dns) needs to move providers though.
How do you apply your rules considering what's available today? Which services are you using? It sounds like it would be a big headache to orchestrate the automation among all these different providers.
I don't have anything more important than a small personal website but now I'm curious. If you set up a system to handle your main DNS provider failing, how do you test it? Is there a good reference where I can find some best practices on this?
1. Pick some subset of your DNS records to monitor, or all of them if you want to be extra thorough. If you are picking a subset, then I'd pick whatever records are most critical to your business.
2. Setup monitoring that queries each of your authoritative name servers for each of the records that you identified in the previous step. The monitoring should notify you if any of the name servers are unresponsive, or return a different response than what's expected.
If you'd like to dig into the details of DNS, then O'Reilly's "DNS and BIND" is highly recommended, even if you're not using BIND.
There are a number of quality hosting providers out there. A rule of thumb that I use is this: If a DNS hosting provider doesn't eat their own dog food, don't trust them to handle your DNS. Digital Ocean doesn't use their own name servers for their main website's domain. Neither does Amazon.
Shameless plug: I created a DNS monitoring service that can be used used for monitoring each of your name servers: https://www.dnscheck.co/
The flipside to the dog food point: if DigitalOcean did use their own nameservers for their main site, then we wouldn't have been able to see the status page.
Their status page at https://status.digitalocean.com is also now giving an intermittent "500 Internal Server Error" nginx error, probably from the load. That's why you should use a service like https://www.statuspage.io for your important stuff, even though creating a status page is a fun side-project for a dev team.
So what you're saying is that instead of running their own Status Page on their own infrastructure that's reachable. They should outsource it to statuspage.io and pay another company to do it?
Yes, that's pretty standard. Availability monitoring and status reporting should be external and separate from your own infrastructure, otherwise neither may be available when you need it the most.
They do have geo-region (not just AZ) redundancy and failover, which puts them quite a bit above most home-grown company status pages in my experience. But yes, if the problem was for example in R53 it would indeed be better to have an solution without that AWS dependency.
Eh, as I remember it they have stuff in multiple AWS regions. Would take a global AWS failure to bring them down. This was my justification for going with them while working somewhere that used AWS.
That said, while it is extremely unlikely it shouldn't be discounted as impossible.
This statement can be misleading. If you read the article, they don't use CloudFlare's DNS servers per se. They use CloudFlare's DNS proxy which acts as a DNS firewall between the DigitalOcean DNS servers and the world.
I think the most annoying aspect of this outage are their updates. Three updates and they all say the same thing with no meaningful information as to what's causing this. Likely they may not have much information, but you'd think there would be something more than what they've been posting for the past hour. Good times!
That's one of the easiest thing to do. Just add multiple NS to the domain. As long as they are configured correctly and at least one is up (and for limited time even if none are up) the service is available.
IP-diverse nameservers requires that you expect that your DNS servers will go down rather than start returning bad results - I highly recommend having some sort of mechanism to hard-terminate access to those machines.
TLD-diverse nameservers is just an extra strategy for reducing the risk that an upstream TLD issue will blow up your spot.
And then BGP anycast is the expensive, complicated piece of this - it requires a high level of technical sophistication, lots of moving parts, and the QA/validation piece of it is tricky.
When I built an anycast DNS system, we ended up resorting to tricks like having the DNS servers publish routes to the router for redistribution, so that a down or unresponsive server automatically withdrew the routes. Then you do things like TXT records for your zone that respond with which POP you're hitting in some sort of hashed/obfuscated fashion.
It's hard and complicated, and unnecessary for most folks. Better to outsource to Route 53 or someone similar.
Use multiple implementations, e.g. NSD/BIND for authoritative servers and Unbound/BIND for resolvers, to mitigate against implementation-specific bugs and vulnerabilities.
I don't know if DigitalOcean's DNS servers allow AXFR, if they do you can use a secondary DNS service to automatically replicate the DNS. You then list the secondary DNS as a NS for your domain.
If they don't allow AXFR -- and after this, they should! -- you can still have a secondary DNS provider but you'd have to duplicate any changes by hand. Not ideal but still doable.
I would be interested in the post-mortem from this. While DigitalOcean operate their own DNS, it is only made publicly available though CloudFlares DNS proxying service.
Recommend AWS Route53 very highly. Route53 also allows you to buy domain names and do lot's of fancy fail-over, geolocation, and CNAME alias at the apex magic.
A few years ago Slicehost had a DNS outage and the webscrapers I had running were falling over because they couldnt resolve DNS. I had to SSH into 8 boxes and update resolv.conf to add google DNS and openDNS as a backup. (Yes, I should've had centralized config management with chef or puppet or ansible)
No offense to anyone here, but what is DO's SLA? Last time I looked, they did not have one.
DO is cheap for a reason. And that's the same reason I don't host with them, I can get SLA-backed infrastructure for a reasonable price and would have no excuse to my customers or cofounders.
We have seamless DNS "failover" by running dnsmasq with the all-hosts option on all our servers. It causes dnsmasq to query all at once so if any go down its transparent to our apps. Works perfectly on our 1500 ec2 instances.
I thought their DNS was supposed to be rock solid since they use Cloudflare Virtual DNS. Oh well, lesson learned. Back to running my own DNS servers on each droplet, if the DNS is down the droplet is likely down regardless :).
Route53 is hard not to love; simple to develop against and very very reliable.
I wrap it in git to make updates more straightforward for people unfamiliar with AWS, but even using it directly is very simple from multiple languages. (https://dns-api.com/)
I ran my non-authoritative DNS server [bind] on a droplet for about a year. But the server crashed every few months. Why? Never figured out. A restart always fixed it.
Later shifted to DO's DNS servers.
Now that that one is down too, just shifted back to domain register's DNS.
For us, Route53 is painful. We host a few thousand zones, and due to rate limiting on APIs, doing something like "Show a list of domain names" or "Give all the domains matching some pattern" were particularly painful. Upwards of 30 seconds to do a simple list of domains meant we were forced to cache locally. A local cache, combined with the fact that zone names are not unique in their system (possible to create multiple abc.com entries, which differ only in an internal id and the list of NS entries) made it hard to ensure that our internal systems matched "reality". Then the administrative nightmare of 3-4 different NS entries for each zone means customized, rather than generic, instructions for validating NS settings at the individual registrars.
All in all, it was not a fun experience with such a large volume of zones, but we knew we were an edge case.
If you want to get alerted when it comes back up, or you wish you had been alerted when it went down, check out my project: https://StatusGator.com.
StatusGator monitors status pages and sends notifications via email, Slack, and others. You can get alerted to status changes inside Slack and you can ask it the status of a service with a /statuscheck command.
A bit of feedback: you should have a link back to your dashboard on every page. That seems like the most important page to me as a user, but if I am changing my notification or account settings, there is no way back to that page.
Yeah, tried to access our site and it was down.
Really was expecting more out of Digital Ocean than to fuck up such an integral part of their infrastructure. In the future we'll be transitioning away from their DNS solution because this is unacceptable.
I hope your clients/users are as understanding and civil as you are.
In the meantime, I'm going wait for post-mortem before deciding if I should continue using them for dns. Looking back over the status history, 1-2 incidents a year isn't that bad for my needs, but might be too much for you, which is fine (since I'm only hosting a couple of small side projects with them).
I get that this is a pain in the ass. I've got a significant chunk of infrastructure on DO, I've got work to do today that depends on those machines. I learned about this simultaneously when a deployment failed and I got a text from an engineer at a company I consult with. Not a great way to start the day, for sure.
Know what I'm going to do? I'm going to have a cup of coffee and play with my dogs for a bit. It's inconvenient, it's going to delay things, and I'm a bit choked about it. But it's not worth getting angry over, because there's nothing I can do about it today.
The resolution is, for any app/startup/business everything is a risk and if you didn't include the edge-case of "What happens if my primary DNS nameserver goes down for my domain?" into account. Is all you can do is blame DO?
If your app goes down do you have failover for that? Or do you blame your devops team?
Any small business has to manage which risks it accepts. OP got the Digital Ocean service to try and mitigate this risk to a degree. Beyond that, it becomes a question of accept risk in other aspects of product or a risk that your primary DNS server provider will fail?
The reality is, you have limited development and financial resources so you simple can’t do everything. Sure, in a vacuum, or in a larger enterprise, we’d love to manage every risk. But when we're just starting out that’s not realistic, and we do have a right to be upset at Digital Ocean's service going down while at the same time realizing that yes, ideally we would / should have had redundancy in place already.
Your question on blaming the devops team is exactly the mindset of someone who has a lot more resources than a brand new startup. In a brand new startup there IS NO devops team. If you're lucky, there is one person who does the devops work part time, balanced with a bunch of other development work he/she also does.
We have auto failover for server, app, and database failures -- this can be easily managed.
DNS Nameserver failover should have been built-in, after all there is a reason we specify 3 DNS nameservers into the domain configuration, since Digital Ocean took on this task (whereby we used Gandi's DNS before that) we expected it to perform as advertised -- so yes blame lies with DO.
I also hope the users of your site understand that shit happens. Also as another user said... if DNS is so critical for you then why don't you have proper failover in place?
[1]https://twitter.com/rodrigoespinosa/status/71303563702097100...