I have a site and API which rates other sites, checking them for ad links and trying to match them to real world business records. Inspecting a site takes 10 seconds to 2 minutes. The API, which is usually used from browser add-ons which tag web search results, lets anyone request the rating info for a site.
Most requests are already cached and return immediately. Requests for unknown sites are queued up for the rating engine, which runs in processes separate from the web-facing side, and return "wait" to the requestor.
So there's a work queue to manage.
It's managed using fair queuing. If there is no pending request from the requesting IP address, the new request goes into the queue. If there's a request from the same IP address already in the queue, the new request is held until the first one completes, and then added to the end of the main queue. No request is rejected until there are 100 requests from a single IP address. So each IP address competes against itself, and no one IP can hog the system.
A typical overload comes when someone searches for an unusual topic and flips through many pages of search results quickly. This can result in the rating engine having fifty or so new sites to examine within a few
seconds. Those requests will be processed one at a time until the rating engine catches up.
This queuing system held up well when someone tried a test where they fed a huge list of sites into the API without waiting for completion of any of their requests. The rating engine ran busily for a week, but they were only tying up one rating process, and it didn't affect other users at all.
There are no adjustment parameters. It just runs. It's not perfect, but it deals well with legitimate transients and with abuse from a small number of IP addresses.
Fair queuing is an interesting and complementary approach that I have not considered, thanks for the suggestion.
I usually work with "internal" systems that don't always have an obvious "user" identifier, but it would be fairly easy to add one of some sort (e.g. an application id or similar). However, for that to be useful, there still has to be some sort of limit. In your case, the limit seems to be 1 concurrent request per user, up to a maximum of 100. In the case of some internal system, like say a metrics aggregator, what is the correct limit? I'm not sure, and I think it would require tuning.
That said, given a reasonable limit, this would definitely ensure that the "misbehaving" application (e.g. the one flooding the metrics aggregator with packets) is the one that gets punished, rather than everyone else. I'll have to think about this more. Thanks!
Python and MySQL/MariaDB. The queue is an in-memory table. Fair queuing is implemented as a big SQL SELECT statement.
The time scale is slow enough that polling it once a second for new work is sufficient. If it had to go fast, it would require something like a standard fair queuing implementation, where you hash the IP address and put the request onto 1 of N queues, serviced round-robin.
Doesn't matter all that much. For ingestion, you might have one VM running one Java process and one DB process as a microservice (I like Dropwizard and PostgreSQL for this). The worker itself would be another Java process with a 100 thread max thread pool that either periodically checks for new work, or uses the new NOTIFY semantics new to Postgres 9.6. In this case the "secret sauce" is that the worker has some extra smarts to reorder the queue to avoid resource hogs, which is the cool part.
I use App Engine's static file serving to serve my site. It has a 1 GB/day bandwidth limit. It turns out Hacker News still moves a lot of traffic, so this is the first time I've crossed it! Oops. Billing is now enabled on this project.
The GCP team did a blog post similar to this about using Load Shedding to survive a spike[0]. It's definitely a great way to survive a sudden spike in traffic, but not optimal, as serving errors is also bad (unless your clients have some sort of retry mechanism in-place). But if your clients do have a retry mechanism in place (especially one that has a backoff time on retries), it can be a great way to handle high loads.
My team moved out entire platform to AppEngine to deal with the traffic spikes we get (among other reasons). It's worked out quite well I must say. It's also cool that if you know you have a spike coming that even AE can't handle, you can automatically tell the service to scale itself a few minutes beforehand in anticipation and then scale itself down later after it's passed.
Retries aren't the only option if you control (or can at least influence) the client. Fallback sources and sane (or customer preferential) defaults are some additional options when the primary source of truth is unavailable. Both of these bypass thundering heard situations where recovery can be difficult even if the application returns to full health.
As a concrete example, when I worked at a large online retailer, we had to check certain customer properties during checkout. In the optimal case, we went to a service which fronted a consistent, transactional database. If that service failed or timed out, we had the option to hit a fast cache which generally was only a few milliseconds out of date. Finally, we could use less authoritative data that came from the application earlier, and if present, would always be in the customer's favor. That made it unlikely that our service ever appeared "down" regardless of the state of the dependencies.
Based on previous decisions made by customers that were processed within the lifetime of the same request, there was some probability that the cache could be out of date.
Typically, there was enough context within the request to make assumptions about the state of the authoritative store if data hadn't propagated to the cache in time. We eventually moved to using that initially.
edit: I should add, though, the order of the datasources wasn't critical to the strategy: have a primary data source, backup data source, use any data available in the request context, and finally err in the customer's favor. That's 4 chances to get things right, none of which include retrying.
Any time I'm posting something that may endorse "the company" I tend to put that as a clear marker that I am likely biased in what I'm saying.
Also, most large companies have something in the employment contract that talks about what you are allowed to post online (especially when it relates to the company).
The thing that trips me up along with request caps a lot is timeouts (briefly mentioned). There are so many network / API / retry timeouts set to arbitrarily large amounts (e.g. 30 seconds) that it's hard to predict how they'll interact, and often you don't know that some of these exist and at which layers they exist (application layer, network layer, DNS, etc).
How you often see this is that your request limiter keeps booting new requests (FIFO/drop tail), and then when you check your currently processing requests, they seem to be all in some busy loop, and hopefully you have some means of checking why. You check, and you find that they're all querying a service that accepts the request but doesn't respond ever (e.g. server down behind a proxy), but for some reason the timeout on the client side is set to 30 seconds so stuff keeps hanging until it's killed.
Anyway, this is to say that in any sufficiently complex networked system scenario, it helps to review all these timeouts holistically and in the context of your application.
Heh, timeouts are the bane of my life as well as my most amusing source of extra income.
Client: I'm getting 504 Gateway Timeout errors on my site. Can you help?
Me: After how many seconds?
Client: About 30.
Me: Reduce the timeout to 20 seconds in your PHP-FPM pool configuration.
Client: But then I'll get even more timeouts!
Me: No you won't. While you're at it, reduce it even further to 10 seconds.
Client: What? That doesn't make any sense!
Me: Let's try 5 seconds then.
... Silence ...
Client: Hey, no more timeouts! But the widget on the sidebar is broken.
Me: That widget was trying to access an API that is currently down. Due to the long timeout and poor caching, it also took down your whole website. I heard you paid $cheapCompetitor to make the widget. If you'd like me to make you a better version by Saturday, that would be $XXX.
You also have the issue that you might be dropping new requests, but there's so much backlog that old requests eventually time out in droves, too. You might end up spending a lot of CPU cycles doing partial work on some requests, only to eventually see them hit their deadline.
That's one of the reasons why Facebook has a mechanism that switches their FIFO queues to LIFO, once the server has passed a certain load/latency threshold.
But mashing reload is what users are supposed to do by design of the web.
That is, errors in internetworked systems can happen at a level high enough that machines can't decide if it is sane to retry or not, they should simply bubble the error all the way back to the user and let her press F5 if she wants to.
Modern UIs might hide this, but users well know the web is flaky and understand it in their bones. So when they press reload they are sending important information.
A classical solution is to use a control algorithm (for example PI or PID) to avoid having to estimate the overload threshold: the server will just start shedding a fraction of the requests to maintain some metric (CPU utilization, latency, queue length, ...) below some target.
See for example [1]. This is about background jobs, but the same principles apply to load shedding.
Thanks for the link, the abstract seems relevant. I agree: a control algorithm seems like it should work. The challenge is figuring out the metrics and parameters to make something that works "well enough" for most applications, like TCP does for networks. I would really love for someone to figure that out, so we don't run into this very often.
Yeah, it's something that needs to be tuned per-service. It is quite easy if the service is X-bound for some local resource X (CPU, disk, flash, network card), but if it is bottlenecked on external service calls it can be quite challenging to define a representative stress metric.
I've seen similar issues occur, but based around GC behavior.
As with the authors example, when requests start stacking up in a process, we also potentially increase the memory load on the GC. It's also possible to cause a sort of pathological case, where the object allocation lives long enough to get promoted to the older generations, and then die, requiring almost constant major GC's to occur to free the promoted objects.
I worked on one system, where all performance testing was done in a lab environment under perfect conditions, and external resources all taking less than 1ms to respond. However, when deployed to a production network, it consistently performed abysmally, losing an order of magnitude of performance to having more outstanding requests being processed, and the associated GC load.
Newer GC's do appear to handle object promotion much better, but it's one more thing to consider, where some GC tuning and reduction in garbage can cause an order of magnitude increase in performance.
Logged in users should take priority, depending on the number of page requests, all users will see an error at some point breaking the site for everyone.
It's worth checking out - and helps you achieve this goal without necessarily having to come up with a req/s magicnum; instead you measure actual event loop latency to detect a slowdown.
A non-general solution that might work for most of us is often goes by the name "shopper prioritization".
Basically if you have a site that does lots of processing (search, orders, etc) then a pretty effective strategy is often to show a generic static "down for maintenance, try again" page. This requires vastly fewer resources to serve since the page can be sitting in web server ram and doesn't result in rpc calls throughout the whole stack.
You can also do something like prioritize logged in users over casual browsers.
Since most http clients are smart clients you can even do some stuff with rngs and timeouts to selectively let people who are willing to wait around back into the site.
Of course if you're an ad-tech company it's likely that serving an error is about as fast as serving something successfully.
Haproxy is very flexible for this sort of thing. You can, for example, limit concurrent connections to backends, and place requests in a queue with a specific timeout. Not at all automatic, but at least all the right knobs are there.
Edit: Related, Fedex's tracking is down on their website right now. It's returning quickly, though, with this in the JSON: "Service com.fedex.nxgen.trck.v8.ientities.TrckInterface is busy, max invoke limit reached"
In AWS, achieving peak theoretical performance requires queueing & pooling jobs of specific sizes. You are literally going to suffer costly bad performance if you don't implement and limit backlogs and concurrency.
This is basically just "stacks and queues" in a general sense, most often seen as wait-and-retry in an application, and rate controls in a service.
To do this and survive cascading failures you have to know the limits of each part of your stack and implement rate controls for each. An easy way to do this is to design network services that serve requests for applications via an API, rather than giving apps unfettered access to network resources. You can allow applications access to your network service and change rate controls for each app as needed. Alternately, you can add limits to your apps' layers at design time, rather than having to discover the limits by performance testing or trial-and-error.
The author wants a universal solution, which is a bit like saying, please solve "traffic" for me for every mode of transportation. Airplanes have different traffic than cars, so their designs will differ, although they all involve an agent that coordinates traffic flows according to a set of rules.
I agree that a universal solution sounds difficult. However, it seems like it should be possible to have something that works "well enough" for some (broad) class of servers. My example is TCP, which does a "good enough" job of controlling the flow of packets over a wide range of networks.
1) Sign session ids that you issue and reject requests that don't have a valid signature. This can be done entirely in software at the router level without any I/O
2) For each session, do authentication. Unauthenticated sessions get lower caps.
3) For authenticated sessions, have various schemes for X units of Y in Z seconds. And then use this database - sharded by Y - before each expensive Y.
4) Y should be prefixed with a user role or payment plan or whatever, followed by the actual resource id. So people can buy another payment plan.
5) Possibly let people pay to access a resource beyond the quota.
6) Make clients recognize quota errors and retry with exponential backoff.
7) Host static resources in the app, fallback to a CDN, and only host dynamic resources on your own source servers.
8) Wherever possible, cache things in the client, keeping in mind that you have to evict the lease recently used items eventually.
The main problem with this article is that it assumes that the only resource consumption is CPU/memory on the server. With that assumption, it makes complete sense to limit the number of concurrent connections.
However, if you do so, then your server can be brought to its knees by a single client opening that many connections, and then just being really slow in accepting the data down the network. Then the limited connections on the server are all sitting there doing nothing, and no new connections are allowed.
This is a good point, although I don't think I said "connections", I said "requests". The definition of requests is going to vary significantly. Yes, very slow clients are yet another problem that extremely robust systems don't have to handle.
In my personal case, nearly all of my work has always been on "internal" services that are being used inside a single organization inside a data center, where this is rarely a problem. This is much more of an issue for services that are accessed over the public Internet, where you need to handle malicious attackers as well as users that have horrible connections.
In any case, I believe what I wrote still applies, but you do need to set appropriate timeouts (yet another parameter to tune; yuck!)
For those asking, this seems to be hosted on App Engine and he's gone over his quota. App Engine's throttling is both a blessing and a curse (presumably he's using the free tier and never noticed this). You'll note that it's explicitly saying he's over quota not that it can't handle it ;).
If you have multiple workers each managing their own throttling, there is a particular failure mode you may run into: everything is operating normally, then requests increase to the point that one of your servers starts throttling requests. Your load balancer redistributes traffic to the remaining machines, which see even greater load, and each in turn start throttling requests. At some point you have no working servers, and all requests are failing. You start adding servers, but as soon as a server enters the load balancer pool it is bombarded with all requests and shuts off. At this point you have more than enough servers to handle requests, but all of them are disabled.
Most solutions to the thundering herd problem involve rate limiting at the load balancer, or a finer-grained heuristic to ensure that each server does as much work as it can, even under load.
But by this point the request has already been accepted and processed. It's also using a rolling average, which means you'll be "killing" connections that otherwise wouldn't need to be killed.
Well, there is no exact threshold what "needs to bee killed". The higher the load the longer it takes to process requests. At some point you rather want to drop additional requests instead of making things even slower for everybody. This is what happens here.
Think of it like this: When the server is already darn slow at a load of 10 we don't want to process more requests because we know at values over 10 it gets unbearable.
My point is it's still processing the request. You're just shielding the web server from whatever the application holds.
In other words, let's find the theoretical limit to what your web server can handle in terms of concurrent requests. Let's throw them all at your application. In this scenario, the check-load-then-die code is worthless because the request will never get there. It doesn't solve the problem as shown in the OP. What it does is mitigate the effect of the application on the total load of the web server. Which is another problem altogether.
Also:
> drop additional requests instead of making things even slower for everybody. This is what happens here.
Not exactly. The request completes like any HTTP request would. That you're die()ing out doesn't change that, at least from the perspective of the web server.
find the theoretical limit to what your web server
can handle in terms of concurrent requests. Let's
throw them all at your application. In this scenario,
the check-load-then-die code is worthless because
the request will never get there.
If there is such a limit, then why not mitigate that by setting a max number of connections for the webserver?
You're just shielding the web server from
whatever the application holds.
True. But as I said: it works. Never seen our server get slow since we implemented this a few years ago. It kicks in when we get about 25x our normal traffic. Which happened less then once a year so far.
This is basically how Hapi works in the NodeJS world.
If the event loop is too long (you configure this) it means you're doing too much! So it issues 503s to all new traffic until the server can cope with the requests it's receiving and get the loop time time.
Works very well and you can configure it to do the same with memory limits as well.
Doesn't fix all issues, but it's a good way to keep your site usable by some users at least.
Not for some users though, usable for some requests. Users will still be annoyed when they can't complete actions, because some of their requests are failing randomly.
I have a site and API which rates other sites, checking them for ad links and trying to match them to real world business records. Inspecting a site takes 10 seconds to 2 minutes. The API, which is usually used from browser add-ons which tag web search results, lets anyone request the rating info for a site.
Most requests are already cached and return immediately. Requests for unknown sites are queued up for the rating engine, which runs in processes separate from the web-facing side, and return "wait" to the requestor. So there's a work queue to manage.
It's managed using fair queuing. If there is no pending request from the requesting IP address, the new request goes into the queue. If there's a request from the same IP address already in the queue, the new request is held until the first one completes, and then added to the end of the main queue. No request is rejected until there are 100 requests from a single IP address. So each IP address competes against itself, and no one IP can hog the system.
A typical overload comes when someone searches for an unusual topic and flips through many pages of search results quickly. This can result in the rating engine having fifty or so new sites to examine within a few seconds. Those requests will be processed one at a time until the rating engine catches up.
This queuing system held up well when someone tried a test where they fed a huge list of sites into the API without waiting for completion of any of their requests. The rating engine ran busily for a week, but they were only tying up one rating process, and it didn't affect other users at all.
There are no adjustment parameters. It just runs. It's not perfect, but it deals well with legitimate transients and with abuse from a small number of IP addresses.