The core of the problem from my angle is that rust’s stdlib is not async safe. I...

brundolf · on Feb 15, 2023

> rust’s stdlib is not async safe. It’s literally incorrect to call parts of the Rust stdlib from async code and the language tells you nothing about this.

Can you get more specific about what goes wrong when you call certain parts of the standard lib from async code? This is the first I'm hearing about this, and I'm not fully clear on what you mean by "not async safe"

herbstein · on Feb 15, 2023

Not OP, but it won't yield to the runtime. This blocks the executing thread until the operation completes. That's the only sense in which the standard library isn't async safe.

Additionally, according to a core Tokio developer, the time between .await calls should be in the order of 100s of microseconds at most to ensure the best throughput of the runtime.

jamincan · on Feb 15, 2023

Isn't that a bit different from it not being safe, though? Ill-advised perhaps, but not necessary unsafe.

dcow · on Feb 15, 2023

If you make too many of these calls your program will stop executing until I/O becomes available because you've exhausted your execution threads. If I/O never comes, which is very possible especially over a network connection, then you're stalled. Same outcome as being deadlocked, essentially.

I'd argue this is very much unsafe because it's not correct async programming and can cause really difficult to debug problems (which is the same reason we care about memory safety--technically I can access any part of the pages mapped to my process, it's just not advisable to make assumptions about the contents of regions that other parts of the program might be concurrently modifying).

The fact that people have never heard of this or don't understand the nuance kinda proves the point (=. You should, ideally, have to opt in to making blocking calls from async contexts. Right now it's all too easy to stumble into doing it and the consequences are gnarly in the edge cases, just like memory corruption.

brundolf · on Feb 15, 2023

Hmm. In the worst-case though, wouldn't it just be equivalent to a non-async version of the program? I.e. "blocking" is what every normal function already does. Blocking for IO might take longer, but again, that's what normal Rust programs already do. So to me it sounds like "not async-optimized" rather than "not async-safe"

But none of that sounds anything like a deadlock, so maybe I'm missing some other aspect here

As an aside: "safe" means a very specific thing in Rust contexts, so even if this is analogous to that (preventing a footgun that could have catastrophic consequences), it might be best to use a different word or at least disambiguate. Otherwise people will argue with you about it :)

hdevalence · on Feb 15, 2023

Correct, there’s no actual problem here, just a potential performance pitfall, but that pitfall is no worse than a naive synchronous implementation would be.

There’s no big problem here, just write code and run it. If it is a really big problem it will show up in a profiler (if you care enough about performance to care about this issue, you are using a profiler, right… right?)

dcow · on Feb 15, 2023

I guess you've never written async code. You can irreparably stall the entire program. I've done it. The person you're responding to is not correct in their analysis.

brundolf · on Feb 15, 2023

Can you explain such a case in detail? Genuinely curious to get to the bottom of this

dcow · on Feb 15, 2023

I replied directly to you.

dcow · on Feb 15, 2023

> In the worst-case though, wouldn't it just be equivalent to a non-async version of the program?

No. You can absolutely stall (I'm hesitant to use the phrase deadlock because you're not locking, specifically) an async program. More specifically, you can block all of the available scheduler threads. To see for yourself, spawn a bunch of async tasks in a loop that all increment and print a shared counter and then sleep forever. The counter will stop incrementing at some point. Compare that number to the number of scheduler threads your runtime uses. It should look similar.

I think maybe you're conceptually comparing a single threaded Rust program to an async program, which isn't quite accurate. A "normal" program in my experience has many threads. All of these threads can block. One of them blocking does not inherently slow down the rest of the program. Try the above example but spawn a new thread instead of a new async task during each loop iteration. The number will increment much higher, at least until you've exhausted some system limit on the number of threads a process can have.

Maybe there are fancy async runtimes that dynamically expand their threadpool as the number of blocked tasks increases. But naively they all use a threadpool sized at some multiple of the number of real cores your system has, and so will eventually stall if starved.

In any case, it's not generally async-safe to block any thread, ever (obviously if you could prove that the number of async tasks that block is always less than the number of threads the runtime uses, this doesn't hold, but that's a rather wild assumption). And it's impossible for say a library to know the context in which it's being used, so it's never correct for a library that is to be called in an async context to block. It must always yield.

You're right that in the common case things will just appear to slow down since most things don't block for a long time and eventually continue. But there are gnarly edge cases where the entire program stalls. Hence my comparison to memory safety. In the common case memory corruption and dangling pointers aren't horrible they just sometimes cause spurious errors which most programs can recover from. But every once in awhile they cause really bad problems which programs can't recover. Hence why we care.

For a real world example, imagine a server that processes requests. In the async handler block, for some requests you access the filesystem and the device starts experiencing degraded performance due to media errors and read calls never complete. Should the server stop processing healthchecks because some of the requests are having trouble? The answer is almost certainly no. Instead the server should report in its healthcheck that one of its block devices has an increased error rate and tasks are piling up.

Another case would be one involving locks. If you are dealing with a program that involves many workers/consumers waiting for a producer to produce a value and the workers use a blocking api to wait on a semaphore before reading a next value from memory and continuing, the program will almost certainly deadlock not because of a logic error leading to an imbalanced semaphore but instead because the async runtime will stall and the producer will never produce a value. If instead the tasks yield as they should, the program would progress normally.

brundolf · on Feb 16, 2023

> I think maybe you're conceptually comparing a single threaded Rust program to an async program

Ok yeah, I think this was the disconnect

> spawn a bunch of async tasks in a loop that all increment and print a shared counter and then sleep forever

My instinct is "duh, of course sleeping/looping forever will prevent progress". I can see how in a multithreaded program it might not (if it's written a certain way), but that feels like the exception, not a norm/intuitive assumption. Of course these things are highly subjective

In my view it's an exaggeration to call functions that might block for a long time "not async safe". Especially since it's in no way worse than the single-threaded case (are they also "not sync safe"?). I think for most people the blocking behavior will be obvious/intuitive, even if it has the potential to cause undesirable behavior (just like any program logic could)

It may be a matter of changing expectations- async programming is mainly about interlacing tasks on a thread, not multithreading; it's useful even when there's only one single thread available. It just so happens that you can also spread it across multiple threads, but I don't think it should be expected to live up to every expectation we might have around fully, manually multi-threaded logic

dcow · on Feb 16, 2023

I edited my comment and added a few real world scenarios. In short, a "many consumers single producer" program where the consumers use a blocking API to lock on a semaphore/latch gating access to shared memory updated by the producer is bound to stall. The workers will all block waiting for the producer to produce a value but it never will because the executor is starved. This setup is 100% correct in a traditional program where each worker gets a thread. But it's 100% incorrect when using async/await. The async-safe variant of this setup is to use async-aware semaphore/latch APIs which yield instead of block.

I've seen this happen in non-contrived scenarios in the real world. It's not common, I'll give you that, but definitely not contrived.

Edit: I disagree about expectations. Async programs should absolutely work correctly in all scenarios that aren't programmer/logic errors. In my example(s) the solution is not "rewrite the program the program is wrong". The solution is "use the correct async-aware function at the call site and the issue goes away". The fact that Rust is being used to write web servers means that async/await is up to the task of handling real world highly parallel programs. You just have to be careful to avoid some nuanced traps, which the Rust compiler currently doesn't assist you with. It totally should!

brundolf · on Feb 16, 2023

Then I would ask, where does it end?

Consider a function that isn't doing IO, but it takes a long, potentially variable amount of time (i.e. it "blocks" as long as a file system request might block)

Is it even possible for that function to be "async-safe"? If you have it spin off a separate async task and yield to the original one, that's still filling up your thread pool. The only full solution then would be to manually create a separate thread outside of the async workers (until the OS hits the thread limit, of course)

This too is a little contrived, but I don't see a hard distinction between this and the IO case

At some point: the computer has finite resources, and those can get overloaded. There's something here around how async runtimes might not be making full use of the machine, which is an interesting thread (no pun intended) to pull on. But I don't think we can expect the language (at least as currently designed) to fully prevent this class of problem. And I also don't think the standard library should orient itself around async use-cases, which are still a minority of Rust uses, especially for the sake of an edge-case like this one

dcow · on Feb 16, 2023

> Then I would ask, where does it end?

Slippery slope fallacy.

> If you have it spin off a separate async task and yield to the original one, that's still filling up your thread pool.

No it's not because the original task yielded so it's just sitting there waiting not taking up resources other than memory. It's not "filling up your thread pool".

> This too is a little contrived, but I don't see a hard distinction between this and the IO case

This is correct. A long running function might as well be blocking. The difference, generally, is that a long running function is presumably doing work whilst a function that blocks is waiting consuming compute resources literally doing nothing. The first case is generally considered acceptable because the function needs the processor/resources. The 2nd case is not because it's a waste. Surely you can see that distinction.

In fact, this is the whole reason async/await exists. It's first class syntax to express a scenario that shows up commonly in application machinery: thread pool executing tasks to avoid the massive numbers of blocked threads which traditionally happens in a thread model. And you should know that by definition async/await is not a good fit for tasks that take a long time to compute exactly for this reason we've been discussing. So much so that async runtimes provide an escape hatch for when your async task needs to do something that blocks the task execution threads: spawn_blocking[1].

As a programmer using async/await, you're supposed to know this to write correct async programs... just like you were supposed to know how to manage your pointers. Hardly any, especially those coming from JS, do. So the result is super sloppy and not very performant async code. And sadly people think it's the runtime's fault for not being good at scheduling or just not being performant or something. It's hilariously sad.

1: https://docs.rs/tokio/latest/tokio/task/fn.spawn_blocking.ht..., https://docs.rs/async-std/1.2.0/async_std/task/fn.spawn_bloc...

> At some point: the computer has finite resources, and those can get overloaded.

Yes, as is the case for any problem of computer engineering, one doesn't run code on an infinitely perfect machine. This isn't news.

> There's something here around how async runtimes might not be making full use of the machine, which is an interesting thread (no pun intended) to pull on.

If you didn't before when claiming that yielded tasks fill up a threadpool, here is where you betray your lack of experience on this topic. async/await (and generally the task/executor paradigm) is actually a solution put forth to maximize the use of available compute resources. That doesn't mean every runtime, or more likely the programer's application code, is good at doing that, but that's besides the point. The problem that tons of threads causes is that historically machine threads (as opposed to green threads like Go) require context switching. Context switching is expensive so if you let tons of them pile up you eventually spend more time context switching than actually computing, thus waste resources. In reality this is only a problem for incredibly highly parallel high throughput systems. Sometimes this means a web server.

> But I don't think we can expect the language (at least as currently designed) to fully prevent this class of problem.

Of course not. The language doesn't fully prevent all instances of memory un-safety either, but it gets pretty darn far.

> And I also don't think the standard library should orient itself around async use-cases, which are still a minority of Rust uses, especially for the sake of an edge-case like this one

Nobody is asking this. It's just not how this does or would work.

You're taking a pretty weak position. Just because we can't make something perfect in all cases is not justification for not improving it in most cases. If it were, we'd not have put any effort towards making a memory safe language in the first place and Rust wouldn't exist (contrary to what you might think, Rust programs don't prevent all memory safety issues). It is entirely possible to inadvertently stall your async program under normal usage and you don't need to delve into endlessly computing tasks to do so. As I've explained this can happen almost comically easily as soon as you introduce any type of waiting e.g. semaphores, sockets that a client keeps open, etc. which are all vastly more common than endless computation and entirely normal.

There are two classes of problems here: (1) programmer errors by using the incorrect blocking version of a call, and (2) trying to run super compute heavy workloads. I'm arguing that steps be taken to vastly reduce the possibility of (1) despite the fact that (2) can still happen. Who cares about (2) if we all but eliminate (1)? (Or, once we do we can take a shot at addressing (2).) And the solution is not black magic or something and it actually exists in some other async/await languages. This isn't fiction. It's 100% possible to mark blocking calls and have the complier tell async callers that this function needs special care, such as use of spawn_blocking, if you want to call it from async code. Let the caller disable it with an attribute if they want to shoot themselves in the foot who cares. Just make it better in the general case of "users don't know that they shouldn't call this blocking thing in an async task". For "normal" Rust usage absolutely nothing changes at all.

But even if it cost semantics, Rust's entire shtick is that it helps users write correct programs at the cost of more annoying semantics. I'm not arguing for polluting the stdlib to "orient it around async use cases". I'm arguing to add semantics so that it works normally in normal use cases and prevents unsuspecting users from shooting themselves in the foot when used in async use cases. async/await is not some "minority use case". It's been in Rust for over 4 years now and is remarkably common in my day to day usage of Rust.

pdimitar · on Feb 16, 2023

> In my view it's an exaggeration to call functions that might block for a long time "not async safe".

"Not async safe" is a loaded term here that means "not complying with requirements of being a good citizen in the entire forest of async workers". As another poster said: it's expected most workers to return in 0.1 - 1.0 ms. After that the guarantees of async start to fall apart.

It's the same with Erlang's BEAM btw: it has an amazing scheduling primitives and you can literally spawn 50k green threads and you'd barely see a blip of lag anywhere BUT that's because the scheduler is preemptive and aggressively switches off a worker if it hasn't yielded in a while (this is not 100% the truth, I am simplifying it here). But if you start calling native functions that don't respond in 1ms or less, the scheduler is starting to struggle and you do start seeing lags that you normally never would.

So yeah, "not async safe" here means more or less "not being a good citizen".

No abstraction is free. Until we have something like the BEAM on a hardware level and anything and everything is preempted without you ever having a say on it, then such async runtime if-s and but-s will keep existing. It's a fundamental limitation of our hardware and OS-es.

dcow · on Feb 16, 2023

Just to be crystal clear: you can deadlock an otherwise logically sound async program by using a blocking lock instead of a yielding lock.

I’d call that an unforced programmer error.

jamincan · on Feb 16, 2023

Would a normal async task block indefinitely like that though? Under non-contrived circumstances, I can see the scheduler becoming starved of threads where all the tasks are blocked, but presumably they will stop being blocked eventually, wouldn't they?

dcow · on Feb 16, 2023

See sibling.