Futurelock: A subtle risk in async Rust

[+] hitekker|4 months ago|reply

Skimming through, this document feels thorough and transparent. Clearly, a hard lesson learned. The footnotes, in particular, caught my eye https://rfd.shared.oxide.computer/rfd/397#_external_referenc...

> Why does this situation suck? It’s clear that many of us haven’t been aware of cancellation safety and it seems likely there are many cancellation issues all over Omicron. It’s awfully stressful to find out while we’re working so hard to ship a product ASAP that we have some unknown number of arbitrarily bad bugs that we cannot easily even find. It’s also frustrating that this feels just like the memory safety issues in C that we adopted Rust to get away from: there’s some dynamic property that the programmer is responsible for guaranteeing, the compiler is unable to provide any help with it, the failure mode for getting it wrong is often undebuggable (by construction, the program has not done something it should have, so it’s not like there’s a log message or residual state you could see in a debugger or console), and the failure mode for getting it wrong can be arbitrarily damaging (crashes, hangs, data corruption, you name it). Add on that this behavior is apparently mostly undocumented outside of one macro in one (popular) crate in the async/await ecosystem and yeah, this is frustrating. This feels antithetical to what many of us understood to be a core principle of Rust, that we avoid such insidious runtime behavior by forcing the programmer to demonstrate at compile-time that the code is well-formed

[+] csande17|4 months ago|reply

In case anyone else was confused: the link/quote in this comment are from the previous "async cancellation issue" write-up, which describes a situation where you "drop" a future: the code in the async function stops running, and all the destructors on its local variables are executed.

The new write-up from OP is that you can "forget" a future (or just hold onto it longer than you meant to), in which case the code in the async function stops running but the destructors are NOT executed.

Both of these behaviors are allowed by Rust's fairly narrow definition of "safety" (which allows memory leaks, deadlocks, infinite loops, and, obviously, logic bugs), but I can see why you'd be disappointed if you bought into the broader philosophy of Rust making it easier to write correct software. Even the Rust team themselves aren't immune -- see the "leakpocalypse" from before 1.0.

[+] rtpg|4 months ago|reply

I guess one big question here is whether there's a higher layer abstraction that is available to wrap around patterns to avoid this.

It does feel like there's still generally possibilities of deadlocks in Rust concurrency right? I understand the feeling here that it feels like ... uhh... RAII-style _something_ should be preventing this, because it feels like statically we should be able to identify this issue in this simple case.

I still have a hard time understanding how much of this is incidental and how much of this is just downstream of the Rust/Tokio model not having enough to work on here.

[+] Dagonfly|4 months ago|reply

That's a really subtle version of the deadlock described in withoutboats FuturesUnordered post [0]

When using “intra-task” concurrency, you really have to ensure that none of the futures are starving.

Spawning task should probably be the default. For timeouts use tokio::select! but make sure all pending futures are owned by it. I would never recommend FuturesUnordered unless you really test all edge-cases.

[0] https://without.boats/blog/futures-unordered/

[+] singron|4 months ago|reply

This sounds very similar to priority inversion. E.g. if you have Thread T_high running at high priority and thread T_low running at low priority, and T_low holds a lock that T_high wants to acquire, T_high won't get to run until T_low gets scheduled.

The OS can detect this and make T_low "inherit" the priority of T_high. I wonder if there is a similar idea possible with tokio? E.g. if you are awaiting a Mutex held by a future that "can't run", then poll that future instead. I would guess detecting the "can't run" case would require quite a bit of overhead, but maybe it can be done.

I think an especially difficult factor is that you don't even need to use a direct await.

    let future1 = do_async_thing("op1", lock.clone()).boxed();
    tokio::select! {
      _ = &mut future1 => {
        println!("do_stuff: arm1 future finished");
      }
      _ = sleep(Duration::from_millis(500)) => {
        // No .await, but both will futurelock on future1.
        tokio::select! {
          _ = do_async_thing("op2", lock.clone()) => {},
          _ = do_async_thing("op3", lock.clone()) => {},
        };
      }
    };

I.e. so "can't run" detector needs to determine that no other task will run the future, and the future isn't in the current set of things being polled by this task.

[+] oconnor663|4 months ago|reply

> I wonder if there is a similar idea possible with tokio? E.g. if you are awaiting a Mutex held by a future that "can't run", then poll that future instead.

Something like this could make sense for Tokio tasks. (I don't know how complicated their task scheduler is; maybe it already does stuff like this?) But it's not possible for futures within a task, as in this post. This goes all the way back to the "futures are inert" design of async Rust: You don't necessarily need to communicate with the runtime at all to create a future or to poll it or to stop polling it. You only need to talk to the runtime at the task level, either to spawn new tasks, or to wake up your own task. Futures are pretty much just plain old structs, and Tokio doesn't know how many futures my async function creates internally, any more than it knows about my integers or strings or hash maps.

[+] Veserv|4 months ago|reply

I thought Rust async is a colored stackless coroutine model and thus it would be unsafe to continue execution of previously executing async functions.

To explain, generally speaking, stackless coroutine async only need coloring because they are actually “independent stack”less coroutines. What they actually do is that they share the stack for their local state. This forces async function execution to proceed in LIFO order so you do not blow away the stack of the async function executing immediately after which demands state machine transforms to be safe. This is why you need coloring unlike stackful coroutine models which can execute, yield, and complete in arbitrary order since their local state is preserved in a safe location.

[+] johnisgood|4 months ago|reply

Off-topic but that code looks quite... complicated as opposed to what I would write in Erlang, Elixir, Go, or even C. Maybe it is just me.

[+] DevelopingElk|4 months ago|reply

My view of this is that its closer to the basic 2 lock Deadlock.

Thread 1 acquires A. Thread 2 acquires B. Thread 1 tries to acquire B. Thread 2 tries to acquire A.

In this case, the role "A" is being played by the front of the Mutex's lock queue. Role "B" is being played by the Tokio's actively executed task.

Based on this understanding, I agree that the surprising behavior is due to Tokio's Mutex/Lock Queue implementation. If this was an OS Mutex, and a thread waiting for the Mutex can't wake for some reason, the OS can wake a different thread waiting for that Mutex. I think the difficulty in this approach has to do with how Rust's async is implemented. My guess is the algorithm for releasing a lock goes something like this:

1. Pop the head of the wait queue. 2. Poll the top level tokio::spawn'ed task of the Future that is holding the Mutex.

What you want is something like this

For each Future in the wait queue (Front to Back): Poll the Future. If Success - Break ???Something if everything fails???

The reason this doesn't work has to do with how futures compose. Futures compile to states within a state machine. What happens when a future polled within the wait queue completes? How is control flow handed back to the caller?

I guess you might be able to have some fallback that polls the futures independently then polls the top level future to try and get things unstuck. But this could cause confusing behavior where futures are being polled even though no code path within your code is await'ing them. Maybe this is better though?

[+] dboreham|4 months ago|reply

Which is why "async" is a pox on our house. System threads can and do address these edge issues. User level concurrency generally doesn't (perhaps with the exceptions of golang and erlang).

[+] Matthias247|4 months ago|reply

As far as I remember from building these things with others within the async rust ecosystem (hey Eliza!) was that there was a certain tradeoff: if you wouldn’t be able to select on references, you couldn’t run into this issue. However you also wouldn’t be able run use select! in a while loop and try to acquire the same lock (or read from the same channel) without losing your position in the queue.

I fully agree that this and the cancellation issues discussed before can lead to surprising issues even to seasoned Rust experts. But I’m not sure what really can be improved under the main operating model of async rust (every future can be dropped).

But compared to working with callbacks the amount of surprising things is still rather low :)

[+] mycoliza|4 months ago|reply

Indeed, you are correct (and hi Matthias!). After we got to the bottom of this deadlock, my coworkers and I had one of our characteristic "how could we have prevented this?" conversations, and reached the somewhat sad conclusion that actually, there was basically nothing we could easily blame for this. All the Tokio primitives involved were working precisely as they were supposed to. The only thing that would have prevented this without completely re-designing Rust's async from the ground up would be to ban the use of `&mut future`s in `select!`...but that eliminates a lot of correct code, too. Not being able to do that would make it pretty hard to express a lot of things that many applications might reasonably want to express, as you described. I discussed this a bit in this comment[1] as well.

On the other hand, it also wasn't our coworker who had written the code where we found the bug who was to blame, either. It wasn't a case of sloppy programming; he had done everything correctly and put the pieces together the way you were supposed to. All the pieces worked as they were supposed to, and his code seemed to be using them correctly, but the interaction of these pieces resulted in a deadlock that it would have been very difficult for him to anticipate.

So, our conclusion was, wow, this just kind of sucks. Not an indictment of async Rust as a whole, but an unfortunate emergent behavior arising from an interaction of individually well-designed pieces. Just something you gotta watch out for, I guess. And that's pretty sad to have to admit.

[1] https://news.ycombinator.com/item?id=45776868

[+] octoberfranklin|4 months ago|reply

> However you also wouldn’t be able run use select! in a while loop and try to acquire the same lock (or read from the same channel) without losing your position in the queue.

No, just have select!() on a bunch of owned Futures return the futures that weren't selected instead of dropping them. Then you don't lose state. Yes, this is awkward, but it's the only logically coherent way. There is probably some macro voodoo that makes it ergonomic. But even this doesn't fix the root cause because dropping an owned Future isn't guaranteed to cancel it cleanly.

For the real root cause: https://news.ycombinator.com/item?id=45777234

[+] jacquesm|4 months ago|reply

If any rust designers are lurking about here: what made you decide to go for the async design pattern instead of the actor pattern, which - to me at least - seems so much cleaner and so much harder to get wrong?

Ever since I started using Erlang it felt like I finally found 'the right way' when before then I did a lot of work with sockets and asynchronous worker threads. But even though it usually worked as advertised it had a large number of really nasty pitfalls which the actor model seemed to - effortlessy - step aside.

So I'm seriously wondering what the motivation was. I get why JS uses async, there isn't any other way there, by the time they added async it was too late to change the fundamentals of the language to such a degree. But rust was a clean slate.

[+] sunshowers|4 months ago|reply

Not a Rust designer, but a big motivation for Rust's async design was wanting it to work on embedded, meaning no malloc and no threads. This unfortunately precludes the vast majority of the design space here, from active futures as seen in JS/C#/Go to the actor model.

You can write code using the actor model with Tokio. But it's not natural to do so.

[+] the__alchemist|4 months ago|reply

Your application needs concurrency. So, the answer is... switch your entire application, code style, and libraries it uses into a separate domain that is borderline incompatible with normal one? And has its own dialects that have their own compatibility barriers? Doesn't make sense to me.

[+] raggi|4 months ago|reply

_an answer_ is performance - the necessity of creating copyable/copied messages for inter-actor communication everywhere in the program _can be_ expensive.

that said there are a lot of parts of a lot of programs where a fully inlined and shake optimized async state machine isn't so critical.

it's reasonable to want a mix, to use async which can be heavily compiler optimized for performance sensitive paths, and use higher level abstractions like actors, channels, single threaded tasks, etc for less sensitive areas.

[+] mdasen|4 months ago|reply

I'd recommend watching this video: https://www.infoq.com/presentations/rust-2019/; and reading this: https://tokio.rs/blog/2020-04-preemption

I'm not the right person to write a tl;dr, but here goes.

For actors, you're basically talking about green threads. Rust had a hard constraint that calls to C not have overhead and so green threads were out. C is going to expect an actual stack so you have to basically spin up a real stack from your green-thread stack, call the C function, then translate it back. I think Erlang also does some magic where it will move things to a separate thread pool so that the C FFI can block without blocking the rest of your Erlang actors.

Generally, async/await has lower overhead because it gets compiled down to a state machine and event loop. Languages like Go and Erlang are great, but Rust is a systems programming language looking for zero cost abstractions rather than just "it's fast."

To some extent, you can trade overhead for ease. Garbage collectors are easy, but they come with overhead compared to Rust's borrow checker method or malloc/free.

To an extent it's about tradeoffs and what you're trying to make. Erlang and Go were trying to build something different where different tradeoffs made sense.

EDIT: I'd also note that before Go introduced preemption, it too would have "pitfalls". If a goroutine didn't trigger a stack reallocation (like function calls that would make it grow the stack) or do something that would yield (like blocking IO), it could starve other goroutines. Now Go does preemption checks so that the scheduler can interrupt hot loops. I think Erlang works somewhat similarly to Rust in scheduling in that its actors have a certain budget, every function call decrements their budget, and when they run of of budget they have to yield back to the scheduler.

[+] the__alchemist|4 months ago|reply

I'm surprised learning this too. I know the hobby embedded, and HTTP-server OSS ecosystem have committed to Async, but I didn't expect Oxide would.

[+] oconnor663|4 months ago|reply

> FAQ: doesn’t future1 get cancelled?

I guess cancellation is really two different things, which usually happen at the ~same time, but not in this case: 1) the future stops getting polled, and 2) the future gets dropped. In this example the drop is delayed, and because the future is holding a guard,* the delay has side effects. So the future "has been cancelled" in the sense that it will never again make forward progress, but it "hasn't been cancelled yet" in the sense that it's still holding resources. I wonder if it's practical to say "make sure those two things always happen together"?

* Technically a Tokio-internal `Acquire` future that owns a queue position to get a guard, but it sounds like the exact same bug could manifest after it got the guard too, so let's call it a guard.

[+] mleonhard|4 months ago|reply

> // Start a background task that takes the lock and holds it for a few seconds.

Holding a lock while waiting for IO can destroy a system's performance. With async Rust, we can prevent this by making the MutexGuard !Send, so it cannot be held across an await. Specifically, because it is !Send, it cannot be stored in the Future [2], so it must be dropped immediately, freeing the lock. This also prevents Futurelock deadlock.

This is how I wrote safina::sync::Mutex [0]. I did try to make it Send, like Tokio's MutexGuard, but stopped when I realized that it would become very complicated or require unsafe.

> You could imagine an unfair Mutex that always woke up all waiters and let them race to grab the lock again. That would not suffer from risk of futurelock, but it would have the thundering herd problem plus all the liveness issues associated with unfair synchronization primitives.

Thundering herd is when clients overload servers. This simple Mutex has O(n^2) runtime: every task must acquire and release the mutex, which adds all waiting tasks to the scheduler queue. In practice, scheduling a task is very fast (~600ns). As long as polling the lock-mutex-future is fast and you have <500 waiting tasks, then the O(n^2) runtime is fine.

Performance is hard to predict. I wrote Safina using the simplest possible implementations and assumed they would be slow. Then I wrote some micro-benchmarks and found that some parts (like the async Mutex) actually outperform Tokio's complicated versions [1]. I spent days coding optimizations that did not improve performance (work stealing) or even reduced performance (thread affinity). Now I'm hesitant to believe assumptions and predictions about performance, even if they are based on profiling data.

[0] https://docs.rs/safina/latest/safina/sync/struct.MutexGuard....

[1] https://docs.rs/safina/latest/safina/index.html#benchmark

[2] Multi-threaded async executors require futures to be Send.

[+] yencabulator|4 months ago|reply

> With async Rust, we can prevent this by making the MutexGuard !Send,

If I understand this correctly...

That means it's never possible to lock two such mutexes at once. You can't await the second one while holding the first one.

Doesn't that make it impossible to express a bunch of good old CS algorithms?

[+] ufmace|4 months ago|reply

Considering this issue did also make me think - maybe the real footgun here is the async mutex. I think a better "rule" to avoid this issue might be something like, don't just use the tokio async mutex by default just because it's there and you're in an async function; instead default to a sync mutex that errors when held across awaits and think very hard about what you're really doing before you switch to the async one.

[+] dvratil|4 months ago|reply

I would guess this is just to make the explanation of the bug easier.

In real world, the futurelock could occur even with very short locks, it just wouldn't be so deterministic. Having a minimal reproducer that you have to run a thousand times and it will maybe futurelock doesn't really make for a good example :)

[+] yxhuvud|4 months ago|reply

Work stealing is more a technique to function better when architecture is pessimal (think mixing slow and fast tasks in one queue) than something that make things go faster in general. It also tend to shuffle around complexity a bit, in ways that are sometimes nice.

Same thing with task preemption, though that one has less organisatorial impact.

In general, getting something to perform well enough on specific tasks is a lot easier than performing well enough on tasks in general. At the same time, most tasks have kinda specific needs when you start looking at them..

[+] crabmusket|4 months ago|reply

This feels like the sort of thing that has led to the development of deterministic simulation testing (DST) techniques as pioneered by FoundationDB and TigerBeetle.

https://notes.eatonphil.com/2024-08-20-deterministic-simulat...

I hope something like this becomes popular in the Rust/Tokio space. It seems like Turmoil is that?

https://tokio.rs/blog/2023-01-03-announcing-turmoil

[+] forrestthewoods|4 months ago|reply

I feel like I’m pretty good at writing multithreaded code. I’ve done it a lot. As long as you use primitives like Rust Mutex that enforce correctness for data access (ie no accessing data without the lock) it’s pretty simple. Define a clean boundary API and you’re off to the races.

async code is so so so much more complex. It’s so hard to read and rationalize. I could not follow this post. I tried. But it’s just a full extra order of complexity.

Which is a shame because async code is supposed to make code simpler! But I’m increasingly unconfident that’s true.

[+] foota|4 months ago|reply

Async code isn't supposed to be simpler than sync code, it's supposed to be simpler than doing thing like continuation passing.

[+] amelius|4 months ago|reply

Async code is simpler because you're implicitly holding a lock on the CPU. That's also why you should stay away from it: it increases latency. Especially since Rust is about speed and responsiveness. In general, async programming in Rust makes little sense.

[+] orthecreedence|4 months ago|reply

Great read, and the example code makes sense. This stuff can be a nightmare to find, but once you do it's like a giant 1000 piece puzzle just clicks together instantly.

[+] bcantrill|4 months ago|reply

Indeed. One of the interesting side effects of being a remote company that records everything[0] is that we have the instant where the "1000 piece puzzle just clicks together" recorded, and it's honestly pretty wild. In this case, it was very much a shared brainstorming between four engineers (Eliza, Sean, John and Dave) -- and there is almost a passing of the baton where they start to imagine the kind of scenario that could induce this and then realize that those are exactly the conditions that exist in the software.

We are (on brand?) going to do a podcast episode on this on Monday[1]; ahead of that conversation I'm going to get a clip of that video out, just because it's interesting to see the team work together to debug it.

[0] https://rfd.shared.oxide.computer/rfd/0537

[1] https://discord.gg/QrcKGTTPrF?event=1433923627988029462

[+] quietbritishjim|4 months ago|reply

Wow, it is simply outrageous that Rust doesn't just allow all active tasks to make progress. It creates a whole class of incomprehensible bugs, like this one, for no reason. Can any Rust experts explain why it's done this way? It seems like an unforced error.

In Python, I often use the Trio library, which offers "structured, concurrency": tasks are (only) spawned into lexical scopes, and they are all completed (waited for) before that scope is left. That includes waiting for any cancelled tasks (which are allowed to do useful async work, including waiting for any of their own task scopes to complete).

Could Rust do something like that? It's far easier to reason about than traditional async programs, which seems up Rust's street. As a bonus it seems to solve this problem, since a Rust equivalent would presumably have all tasks implicitly polled by their owning scope.

[+] duped|4 months ago|reply

So there's a distinction between a task and a future. A future doesn't do anything until it's polled, and since there's nothing special about async runtimes (it's just user level code), it's always possible to create futures and never poll them, or stop polling them.

A task is a different construct and usually tied to the runtime. If you look at the suggestions in the RFD they call out using a task explicitly instead of polling a future in place.

There's some debate to be had over what constitutes "cancellation." The article and most colloquial definitions I've heard define it as a future being dropped before being polled to completion. Which is very clean - if you want to cancel a future, just drop it. Since Rust strongly encourages RAII, cleanup can go in drop implementations.

A much tougher definition of cancellation is "the future is never polled again" which is what the article hits on. The future isn't dropped but its poll is also unreachable, hence the deadlock.

[+] wrs|4 months ago|reply

It's hard to answer the question because of unclear terminology (tasks vs. futures). There are ways to do structured concurrency in Rust, but they are for tasks, not futures. There's not really a concept of "active futures" (other than calling an "active future" one that returned Pending the last time you polled it).

A task is the thing that drives progress by polling some futures. But one of those futures may want to handle polling for other futures that it made, which is where this arises.

As the article says, one option is to spawn everything as a task, but that doesn't solve all problems, and precludes some useful ways of using futures.

[+] jcalvinowens|4 months ago|reply

I have very little Rust experience... but I'm hung up on this:

> The lock is given to future1

> future1 cannot run (and therefore cannot drop the Mutex) until the task starts running it.

This seems like a contradiction to me. How can future1 acquire the Mutex in the first place, if it cannot run? The word "given" is really odd to me.

Why would do_async_thing() not immediately run the prints, return, and drop the lock after acquiring it? Why does future1 need to be "polled" for that to happen? I get that due to the select! behavior, the result of future1 is not consumed, but I don't understand how that prevents it from releasing the mutex.

It's more typical in my experience that the act of granting the lock to a thread is what makes it runnable, and it runs right then. Having to take some explicit second action to make that happen seems fundamentally broken to me...

EDIT: Rephrased for clarity.

[+] dvt|4 months ago|reply

> &mut future1 is dropped, but this is just a reference and so has no effect. Importantly, the future itself (future1) is not dropped.

There's a lot of talk about Rust's await implementation, but I don't really think that's the issue here. After all, Rust doesn't guarantee convergence. Tokio, on the other hand (being a library that handles multi-threading), should (at least when using its own constructs, e.g. the `select!` macro).

So, since the crux of the problem is the `tokio::select!` macro, it seems like a pretty clear tokio bug. Side note, I never looked at it before, but the macro[1] is absolutely hideous.

[1] https://docs.rs/tokio/1.34.0/src/tokio/macros/select.rs.html

[+] oconnor663|4 months ago|reply

There's nothing `select!` could do here to force `future1` to drop, because it doesn't receive ownership of `future1`. If we wanted to force this, we'd have to forbid `select!` from polling futures by reference, but that's a pretty fundamental capability that we often rely on to `select!` in a loop for example. The blanket `impl<F> Future for &mut F where F: Future ...` isn't a Tokio thing either; that's in the standard library.

[+] mycoliza|4 months ago|reply

What could `tokio::select!` do differently here to prevent bugs like this?

In the case of `select!`, it is a direct consequence of the ability to poll a `&mut` reference to a future in a `select!` arm, where the future is not dropped should another future win the "race" of the select. This is not really a choice Tokio made when designing `select!`, but is instead due to the existence of implementations of `Future` for `&mut T: Future + Unpin`[1] and `Pin<T: Future>`[2] in the standard library.

Tokio's `select!` macro cannot easily stop the user from doing this, and, furthermore, the fact that you can do this is useful --- there are many legitimate reasons you might want to continue polling a future if another branch of the select completes first. It's desirable to be able to express the idea that we want to continually poll drive one asynchronous operation to completion while periodically checking if some other thing has happened and taking action based on that, and then continue driving forward the ongoing operation. That was precisely what the code in which we found the bug was doing, and it is a pretty reasonable thing to want to do; a version of the `select!` macro which disallows that would limit its usefulness. The issue arises specifically from the fact that the `&mut future` has been polled to a state in which it has acquired, but not released, a shared lock or lock-like resource, and then another arm of the `select!` completes first and the body of that branch runs async code that also awaits that shared resource.

If you can think of an API change which Tokio could make that would solve this problem, I'd love to hear it. But, having spent some time trying to think of one myself, I'm not sure how it would be done without limiting the ability to express code that one might reasonably want to be able to write, and without making fundamental changes to the design of Rust async as a whole.

[1] https://doc.rust-lang.org/stable/std/future/trait.Future.htm... [2]: https://doc.rust-lang.org/stable/std/future/trait.Future.htm...

[+] Sytten|4 months ago|reply

I am wondering if there is a larger RFC for Rust to force users to not hold a variable across await points.

In my mind futurelock is similar to keeping a sync lock across an await point. We have nothing right now to force a drop and I think the solution to that problem would help here.

[+] ameliaquining|4 months ago|reply

There's an existing lint that lets you prohibit instances of specific types from being held across await points: https://rust-lang.github.io/rust-clippy/stable/index.html#aw...

[+] amluto|4 months ago|reply

I’m not convinced that this can help in a meaningful way.

Fundamentally, if you have two coroutines (or cooperatively scheduled threads or whatever), and one of them holds a lock, and the other one is awaiting the lock, and you don’t schedule the first one, you’re stuck.

I wonder if there’s a form of structured concurrency that would help. If I create two futures and start both of them (in Rust this means polling each one once) but do not continue to poll both, then I’m sort of making a mistake.

So imagine a world where, to poll a future at all, I need to have a nursery, and the nursery is passed in from my task and down the call stack. When I create a future, I can pass in my nursery, but that future then gets an exclusive reference to my future until it’s complete or cancelled. If I want to create more than one future that are live concurrently, I need to create a FutureGroup (that gets an exclusive reference to my nursery) and that allows me to create multiple sub-nurseries that can be used to make futures but cannot be used to poll them — instead I poll the FutureGroup.

(I have yet to try using an async/await system or a reactor or anything of the sort that is not very easy to screw up. My current pet peeve is this pattern:

    data = await thingy.read()

What if thingy.read() succeeds but I am cancelled? This gets nasty is most programming languages. Python: the docs on when I can get cancelled are almost nonexistent, and it’s not obviously possible to catch the CancelledError such that I still have data and can therefore save it somewhere so it’s not lost. Rust: what if thingy thinks it has returned the data but I’m never polled again? Maybe this can’t happen if I’m careful, but that requires more thought than I’m really happy with.)

[+] sunshowers|4 months ago|reply

Note that forcing a drop of a lock guard has its own issues, particularly around leaving the guarded data in an invalid state. I cover this a bit in my talk that Bryan linked to in the OP [1].

[1] timestamped: https://youtu.be/zrv5Cy1R7r4?t=1067

[+] mechanical_berk|4 months ago|reply

I agree. It seems like this bug arises because one Future is awaited while another is ignored. I have seen this sort of bug a lot.

So maybe all that is needed is a lint that warns if you keep a Future (or a reference to one) across an await point? The Future you are awaiting wouldn't count of course. Is there some case where this doesn't work?

[+] cogman10|4 months ago|reply

The ideas that have been batted around is called "async drop" [1]

And it looks like it's still just an unaddressed well known problem [2].

Honestly, once the Mozilla sackening of rust devs happened it seems like the language has been practically rudderless. The RFC system seems almost dead as a lot of the main contributors are no longer working on rust.

This initiative hasn't had motion since 2021. [3]

[1] https://rust-lang.github.io/async-fundamentals-initiative/ro...

[2] https://rust-lang.github.io/async-fundamentals-initiative/

[3] https://github.com/rust-lang/async-fundamentals-initiative

[+] arjie|4 months ago|reply

Wow, that makes sense afterwards but I would not have guessed at it immediately looking at the code. Very insidious. Great blogpost.

[+] mpeklar|4 months ago|reply

When considering this issue alongside with RFD 397, it seems to me that the problem is actually using future drops as an implicit (!) cancellation signal. This makes drop handlers responsible for handling every cancellation-related task, which they are not very good at. If a future is not immediately dropped after selecting on it, you get futurelock, and if it is, you get an async cancellation correctness problem, where the only way to try and interact with the cancellation execution flow is to use drop handlers (maybe in the form of scope guards).

Sadly, the only solution I know of is to use an explicit cancellation signal, and to modify ~everything to work with it. In that world, almost all async functions would need to accept a cancellation parameter of some sort, like a Go Context or like the tokio-utils CancellationToken, and explicitly check it every time they await a function. The new select!-equivalent would need to signal cancellations and then keep polling all unfinished cancellation-aware futures in a loop until they finished, and maybe immediately drop all non-aware futures to prevent futurelock. The entire Tokio API would need to be wrapped to take into account cancellation tokens, as well as any other async library you would want to use.

A lot of work, and you would need to do something if cancel-aware futures get dropped anyway. What a mess.

[+] ajross|4 months ago|reply

Trying to get my head around this. It seems like the "rootest" cause here is a paradigm clash between lock fairness and async futures.

A fair lock[1] is designed to wake up the longest-waiting task, since it got to the queue first and might otherwise be starved if the algorithm doesn't guarantee it gets to the head of the queue.

BUT CRITICALLY: a future isn't a task. It's not a thread, it's not guaranteed to "run". It's just a flag that gets set somewhere. But it can consume that wakeup nonetheless. So it's possible to "wake up"[2] a future that isn't actually being polled, and won't be, until something else that is waiting on the resource that just tried to wake it up.

I don't see that these concepts are ever going to work together. You can't have locks generating wakeup events that aren't consumed. If you're going to use them with async, you need to do something like a broadcast to guarantee that every waiter sees an event.

Stated differently: the lock is signalling an edge-triggered interrupt, but rust async demands level sensititivity.

[1] In one sense of fair. There are others, like "switch now" vs. "defer context switch", but that's not relevant here.

[2] Which doesn't actually wake anything up, thus the bug.

[+] qouteall|4 months ago|reply

Simplify: tokio::select! will discard other futures when one future progress.

The discarded futures will never be run again.

Normally when a future is discarded it's dropped. When a future holding lock is dropped, lock is released, but it's passing future borrow to select so the discarded future is not dropped while holding lock.

So it leaves a future that holds a lock that will never run again.

[+] octoberfranklin|4 months ago|reply

For anybody who wants to cut to the chase, it's this:

> The behavior of tokio::select! is to poll all branches' futures only until one of them returns `Ready`. At that point, it drops the other branches' futures and only runs the body of the branch that’s ready.

This is, unfortunately, doing what it's supposed to do: acting as a footgun.

The design of tokio::select!() implicitly assumes it can cancel tasks cleanly by simply dropping them. We learned the hard way back in the Java days that you cannot kill threads cleanly all the time. Unsurprisingly, the same thing is true for async tasks. But I guess every generation of programmers has to re-learn this lesson. Because, you know, actually learning from history would be too easy.

Unfortunately there are a bunch of footguns in tokio (and async-std too). The state-machine transformation inside rustc is a thing of beauty, but the libraries and APIs layered on top of that should have been iterated many more times before being rolled out into widespread use.

[+] tick_tock_tick|4 months ago|reply

It seems more and more clear every day that async was rushed out the door way to quickly in Rust.

245 comments