From Async/Await to Virtual Threads

lucumr.pocoo.org

46 points by Bogdanp 5 days ago

If I'm understanding the suggestion, the proposed python virtual threads are ~= fibers ~= stackful coroutines.

I have this paper saved in my bookmarks as "fibers bad":

https://www.open-std.org/JTC1/SC22/WG21/docs/papers/2018/p13...

AFAIK sync/await and stackless coroutines are the most efficient way to do async operations as far as I can tell, even if they are a unwieldly and complicated. Is there something to be gained here other than usability?

Python is certainly in the business of trading efficiency and optimal solutions for readability and some notion of simplicity, and that has held it back (and all the programmers that overindex on the pythonic way, it's incredibly sad that all of modern ML is essentially built on python) IMO, but the language is certainly easy to write.

[EDIT] - wanted to put here, people spend a lot of characters complaining about tokio in Rust land, but I honestly think it's just fine. It was REALLY rough early on, but at this point the ergonomics are quite easy to use and understand, and it's quite performant out of the box. It's not perfect, but it's really quite pleasing to use and understand (i.e. running into a bug/surprising behavior almost always ends in understanding more about the fundamental tradeoffs and system design for async systems)

Swift doing something similar seems to be an endorsement of the approach. In fact, IIRC this might be where I saw that first paper? Maybe it was a HN comment that pointed to it:

https://forums.swift.org/t/why-stackless-async-await-for-swi...

Rust and Swift are the most impressive modern languages IMO, the sheer amount of lessons they've taken from previous generations of PL is encouraging.

gpderetta 19 hours ago

That paper is specifically for C++ and even there not every body agrees (there is still a proposal to add stackful coroutines). One claim in the paper is that stackfull coroutines have an higher stack switching cost. I disagree, but also this is completely irrelevant in python where spending a couple of additional nanoseconds is completely hidden by the inefficiency of the interpreter.
It is true that stackless coroutines are more memory efficient when you are running millions of lightweight tasks. But you won't be running millions of tasks (or even hundreds of thousands) in python, so it is a non-issue.
There is really no reason for python to have chosen the async/await model.
janalsncm 20 hours ago

> Python is certainly in the business of trading efficiency and optimal solutions for readability and some notion of simplicity, and that has held it back
Sometimes a simple, suboptimal solution is faster in clock time than an optimal one. You have to write the code and then execute it after all.
As for why ML is dominated by python, I feel like it just gets to the point quicker than other languages. In Java or typescript or even rust there are just too many other things going on. I feel like writing even a basic training loop in Java would be a nightmare.
- hardwaresofton 19 hours ago
  
  I do wonder how much hand wringing there is around tools like uv. The perf of Rust with the usability of python if everyone just ignores the binary blobs and FFI involved.
  I personally think uv is a huge brightspot in the python ecosystem and think others will follow suit (i.e other langs embedding rust in tooling for speedups). Wonder what the rest of the ecosystem thinks.
  Maybe the uv approach is really the best of both worlds — pythons dx with rusts rigor.
  
  llamavore 18 hours ago
  
  Totally agree, the FFI escape hatch and excellent tooling from rust, maturin pyo3 etc means so many python problems can just be solved with rust. Which begs the question, has anyone tried doing a greenthread implementation in rust? Maybe offload some of the dynamically evaled python code to a seperate process maybe with https://github.com/RustPython/RustPython
rfoo a day ago

Is there really something to lose? How often do we see "stackless coroutine" listed as advantage in Rust vs Go for network programming flamewars?
- hardwaresofton a day ago
  
  Rust vs Go is IMO not even a reasonable discussion to have -- you really have to be careful on the axes for which you compare them for the thought exercise to make any sense.
  If you want a productive backend language that developers can pick up quickly, write code in, and be productive relatively quickly and write low latency applications, Go is an easy winner.
  If you want a systems language, then Rust is the only choice between those two. Rust is harder to pick up, but produces more correct and faster code, with obviously a much stronger type system.
  They could be directly comparable, but usually only if your goal is as abstract as "I need a language to write my backend in". But you can see how that question is a very difficult one to pull meaning from generally.
  > How often do we see "stackless coroutine" listed as advantage in Rust vs Go for network programming flamewars?
  I'm just going to go out on a limb and say network programming is bar none more efficient in Rust. I could see C/C++ beating out Rust in that domain, but I would not expect Go to do so, especially if you're reaching for unsafe rust.
  Again, it also depends on what you mean by "network programming", and whether development speed/ecosystem/etc are a concern.
  To avoid the flamewars and stay on topic, it's one of those things that is just a fundamentally limiting choice -- most will never hit the limit (most apps really aren't going to do more than 1000RPS of meaningful requests), but if you do actually need the perf it's quite disappointing.
  People are very annoyed with Rust for not having a blessed async solution in-tree, and while it's produced a ton of churn I think it was ultimately beneficial for reasons like this. You can do either one of these in Rust, the choice isn't made for you.
  That said, the OP's suggestion seems to be adding virtual threads on rather than swapping out asyncio for virtual threads, so maybe there's a world where people just use what they want, and python can interop the two as necessary.
  
  rfoo a day ago
  
  Good points.
  Personally I'm more annoyed of async-Rust itself than not having a blessed async solution in-tree. Having to just Arc<T> away things here and there because you can't do thread::scope(f) honestly just demonstrates how stackless coroutine is unreasonably hard to everyone.
  Back to the original topic, I bring this up because I believe the performance advantages claimed in these "fibers bad" papers are superficial, and the limit is almost the same (think 1.00 vs 1.02 level almost), even in languages which consider raw performance as a selling-point. In case you need the absolutely lowest overhead and latency you usually want the timing to be as deterministic as possible too, and it's not even a given in async-await solutions, you still need to be very careful about that.
  Let alone Python.
  
  hardwaresofton 21 hours ago
  
  > Personally I'm more annoyed of async-Rust itself than not having a blessed async solution in-tree. Having to just Arc<T> away things here and there because you can't do thread::scope(f) honestly just demonstrates how stackless coroutine is unreasonably hard to everyone.
  Yeah as annoying as this is, I think it actually played out to benefit Rust -- imagine if the churn that we saw in tokio/async-std/smol/etc played out in tree? I think things might ahve been even worse
  That said, stackless coroutines are certainly unreasonably hard.
  > Back to the original topic, I bring this up because I believe the performance advantages claimed in these "fibers bad" papers are superficial, and the limit is almost the same (think 1.00 vs 1.02 level almost), even in languages which consider raw performance as a selling-point. In case you need the absolutely lowest overhead and latency you usually want the timing to be as deterministic as possible too, and it's not even a given in async-await solutions, you still need to be very careful about that.
  Yeah, I don't think this is incorrect, and I'd love to see some numbers on it. The only thing that I can say definitively is that there is overhead to doing the literal stack switch. There's a reason async I/O got us past the C10k problem so handily.
  One of the nice things about some recent Zig work was how clearly you can see how they do their stack switch -- literally you can jump in the Zig source code (on a branch IIRC) and just read the ASM for various platforms that represents a user space context switch.
  Agree with the deterministic timing thing too -- this is one of the big points that people who only want to use threads (and are against tokio/etc) argue -- the pure control and single-mindedness of a core against a problem is clearly simple and performant. Thread per core is still the top for performance, but IMO the ultimate is async runtime thread per core, because some (important) problems are embarassingly concurrent.
  > Let alone Python.
  Yeah, I' trying really not to comment much on Python because I'm out of my depth and I think there are...
  I mean I'm of the opinion that JS (really TS) is the better scripting language (better bolt-on type systems, got threads faster, never had a GIL, lucked into being async-forward and getting all it's users used to async behavior), but obviously Python is a powerhouse and a crucially important ecosystem (excluding the AI hype).
  
  rfoo 17 hours ago
  
  > The only thing that I can say definitively is that there is overhead to doing the literal stack switch. There's a reason async I/O got us past the C10k problem so handily.
  You can also say that not having to constantly allocate & deallocate stuff and rely on a bump allocator (the stack) most of the time more than compensate for the stack switch overhead. Depends on workload of course :p
  IMO it's more about memory and nowadays it might just be path dependence. Back in C10k days address spaces were 32-bit (ok 31-bit really), and 2**31 / 10k ~= 210KiB. Makes static-ish stack management really messy. So you really need to extract the (minimal) state explicitly and pack them on heap.
  Now we happily run ASAN which allocates 1TiB (2**40) address space during startup for a bitmap of the entire AS (2**48) and nobody complains.

llamavore a day ago

Great article, the previous related one goes into a lot more detail on some of pythons different concurrency implementation details: https://lucumr.pocoo.org/2024/11/18/threads-beat-async-await...

Something which would be immensely helpful for the community is to create a test suite of pathological problems with existing python concurrency patterns and libraries.

At that point it should just be a matter of time before the right implementation and PEP(s) can be iterated on that solves said problems while maximizing for devex.

For anyone interested in learning more about different concurrency models I can highly recommend Paul Butcher's: Seven Concurrency Models in Seven Weeks.

laurencerowe a day ago

> One key part of how async/await works in Python is that nothing really happens until you call await. You’re guaranteed not to be suspended. Unfortunately, recent changes with free-threading make that guarantee rather pointless. Because you still need to write code to be aware of other threads, and so now we have the complexity of both the async ecosystem and the threading system at all times.

I don’t understand how free-threading changes things here. In a multithreaded Python program without free-threading you might be preempted at any point before calling await at the granularity of the GIL.

rfoo a day ago

You are right. It's just that multithreading before free-threading was too broken and too slow to the point that very few people use them for high-traffic production things. And people decided to write it like NodeJS to further handwave the problem.
- laurencerowe 9 hours ago
  
  There are real advantages to the async/await pattern, eloquently expressed in Glyph's Unyielding essay: https://glyph.twistedmatrix.com/2014/02/unyielding.html
  Python is a multi-paradigm language so I'm not against making virtual threads (aka green threads) built in. But I don't think they really address the problems that the async/await approach solves. It's really hard to write multi-threaded code correctly without something like Rust's type system.

travisgriggs a day ago

Love this article. I consider the async/await function coloring to be a lot like aspect oriented programming of the 90s. It requires a bunch of extra compiler features added, it’s supposed to make a class of problems easier, and it has one or two poster child applications that always gets showcased. I cross my fingers that like AOP, after a bunch of effort/hype is spent on it, it will go away.

On another note, I work in all of Python, embedded C with an RTOS, Swift, Kotlin, Python, and Elixir. I can’t say enough how wonderfully simple reasoning about these concurrent kinds of examples is in Elixir. It’s like a whole new world (yeah, sing it). My closest second favorite was actually GCD for Apple back in the day.

pragmatic 15 hours ago

But it's actually useful for scalability.
All modern IO is async. The fact that you get this easy to use API to make an app more scalable is a ridiculously good deal.
Now, if you don't really need that, then it just looks like pointless extra work.

nromiun a day ago

After several years of dealing with concurrency I have come to the same conclusion.

Cooperative multitasking is not worth the pain of manually selecting where your IO hot spots are. It is too easy to block your entire program and not easy to debug where it is getting blocked.

IMO preemptive scheduling like Goroutines and Java virtual threads are the future. As a bonus they play well with parallelism too, because you are already doing all the synchronization.

Phil_Latio 21 hours ago

What do you mean by "IO hot spots"? You really mean IO or rather CPU bound work? Because for IO, the standard library stuff could always yield properly (like Go does).
If you mean CPU-bound-like work, then that's true. But is the Go model really the solution? I don't know. It basically replicates the OS model, where preemption is required. But in a single application, the developer should be able to coordinate..? Maybe the cooperative model just needs more tooling, like clear rules: The main coroutine shouldn't execute for longer than 1 millisecond (or better a certain number of instructions) between yields. In debug mode, you could then get the exact stack traces where "stuttering" happens. You may then insert yield points or better just offload CPU-bound work by spawning a new coroutine on a different OS thread and await it cooperatively.
- nromiun 20 hours ago
  
  Both basically. Another beauty of M:N green threads is that you don't need to differentiate between IO and CPU bound tasks.
  If you add all those rules to coroutines you are halfway to preemptive scheduling already. And maybe async is not the answer if it still needs more work and tooling after all these years.
  
  Phil_Latio 20 hours ago
  
  No i meant cooperative green threads, not the stackless async/await model. My model would basically mean: No "function coloring", all functions can be called as usual. IO related functions will automatically yield, no problem. All CPU-bound work either need manual yield points (not good, I agree) or should be offloaded to a coroutine on a different thread and then awaited (yes with await keyword) cooperatively. If you want to invoke a click handler for an UI, you can launch a coroutine on the same thread (cooperative).
  Go must do all sorts of trickery with the preemption. Like inserting yield points or even depend on signals so it can preempt a goroutine which didn't hit a yield point. It basically replicates what the OS does with threads.
  
  nromiun 20 hours ago
  
  So basically like gevent. I agree that is a very good concurrency model. Much better than the current asyncio mess we have.
  But if you already have a runtime I don't know why it would be big deal to make them N:M threads as well. Makes managing CPU bound tasks easy as well as IO bound tasks.
  
  Phil_Latio 19 hours ago
  
  Well I see 2 cases for automatic preemption:
  - You are lazy and just don't care, let the runtime do it - Or you failed to realize that what you do could block
  The first case is what annoys me. I think the developer should handle obvious cases manually. While the second case would be considered a bug in my model and the language should help you with that as explained earlier. If that works out, I mean to rule out mistakes for the second case, then this model is superior and more performant I think.
  
  gpderetta 15 hours ago
  
  You could make the same argument for GC, yet, outside of system languages, it is generally considered a net positive. The reality is that in a large application it not easy to find all the right preemption points.
  
  Phil_Latio 13 hours ago
  
  I guess you are right. Maybe I see it too much from a system programming language perspective. After all, Go is of a different kind.

mattbillenstein a day ago

I'm still using gevent, it still solves a bunch of problems nicely for me - yeah, the monkey patching is kinda ugly, but yolo.

nromiun 20 hours ago

Same for me. It makes concurrency in Python as easy as Golang.

mkleczek a day ago

Java had its world changing moment when Ron Pressler gave this talk at Curry On: https://www.youtube.com/watch?v=449j7oKQVkc&t=1s.

senand a day ago

I‘m no expert, but I wonder why Go Routine Style concurrency isn‘t more wide spread

gethly a day ago

I think the keyword is runtime - there must be some higher logic being run above your own code that manages these things. Which is what Go is doing and why "hello world" cannot produce a binary that has few bytes in size. If other languages would want to provide the same support, they would likely have to refactor a chunk of code and maybe change things too much to be worth implementing. Go had this from the get-go, so it is o no concern. Also, GC likely plays some role as well.