How much memory do you need in 2024 to run 1M concurrent tasks?

[+] AkshitGarg|1 year ago|reply

I feel this benchmark compares apples to oranges in some cases.

For example, for node, the author puts a million promises into the runtime event loop and uses `Promise.all` to wait for them all.

This is very different from, say, the Go version where the author creates a million goroutines and puts `waitgroup.Done` as a defer call.

While this might be the idiomatic way of concurrency in the respective languages, it does not account for how goroutines are fundamentally different from promises, and how the runtime does things differently. For JS, there's a single event loop. Counting the JS execution threads, the event loop thread and whatever else the runtime uses for async I/O, the execution model is fundamentally different from Go. Go (if not using `GOMAXPROCS`) spawns an OS thread for every physical thread that your machine has, and then uses a userspace scheduler to distribute goroutines to those threads. It may spawn more OS threads to account for OS threads sleeping on syscalls. Although I don't think the runtime will spawn extra threads in this case.

It also depends on what the "concurrent tasks" (I know, concurrency != parallelism) are. Tasks such as reading a file or doing a network call are better done with something like promises, but CPU-bound tasks are better done with goroutines or Node worker_threads. It would be interesting to see how the memory usage changes when doing async I/O vs CPU-bound tasks concurrently in different languages.

[+] n2d4|1 year ago|reply

Actually, I think this benchmark did the right thing, that I wish more benchmarks would do. I'm much less interested in what the differences between compilers are than in what the actual output will be if I ask a professional Go or Node.js dev to solve the same task. (TBF, it would've been better if the task benchmarked was something useful, eg. handling an HTTP request.)

Go heavily encourages a certain kind of programming; JavaScript heavily encourages a different kind; and the article does a great job at showing what the consequences are.

[+] SPascareli13|1 year ago|reply

As far as I know there is no way to do Promise like async in go, you HAVE to create a goroutine for each concurrent async task. If this is really the case then I believe the submition is valid.

But I do think that spawning a goroutine just to do a non-blocking task and get its return is kinda wasteful.

[+] threeseed|1 year ago|reply

The requirement is to run 1 million concurrent tasks.

Of course each language will have a different way of achieving this task each of which will have their unique pros/cons. That's why we have these different languages to begin with.

[+] gleenn|1 year ago|reply

Also, for Java, Virtual Threads are a very new feature (Java 21 IIRC or somewhere around there). OS threads have been around for decades. As a heavy JVM user it would have been nice to actually see those both broken out to compare as well!

[+] xargon7|1 year ago|reply

There's a difference between "running a task that waits for 10 seconds" and "scheduling a wakeup in 10 seconds".

The code for several of the languages that are low-memory usage that do the second while the high memory usage results do the first. For example, on my machine the article's go code uses 2.5GB of memory but the following code uses only 124MB. That difference is in-line with the rust results.

  package main
  
  import (
    "os"
    "strconv"
    "sync"
    "time"
  )
  
  func main() {
    numRoutines, _ := strconv.Atoi(os.Args[1])
    var wg sync.WaitGroup
    for i := 0; i < numRoutines; i++ {
      wg.Add(1)
      time.AfterFunc(10*time.Second, wg.Done)
    }
    wg.Wait()
  }

[+] mrighele|1 year ago|reply

I agree with you. Even something as simple as a loop like (pseudocode)

for (n=0;n<10;n++) { sleep(1 second); }

Changes the results quite a bit: for some reasons java use a _lot_ more memory and takes longer (~20 seconds), C# uses more that 1GB of memory, while python struggles with just scheduling all those tasks and takes more than one minute (beside taking more memory). node.js seems unfazed by this change.

I think this would be a more reasonable benchmark

[+] neonsunset|1 year ago|reply

Spawning a periodically waking up Task in .NET (say every 250ms) that performs work like sending out a network request would retain comparable memory usage (in terms of async overhead itself).

Even at 100k tasks the bottleneck is going to be the network stack (sending outgoing 400k RPS takes a lot of CPU and syscall overhead, even with SocketAsyncEngine!).

Doing so in Go would require either spawning Goroutines, or performing scheduling by hand or through some form of aggregation over channel readers. Something that Tasks make immediately available.

The concurrency primitive overhead becomes more important if you want to quickly interleave multiple operations at once. In .NET you simply do not await them at callsite until you need their result later - this post showcases how low the overhead of doing so is.

[+] piterrro|1 year ago|reply

I don't know what's a fair way to do this for all languages listed in the benchmark, but for Go vs Node the only fair way would be to use a single goroutine to schedule timers and another one to pick them up when they tick, this way we don't create a huge stack and it's much more comparable to what you're really doing in Node.

Consider the following code:

package main

import ( "fmt" "os" "strconv" "time" )

func main() {

    numTimers, _ := strconv.Atoi(os.Args[1])

    timerChan := make(chan struct{})

    // Goroutine 1: Schedule timers
    go func() {
        for i := 0; i < numTimers; i++ {
            timer := time.NewTimer(10 * time.Second)
            go func(t *time.Timer) {
                <-t.C
                timerChan <- struct{}{}
            }(timer)
        }
    }()

    // Goroutine 2: Receive and process timer signals
    for i := 0; i < numTimers; i++ {
        <-timerChan
    }

}

Also for Node it's weird not to have Bun and Deno included. I suppose you can have other runtimes for other languages too.

In the end I think this benchmark is comparing different things and not really useful for anything...

[+] theamk|1 year ago|reply

> high number of concurrent tasks can consume a significant amount of memory

note absolute numbers here: in the worst case, 1M tasks consumed 2.7 GB of RAM, with ~2700 bytes overhead per task. That'd still fit in the cheapest server with room to spare.

My conclusion would be opposite: as long as per-task data is more than a few KB, the memory overhead of task scheduler is negligible.

[+] pkulak|1 year ago|reply

Except it’s more than that. Go and Java maintain a stack for every virtual thread. They are clever about it, but it’s very possible that doing anything more than a sleep would have blown up memory on those two systems.

[+] cperciva|1 year ago|reply

This depends a lot on how you define "concurrent tasks", but the article provides a definition:

Let's launch N concurrent tasks, where each task waits for 10 seconds and then the program exists after all tasks finish. The number of tasks is controlled by the command line argument.

Leaving aside semantics like "since the tasks aren't specified as doing anything with side effects, the compiler can remove them as dead code", all you really need here is a timer and a continuation for each "task" -- i.e 24 bytes on most platforms. Allowing for allocation overhead and a data structure to manage all the timers efficiently, you might use as much as double that; with some tricks (e.g. function pointer compression) you could get it down to half that.

Eyeballing the graph, it looks like the winner is around 200MB for 1M concurrent tasks, so about 4x worse than a reasonably efficient but not heavily optimized implementation would be.

I have no idea what Go is doing to get 2500 bytes per task.

[+] masklinn|1 year ago|reply

> I have no idea what Go is doing to get 2500 bytes per task.

TFA creates a goroutine (green thread) for each task (using a waitgroup to synchronise them). IIRC goroutines default to 2k stacks, so that’s about right.

One could argue it’s not fair and it should be timers which would be much lighter. There’s no “efficient wait” for them but that’s essentially the same as the appendix rust program.

[+] liveoneggs|1 year ago|reply

https://tpaschalis.me/goroutines-size/

https://github.com/golang/go/blob/master/src/runtime/stack.g...

[+] Mawr|1 year ago|reply

> Now Go loses by over 13 times to the winner. It also loses by over 2 times to Java, which contradicts the general perception of the JVM being a memory hog and Go being lightweight.

Well, if it isn't the classic unwavering confidence that an artificial "hello world"-like benchmark is in any way representative of real world programs.

[+] phillipcarter|1 year ago|reply

Yes, but also, languages like Java and C# have caught up a great deal over the past 10 years and run incredibly smoothly. Most peoples' perception of them being slow is really just from legacy tech that they encountered a long time ago, or (oof) being exposed to some terrible piece of .NET Framework code that's still running on an underprovisioned IIS server.

[+] blixt|1 year ago|reply

While it’s nice to compare languages with simple idiomatic code I think it’s unfair to developers to show them the performance of an entirely empty function body and graphs with bars that focus on only one variable. It paints a picture that you can safely pick language X because it had the smaller bar.

I urge anyone making decisions from looking at these graphs to run this benchmark themselves and add two things:

- Add at least the most minimal real world task inside of these function bodies to get a better feel for how the languages use memory

- Measure the duration in addition to the memory to get a feel for the difference in scheduling between the languages

[+] tossandthrow|1 year ago|reply

This urge is as old as statistics. And I dare to say that most people after reading the article in question are well prepared to use the results for what they are.

[+] JyB|1 year ago|reply

I’m still baffled that some people are bold enough to voluntarily posts those kind of most-of-the-time useless “benchmark” that will inevitably be riddled with errors. I don’t know what pushes them. In the end you look like a clown more often than not.

[+] wiseowise|1 year ago|reply

The fastest way to learn truth is by posting wrong thing on the internet, or something.

[+] enginoid|1 year ago|reply

Trying things casually out of curiosity isn’t harmful. I expect people understand that these kinds of blog posts aren’t rigorous science to draw foundational conclusions from.

And the errors are a feature — I learn the most from the errata!

[+] davidatbu|1 year ago|reply

I write (async) Rust regularly, and I don't understand how the version in the appendix doesn't take 10x1,000,000 seconds to complete. In other words, I'd have expected no concurrency to take place.

Am I wrong?

UPDATE: From the replies below, it looks like I was right about "no concurrency takes place", but I was wrong about how long it takes, because `tokio::time::sleep()` keeps track of when the future was created, (ie when `sleep()` was called) instead of when the future is first `.await`ed (which was my unsaid assumption).

[+] aba_cz|1 year ago|reply

Regarding Java I'm pretty sure that benchmark is broken at least a little bit and testing something else as not specifying initial size for ArrayList means list of size 10 which gets resized all the time when `add()` is called, leading to big amount of unused objects needing garbage collection.

[+] jeswin|1 year ago|reply

Good to see NativeAOT getting positive press.

Go won because it served a need felt by many programmers: a garbage-collected language which compiled to native code, with robust libraries supported by a large corp.

With Native AOT, C# is walking into the same space. With arguably better library selection, equivalent performance, and native code compilation. And a much more powerful, well-thought-out language - at a slight complexity cost. If you're starting a project today (with the luxury of choosing a language), you should give C# + NativeAOT a consideration.

[+] jillesvangurp|1 year ago|reply

Did a similar benchmark in Kotlin using co-routines.

    import kotlin.time.Duration.Companion.milliseconds
    import kotlin.time.measureTime
    import kotlinx.coroutines.async
    import kotlinx.coroutines.awaitAll
    import kotlinx.coroutines.coroutineScope
    import kotlinx.coroutines.delay
    
    suspend fun main() {
        measureTime {
            coroutineScope {
                (0..1000000).map {
                    async {
                        delay(1.milliseconds)
                    }
                }.awaitAll()
            }
        }.let { t ->
            println("Took $t")
            val runtime = Runtime.getRuntime()
    
            val maxHeapSize = runtime.maxMemory() 
            val allocatedHeapSize = runtime.totalMemory()
            val freeHeapSize = runtime.freeMemory()
    
            println("Max Heap: ${maxHeapSize / 1024 / 1024} MB")
            println("Allocated Heap: ${allocatedHeapSize / 1024 / 1024} MB")
            println("Free Heap: ${freeHeapSize / 1024 / 1024} MB")
        }
    }

This produces the following output:

   Took 1.597011084s
   Max Heap: 4096 MB
   Allocated Heap: 2238 MB
   Free Heap: 1548 MB

So whatever is needed to load classes and a million co-routines with some heap state. Of course the whole thing isn't doing any work and this isn't much of a benchmark. And of course if I run it with kotlin-js it actually ends up using promises. So, it's not going to be any better there than on the JVM.

[+] promiseofbeans|1 year ago|reply

It would be nice if the author also compared different runtimes (e.g. NodeJS vs Deno, or cpython vs pypy) and core language engines (e.g. v8 vs spider monkey vs JavaScript core)

[+] polyrand|1 year ago|reply

Out of curiosity, I checked if using uvloop[0] in Python changed the numbers.

This is the code:

  # /// script
  # requires-python = ">=3.12"
  # dependencies = ["uvloop"]
  # ///
  
  import asyncio
  import sys
  
  import uvloop
  
  
  async def main(num_tasks):
      tasks = []
  
      for task_id in range(num_tasks):
          tasks.append(asyncio.sleep(10))
  
      await asyncio.gather(*tasks)
  
  
  if __name__ == "__main__":
      num_tasks = int(sys.argv[1])
      # uvloop.run(main(num_tasks))
      asyncio.run(main(num_tasks))

I ran it with 100k tasks:

  /usr/bin/time -l -p -h uv run async-memory.py 100000

On my M1 MacBook Pro, using asyncio reports (~170MB):

  170835968  maximum resident set size

Using uvloop (~204MB):

  204259328  maximum resident set size

I kept the `import uvloop` statement when just using asyncio so that both cases start in the same conditions.

[0]: https://github.com/MagicStack/uvloop/

[+] octacat|1 year ago|reply

Where is erlang? Sleeping is not running, by the way. If you just sleep, in Erlang you would use a hibernated process.

I feel this is so misleading. For example, by default after spawning, Erlang would have some memory preallocated for each process, so they don't need to ask the operation system for new allocations (and if you want to shrink it, you call hibernate).

Do something more real, like message passing with one million processes or websockets. Or 1M tcp connections. Because, the moment you send messages, here is when the magic happens (and memory would grow, the delay when each message is processed would be different in different languages).

Oh, and btw, if you want to do THAT in erlang, use timer:apply_after(Time, Module, Function, Arguments). Which would not spawn an erlang process, just would put the task to the timer scheduling table.

And Elixir was in the old article, and they implemented it all wrong. Sad.

[+] afavour|1 year ago|reply

Maybe I’m missing something here but surely Node isn’t doing anything concurrently? Promises don’t execute concurrently, they just tidy up async execution. The code as given will just sequentially resolve a million promises. No wonder it looks so good. You’d need to be using workers to actually do anything concurrently.

[+] Izkata|1 year ago|reply

You're thinking of parallelism. Concurrency doesn't require them to actually be running at the same time.

[+] charlotte-fyi|1 year ago|reply

That's not entirely true. There's a thread pool of workers underneath libuv. Tasks that would block do indeed execute concurrently.

[+] unknown|1 year ago|reply

[deleted]

[+] citrin_ru|1 year ago|reply

It would be more interesting to see a benchmark where a task will not be empty but would have an open network connection e.g. would make an HTTP request to a test server with 10 seconds response time. Network is a frequent reason real world applications spawn 1M tasks.

[+] pgAdmin4|1 year ago|reply

Why C with pthreads missing in this benchmark ?

[+] throwaway81523|1 year ago|reply

I don't think 1M posix threads is a thing. 1K is no big deal though.

[+] thesnide|1 year ago|reply

that. Or just using a C coroutine lib.

[+] jakobnissen|1 year ago|reply

Just tried this in Julia: 16.0 GB of memory for 1M tasks!

I believe each task in Julia has its own stack, so this makes sense. Still, it does mean you've got to take account of ~16 KB of memory per running task which is not great.

[+] joshka|1 year ago|reply

RUST

The rust code is really checking how big Tokio's structures that track timers are. Solving the problem in a fully degenerate manner, the following code runs correct correctly and uses only 35MB peak. 35 bytes per future seems pretty small. 1 billion futures was ~14GB and ran fine.

    #[tokio::main]
    async fn main() {
        let sleep = SleepUntil {
            end: Instant::now() + Duration::from_secs(10),
        };
        let timers: Vec<_> = iter::repeat_n(sleep, 1_000_000_0).collect();
        for sleep in timers {
            sleep.await;
        }
    }

    #[derive(Clone)]
    struct SleepUntil {
        end: Instant,
    }

    impl Future for SleepUntil {
        type Output = ();

        fn poll(self: Pin<&mut Self>, cx: &mut Context) -> Poll<Self::Output> {
            if Instant::now() >= self.end {
                Poll::Ready(())
            } else {
                cx.waker().wake_by_ref();
                Poll::Pending
            }
        }
    }

Note: I do understand why this isn't good code, and why it solves a subtly different problem than posed (the sleep is cloned, including the deadline, so every timer is the same).

The point I'm making here is that synthetic benchmarks often measure something which doesn't help much. While the above is really degenerate, it shares the same problems as the article's code (it just leans into problems much harder).

196 comments