top | item 46846426

(no title)

kdps | 29 days ago

It's not really surprising given the implementations. The C# stdlib just exposes more low-level levers here (quick look, correct me if I'm wrong):

For one, the C# code is explicitly using SIMD (System.Numerics.Vector) to process blocks, whereas Go is doing it scalar. It also uses a read-only FrozenDictionary which is heavily optimized for fast lookups compared to a standard map. Parallel.For effectively maps to OS threads, avoiding the Go scheduler's overhead (like preemption every few ms) which is small but still unnecessary for pure number crunching. But a bigger bottleneck is probably synchronization: The Go version writes to a channel in every iteration. Even buffered, that implies internal locking/mutex contention. C# is just writing to pre-allocated memory indices on unrelated disjoint chunks, so there's no synchronization at all.

discuss

freeopinion|29 days ago

In other words the benchmark doesn't even use the same hardware for each run?

kdps|29 days ago

If you're referring to the SIMD aspect (I assume the other points don't apply here): It depends on your perspective.

You could say yes, because the C# benchmark code is utilizing vector extensions on the CPU while Go's isn't. But you could also say no: Both are running on the same hardware (CPU and RAM). C# is simply using that hardware more efficiently here because the capabilities are exposed via the standard library. There is no magic trick involved. Even cheap consumer CPUs have had vector units for decades.