(no title)
felixge | 1 year ago
The low hanging fruits to speed up stack unwinding in the Go runtime is to switch to frame pointer unwinding in more places. In go1.21 we contributed patches to do this for the execution tracer. For the upcoming go1.23 release, my colleague Nick contributed patches to upgrade the block and mutex profiler. Once the go1.24 tree opens, we're hoping to tackle the memory profiler as well as copystack. The latter would benefit all Go programs, even those not using profiling. But it's likely going to be relative small win (<= 1%).
Once all of this is done, shadow stacks have the potential to make things even faster. But the problem is that we'll be deeply in diminishing returns territory at that point. Speeding up stack capturing is great when it makes up 80-90% of your overhead (this was the case for the execution tracer before frame pointers). But once we're down to 1-2% (the current situation for the execution tracer), another 8x speedup is not going to buy us much, especially when it has downsides.
The only future in which shadow stacks could speed up real Go programs is one where we decide to drop frame pointer support in the compiler, which could provide 1-2% speedup for all Go programs. Once hardware shadow stacks become widely available and accessible, I think that would be worth considering. But that's likely to be a few years down the road from now.
aerfio|1 year ago
felixge|1 year ago
That being said, I'm sure there are a lot of remaining incremental optimization opportunities that could add up to 10% over time. For example a faster map implementation [1]. I'm sure there is more.
Another recent perf opportunity is using pgo [2] which can get you 10% in some cases. Shameless plug: We recently GA'ed our support for it at Datadog [3].
[1] https://github.com/golang/go/issues/54766 [2] https://go.dev/doc/pgo [3] https://www.datadoghq.com/blog/datadog-pgo-go/
neonsunset|1 year ago
Though I find it unfortunate that the industry considers Go as a choice for performance-sensitive scenarios when C# exists which went the above route and does not sacrifice performance and ability to offer performance-specific APIs (like crossplat SIMD) by paying the price of higher effort/complexity compiler implementation. It also does in-runtime PGO (DynamicPGO) given long-running server workloads are usually using JIT where it's available, so you don't need to carefully craft a sample workload hoping it would match production behavior - JIT does it for you and it yields anything from 10% to 35% depending on how abstraction-heavy the codebase is.
dolmen|1 year ago
As a developer I like that approach as it keeps a great developer experience and helps me stayed focus and gives me great productivity.