top | item 46753261

(no title)

mnahkies | 1 month ago

Something that struck me earlier this week was when profiling certain workloads, I'd really like a flame graph that included wall time waiting on IO, be it a database call, filesystem or other RPC.

For example, our integration test suite on a particular service has become quite slow, but it's not particularly clear where the time is going. I suspect a decent amount of time is being spent talking to postgres, but I'd like a low touch way to profile this

discuss

6keZbCECT2uB|1 month ago

There's prior work: https://www.brendangregg.com/FlameGraphs/offcpuflamegraphs.h...

There are a few challenges here. - Off-cpu is missing the interrupt with integrated collection of stack traces, so you instrument a full timeline when they move on and off cpu or periodically walk every thread for its stack trace - Applications have many idle threads and waiting for IO is a common threadpool case, so its more challenging to associate the thread waiting for a pool doing delegated IO from idle worker pool threads

Some solutions: - Ive used nsight systems for non GPU stuff to visualize off CPU time equally with on CPU time - gdb thread apply all bt is slow but does full call stack walking. In python, we have py-spy dump for supported interpreters - Remember that any thing you can represent as call stacks and integers can be converted easily to a flamegraph. eg taking strace durations by tid and maybe fd and aggregating to a flamegraph

trillic|1 month ago

See if you can wrap the underlying library call to pg.query or whatever it is with a generic wrapper that logs time in the query function. Should be easy in a dynamic lang.

Kuinox|1 month ago

Tracing profiler can do exactly that, you don't need a dynamic lang.