An amazing part of sampling-based profilers: They're so low overhead, you can leave them on permanently in production. And since all they're taking is counts, you can combine results from multiple instances very easily. Put this together and it forms an easy, highly insightful part of your production monitoring systems. Notice a service being slow? You can instantly see its live flamegraph, check a function's average runtime over time, etc.
Indeed. But careful there, if there is anything that's slowing down your system it can slow down the sampling tics as well and hide performance degradations. Its a good idea to record what's the time interval between tics. If tics are too frequent, this can again get you into trouble as this time interval computation can become relatively expensive.
It might be interesting to some to know that relatively recently (2016) PEP 523 ( https://www.python.org/dev/peps/pep-0523/ ) was created which provides a more efficient alternative to PyEval_SetTrace or PyEval_SetProfile for some use cases.
The idea of this PEP is to add frame evaluation into cPython. As the PEP says "For instance, it would not be difficult to implement a tracing or profiling function at the call level with this API"
This is a really nice blog post! It covers the ecosystem well and goes into the right level of detail to help someone unfamiliar with the tools understand how they work.
At PyOhio I gave a hastily-prepared lighting talk on building a statistical profiler in Python. I'll link to it here because, although going in much less detail than Julia's post, it shows a visualization technique and an example of the sort of payoff you get from profiling.
> The main disadvantage of tracing profilers implemented in this way is
> that they introduce a fixed amount for every function call / line of code
> executed. This can cause you to make incorrect decisions! For example, if
> you have 2 implementations of something – one with a lot of function calls
> and one without, which take the same amount of time, the one with a lot of
> function calls will appear to be slower when profiled.
I'm not convinced this is the main disadvantage of tracing profilers. The main disadvantage is that the perf impact tends to be much more severe.
It's theoretically possible to mitigate the problem identified (keep track of how much time you spend profiling, and reduce the time tracked spent in the function by that amount). However, it's harder to have a tracing profiler which only profiles some functions, and harder to specify those.
The nice thing about sampling profiles is that it's very easy to configure the performance impact and easy to get a good sample by just running for longer.
> The main disadvantage is that the perf impact tends to be much more severe.
That is an annoyance but not an issue. A 3x performance degradation, or even a 10x one, can be very annoying. But that performance degradation is uneven means profiling is unreliable, and that's a problem I've had semi-frequently with tracing profilers: the function call counts are reliable, but the timings/percentages aren't always, and you may see a hotspot under profiler, optimise it to the hilt, run without profiling and… nothing changed. Because the hotspot was mostly a profiler artefact.
I'm surprised that no system-wide tracers were included (unless I missed something?). You can use dtrace / systemtap with both ruby and python (>=3.6). You still have to extract the bits you're interested in, but the hooks for function entry/exit and GC are exposed, so you can effectively add a targeted profiler/tracer at runtime.
I'm sad there are no drawings in the article :( Julia Evans drawings are amazing. If you've never seen them, you should check the strace zine : https://jvns.ca/zines/#strace-zine
I guess this is as good a time as any to make my post-PyParallel pet project public (holy alliteration Batman): https://github.com/tpn/tracer.
I started it about 2.5 years ago because I wasn't really happy with existing Python tracing tools. I didn't like:
a) the performance hit when tracing non-trivial programs,
b) the inability to say: "hey, trace all code living in modules mymodule1 and mymodule2; ignore everything else", and
c) the inability to say: "screw it, trace everything", and have an efficient means to persist tens of gigabytes worth of data (i.e. don't separate in-memory representation from on-disk representation).
So I came up with this tracing project, creatively named: tracer.
It has relatively low overhead, efficiently and optimally persists tens of gigabytes worth of data to disk with minimal overhead, and allows for module-based tracing directives so that you can capture info about your project and ignore all the other stuff (like stdlib and package code).
Edit: something pretty fascinating I observed in the wild when running this code: a tracing-enabled Python run of a large project ran faster than a normal, non-traced version of Python in a low-memory environment (a VM with 8GB of RAM). My guess is that splaying the PyCodeObject structures each frame invocation has the very beneficial side effect of reducing the likelihood the OS swaps the underlying pages out at a later date (https://github.com/tpn/tracer/blob/master/Python/PythonFunct...).
As far as I can tell, there's not much of a reason to use an instrumenting profiler over a sampling one. Is there some niche use case? Maybe to experimentally validate some claim about average case computational complexity?
I use explicit instrumenting APIs to generate flamegraphs a lot - useful for a lot of niches in gamedev - where we care about individual frames exceeding their frame budget. Being able to capture several thousand frames worth of flame graphs and then being able to drill down into the individual outliers is absolutely killer in this context. Sampling profilers are great for getting a broad idea of where problems might be, but not nearly as good as instrumented options when you really want detail. For example:
- Finding where we checked a file for modification on the main thread every few seconds, to consider reloading resources, causing us a 5ms spike that would sometimes cause us to miss vsync.
- Spotting a framerate hitch caused by the main thread stalling out on a contested logging lock under some specific set of circumstances for a frame or two.
- Spotting frames where overly aggressive background jobs simply starved the main thread out of CPU cycles.
- Which specific subanimations of a set (complete with names because I'm using an explicit API that lets me annotate them) are causing our optimization code to exhibit O(scary) behavior.
There's nothing here that you couldn't eventually figure out and solve with a debugger and a sampling profiler, but instrumenting profilers - and profiler APIs that let you add your own annotations - make the job way easier in my experience.
> For long-running, batch-oriented programs, though, where there are no pathological corner cases, the usefulness of cProfile diminishes. It becomes hard to distinguish signal from noise with regards to the cumulative call counts and function time. Additionally, cProfile only captures function call entry and exit times -- it doesn't do line profiling, and thus, won't be able to depict abnormally expensive lines within a routine (this can be common with NumPy and Pandas; a seemingly innocuous line will actually have a very high overhead behind the scenes).
Catching the runtime of a function that's executed very infrequently for example and has execution time extremely dependent on the context. (Which makes an isolated reproducer hard to achieve)
Or finding out about a tiny function which is called thousands of times and adds up over time, but each execution is so fast you rarely ever see it on a sample profile.
I don't use tracing much because it produces such bad results, but I often find myself looking at a profile and thinking "I really wish I knew how many times this method was called".
One of the things that makes Ruby so difficult to profile is that in order to do anything really fun (like get memory usage, for instance), you need to shell out, which means you're forking/doubling your memory footprint immediately, which means you probably can't do this in production. This is especially frustrating because by far the biggest issue you'll have with a Ruby (especially Rails) app is memory usage, so trying to pinpoint what of the many things that are currently running on your box are blowing up your memory can be non-trivial.
Why do you have to fork to get a memory profile? I don't know anything about memory profiling yet (I collected a bunch of links in https://gist.github.com/jvns/b81eb039f6373bc577d7dbdd978581b... but haven't read their source yet) but I'd like to know more. It seems like it should be possible to get a memory profile in production without a prohibitive amount of overhead though.
Also, I believe forking shouldn't actually double memory usage because of copy on write.
On linux you can get a process' memory usage by reading the `/proc` pseudo-filesystem. See [0], which uses that approach when supported.
You're correct that production ruby applications should try to avoid calling fork() at runtime, which excludes all the standard library mechanisms for shelling out. posix-spawn [1] is an alternative that works well.
I'm currently working on a programming language (Lily. See this name on github), and this is really helpful. I've been thinking about adding debugging support, and this will help a lot toward doing that.
[+] [-] ekimekim|8 years ago|reply
[+] [-] srean|8 years ago|reply
[+] [-] MuffinFlavored|8 years ago|reply
[+] [-] alex-|8 years ago|reply
The idea of this PEP is to add frame evaluation into cPython. As the PEP says "For instance, it would not be difficult to implement a tracing or profiling function at the call level with this API"
Elizaveta Shashkova (a PyCharm developer at JetBrains) gave a really good talk on the subject at this years PyCon ( https://www.youtube.com/watch?v=NdObDUbLjdg ).
[+] [-] sciurus|8 years ago|reply
At PyOhio I gave a hastily-prepared lighting talk on building a statistical profiler in Python. I'll link to it here because, although going in much less detail than Julia's post, it shows a visualization technique and an example of the sort of payoff you get from profiling.
https://www.polibyte.com/2017/08/05/building-statistical-pro...
[+] [-] richardwhiuk|8 years ago|reply
It's theoretically possible to mitigate the problem identified (keep track of how much time you spend profiling, and reduce the time tracked spent in the function by that amount). However, it's harder to have a tracing profiler which only profiles some functions, and harder to specify those.
The nice thing about sampling profiles is that it's very easy to configure the performance impact and easy to get a good sample by just running for longer.
[+] [-] masklinn|8 years ago|reply
That is an annoyance but not an issue. A 3x performance degradation, or even a 10x one, can be very annoying. But that performance degradation is uneven means profiling is unreliable, and that's a problem I've had semi-frequently with tracing profilers: the function call counts are reliable, but the timings/percentages aren't always, and you may see a hotspot under profiler, optimise it to the hilt, run without profiling and… nothing changed. Because the hotspot was mostly a profiler artefact.
[+] [-] unwind|8 years ago|reply
Use asterisks around the paragraph, that renders as italics.
Thanks.
[+] [-] viraptor|8 years ago|reply
[+] [-] cdancette|8 years ago|reply
But it's still a very interesting blog post !
[+] [-] trentnelson|8 years ago|reply
I started it about 2.5 years ago because I wasn't really happy with existing Python tracing tools. I didn't like:
a) the performance hit when tracing non-trivial programs,
b) the inability to say: "hey, trace all code living in modules mymodule1 and mymodule2; ignore everything else", and
c) the inability to say: "screw it, trace everything", and have an efficient means to persist tens of gigabytes worth of data (i.e. don't separate in-memory representation from on-disk representation).
So I came up with this tracing project, creatively named: tracer.
It has relatively low overhead, efficiently and optimally persists tens of gigabytes worth of data to disk with minimal overhead, and allows for module-based tracing directives so that you can capture info about your project and ignore all the other stuff (like stdlib and package code).
Like all fun pet projects, once the easy stuff was done, this one devolved into writing assembly to do remote thread injection (https://github.com/tpn/tracer/blob/master/Asm/InjectionThunk...), coming up with new "string table" data structure as an excuse to use some AVX2 intrinsics (https://github.com/tpn/tracer/blob/master/StringTable/Prefix...), and writing a sqlite3 virtual table interface to the flat-file memory maps, complete with CUDA kernels (https://github.com/tpn/tracer/blob/master/TraceStore/TraceSt...).
Edit: something pretty fascinating I observed in the wild when running this code: a tracing-enabled Python run of a large project ran faster than a normal, non-traced version of Python in a low-memory environment (a VM with 8GB of RAM). My guess is that splaying the PyCodeObject structures each frame invocation has the very beneficial side effect of reducing the likelihood the OS swaps the underlying pages out at a later date (https://github.com/tpn/tracer/blob/master/Python/PythonFunct...).
[+] [-] chubot|8 years ago|reply
I recently strated playing with uftrace, a tracing framework for C++ using instrumented binaries. I got it working on CPython and I like it so far.
The CLI seems to be well done. I haven't looked in detail your tracer (a README would help), but maybe there are some ideas that can be exchanged.
This thread has a link to the repo and a video:
https://www.reddit.com/r/ProgrammingLanguages/comments/7djre...
[+] [-] dagw|8 years ago|reply
Does this mean the PyParallel is dead?
[+] [-] blt|8 years ago|reply
[+] [-] MaulingMonkey|8 years ago|reply
- Finding where we checked a file for modification on the main thread every few seconds, to consider reloading resources, causing us a 5ms spike that would sometimes cause us to miss vsync.
- Spotting a framerate hitch caused by the main thread stalling out on a contested logging lock under some specific set of circumstances for a frame or two.
- Spotting frames where overly aggressive background jobs simply starved the main thread out of CPU cycles.
- Which specific subanimations of a set (complete with names because I'm using an explicit API that lets me annotate them) are causing our optimization code to exhibit O(scary) behavior.
There's nothing here that you couldn't eventually figure out and solve with a debugger and a sampling profiler, but instrumenting profilers - and profiler APIs that let you add your own annotations - make the job way easier in my experience.
[+] [-] trentnelson|8 years ago|reply
From https://github.com/tpn/tracer.
[+] [-] viraptor|8 years ago|reply
Or finding out about a tiny function which is called thousands of times and adds up over time, but each execution is so fast you rarely ever see it on a sample profile.
[+] [-] hyperpape|8 years ago|reply
[+] [-] liuliu|8 years ago|reply
[+] [-] wdewind|8 years ago|reply
[+] [-] jvns|8 years ago|reply
Also, I believe forking shouldn't actually double memory usage because of copy on write.
[+] [-] semiquaver|8 years ago|reply
You're correct that production ruby applications should try to avoid calling fork() at runtime, which excludes all the standard library mechanisms for shelling out. posix-spawn [1] is an alternative that works well.
[0] https://github.com/schneems/get_process_mem
[1] https://github.com/rtomayko/posix-spawn
[+] [-] FascinatedBox|8 years ago|reply
[+] [-] josefelix|8 years ago|reply
[+] [-] frik|8 years ago|reply
[deleted]