DTrace for Linux 2016

[+] jdesfossez|9 years ago|reply

It would be worthwhile to clarify the term "tracing" to distinguish between live aggregation and post-processing approaches. The general confusion around the "tracing" terminology seems to imply a competition between these two, while they should rather be seen as complementary.

DTrace, SystemTap and eBPF/BCC are designed to aggregate data in the critical path and compute a summary of the activity. Ftrace and LTTng are designed to extract traces of execution for high resolution post-processing with as small overhead as possible.

Aggregation is very powerful and gives a quick overview of the current activity of the system. Tracing extracts the detailed activity at various levels and allows in-depth understanding of a particular behaviour after the fact by allowing to run as many analyses as necessary on the captured trace.

In terms of impact on the traced system, trace buffering scales better with the number of cores than aggregation approaches due its ability to partition the trace data into per-core buffers.

Both approaches have upsides and downsides and should not be seen as being in competition, they address different use-cases and can even complement each other.

[+] brendangregg|9 years ago|reply

You're right that a key feature and differentiator of DTrace/stap/BPF is kernel aggregations, but they can do per-event output as well. But I think I know what you mean, especially as I was at the sysdig summit yesterday and could see a major difference.

I think the two models for tracers, playing on their strengths, are: 1. real-time analysis tracers (DTrace/stap/BPF), and 2. offline analysis tracers (LTTng, sysdig). Both can do the other as well, but I'm just pointing out strengths.

sysdig (and I believe LTTng) has done great work at creating capture files that can then be analyzed offline in many many different ways, and they've optimized the way full-event dumps can be captured and saved (which I know LTTng has done as well). DTrace/stap/BPF don't have any offline capture file capabilities -- they could do it, but it's not been their focus.

[+] AceJohnny2|9 years ago|reply

I've only recently tried out DTrace on OS X, and I'll admit to being kinda floored at what it can do. To think I used to be satisfied with strace on Linux!

Seeing the tracing capabilites of Linux expand is exciting indeed.

Edit: the couple of tutorials that finally unlocked DTrace (on OS X) for me are:

https://www.objc.io/issues/19-debugging/dtrace/

https://www.bignerdranch.com/blog/hooked-on-dtrace-part-1/

[+] tkinom|9 years ago|reply

Agree, DTrace on OS X is supper powerful.

I once try to debug the open source libusb app in Mac OS, with DTrace I can trace the App, Kernel USB API call, libusb internal thread in user space, etc.

Much better visibility to system activities compare to simple strace.

Absolutely love the power of what it can do.

BTW, Can a DTrace script to use to monitor a system with potential "Dirty COW" type privilege escalation issue?

[+] helper|9 years ago|reply

The most challenging thing for us is running a new enough kernel to get these features. While upgrading to a newer kernel isn't particularly hard, small companies don't have a lot of engineering resources to run kernels that aren't maintained by their distro of choice (usually on the LTS release).

The good thing is this is solved simply by waiting long enough. The bad thing is most developers can't just pick this up today without a bunch of extra effort.

If you are looking for something you can use with old kernels you should definitely checkout Brendan's perf-tools repo[1]. It takes advantage of older kernel features and works with things as old as ubuntu 12.04.

*Edit: Fixed Brendan's name

[1]: https://github.com/brendangregg/perf-tools

[+] devonkim|9 years ago|reply

On the other side of the spectrum, companies highly averse to technical changes culturally (typical case in the F500) will avoid ever upgrading kernels, libraries, and tooling. It's how I've wound up spending days trying to compile C++11 and C++14 code with toolchains that would run on CentOS / RHEL 5 and 6. Using the JVM lets you side-step the shared library linkage compatibility issues at least, but when you need a new kernel for instrumentation it's an even harder sell to an antagonistic IT department that only wants 2 OSes and corresponding versions to exist in the world ideally - "Linux" and Windows.

[+] brendangregg|9 years ago|reply

Right, thanks, my perf-tools are on the Netflix BaseAMI, and are my go-to tracing tools for 3.x and earlier 4.x kernels.

[+] cthalupa|9 years ago|reply

Amazon enabled BPF flags in the Amazon Linux AMI with 2016.03, and generally seems to move to whatever the latest LTS kernel is when they release a new version.

Since 4.9 is supposed to be the next LTS, if it gets out of RC fast enough, we could see 4.9 in the 2017.03 Amazon Linux AMI, which would be a pretty big win for those of us running workloads in the AWS cloud.

[+] technofiend|9 years ago|reply

This is the same problem shared by RedHat customers, although RH is great about backporting features to older kernels, I'm not sure they'll be able to move this to 3.x from 4.9. The price we pay for stability.

[+] wyldfire|9 years ago|reply

Congrats, this is good news.

> On Linux, some out-of-tree tracers like SystemTap could serve these needs, but brought their own challenges.

I was pretty happy with stap, it had a really rich feature set.

> DTrace has its own concise language, D, similar to awk, whereas bcc uses existing languages (C and Python or lua) with libraries.

I think we need more creative names for languages. The short and simple ones like "go" and "D" keep on having collisions. :)

>BPF adds ... uprobes

uprobes + all the other stuff is really killer, I like the idea of watching for stuff like "my app has crossed this threshold and then this system condition occurs". At least when I tried it a couple years ago with stap my kernel wasn't built with uprobes support and I wasn't inclined to rebuild it. Hopefully it becomes (or has become) more mainstream.

[+] brendangregg|9 years ago|reply

> I was pretty happy with stap, it had a really rich feature set.

So are other companies. I mentioned it in the post, as in a way this hurt BPF development, as companies that normally would have contributed resources said they were satisfied with stap. Exciting times might be ahead for stap, if it continues its BPF backend.

As for naming, yes, we need better names. Maybe the bcc/Python/BPF combination can be named something?

[+] qwertyuiop924|9 years ago|reply

Will there every be way to write probes/tracing scripts without dropping into C? I don't mind C in general, but I don't want to have to dig out the documentation for the eBPF C library and start writing hundreds of lines of C every time I want to run a trace.

DTrace made this really nice, because you would write your tracing scripts in a high-level, awk-like language, which is the sort of thing well-suited to the purpose.

[+] brendangregg|9 years ago|reply

Yes, see the section "A higher-level language", which mentions at least two projects: SystemTap+BPF and ply.

Think of the current bcc/Python/C interface as a lightweight skin that was necessary during BPF development to kick the tires on various features, prototype tools, see what else needed to be done, etc. It may be good enough to stay around, as lots of tools have been written for it that will get used and be valuable. But there's room for higher-level languages too.

If Sasha keeps developing his "trace" tool (and its summary counterpart, argdist), that may serve many such custom needs (as another option). See the various examples: https://github.com/iovisor/bcc/blob/master/tools/trace_examp... , like:

    # trace 'sys_read (arg3 > 20000) "read %d bytes", arg3'
    TIME     PID    COMM         FUNC             -
    05:18:23 4490   dd           sys_read         read 1048576 bytes
    05:18:23 4490   dd           sys_read         read 1048576 bytes
    05:18:23 4490   dd           sys_read         read 1048576 bytes

[+] vavrusa|9 years ago|reply

It wasn't mentioned in the article, but I've recently merged LuaJIT to BPF compiler. So you just write Lua and the kernel bits get compiled into BPF bytecode and loaded. No C.

See https://github.com/iovisor/bcc/blob/master/src/lua/README.md... or examples https://github.com/iovisor/bcc/blob/master/examples/lua/trac... (this one is for tracing).

[+] Roxxik|9 years ago|reply

It shouldn't be that difficult to write some simple compiler to translate a scripty language that looks like awk to BPF-Assembly (or C and stuff that into LLVM). I might look into this stuff for my Bachelor Thesis ;)

[+] bonzini|9 years ago|reply

Yes, people are working on an eBPF backend for systemtap.

[+] lallysingh|9 years ago|reply

So we're not getting DTrace proper, it seems. Instead something else will stem up from the various linux tracing systems. Maybe this BPF-based one.

It's a shame. One of the nice things about dtrace was that there was a book on it. Good, in-depth documentation on performance tools is hard to find.

[+] brendangregg|9 years ago|reply

Thanks, I wrote the DTrace book with Jim Mauro, and there will be a BPF tracing book as well.

BTW, I wouldn't say "maybe" regarding BPF, as it's integrated in the Linux kernel (unlike most of the other tracers, which are add-ons). Sooner or later everyone who runs Linux is getting it.

[+] asymmetric|9 years ago|reply

> In 2014 I joined the Netflix cloud performance team. Having spent years as a DTrace expert, it might have seemed crazy for me to move to Linux

I thought Netflix was mostly running FreeBSD [1]. Is it only the Open Connect Appliance?

[1]: https://www.freebsdfoundation.org/testimonial/netflix/

[+] brendangregg|9 years ago|reply

When you login to Netflix and browse videos, you're running on the Netflix cloud, which is massive, AWS/EC2, and mostly Ubuntu Linux. When you hit play, you're running on the OCA FreeBSD CDN, which is also a large deployment.

[+] easytiger|9 years ago|reply

Really rather unfortunate that big enterprise platforms such as banks and so forth are so far behind on their kernel version that it will be approximately 7-8 years before they will have this capability, unless RH backport of course.

[+] twblalock|9 years ago|reply

On the other hand, I'm glad the banks who handle my money don't upgrade to the latest and greatest software without taking very, very stringent precautions to make sure everything will work.

[+] unknown|9 years ago|reply

[deleted]

[+] 4ad|9 years ago|reply

Linux is not my favorite operating system, but it seems like we're stuck with it. I'm very happy for all these improvements. Once you got used to a system with a quality and functional tracer, Linux was hard to get back to. But Linux tracing is getting better and better now. I am very satisfied.

[+] Annatar|9 years ago|reply

Linux is not my favorite operating system, but it seems like we're stuck with it.

It only seems that way. We're never stuck with something as long as we don't accept it. One other factor is at play which works against Linux, and that is that people in IT like shiny new things, and therefore something else always comes along. Hopefully this time around, that something else will be the old new thing (learning from the past, and re-discovery). One way or the other, the clock is ticking on Linux, and one of these days, it won't be as popular any more, because something else will be the new-new thing. It's the nature of this industry:

change is the only constant.

You don't have to accept anything. Don't bow to peer pressure.

[+] honkhonkpants|9 years ago|reply

So how does this relate to uprobes? I've been looking into that lately because I want frequency counts (or coverage analysis) of user space programs but without the nop-sled overhead of xray. Does dtrace supplement or replace uprobes? Or am I really just confused?

[+] cthalupa|9 years ago|reply

DTrace is a Solaris (and BSD/OSX) tracing tool that never quite made it to Linux (There are some attempted ports, but none of them really caught on). BPF (and adding in frontends like BCC) give you the same sort of functionality in Linux.

BPF can take advantage of uprobes and instrument around them, but it interacts with them, and does not replace them

[+] unknown|9 years ago|reply

[deleted]

81 comments