Indiana native. Purdue grad. Programmer / Dev Ops in trade. Dog owner. Husband and father. Have questions? Ask!
101 stories
·
1 follower

Linux tracing systems & how they fit together

1 Comment

I’ve been confused about Linux tracing systems for years. There’s strace, and ltrace, kprobes, and tracepoints, and uprobes, and ftrace, and perf, and eBPF,
and how does it all fit together and what does it all MEAN?

Last week I went to Papers We Love and later me & Kamal hung out with
Suchakra at the Ecole Polytechnique (where LTTng comes from) and
finally I think I understand how all these pieces fit together, more or less.

I’m actually going to leave strace out of this post (even though it’s my favorite thing) because the overhead is so high – in this post we’re only going to talk about tracing systems that are relatively fast / low overhead.

The thing I learned last week that helped me really understand better was – you can split linux tracing systems into data sources (where the tracing data comes from), ways the data gets out of the kernel and into userspace (like “ftrace”) and tracing frontends (the tool you use to actually collect the data). In this post I’ll talk about all these things:

Data sources:

  • kprobes
  • uprobes
  • Tracepoints
  • lttng-ust
  • USDT / dtrace probes

Mechanisms for collecting your delicious data:

  • ftrace
  • perf_events
  • eBPF
  • sysdig
  • Systemtap kernel module
  • LTTng

User frontends:

  • perf
  • Various ftrace frontends (trace-cmd, catapult, kernelshark, perf-tools)
  • The bcc frontend for eBPF
  • LTTng & SystemTap frontends

It’s still kind of complicated but breaking it up this way really helps me understand (thanks to Brendan Gregg for suggesting this breakdown on twitter!)

a picture version

here are 6 drawings summarizing what this post is about:

What can you trace?

A few different kinds of things you might want to trace:

  • System calls
  • Linux kernel function calls (which functions in my TCP stack are being called?)
  • Userspace function calls (did malloc get called?)
  • Custom “events” that you’ve defined either in userspace or in the kernel

All of these things are possible, but it turns out the tracing landscape is actually pretty complicated.

Data sources: kprobes, tracepoints, uprobes, dtrace probes & more

Okay, let’s do data sources! This is kind of the most fun part – there are so many EXCITING PLACES you can get data about your programs.

I’m going to split these up into “probes” (kprobes/uprobes) and “tracepoints” (USDT/kernel tracepoints / lttng-ust). I’m think I’m not using the right terminology exactly but there are 2 distinct ideas here that are useful to understand

A probe is when the kernel dynamically modifies your assembly program at runtime (like, it changes the instructions) in order to enable tracing. This is super powerful (and kind of scary!) because you can enable a probe on literally any instruction in the program you’re tracing. (though dtrace probes aren’t “probes” in this sense). Kprobes and uprobes are examples of this pattern.

A tracepoint is something you compile into your program. When someone using your program wants to see when that tracepoint is hit and extract data, they can “enable” or “activate” the tracepoint to start using it. Generally a tracepoint in this sense doesn’t cause any extra overhead when it’s not activated, and is relatively low overhead when it is activated. USDT (“dtrace probes”), lttng-ust, and kernel tracepoints are all examples of this pattern.

kprobes

Next up is kprobes! What’s that? From an LWN article:

KProbes are a debugging mechanism for the Linux kernel which can also be used for monitoring events inside a production system. You can use it to weed out performance bottlenecks, log specific events, trace problems etc.

To reiterate – basically kprobes let you dynamically change the Linux kernel’s assembly code at runtime (like, insert extra assembly instructions) to trace when a given instruction is called. I usually think of kprobes as tracing Linux kernel function calls, but you can actually trace any instruction inside the kernel and inspect the registers. Weird, right?

Brendan Gregg has a kprobe script that you can use to play around with kprobes.

For example! Let’s use kprobes to spy on which files are being opened on our computer. I ran

$ sudo ./kprobe 'p:myopen do_sys_open filename=+0(%si):string' 

from the examples and right away it started printing out every file that was being opened on my computer. Neat!!!

You’ll notice that the kprobes interface by itself is a little gnarly though – like, you have to know that the filename argument to do_sys_open is in the %si register and dereference that pointer and tell the kprobes system that it’s a string.

I think kprobes are useful in 3 scenarios: 1. You’re tracing a system call. System calls all have corresponding kernel functions like do_sys_open 2. You’re debugging some performance issue in the network stack or to do with file I/O and you understand the kernel functions that are called well enough that it’s useful for you to trace them (not impossible!!! The linux kernel is just code after all!) 3. You’re a kernel developer,or you’re otherwise trying to debug a kernel bug, which happens sometimes!! (I am not a kernel developer)

uprobes

Uprobes are kind of like kprobes, except that instead of instrumenting a kernel function you’re instrumenting userspace functions (like malloc). brendan gregg has a good post from 2015.

My understanding of how uprobes work is:

  1. You decide you want to trace the malloc function in libc
  2. You ask the linux kernel to trace malloc for you from libc
  3. Linux goes and finds the copy of libc that’s loaded into memory (there should be just one, shared across all processes), and changes the code for malloc so that it’s traced
  4. Linux reports the data back to you somehow (we’ll talk about how “asking linux” and “getting the data back somehow” works later)

This is pretty cool! One example of a thing you can do is spy on what people are typing into their bash terminals

bork@kiwi~/c/perf-tools> sudo ./bin/uprobe 'r:bash:readline +0($retval):string' 
Tracing uprobe readline (r:readline /bin/bash:0x9a520 +0($retval):string). Ctrl-C to end. 
            bash-10482 [002] d...  1061.417373: readline: (0x42176e <- 0x49a520) arg1="hi" 
            bash-10482 [003] d...  1061.770945: readline: (0x42176e <- 0x49a520) arg1=(fault)
            bash-10717 [002] d...  1063.737625: readline: (0x42176e <- 0x49a520) arg1="hi" 
            bash-10717 [002] d...  1067.091402: readline: (0x42176e <- 0x49a520) arg1="yay" 
            bash-10717 [003] d...  1067.738581: readline: (0x42176e <- 0x49a520) arg1="wow" 
            bash-10717 [001] d...  1165.196673: readline: (0x42176e <- 0x49a520) arg1="cool-command" 

USDT/dtrace probes

USDT stands for “Userland Statically Defined Tracing”, and “USDT probe” means the same thing as “dtrace probe” (which was surprising to me!). You might have heard of dtrace on BSD/Solaris, but you can actually also use dtrace probes on Linux, though the system is different. It’s basically a way to expose custom events. For example! Python 3 has dtrace probes, if you compile it right.

python.function.entry(str filename, str funcname, int lineno, frameptr)

This means that if you have a tool that can consume dtrace probes, (like eBPF / systemtap), and a version of Python compiled with dtrace support, you can automagically trace Python function calls. That’s really cool! (though this is a little bit of an “if” – not all Pythons are compiled with dtrace support, and the version of Python I have in Ubuntu 16.04 doesn’t seem to be)

How to tell if you have dtrace probes, from the Python docs. Basically you poke around in the binaries with readelf and look for the string “stap” in the notes.

$ readelf -S ./python | grep .note.stapsdt 
[30] .note.stapsdt        NOTE         0000000000000000 00308d78 
# sometimes you need to look in the .so file instead of the binary 
$ readelf -S libpython3.3dm.so.1.0 | grep .note.stapsdt 
[29] .note.stapsdt        NOTE         0000000000000000 00365b68 
$ readelf -n ./python 

If you want to read more about dtrace you can read this paper from 2004 but I’m not actually sure what the best reference is.

kernel tracepoints

Tracepoints are also in the Linux kernel. (here’s an LWN article). The system was written by Mathieu Desnoyers (who’s from Montreal! :)). Basically there’s a TRACE_EVENT macro that lets you define tracepoints like this one (which has something to do with UDP… queue failures?):

TRACE_EVENT(udp_fail_queue_rcv_skb, 
           TP_PROTO(int rc, struct sock *sk), 
        TP_ARGS(rc, sk), 
        TP_STRUCT__entry( 
                __field(int, rc) 
                __field(__u16, lport) 
        ), 
…. 

I don’t really understand how it works (I think it’s pretty involved) but basically tracepoints:

  • Are better than kprobes because they stay more constant across kernel versions (kprobes just depend on whatever code happens to be in the kernel at that time)
  • Are worse than kprobes because somebody has to write them explicitly

lttng-ust

I don’t understand LTTng super well yet but – my understanding is that all of the 4 above things (dtrace probes, kprobes, uprobes, and tracepoints) all need to go through the kernel at some point. lttng-ust is a tracing system that lets you compile tracing probes into your programs, and all of the tracing happens in userspace. This means it’s faster because you don’t have to do context switching. I’ve still used LTTng 0 times so that’s mostly all I’m going to say about that.

Mechanisms for collecting your delicious delicious data

To understand the frontend tools you use to collect & analyze tracing data, it’s important to understand the fundamental mechanisms by which tracing data gets out of the kernel and into your grubby hands. Here they are. (there are just 5! ftrace, perf_events, eBPF, systemtap, and lttng).

Let’s start with the 3 that are actually part of the core Linux kernel: ftrace, perf_events, and eBPF.

ftrace

Those ./kprobe and ./uprobe scripts up there? Those both use ftrace to get data out of the kernel. Ftrace is a kind of janky interface which is a pain to use directly. Basically there’s a filesystem at /sys/kernel/debug/tracing/ that lets you get various tracing data out of the kernel.

The way you fundamentally interact with ftrace is 1. Write to files in /sys/kernel/debug/tracing/ 2. Read output from files in /sys/kernel/debug/tracing/

Ftrace supports: * Kprobes * Tracepoints * Uprobes
* I think that’s it.

Ftrace’s output looks like this and it’s a pain to parse and build on top of:

            bash-10482 [002] d...  1061.417373: readline: (0x42176e <- 0x49a520) arg1="hi" 
            bash-10482 [003] d...  1061.770945: readline: (0x42176e <- 0x49a520) arg1=(fault) 
            bash-10717 [002] d...  1063.737625: readline: (0x42176e <- 0x49a520) arg1="hi" 
            bash-10717 [002] d...  1067.091402: readline: (0x42176e <- 0x49a520) arg1="yay" 
            bash-10717 [003] d...  1067.738581: readline: (0x42176e <- 0x49a520) arg1="wow" 
            bash-10717 [001] d...  1165.196673: readline: (0x42176e <- 0x49a520)  

perf_events

The second way to get data out of the kernel is with the perf_event_open system call. The way this works is:

  1. You call the perf_event_open system call
  2. The kernel writes events to a ring buffer in user memory, which you can read from

As far as I can tell the only thing you can read this way is tracepoints. This is what running sudo perf trace does (there’s a tracepoint for every system call)

eBPF

eBPF is a VERY EXCITING WAY to get data. Here’s how it works.

  1. You write an “eBPF program” (often in C, or likely you use a tool that generates that program for you).
  2. You ask the kernel to attach that probe to a kprobe/uprobe/tracepoint/dtrace probe
  3. Your program writes out data to an eBPF map / ftrace / perf buffer
  4. You have your precious precious data!

eBPF is cool because it’s part of Linux (you don’t have to install any kernel modules) and you can define your own programs to do any fancy aggregation you want so it’s really powerful. You usually use it with the bcc frontend which we’ll talk about a bit later. It’s only available on newer kernels though (the kernel version you need depends on what data sources you want to attach your eBPF programs to)

Different eBPF features are available at different kernel versions, here’s a slide with an awesome summary:

sysdig

Sysdig is a kernel module + tracing system. It lets you trace system calls and maybe some other things? I find their site kind of confusing to navigate, but I think this file contains the list of all the events sysdig supports. So it will tell you what files are being opened but not the weird internal details of what your TCP stack is doing.

systemtap

I’m a little fuzzy how SystemTap works so we’re going to go from this architecture document

  1. You decide you want to trace a kprobe
  2. You write a “systemtap program” & compile it into a kernel module
  3. That kernel module, when inserted, creates kprobes that call code from your kernel module when triggered (it calls register_kprobe)
  4. You kernel modules prints output to userspace (using relayfs or something)

SystemTap supports: * tracepoints * kprobes * uprobes * USDT

Basically lots of things! There are some more useful words about systemtap in choosing a linux tracer

LTTng

LTTng (linux tracing: the next generation) is from Montreal (a lab at ecole polytechnique)!! which makes me super happy (montreal!!). I saw an AMAZING demo of tool called trace compass the other day that reads data that comes from LTTng. Basically it was able to show all the sched_switch transitions between programs and system calls when running tar -xzf somefile.tar.gz, and you could really see exactly what was happening in a super clear way.

The downside of LTTng (like SystemTap) is that you have to install a kernel module for the kernel parts to work. With lttng-ust everything happens in userspace and there’s no kernel module needed.

Frontends

Okay! Time for frontends! I’m going to categorize them by mechanism (how the data gets out of the kernel) to make it easier

perf frontends

The only frontend here is perf, it’s simple.

perf trace will trace system calls for you, fast. That’s great and I love it. perf trace is the only one of these I actually use day to day right now. (the ftrace stuff is more powerful and also more confusing / difficult to use)

ftrace frontends

Ftrace is a pain to use on its own and so there are various frontend tools to help you. I haven’t found the best thing to use yet but here are some starting points:

  • trace-cmd is a frontend for ftrace, you can use it to collect and display ftrace data. I wrote about it a bit in this blog post and there’s an article on LWN about it
  • Catapult lets you analyze ftrace output. It’s for Android / chrome performance originally but you can also just analyze ftrace. So far the only thing I’ve gotten it to do is graph sched_switch events so you know which processes were running at what time exactly, and which CPU they were on. Which is pretty cool but I don’t really have a use for yet?
  • kernelshark consumes ftrace output but I haven’t tried it yet
  • The perf command line tool is a perf frontend and (confusingly) also a frontend for some ftrace functionality (see perf ftrace)

eBPF frontends

The only I know of is the bcc framework: https://github.com/iovisor/bcc. It lets you write eBPF programs, it’ll insert them into the kernel for you, and it’ll help you get the data out of the kernel so you can process it with a Python script. It’s pretty easy to get started with.

If you’re curious about the relationship between eBPF and the BPF you use in tcpdump I wrote a post about eBPF & its relationship with BPF for packet filtering the other day. I think it might be easiest though to think of them as unrelated because eBPF is so much more powerful.

bcc is a bit weird because you write a C program inside a Python program but there are a lot of examples. Kamal and I wrote a program with bcc the other day for the first time and it was pretty easy.

LTTng & SystemTap frontends

LTTng & SystemTap both have their own sets of tools that I don’t really understand. THAT SAID – there’s this cool graphical tool called Trace Compass that seems really powerful. It consumes a trace format called CTF (“common trace format”) that LTTng emits.

what tracing tool should I use though

Here’s kind of how I think about it right now (though you should note that I only just figured out how all this stuff fits together very recently so I’m not an expert):

  • perf trace is good, it will trace system calls with low overhead and it’s super simple, there’s not much to learn. A+.
  • For everything else, they’re, well, an investment, they take time to get used to.
  • I think playing with kprobes is a good idea (via eBPF/ftrace/systemtap/lttng/whatever, for me right now ftrace is easiest). Being able to know what’s going on in the kernel is a good superpower.
  • eBPF is only available in kernel versions above 4.4, and some features only above 4.7. I think it makes sense to invest in learning it but on older systems it won’t help you out yet
  • ftrace is kind of a huge pain, I think it would be worth it for me if I could find a good frontend tool but I haven’t managed it yet.

I hope this helped!

I’m really excited that now I (mostly) understand how all these pieces fit together, I am literally typing some of this post at a free show at the jazz festival and listening to blues and having a fun time.

Now that I know how everything fits together, I think I’ll have a much easier time navigating the landscape of tracing frontends!

Read the whole story
RyanAdams
16 days ago
reply
Intriguing list. I feel the need to try some of these out now.
Central Indiana
Share this story
Delete

#1546 – Picky (No Comments)

1 Share

#1546 – Picky

Read the whole story
RyanAdams
21 days ago
reply
Central Indiana
Share this story
Delete

strip for June / 19 / 2017 - The Savvy Consumer

1 Share
strip for June / 19 / 2017 - The Savvy Consumer
The Savvy Consumer

Jump to a Random Strip in the Archives! | Buy This Original Art | Archives | E-mail Dave

Read the whole story
RyanAdams
23 days ago
reply
Central Indiana
Share this story
Delete

How Fox Sports laid off Tyson Winter 27 hours later, after telling him his job was safe

1 Comment
Fox Sports Go Fail

One of the weirdest stories to come out Fox laying off its digital writers, editors and some social media talent as part of their long-dreaded website changes was what happened to Tyson Winter.  Winter, a social media specialist who (per his LinkedIn profile) had been with Fox since January after several years at the Pac-12 Networks, tweeted Tuesday that he only found out he was let go 27 hours later thanks to a vice-president not knowing who he was:

A non-Winter source tells Awful Announcing Winter attended a meeting for the remaining staff Tuesday and was told he was safe, only to later learn he had been laid off. A second source would later confirm the following retelling of how this played out.

“Basically, Jamie [Horowitz, Fox Sports’ president of national networks and head of digital] showed up today for a big all-hands meeting with everybody left,” the first source said Tuesday. “Pretty much, he acknowledged the [Awful Announcing] article and said he was sad to hear, or bummed to hear, or something like that, disappointed to hear that morale had been low and that people had been stressed. …He defended the [January] meeting [where Horowitz talked about shifting from original written content to written content relaying what TV personalities said, or ghostwritten content under TV personalities’ bylines] , and said his only other option would have been to lie to everyone about what the future was.”

“So Tyson went through that whole meeting. And they said ‘Anybody in this room is safe, you have nothing to worry about.’ Jamie even said for as many people that are down, and as many frowns there are in this room, we want to fill this room back up with smiles, or something along those lines.”

“So Tyson tries to log on to his computer, and it’s not working. He calls IT, and they say ‘Your account’s been deactivated, you should talk to your manager.’ Well, the funny thing is his manager had been fired. So he then gets an email on his personal email from HR saying ‘I need to speak to you immediately.’ So they call him and they basically say, ‘So sorry, we regret to inform you, we were supposed to tell you this yesterday, and there was a miscommunication, and we didn’t.'”

The source had strong criticism for new vice-president Michael Bucklin and his role in Winter’s poorly-handled dismissal, saying Bucklin is only there to please Horowitz and Charlie Dixon (executive vice president of content for FS1 and FS2).

“The new VP who’s been put in charge, apparently his objective is to keep Jamie happy, keep Charlie happy, and ignore what’s happening below you. It’s all about managing up, not managing staff.”

The source said morale amongst the remaining employees remains terrible, both at Fox digital and on the FS1 lot, with many of those who avoided the cuts looking for other jobs. And that’s interesting, because the supposed emphasis here was a shift to digital video (as you can see from their web redesign), but they’ve cut plenty of people doing various social video functions and apparently created a morale disaster with the rest. The way they dealt with these layoffs has also produced major blowback, with this source saying they only spoke up because of how Fox has handled this.

“The business side I get, I don’t care, but when people are getting the way that people are getting treated, that’s when there are consequences for action,” the source said. “It has been a clusterfuck on every level from the second that Pete [Vlastelica, the former Fox Sports executive vice-president of digital] walked out the door.”

The post How Fox Sports laid off Tyson Winter 27 hours later, after telling him his job was safe appeared first on Awful Announcing.

Read the whole story
RyanAdams
26 days ago
reply
Oops.
Central Indiana
Share this story
Delete

A’s owners say moving four miles to new stadium will create $3B in economic benefits, I’m done

1 Comment

Okay, forget considering building a stadium on a site that’s really too small to fit one, because the Oakland A’s just trumped their own crazy yesterday, releasing a report that insists that a new stadium would generate $3.05 billion in economic impact over ten years for Oakland. That’s billion. With a B:

The study, conducted by the Bay Area Council Economic Institute, also concluded that a new stadium would boost annual attendance by roughly one million, up from the 1.5 million or so that the A’s drew in 2016. Building a new ballpark would also produce about 2,000 construction jobs…

The $3.05 billion in economic benefit is broken down into $768 million from construction and related spending, $1.54 billion from game-day spending and $742 million from ballpark operations.

Okay, let’s break this down some. The $768 million from construction is uncontroversial, since a stadium is going to cost close to that much to build, though whether it should be counted as a net win for Oakland is another question. (Are all the vendors providing steel for the stadium going to go out and spend their earnings at the local Safeway?) The rest of the study, though — which features a lovely opening photo of Ryon Healy about to be mobbed by teammates excited about all the new economic benefits he’s bringing to Oakland — relies on the assumption that “gate receipts grow by approximately 2x in the first year of operations of a new stadium while concession spending increases at an even higher rate.”

There are a couple of problems with this. One is, obviously, the substitution effect: If all those new fans would have been spending that money elsewhere in Oakland anyway, it’s not a net benefit to the city. The Bay Area Council Economic Institute study takes this into account by discounting new spending by 20% — a number they apparently got by seeing that a previous study for the Detroit Tigers used 25% as the amount of spending that was substituted, then arbitrarily reduced that to 20% because Oakland is likely to have more out-of-towners attending games.

Then there’s that doubling of gate receipts thing. The study actually calculates that MLB teams saw attendance increase an average of 40% in the first ten years of a new stadium — though it puts its finger on the scale by not counting the New York Yankees and Mets for “lack of applicability,” for which read “they actually saw attendance go down in their new stadiums, that’s not going to help our numbers.” I’m not going to re-run their calculations right now, but suffice to say that that’s a lot to expect from a stadium honeymoon period, and isn’t likely to be sustainable over the long term (cf. the Cleveland Indians, who sold out several seasons in a row after their new stadium opened and now can’t draw flies despite a shiny new league championship trophy).

So, in short: If the A’s move four miles down the road from the Coliseum to new digs, some unknown number of additional people might go to A’s games, and some unknown number of them might have otherwise spent that money eating dinner in San Leandro, and some unknown amount of that cash might end up getting recirculated in the Oakland economy rather than just getting pocketed by the A’s owners and players, and — hey, why don’t we just call it an even three billion dollars? And don’t be bothered by the fact that nobody who has ever tried to find evidence of one of these stadium-sparked public windfalls has ever found any — $3 billion! With a B!

Anyway, here’s hoping that the A’s owners are just using the clear plastic binder gambit because it’s available to them, and they’ll end up actually paying for stadium construction and land acquisition and accepting a site that works for the city of Oakland and not just for them, like they say they will. If that happens, I will forgive them all their silly economic projections, though I make no promises to stop making fun of them for them.

Read the whole story
RyanAdams
31 days ago
reply
Sports math is a bit questionable at times.
Central Indiana
Share this story
Delete

#1541 – Solved (No Comments)

1 Share

#1541 – Solved

Read the whole story
RyanAdams
36 days ago
reply
Central Indiana
Share this story
Delete
Next Page of Stories