TOC:

  1. Intro
  2. Userspace profiling: gprof, gcov
  3. Userspace profiling: Valgrind
  4. Kernel profiling: Intro
  5. Kernel profiling: ftrace
  6. Kernel profiling: perf
  7. Kernel profiling: SystemTap
  8. Various tools

Quick reminder. I’m trying to understand why my nice little program is reading from RAID so slow. I’m using various profiling tools to inspect my block_hasher and get a hands-on experience with profilers.

So, we reviewed userspace profiling and arrived to main theme of my article series which is kernel profiling: tools, profilers, facilities and methods of profiling.

In this article I’ll tell you about facilities that kernel provides to profile it.

Intro

Sometimes when you’re facing really hard performance problem it’s not always enough to profile your applicatiion. As we saw while profiling our application with gprof, gcov and Valgrind problem is somewhere underneath our application — something is holding pread in long I/O wait cycles.

How to trace system call is not clear at first sight — there are various kernel profilers, all of them works in it’s own way, requires unique configuration, methods, analyzis and so on. Yes, it’s really hard to figure it out. Being the biggest open-source project developed by massive community, Linux absorbed several different and sometimes conflicting profiling facilities. And it’s in some sense getting even worse — while some profiles tend to merge (ftrace and perf) other tools emerge — last example is ktap.

To undertand that bazaar lets start from the bottom — what does kernel have that makes it able profile it? Basically there are only 3 kernel facilities that enable profiling:

  • Kernel tracepoints
  • Kernel probes
  • Perf events

It’s features that gives us access to kernel internals. By using them we can measure kernel functions execution, trace access to devices, analyze CPU states and so on.

This very features are really awkward for direct use and accessible only from kernel. Well, if you really want you can write your own Linux kernel module that will utilize these facilities for your custom use, but it’s pretty much pointless. That’s why people have created a few really good general purpose profilers:

  • ftrace
  • perf
  • SystemTap
  • ktap

All of them are based on that features and will be discussed later more thoroughly, but now let’s review features itself.

Kernel tracepoints

Kernel Tracepoints — it’s framework for tracing kernel function via static instrumenting1.

Tracepoint — it’s a place in code where you can bind your callback. Tracepoints can be disabled (no callback) and enabled (has callback). There might be several callbacks though it’s still lightweight — when callback disabled it’s actually looks like if (unlikely(tracepoint.enabled)).

Tracepoint output is written in ring buffer that is export through debugfs at /sys/kernel/debug/tracing/trace. There are also whole tree of traceable events at /sys/kernel/debug/tracing/events that exports control files to enable/disable particular event.

Despite of it’s name tracepoints are base for event-based profiling because besides traceing you can do anything in callback, e.g. timestamping and measuring resource usage. Linux kernel is already (since 2.6.28) instrumented with that tracepoints in many places. For example, __do_kmalloc:

/**
 * __do_kmalloc - allocate memory
 * @size: how many bytes of memory are required.
 * @flags: the type of memory to allocate (see kmalloc).
 * @caller: function caller for debug tracking of the caller
 */
static __always_inline void *__do_kmalloc(size_t size, gfp_t flags,
                                          unsigned long caller)
{
        struct kmem_cache *cachep;
        void *ret;

        /* If you want to save a few bytes .text space: replace
         * __ with kmem_.
         * Then kmalloc uses the uninlined functions instead of the inline
         * functions.
         */
        cachep = kmalloc_slab(size, flags);
        if (unlikely(ZERO_OR_NULL_PTR(cachep)))
                return cachep;
        ret = slab_alloc(cachep, flags, caller);

        trace_kmalloc(caller, ret,
                      size, cachep->size, flags);

        return ret;
}

trace_kmalloc is tracepoint. There are many others in other critical parts of kernel such as schedulers, block I/O, networking and even interrupt handlers. All of them are used by most profilers because they have minimal overhead, fires by event and saves you from modifying kernel.

Ok, so by now you may be eager to insert in all of your kernel modules and profile it to hell, but BEWARE. If you want to add tracepoints you must have a lot of patience and skills because writing your own tracepoints is really ugly and awkward. You can see examples at samples/trace_events/. Under the hood tracepoints are C macros black magic that only bold and fearless persons could understand.

And even if you do all that crazy macro declarations and struct definitions it might just simply not work at all if you have CONFIG_MODULE_SIG=y and don’t sign module. It might seem kinda strange configuration but in reality it’s a default for all major distributions including Fedora and Ubuntu. That said, after 9 circles of hell, you will end up with nothing.

So, just remember:

USE ONLY EXISTING TRACEPOINTS IN KERNEL, DO NOT CREATE YOUR OWN.

Now I’m gonna explain why it’s happening. So if you tired of tracepoints just skip to read about kprobes.

Ok, so some time ago while preparing kernel 3.12 these code was added:

static int tracepoint_module_coming(struct module *mod)
{
          struct tp_module *tp_mod, *iter;
          int ret = 0;

          /*
           * We skip modules that tain the kernel, especially those with different
           * module header (for forced load), to make sure we don't cause a crash.
           */
          if (mod->taints)
                  return 0;

If module is tainted we will NOT write ANY tracepoints. Later it became more adequate

/*
 * We skip modules that taint the kernel, especially those with different
 * module headers (for forced load), to make sure we don't cause a crash.
 * Staging and out-of-tree GPL modules are fine.
 */
if (mod->taints & ~((1 << TAINT_OOT_MODULE) | (1 << TAINT_CRAP)))
        return 0;

Like, ok it may be out-of-tree (TAINT_OOT_MODULE) or staginig(TAINT_CRAP) but any others are no-no.

Seems legit, right? Now what whould you think will be if your kernel is compiled with CONFIG_MODULE_SIG enabled and your pretty module is not signed? Well, module loader will set TAINT_FORCES_MODULE flag for it. And now your pretty module will never pass the condition in tracepoint_module_coming and never show you any tracepoints output. And as I said earlier this stupid option is set for all major distributions including Fedora and Ubuntu since kernel version 3.1.

If you think — «Well, lets sign goddamn module!» — you’re wrong. Modules must be signed with kernel private key that are held by your Linux distro vendor and, of course, not available for you.

The whole terrifying story is available in lkml 1, 2.

As for me I just cite my favorite thing from Steven Rostedt (ftrace maintainer and one of the tracepoints developer):

> OK, this IS a major bug and needs to be fixed. This explains a couple
> of reports I received about tracepoints not working, and I never
> figured out why. Basically, they even did this:
> 
> 
>    trace_printk("before tracepoint\n");
>    trace_some_trace_point();
>    trace_printk("after tracepoint\n");
>
> Enabled the tracepoint (it shows up as enabled and working in the
> tools, but not the trace), but in the trace they would get:
>
>    before tracepoint
>    after tracepoint
>
> and never get the actual tracepoint. But as they were debugging
> something else, it was just thought that this was their bug. But it
> baffled me to why that tracepoint wasn't working even though nothing in
> the dmesg had any errors about tracepoints.
> 
> Well, this now explains it. If you compile a kernel with the following
> options:
> 
> CONFIG_MODULE_SIG=y
> # CONFIG_MODULE_SIG_FORCE is not set
> # CONFIG_MODULE_SIG_ALL is not set
> 
> You now just disabled (silently) all tracepoints in modules. WITH NO
> FREAKING ERROR MESSAGE!!!
> 
> The tracepoints will show up in /sys/kernel/debug/tracing/events, they
> will show up in perf list, you can enable them in either perf or the
> debugfs, but they will never actually be executed. You will just get
> silence even though everything appeared to be working just fine.

Recap:

  • Kernel tracepoints is a lightweight tracing and profiling facility.
  • Linux kernel is heavy instrumented with tracepoints that are used by the most profilers and especially by perf and ftrace.
  • Tracepoints are C marcos black magic and almost impossible for usage in kernel modules.
  • It will NOT work in your LKM if:
    • Kernel version >=3.1 (might be fixed in 3.15)
    • CONFIG_MODULE_SIG=y
    • Your module is not signed with kernel private key.

Kernel probes

Kernel probes is a dynamic debugging and profiling mechanism that allows you to break into kernel code, invoke your custom function called probe and return everything back.

Basically its done by writing kernel module where you register handler for some address or symbol in kernel code. Also according to definition of struct kprobe you can pass offset from address but I’m not sure about that. In your registered handler you can do really anything — write to log, to some buffer exported via sysfs, measure time and infinite amount of possibilities to do. And that’s really nifty contrary to tracepoints where you can only read logs from debugfs.

There are 3 types of probes:

  • kprobes — basic probe that allows you to break into any kernel address.
  • jprobes — jump probes that inserted in the start of function and gives you handy access to function arguments; it’s something like proxy-function.
  • kretprobes — return probes that inserted in the return point of function.

Last 2 types are based on basic kprobes.

All of this generally works like this:

  • We register probe on some address A.
  • kprobe subsystem finds A.
  • kprobe copies instruction at address A.
  • kprobe replaces instruction at A for breakpoint (int 3 in case of x86).
  • Now when execution hits probed address A, CPU trap occurs.
  • Registers are saved.
  • CPU transfers control to kprobes via notifier_call_chain mechanism.
  • And finally kprobes invokes our handler.
  • After all we restore registers, copies back instruction at A and continues execution.

Our handler usually gets as an argument adress where breakpoint happened and registers values in pt_args structures. kprobes handler prototype:

    typedef int (*kprobe_break_handler_t) (struct kprobe *, struct pt_regs *);

In most cases except debugging this info is useless because we have jprobes. jprobes handler has exactly the same prototype as and intercepting function. For example, this is handler for do_fork:

    /* Proxy routine having the same arguments as actual do_fork() routine */
    static long jdo_fork(unsigned long clone_flags, unsigned long stack_start,
              struct pt_regs *regs, unsigned long stack_size,
              int __user *parent_tidptr, int __user *child_tidptr)

Also jprobes doesn’t causes interrupts because it works with help of setjmp/longjmp that are much more lightweight.

And finally the most convenient tool for profiling are kretprobes. It allows you to register 2 handlers — one to invoke on function start and the other to invoke in the end. But really cool feature is that it allows you to save state between those 2 calls, like timestamp or counters.

Instead of thousand words — look at absolutely astonishing samples at samples/kprobes.

Recap:

  • kprobes is a beatiful hack for dynamic debugging, tracing and profiling.
  • It’s a fundamental kernel feature for non-invasive profiling.

Perf events

perf_events is an interface for hardware metrics implemented in PMU (Performance Monitoring Unit) which is part of CPU.

Thanks to perf_events you can easily ask kernel to show you L1 cache misses count regardless of what architecture you are on — x86 or ARM. What CPUs are supported by perf are listed here.

In addition to that perf included various kernel metrics like software context swithes count (PERF_COUNT_SW_CONTEXT_SWITCHES).

And in addition to that perf included tracepoint support via ftrace.

To access perf_events there is a special syscall perf_event_open. You are passing the type of event (hardware, kernel, tracepoint) and so called config, where you specifies what exactly you want depending on type. It’s gonna be function name in case of tracepoint, some CPU metric in case of hardware and so on.

And on top of that there are a whole lot of stuff like event groups, filters, sampling, various output formats and others. And all of that are constantly breaking3, that’s why the only thing you can ask for perf_events is special perf utility — the only userspace utility that is a part of kernel tree.

_perf_events and all things related to it spreads as a plague in kernel and now ftrace is going to be part of perf (1, 2). Some people overreacting on perf related things though it’s useless because perf is developed by kernel big fishes — Info Molnar4 and Peter Zijlstra.

I really can’t tell anything more about perf_events in isolation of perf, so here I finish.

Summary

There are a few Linux kernel features that enables profiling:

  1. tracepoints
  2. kprobes
  3. perf_events

All Linux kernel profilers uses combinations of that features, read details in article for particular profiler.

To read


  1. Tracepoints are improvement of early feature called kernel markers.

  2. Namely in commit b75ef8b44b1cb95f5a26484b0e2fe37a63b12b44

  3. And that’s indended behaviour. Kernel ABI in no sense stable, API is.

  4. Author of current default O(1) process scheduler CFS — Completely Fair Scheduler.