Intro to intro

Initially I’ve wanted to write single article with review of Linux profiling tools, but being very curious person I’ve overblew it. And so I’ve decided to create a series of articles that will be interesting from techical point of view and not to broad as in some book. So now, please welcome, a whole sequence of articles.


  1. Intro
  2. Userspace profiling: gprof, gcov
  3. Userspace profiling: Valgrind
  4. Kernel profiling: Intro
  5. Kernel profiling: ftrace
  6. Kernel profiling: perf
  7. Kernel profiling: SystemTap
  8. Various tools


Profiling — dynamic analysis of software, consisting of gathering various metrics and calculating some statistical info from it. Usually, you do profiling to analyze performance though it’s not the single case, e.g. there are works about profiling for energy consumption analysis.

Do not confuse profiling and tracing. Tracing is a procedure of saving program runtime steps to debug it — you are not gathering any metrics.

Also don’t confuse profiling and benchmarking. Benchmarking is all about marketing. You launch some predefined procedure to get couple of numbers that you can print in your marketing brochures.

Profiler — program that do profiling.

Profile — result of profiling, some statistical info calculated from gathered metrics.

There are a lot of metrics that profiler can gather and analyze and I won’t list them all but instead try to make some hierarchy of it:

  • Time metrics
    • Program/function runtime
    • I/O latency
  • Space metrics
    • Memory usage
    • Open files
    • Bandwidth
  • Code metrics
    • Call graph
    • Function hit count
    • Loops depth
  • Hardware metrics
    • CPU cache hit/miss ratio
    • Interrupts count

Variety of metrics imply variety of methods to gather it. And I have a beautiful hierarchy for that, yeah:

  • Invasive profiling — changing profiled code
    • Source code instrumentation
    • Static binary instrumentation
    • Dynamic binary instrumentation
  • Non-invasive profiling — without changing any code
    • Sampling
    • Event-based
    • Emulation

(That’s all the methods I know. If you come up with another — feel free to contact me).

Quick review of methods.

Source code instrumentation is the simplest one. If you have source codes you can add special profiling calls to every function (not manually, of course) and then launch your program. Profiling calls will trace function graph and can also compute time spent in functions and also branch prediction probability and a lot of other things. But oftentimes you don’t have source code. And that makes me saaaaad panda.

Binary instrumentation is what you can guess by yourself — you are modifying program binary image — either on disk (program.exe) or in memory. This is what reverse engineers love to do. To research some commercial critical software or analyze malware they do binary instrumentation and analyze program behaviour. If you’re interesting in this, please, call my good friend and uni groupmate Dima Evdokimov (@evdokimovds) — he is research director in Digital Security. He is really in this theme (see, for example, DBI in informational security (in Russian)).

Anyway, binary instrumentation also really useful in profiling — many modern instruments are built on top binary instrumentation ideas (SystemTap, ktap, dtrace).

Ok, so sometimes you can’t instrument even binary code, e.g. you’re profiling OS kernel, or some pretty complicated system consisting of many tightly coupled modules that won’t work after instrumenting. That’s why you have non-invasive profiling.

Sampling is the first natural idea that you can come up with when you can’t modify any code. The point is that profiler periodically asks CPU registers (e.g. PSW) and analyze what is going on. By the way, this is the only reasonable way you can get hardware metrics — by periodical polling of [PMU] (performance monitoring unit).

Event-based profiling is about gathering events that must somehow be prepared/preinstalled by vendor of profiling subject. Examples are inotify, kernel tracepoints in Linux and VTune events.

And finally emulation is just running your program in isolated environment like virtual machine or QEMU thus giving you full control over program execution but garbling behaviour.

Problem definition

I’m a big fan of studying something with the real-world examples, instead of thoughtless manual reading. That’s why I’ll define problem and will try to solve it using profiling.

That said, I have a nice little program that checks data integrity on given block device. Simply put, it reads data blocks in multiple threads and computes checksums along with bandwith. Here are the sources.

So, I use that utility to check my 8 disks RAID 0 (standard Linux mdraid). This is how I do reading:

./block_hasher -d /dev/md126 -b 1048576 -t 10 -n 1000

1000 blocks of size 1 MiB for each of 10 threads.

block_hasher also computes bandwith by simply dividing data read on thread running time. And so I’ve got that bandwidth:

[root@simplex block_hasher]# cat bad.out 
T06: 57.12 MB/s c86253f827c0e40a056d2afc7d6605c291e57400
T08: 56.72 MB/s 9364a42836daa9beadf02c15777b3e1779f57b00
T04: 54.82 MB/s d0d7c3e2faed39d83ea25e468b5714bbfe23e200
T00: 53.06 MB/s c32caf8e5bdebeb2ffa73707e61fad50a751e800
T02: 53.00 MB/s 34a7495fe2ccaac4afee0e7460d9dff051701900
T07: 29.93 MB/s 95b3dc919fc4d61548a3b0737dd2ab03a0bab400
T03: 29.93 MB/s c1228ce6d4920e3bc101f1874bd5beeeb25ec600
T01: 29.89 MB/s 63d484d0fc2456c9a3c18d1d0ef43d60957d1200
T05: 29.89 MB/s 5c229e2fe168fb60a0d56b22f6eaa8fc6675d700
T09: 29.88 MB/s f6eb529ee5b59824a657fb8de43c8c6d3e29cb00

If you sum bandwidth for all threads you’ll get total bandwidth for whole RAID.

[root@simplex block_hasher]# cut -f2 -d' ' bad.out |  paste -sd + | bc

Namely, 424.24 MB/s which is pretty bad. In theory, you can get1:

Speed = <IOPS from 1 disk> * <block size> * <disks count> 

180 * 1048576 * 8 = 1509949440 Bytes/s = 1.5 GB/s

In real life you’ll get something about 1 GB/s.

To determine why is it slow we’ll use profilers. We’ll profile block_hasher as much as everything below including Linux kernel.

In this series of articles I’ll try to review next profilers:

  • gprof
  • gcov
  • Valgrind
  • perf
  • SystemTap
  • ktap
  • VTune
  • Block devices related tools: blktrace, etc.


  1. This trivial formulae implies that block reading time is the same for any sizes which in fact is not. Also it’s applicable only for RAID level 0.