TOC:

  1. Intro
  2. Userspace profiling: gprof, gcov
  3. Userspace profiling: Valgrind
  4. Kernel profiling: Intro
  5. Kernel profiling: ftrace
  6. Kernel profiling: perf
  7. Kernel profiling: SystemTap
  8. Various tools

Quick reminder. I’m trying to understand why my nice little program is doing reading from RAID so slow. I’m using various profiling tools to inspect my block_hasher and get a hands-on experience with profilers.

Valgrind

Countrary to popular belief, _Valgrind is not a single tool, but a suite of such tools, with Memcheck as default. By the time of writing Valgrind consists of:

  • Memcheck — memory management errors detection.
  • Cachegrind — CPU cache access profiling tool.
  • Massif — sampling heap profiler.
  • Helgrind — race condition detector.
  • DRD — tool to detect error in multithreading applications.

Plus there are unofficial tools not included in Valgrind and distributed as patches .

Despite some of the tools are not profilers per se (e.g. Memcheck is more like debugger) I will use all of them just because it’s incredibly useful for every single programmer.

We don’t need to recompile or modify our program in any way because Valgrind tools use emulation as a method of profiling. All of that tools are using common infrastructure that emulates application runtime — memory management function, CPU caches, threading primitives, etc. That’s where our program is executing and being analyzed by Valgrind.

Now let’s look at what Valgrind can do.

Memcheck

Ok, so Memcheck is memory errors detector. Though it’s not profile it’s one of most useful tools in programmer’s toolbox.

Let’s launch our hasher under Memcheck

[root@simplex block_hasher]# valgrind --leak-check=full ./block_hasher -d /dev/md126 -b 1048576 -t 10 -n 1000
==4323== Memcheck, a memory error detector
==4323== Copyright (C) 2002-2010, and GNU GPL'd, by Julian Seward et al.
==4323== Using Valgrind-3.6.0 and LibVEX; rerun with -h for copyright info
==4323== Command: ./block_hasher -d /dev/md126 -b 1048576 -t 10 -n 1000
==4323== 
==4323== 
==4323== HEAP SUMMARY:
==4323==     in use at exit: 16 bytes in 1 blocks
==4323==   total heap usage: 43 allocs, 42 frees, 10,491,624 bytes allocated
==4323== 
==4323== LEAK SUMMARY:
==4323==    definitely lost: 0 bytes in 0 blocks
==4323==    indirectly lost: 0 bytes in 0 blocks
==4323==      possibly lost: 0 bytes in 0 blocks
==4323==    still reachable: 16 bytes in 1 blocks
==4323==         suppressed: 0 bytes in 0 blocks
==4323== Reachable blocks (those to which a pointer was found) are not shown.
==4323== To see them, rerun with: --leak-check=full --show-reachable=yes
==4323== 
==4323== For counts of detected and suppressed errors, rerun with: -v
==4323== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 6 from 6)

I won’t explain what is definitely lost, indirectly lost and other — that’s what is documentaion for.

From Memcheck profile we can say that there are no errors except little leak, 1 block is still reachable. From the message

total heap usage: 43 allocs, 42 frees, 10,491,624 bytes allocated

I have somewhere forgotten to call free. And that’s true, in bdev_open I’m allocating struct for block_device but in bdev_close it’s not freeing. By the way, it’s interesting that Memcheck reports about 16 bytes loss, while block_device is int and off_t that should occupy 4 + 8 = 12 bytes. Where are 4 more bytes? Structs are 8 byte aligned (for 64 bit system), that’s why int field are padded with 4 bytes.

Anyway, I’ve fixed memory leak:

@@ -240,6 +241,9 @@ void bdev_close( struct block_device *dev )
         perror("close");
     }
 
+    free(dev);
+    dev = NULL;
+
     return;
 }

Check:

[root@simplex block_hasher]# valgrind --leak-check=full ./block_hasher -d /dev/md126 -b 1048576 -t 10 -n 1000
==15178== Memcheck, a memory error detector
==15178== Copyright (C) 2002-2010, and GNU GPL'd, by Julian Seward et al.
==15178== Using Valgrind-3.6.0 and LibVEX; rerun with -h for copyright info
==15178== Command: ./block_hasher -d /dev/md0 -b 1048576 -t 10 -n 1000
==15178== 
==15178== 
==15178== HEAP SUMMARY:
==15178==     in use at exit: 0 bytes in 0 blocks
==15178==   total heap usage: 43 allocs, 43 frees, 10,491,624 bytes allocated
==15178== 
==15178== All heap blocks were freed -- no leaks are possible
==15178== 
==15178== For counts of detected and suppressed errors, rerun with: -v
==15178== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 6 from 6)

Real pleasure to see.

As a resume I’d like to say that Memcheck can do a lot. Not only in detection of memory errors, but also in explaining. It’s understatement to say «Hey, you’ve got some error here!» — to fix error it’s better to know what is the reason. And Memcheck does it. It’s so good that it’s even listed as a skill for system programmers positions.

CacheGrind

Cachegrind — CPU cache access profiler. What amazed me is that how it trace cache accesses — Cachegrind simulates it, see excerpt from documentation:

It performs detailed simulation of the I1, D1 and L2 caches in your CPU and so can accurately pinpoint the sources of cache misses in your code.

If you think it’s easy, please spend 90 minutes to read this great article.

Let’s collect profile!

[root@simplex block_hasher]# valgrind --tool=cachegrind ./block_hasher -d /dev/md126 -b 1048576 -t 10 -n 1000
==9408== Cachegrind, a cache and branch-prediction profiler
==9408== Copyright (C) 2002-2010, and GNU GPL'd, by Nicholas Nethercote et al.
==9408== Using Valgrind-3.6.0 and LibVEX; rerun with -h for copyright info
==9408== Command: ./block_hasher -d /dev/md126 -b 1048576 -t 10 -n 1000
==9408== 
--9408-- warning: Unknown Intel cache config value (0xff), ignoring
--9408-- warning: L2 cache not installed, ignore LL results.
==9408== 
==9408== I   refs:      167,774,548,454
==9408== I1  misses:              1,482
==9408== LLi misses:              1,479
==9408== I1  miss rate:            0.00%
==9408== LLi miss rate:            0.00%
==9408== 
==9408== D   refs:       19,989,520,856  (15,893,212,838 rd   + 4,096,308,018 wr)
==9408== D1  misses:        163,354,097  (   163,350,059 rd   +         4,038 wr)
==9408== LLd misses:         74,749,207  (    74,745,179 rd   +         4,028 wr)
==9408== D1  miss rate:             0.8% (           1.0%     +           0.0%  )
==9408== LLd miss rate:             0.3% (           0.4%     +           0.0%  )
==9408== 
==9408== LL refs:           163,355,579  (   163,351,541 rd   +         4,038 wr)
==9408== LL misses:          74,750,686  (    74,746,658 rd   +         4,028 wr)
==9408== LL miss rate:              0.0% (           0.0%     +           0.0%  )

First thing, I look at — cache misses. But here it’s less then 1% so it can’t be the problem.

If you asking yourself how Cachegrind can be useful, I’ll tell you one of the work stories. To accelerate some of RAID calculation algorithm my collegue has reduced multiplications for the price of increased additions and complicated data structure. In theory it should’ve worked better like in Karatsuba multiplication. But in reality it became much worse. After few days of hard debugging we launched it under Cachegrind and saw cache miss rate about 80%. More additions invoked more memory accesses and reduced locality. So we abandoned the idea.

Massif

Massif — heap profiler, in the sense that it shows dynamic of heap allocations, i.e. how much memory your applications were using in some moment.

To do that Massif samples heap state, generating file that later transformed to report with help of ms_print tool.

Ok, launch it

[root@simplex block_hasher]# valgrind --tool=massif ./block_hasher -d /dev/md0 -b 1048576 -t 10 -n 100
==29856== Massif, a heap profiler
==29856== Copyright (C) 2003-2010, and GNU GPL'd, by Nicholas Nethercote
==29856== Using Valgrind-3.6.0 and LibVEX; rerun with -h for copyright info
==29856== Command: ./block_hasher -d /dev/md0 -b 1048576 -t 10 -n 100
==29856== 
==29856== 

Got a massif.out.29856 file. Convert it to text profile:

[root@simplex block_hasher]# ms_print massif.out.29856 > massif.profile

Profile conteins histogram of heap allocations

    MB
10.01^::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::#
     |:                                                                 #
     |@                                                                 #::
     |@                                                                 # :
     |@                                                                 # ::
     |@                                                                 # ::
     |@                                                                 # ::@
     |@                                                                 # ::@
     |@                                                                 # ::@
     |@                                                                 # ::@
     |@                                                                 # ::@
     |@                                                                 # ::@
     |@                                                                 # ::@@
     |@                                                                 # ::@@
     |@                                                                 # ::@@
     |@                                                                 # ::@@
     |@                                                                 # ::@@
     |@                                                                 # ::@@
     |@                                                                 # ::@@
     |@                                                                 # ::@@
   0 +----------------------------------------------------------------------->Gi
     0                                                                   15.63

and summary table of most noticable allocations. Example:

--------------------------------------------------------------------------------
  n        time(i)         total(B)   useful-heap(B) extra-heap(B)    stacks(B)
--------------------------------------------------------------------------------
 40        344,706        9,443,296        9,442,896           400            0
 41        346,448       10,491,880       10,491,472           408            0
 42        346,527       10,491,936       10,491,520           416            0
 43        346,723       10,492,056       10,491,624           432            0
 44 15,509,791,074       10,492,056       10,491,624           432            0
100.00% (10,491,624B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc.
->99.94% (10,485,760B) 0x401169: thread_func (block_hasher.c:142)
| ->99.94% (10,485,760B) 0x54189CF: start_thread (in /lib64/libpthread-2.12.so)
|   ->09.99% (1,048,576B) 0x6BDC6FE: ???
|   |
|   ->09.99% (1,048,576B) 0x7FDE6FE: ???
|   |
|   ->09.99% (1,048,576B) 0x75DD6FE: ???
|   |
|   ->09.99% (1,048,576B) 0x93E06FE: ???
|   |
|   ->09.99% (1,048,576B) 0x89DF6FE: ???
|   |
|   ->09.99% (1,048,576B) 0xA1E16FE: ???
|   |
|   ->09.99% (1,048,576B) 0xABE26FE: ???
|   |
|   ->09.99% (1,048,576B) 0xB9E36FE: ???
|   |
|   ->09.99% (1,048,576B) 0xC3E46FE: ???
|   |
|   ->09.99% (1,048,576B) 0xCDE56FE: ???
|
->00.06% (5,864B) in 1+ places, all below ms_print's threshold (01.00%)

In the table above we can see that we usually allocate in 10 MiB portions that is really just a 10 blocks of 1 MiB (our block size). Nothing special but it was interesting.

Of course, Massif is useful, because it can show you a history of allocation, how much memory was allocated with all the alignment and also what code pieces allocated most. Too bad I don’t have heap errors.

Helgrind

Helgrind is not a profiler but a tool to detect threading errors. It’s a thread debugger.

I just show how I’ve fixed bug in my code with Helgrind help.

When I’ve launched my block_hasher under it I was sure that I will have 0 errors, but stuck in debugging for couple of days.

[root@simplex block_hasher]# valgrind --tool=helgrind ./block_hasher -d /dev/md0 -b 1048576 -t 10 -n 100
==3930== Helgrind, a thread error detector
==3930== Copyright (C) 2007-2010, and GNU GPL'd, by OpenWorks LLP et al.
==3930== Using Valgrind-3.6.0 and LibVEX; rerun with -h for copyright info
==3930== Command: ./block_hasher -d /dev/md0 -b 1048576 -t 10 -n 100
==3930== 
==3930== Thread #3 was created
==3930==    at 0x571DB2E: clone (in /lib64/libc-2.12.so)
==3930==    by 0x541E8BF: do_clone.clone.0 (in /lib64/libpthread-2.12.so)
==3930==    by 0x541EDA1: pthread_create@@GLIBC_2.2.5 (in /lib64/libpthread-2.12.so)
==3930==    by 0x4C2CE76: pthread_create_WRK (hg_intercepts.c:257)
==3930==    by 0x4019F0: main (block_hasher.c:350)
==3930== 
==3930== Thread #2 was created
==3930==    at 0x571DB2E: clone (in /lib64/libc-2.12.so)
==3930==    by 0x541E8BF: do_clone.clone.0 (in /lib64/libpthread-2.12.so)
==3930==    by 0x541EDA1: pthread_create@@GLIBC_2.2.5 (in /lib64/libpthread-2.12.so)
==3930==    by 0x4C2CE76: pthread_create_WRK (hg_intercepts.c:257)
==3930==    by 0x4019F0: main (block_hasher.c:350)
==3930== 
==3930== Possible data race during write of size 4 at 0x5200380 by thread #3
==3930==    at 0x4E98AF8: CRYPTO_malloc (in /usr/lib64/libcrypto.so.1.0.1e)
==3930==    by 0x4F16FF6: EVP_MD_CTX_create (in /usr/lib64/libcrypto.so.1.0.1e)
==3930==    by 0x401231: thread_func (block_hasher.c:163)
==3930==    by 0x4C2D01D: mythread_wrapper (hg_intercepts.c:221)
==3930==    by 0x541F9D0: start_thread (in /lib64/libpthread-2.12.so)
==3930==    by 0x75E46FF: ???
==3930==  This conflicts with a previous write of size 4 by thread #2
==3930==    at 0x4E98AF8: CRYPTO_malloc (in /usr/lib64/libcrypto.so.1.0.1e)
==3930==    by 0x4F16FF6: EVP_MD_CTX_create (in /usr/lib64/libcrypto.so.1.0.1e)
==3930==    by 0x401231: thread_func (block_hasher.c:163)
==3930==    by 0x4C2D01D: mythread_wrapper (hg_intercepts.c:221)
==3930==    by 0x541F9D0: start_thread (in /lib64/libpthread-2.12.so)
==3930==    by 0x6BE36FF: ???
==3930== 
==3930== 
==3930== For counts of detected and suppressed errors, rerun with: -v
==3930== Use --history-level=approx or =none to gain increased speed, at
==3930== the cost of reduced accuracy of conflicting-access information
==3930== ERROR SUMMARY: 9 errors from 1 contexts (suppressed: 955 from 9)

As we see, EVP_MD_CTX_create leads to data race. This is an openssl’s 1 function that initializes context for hash calculation. I calculate hash for blocks read in each thread with EVP_DigestUpdate and then write it to file after final EVP_DigesFinal_ex. So this Helgrind errors are related to openssl functions. And I asked myself — «Is libcrypto thread-safe?». So I used my google-fu and the answer is — by default no. To use EVP functions in multithreaded applications openssl recommends to either register 2 crazy callbacks or use dynamic locks (see here). As for me, I’ve just wrapped context initialization in pthread mutex and that’s it.


@@ -159,9 +159,11 @@ void *thread_func(void *arg)
     gap = num_threads * block_size; // Multiply here to avoid integer overflow
 
     // Initialize EVP and start reading
+    pthread_mutex_lock( &mutex );
     md = EVP_sha1();
     mdctx = EVP_MD_CTX_create();
     EVP_DigestInit_ex( mdctx, md, NULL );
+    pthread_mutex_unlock( &mutex );

If anyone knows something about this — please, tell me.

DRD

DRD is one more tool in Valgrind suite that can detect thread errors. It’s more thorough and has more features while less memory hungry.

In my case it has detected some mysterious pread data race.

==16358== Thread 3:
==16358== Conflicting load by thread 3 at 0x0563e398 size 4
==16358==    at 0x5431030: pread (in /lib64/libpthread-2.12.so)
==16358==    by 0x4012D9: thread_func (block_hasher.c:174)
==16358==    by 0x4C33470: vgDrd_thread_wrapper (drd_pthread_intercepts.c:281)
==16358==    by 0x54299D0: start_thread (in /lib64/libpthread-2.12.so)
==16358==    by 0x75EE6FF: ???

pread itself is thread-safe in the sense that it can be called from multiple threads, but access to data might be not synchronized. For example, you can call pread in one thread while calling pwrite in another and that’s where you got data race. But in my case data blocks do not overlap, so I can’t tell what’s a real problem here.

Conclusion

Conclusion will be dead simple — learn how to use Valgrind, it’s extremely useful.

To read


  1. libcrypto is a library of cryptography functions and primitives that’s openssl is based on.