Valgrind

Contrary to popular belief, Valgrind is not a single tool, but a suite of such tools, with Memcheck being a default one. By the time of writing Valgrind suite consists of:

  • Memcheck – memory management errors detection.
  • Cachegrind – CPU cache access profiling tool.
  • Massif – sampling heap profiler.
  • Helgrind – race condition detector.
  • DRD – tool to detect errors in multithreading applications.

Plus there are unofficial tools not included in Valgrind and distributed as patches.

The biggest plus of Valgrind is that we don’t need to recompile or modify our program in any way because Valgrind tools use emulation as a method of profiling. All of that tools are using common infrastructure that emulates application runtime – memory management function, CPU caches, threading primitives, etc. That’s where our program is executing and being analyzed by Valgrind.

In the examples below, I’ll use my block_hasher program to illustrate the usage of profilers. because it’s a small and simple utility.

Now let’s look at what Valgrind can do.

Memcheck

Ok, so Memcheck is a memory errors detector – it’s one of the most useful tools in programmer’s toolbox.

Let’s launch our hasher under Memcheck

$ valgrind --leak-check=full ./block_hasher -d /dev/md126 -b 1048576 -t 10 -n 1000
==4323== Memcheck, a memory error detector
==4323== Copyright (C) 2002-2010, and GNU GPL'd, by Julian Seward et al.
==4323== Using Valgrind-3.6.0 and LibVEX; rerun with -h for copyright info
==4323== Command: ./block_hasher -d /dev/md126 -b 1048576 -t 10 -n 1000
==4323== 
==4323== 
==4323== HEAP SUMMARY:
==4323==     in use at exit: 16 bytes in 1 blocks
==4323==   total heap usage: 43 allocs, 42 frees, 10,491,624 bytes allocated
==4323== 
==4323== LEAK SUMMARY:
==4323==    definitely lost: 0 bytes in 0 blocks
==4323==    indirectly lost: 0 bytes in 0 blocks
==4323==      possibly lost: 0 bytes in 0 blocks
==4323==    still reachable: 16 bytes in 1 blocks
==4323==         suppressed: 0 bytes in 0 blocks
==4323== Reachable blocks (those to which a pointer was found) are not shown.
==4323== To see them, rerun with: --leak-check=full --show-reachable=yes
==4323== 
==4323== For counts of detected and suppressed errors, rerun with: -v
==4323== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 6 from 6)

I won’t explain what is definitely lost, indirectly lost and other – that’s what is documentation for.

From Memcheck profile we can say that there are no errors except little leak, 1 block is still reachable. From the message

total heap usage: 43 allocs, 42 frees, 10,491,624 bytes allocated

I have somewhere forgotten to call free. And that’s true, in bdev_open I’m allocating struct for block_device but in bdev_close it’s not freeing. By the way, it’s interesting that Memcheck reports about 16 bytes loss, while block_device is int and off_t that should occupy 4 + 8 = 12 bytes. Where are 4 more bytes? Structs are 8 bytes aligned (for 64-bit system), that’s why int field is padded with 4 bytes.

Anyway, I’ve fixed memory leak:

@@ -240,6 +241,9 @@ void bdev_close( struct block_device *dev )
         perror("close");
     }
 
+    free(dev);
+    dev = NULL;
+
     return;
 }

Check:

$ valgrind --leak-check=full ./block_hasher -d /dev/md126 -b 1048576 -t 10 -n 1000
==15178== Memcheck, a memory error detector
==15178== Copyright (C) 2002-2010, and GNU GPL'd, by Julian Seward et al.
==15178== Using Valgrind-3.6.0 and LibVEX; rerun with -h for copyright info
==15178== Command: ./block_hasher -d /dev/md0 -b 1048576 -t 10 -n 1000
==15178== 
==15178== 
==15178== HEAP SUMMARY:
==15178==     in use at exit: 0 bytes in 0 blocks
==15178==   total heap usage: 43 allocs, 43 frees, 10,491,624 bytes allocated
==15178== 
==15178== All heap blocks were freed -- no leaks are possible
==15178== 
==15178== For counts of detected and suppressed errors, rerun with: -v
==15178== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 6 from 6)

A real pleasure to see.

As a resume, I’d like to say that Memcheck can do a lot. Not only in detection of memory errors, but also in explaining. It’s understatement to say “Hey, you’ve got some error here!” – to fix the error it’s better to know what is the reason. And Memcheck does it. It’s so good that it’s even listed as a skill for system programmers positions.

TODO:

  • More examples of memory errors
  • track origins

CacheGrind

Cachegrind – CPU cache access profiler. What amazed me is that how it trace cache accesses – Cachegrind simulates it, seean excerpt from the documentation:

It performs detailed simulation of the I1, D1 and L2 caches in your CPU and so can accurately pinpoint the sources of cache misses in your code.

If you think it’s easy, please spend 90 minutes to read this great article.

Let’s collect profile!

$ valgrind --tool=cachegrind ./block_hasher -d /dev/md126 -b 1048576 -t 10 -n 1000
==9408== Cachegrind, a cache and branch-prediction profiler
==9408== Copyright (C) 2002-2010, and GNU GPL'd, by Nicholas Nethercote et al.
==9408== Using Valgrind-3.6.0 and LibVEX; rerun with -h for copyright info
==9408== Command: ./block_hasher -d /dev/md126 -b 1048576 -t 10 -n 1000
==9408== 
--9408-- warning: Unknown Intel cache config value (0xff), ignoring
--9408-- warning: L2 cache not installed, ignore LL results.
==9408== 
==9408== I   refs:      167,774,548,454
==9408== I1  misses:              1,482
==9408== LLi misses:              1,479
==9408== I1  miss rate:            0.00%
==9408== LLi miss rate:            0.00%
==9408== 
==9408== D   refs:       19,989,520,856  (15,893,212,838 rd   + 4,096,308,018 wr)
==9408== D1  misses:        163,354,097  (   163,350,059 rd   +         4,038 wr)
==9408== LLd misses:         74,749,207  (    74,745,179 rd   +         4,028 wr)
==9408== D1  miss rate:             0.8% (           1.0%     +           0.0%  )
==9408== LLd miss rate:             0.3% (           0.4%     +           0.0%  )
==9408== 
==9408== LL refs:           163,355,579  (   163,351,541 rd   +         4,038 wr)
==9408== LL misses:          74,750,686  (    74,746,658 rd   +         4,028 wr)
==9408== LL miss rate:              0.0% (           0.0%     +           0.0%  )

First thing, I look at – cache misses. But here it’s less than 1% so it can’t be the problem.

If you asking yourself how Cachegrind can be useful, I’ll tell you one of the work stories. To accelerate some of the RAID calculation algorithms my colleague has reduced multiplications for the price of increased additions and complicated data structure. In theory, it should’ve worked better like in Karatsuba multiplication. But in reality, it became much worse. After few days of hard debugging, we launched it under Cachegrind and saw cache miss rate about 80%. More additions invoked more memory accesses and reduced locality. So we abandoned the idea.

IMHO cachegrind is not that useful anymore since the advent of perf which does actual cache profiling using CPU’s PMU (performance monitoring unit), so perf is more precise and has much lower overhead.

Massif

Massif – heap profiler, in the sense that it shows dynamic of heap allocations, i.e. how much memory your applications were using at some moment.

To do that Massif samples heap state, generating a file that later transformed to report with help of ms_print tool.

Ok, launch it

$ valgrind --tool=massif ./block_hasher -d /dev/md0 -b 1048576 -t 10 -n 100
==29856== Massif, a heap profiler
==29856== Copyright (C) 2003-2010, and GNU GPL'd, by Nicholas Nethercote
==29856== Using Valgrind-3.6.0 and LibVEX; rerun with -h for copyright info
==29856== Command: ./block_hasher -d /dev/md0 -b 1048576 -t 10 -n 100
==29856== 
==29856== 

Got a massif.out.29856 file. Convert it to text profile:

$ ms_print massif.out.29856 > massif.profile

Profile contains histogram of heap allocations

    MB
10.01^::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::#
     |:                                                                 #
     |@                                                                 #::
     |@                                                                 # :
     |@                                                                 # ::
     |@                                                                 # ::
     |@                                                                 # ::@
     |@                                                                 # ::@
     |@                                                                 # ::@
     |@                                                                 # ::@
     |@                                                                 # ::@
     |@                                                                 # ::@
     |@                                                                 # ::@@
     |@                                                                 # ::@@
     |@                                                                 # ::@@
     |@                                                                 # ::@@
     |@                                                                 # ::@@
     |@                                                                 # ::@@
     |@                                                                 # ::@@
     |@                                                                 # ::@@
   0 +----------------------------------------------------------------------->Gi
     0                                                                   15.63

and a summary table of most notable allocations.

Example:

--------------------------------------------------------------------------------
  n        time(i)         total(B)   useful-heap(B) extra-heap(B)    stacks(B)
--------------------------------------------------------------------------------
 40        344,706        9,443,296        9,442,896           400            0
 41        346,448       10,491,880       10,491,472           408            0
 42        346,527       10,491,936       10,491,520           416            0
 43        346,723       10,492,056       10,491,624           432            0
 44 15,509,791,074       10,492,056       10,491,624           432            0
100.00% (10,491,624B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc.
->99.94% (10,485,760B) 0x401169: thread_func (block_hasher.c:142)
| ->99.94% (10,485,760B) 0x54189CF: start_thread (in /lib64/libpthread-2.12.so)
|   ->09.99% (1,048,576B) 0x6BDC6FE: ???
|   |
|   ->09.99% (1,048,576B) 0x7FDE6FE: ???
|   |
|   ->09.99% (1,048,576B) 0x75DD6FE: ???
|   |
|   ->09.99% (1,048,576B) 0x93E06FE: ???
|   |
|   ->09.99% (1,048,576B) 0x89DF6FE: ???
|   |
|   ->09.99% (1,048,576B) 0xA1E16FE: ???
|   |
|   ->09.99% (1,048,576B) 0xABE26FE: ???
|   |
|   ->09.99% (1,048,576B) 0xB9E36FE: ???
|   |
|   ->09.99% (1,048,576B) 0xC3E46FE: ???
|   |
|   ->09.99% (1,048,576B) 0xCDE56FE: ???
|
->00.06% (5,864B) in 1+ places, all below ms_print's threshold (01.00%)

In the table above we can see that we usually allocate in 10 MiB portions that are really just a 10 blocks of 1 MiB (our block size). Nothing special but it was interesting.

Of course, Massif is useful, because it can show you a history of allocation, how much memory was allocated with all the alignment and also what code pieces allocated most. Too bad I don’t have heap errors.

Helgrind

Helgrind is not a profiler but a tool to detect threading errors. It’s a thread debugger.

I just show how I’ve fixed bug in my code with Helgrind help.

When I’ve launched my block_hasher under it I was sure that I will have 0 errors, but stuck in debugging for a couple of days.

$ valgrind --tool=helgrind ./block_hasher -d /dev/md0 -b 1048576 -t 10 -n 100
==3930== Helgrind, a thread error detector
==3930== Copyright (C) 2007-2010, and GNU GPL'd, by OpenWorks LLP et al.
==3930== Using Valgrind-3.6.0 and LibVEX; rerun with -h for copyright info
==3930== Command: ./block_hasher -d /dev/md0 -b 1048576 -t 10 -n 100
==3930== 
==3930== Thread #3 was created
==3930==    at 0x571DB2E: clone (in /lib64/libc-2.12.so)
==3930==    by 0x541E8BF: do_clone.clone.0 (in /lib64/libpthread-2.12.so)
==3930==    by 0x541EDA1: pthread_create@@GLIBC_2.2.5 (in /lib64/libpthread-2.12.so)
==3930==    by 0x4C2CE76: pthread_create_WRK (hg_intercepts.c:257)
==3930==    by 0x4019F0: main (block_hasher.c:350)
==3930== 
==3930== Thread #2 was created
==3930==    at 0x571DB2E: clone (in /lib64/libc-2.12.so)
==3930==    by 0x541E8BF: do_clone.clone.0 (in /lib64/libpthread-2.12.so)
==3930==    by 0x541EDA1: pthread_create@@GLIBC_2.2.5 (in /lib64/libpthread-2.12.so)
==3930==    by 0x4C2CE76: pthread_create_WRK (hg_intercepts.c:257)
==3930==    by 0x4019F0: main (block_hasher.c:350)
==3930== 
==3930== Possible data race during write of size 4 at 0x5200380 by thread #3
==3930==    at 0x4E98AF8: CRYPTO_malloc (in /usr/lib64/libcrypto.so.1.0.1e)
==3930==    by 0x4F16FF6: EVP_MD_CTX_create (in /usr/lib64/libcrypto.so.1.0.1e)
==3930==    by 0x401231: thread_func (block_hasher.c:163)
==3930==    by 0x4C2D01D: mythread_wrapper (hg_intercepts.c:221)
==3930==    by 0x541F9D0: start_thread (in /lib64/libpthread-2.12.so)
==3930==    by 0x75E46FF: ???
==3930==  This conflicts with a previous write of size 4 by thread #2
==3930==    at 0x4E98AF8: CRYPTO_malloc (in /usr/lib64/libcrypto.so.1.0.1e)
==3930==    by 0x4F16FF6: EVP_MD_CTX_create (in /usr/lib64/libcrypto.so.1.0.1e)
==3930==    by 0x401231: thread_func (block_hasher.c:163)
==3930==    by 0x4C2D01D: mythread_wrapper (hg_intercepts.c:221)
==3930==    by 0x541F9D0: start_thread (in /lib64/libpthread-2.12.so)
==3930==    by 0x6BE36FF: ???
==3930== 
==3930== 
==3930== For counts of detected and suppressed errors, rerun with: -v
==3930== Use --history-level=approx or =none to gain increased speed, at
==3930== the cost of reduced accuracy of conflicting-access information
==3930== ERROR SUMMARY: 9 errors from 1 contexts (suppressed: 955 from 9)

As we see, EVP_MD_CTX_create leads to a data race. This is an OpenSSL’s 1 function that initializes context for hash calculation. I calculate the hash for blocks read in each thread with EVP_DigestUpdate and then write it to file after final EVP_DigesFinal_ex. So these Helgrind errors are related to OpenSSL functions. And I asked myself – “Is libcrypto thread-safe?”. So I used my google-fu and the answer is – by default no. To use EVP functions in multithreaded applications OpenSSL recommends to either register 2 crazy callbacks or use dynamic locks (see here). As for me, I’ve just wrapped context initialization in pthread mutex and that’s it.

@@ -159,9 +159,11 @@ void *thread_func(void *arg)
     gap = num_threads * block_size; // Multiply here to avoid integer overflow
 
     // Initialize EVP and start reading
+    pthread_mutex_lock( &mutex );
     md = EVP_sha1();
     mdctx = EVP_MD_CTX_create();
     EVP_DigestInit_ex( mdctx, md, NULL );
+    pthread_mutex_unlock( &mutex );

If anyone knows something about this – please, tell me.

DRD

DRD is one more tool in Valgrind suite that can detect thread errors. It’s more thorough and has more features while less memory hungry.

In my case, it has detected some mysterious pread data race.

==16358== Thread 3:
==16358== Conflicting load by thread 3 at 0x0563e398 size 4
==16358==    at 0x5431030: pread (in /lib64/libpthread-2.12.so)
==16358==    by 0x4012D9: thread_func (block_hasher.c:174)
==16358==    by 0x4C33470: vgDrd_thread_wrapper (drd_pthread_intercepts.c:281)
==16358==    by 0x54299D0: start_thread (in /lib64/libpthread-2.12.so)
==16358==    by 0x75EE6FF: ???

pread itself is thread-safe in the sense that it can be called from multiple threads, but access to data might be not synchronized. For example, you can call pread in one thread while calling pwrite in another and that’s where you got data race. But in my case data blocks do not overlap, so I can’t tell what’s a real problem here.

Conclusion

The conclusion will be dead simple – learn how to use Valgrind, it’s extremely useful.

To read


  1. libcrypto is a library of cryptography functions and primitives that’s openssl is based on.

    [return]