Contrary to popular belief, Valgrind is not a single tool, but a suite of such tools, with Memcheck being a default one. By the time of writing Valgrind suite consists of:
Plus there are unofficial tools not included in Valgrind and distributed as patches.
The biggest plus of Valgrind is that we don’t need to recompile or modify our program in any way because Valgrind tools use emulation as a method of profiling. All of that tools are using common infrastructure that emulates application runtime – memory management function, CPU caches, threading primitives, etc. That’s where our program is executing and being analyzed by Valgrind.
In the examples below, I’ll use my block_hasher program to illustrate the usage of profilers. because it’s a small and simple utility.
Now let’s look at what Valgrind can do.
Ok, so Memcheck is a memory errors detector – it’s one of the most useful tools in programmer’s toolbox.
Let’s launch our hasher under Memcheck
$ valgrind --leak-check=full ./block_hasher -d /dev/md126 -b 1048576 -t 10 -n 1000
==4323== Memcheck, a memory error detector
==4323== Copyright (C) 2002-2010, and GNU GPL'd, by Julian Seward et al.
==4323== Using Valgrind-3.6.0 and LibVEX; rerun with -h for copyright info
==4323== Command: ./block_hasher -d /dev/md126 -b 1048576 -t 10 -n 1000
==4323==
==4323==
==4323== HEAP SUMMARY:
==4323== in use at exit: 16 bytes in 1 blocks
==4323== total heap usage: 43 allocs, 42 frees, 10,491,624 bytes allocated
==4323==
==4323== LEAK SUMMARY:
==4323== definitely lost: 0 bytes in 0 blocks
==4323== indirectly lost: 0 bytes in 0 blocks
==4323== possibly lost: 0 bytes in 0 blocks
==4323== still reachable: 16 bytes in 1 blocks
==4323== suppressed: 0 bytes in 0 blocks
==4323== Reachable blocks (those to which a pointer was found) are not shown.
==4323== To see them, rerun with: --leak-check=full --show-reachable=yes
==4323==
==4323== For counts of detected and suppressed errors, rerun with: -v
==4323== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 6 from 6)
I won’t explain what is definitely lost, indirectly lost and other – that’s what is documentation for.
From Memcheck profile we can say that there are no errors except little leak, 1 block is still reachable. From the message
total heap usage: 43 allocs, 42 frees, 10,491,624 bytes allocated
I have somewhere forgotten to call free
. And that’s true, in bdev_open
I’m
allocating struct for block_device
but in bdev_close
it’s not freeing.
By the way, it’s interesting that Memcheck reports about 16 bytes loss, while
block_device
is int
and off_t
that should occupy 4 + 8 = 12
bytes. Where
are 4 more bytes? Structs are 8 bytes aligned (for 64-bit system), that’s why
int
field is padded with 4 bytes.
Anyway, I’ve fixed memory leak:
@@ -240,6 +241,9 @@ void bdev_close( struct block_device *dev )
perror("close");
}
+ free(dev);
+ dev = NULL;
+
return;
}
Check:
$ valgrind --leak-check=full ./block_hasher -d /dev/md126 -b 1048576 -t 10 -n 1000
==15178== Memcheck, a memory error detector
==15178== Copyright (C) 2002-2010, and GNU GPL'd, by Julian Seward et al.
==15178== Using Valgrind-3.6.0 and LibVEX; rerun with -h for copyright info
==15178== Command: ./block_hasher -d /dev/md0 -b 1048576 -t 10 -n 1000
==15178==
==15178==
==15178== HEAP SUMMARY:
==15178== in use at exit: 0 bytes in 0 blocks
==15178== total heap usage: 43 allocs, 43 frees, 10,491,624 bytes allocated
==15178==
==15178== All heap blocks were freed -- no leaks are possible
==15178==
==15178== For counts of detected and suppressed errors, rerun with: -v
==15178== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 6 from 6)
A real pleasure to see.
As a resume, I’d like to say that Memcheck can do a lot. Not only in detection of memory errors, but also in explaining. It’s understatement to say “Hey, you’ve got some error here!” – to fix the error it’s better to know what is the reason. And Memcheck does it. It’s so good that it’s even listed as a skill for system programmers positions.
TODO:
Cachegrind – CPU cache access profiler. What amazed me is that how it trace cache accesses – Cachegrind simulates it, seean excerpt from the documentation:
It performs detailed simulation of the I1, D1 and L2 caches in your CPU and so can accurately pinpoint the sources of cache misses in your code.
If you think it’s easy, please spend 90 minutes to read this great article.
Let’s collect profile!
$ valgrind --tool=cachegrind ./block_hasher -d /dev/md126 -b 1048576 -t 10 -n 1000
==9408== Cachegrind, a cache and branch-prediction profiler
==9408== Copyright (C) 2002-2010, and GNU GPL'd, by Nicholas Nethercote et al.
==9408== Using Valgrind-3.6.0 and LibVEX; rerun with -h for copyright info
==9408== Command: ./block_hasher -d /dev/md126 -b 1048576 -t 10 -n 1000
==9408==
--9408-- warning: Unknown Intel cache config value (0xff), ignoring
--9408-- warning: L2 cache not installed, ignore LL results.
==9408==
==9408== I refs: 167,774,548,454
==9408== I1 misses: 1,482
==9408== LLi misses: 1,479
==9408== I1 miss rate: 0.00%
==9408== LLi miss rate: 0.00%
==9408==
==9408== D refs: 19,989,520,856 (15,893,212,838 rd + 4,096,308,018 wr)
==9408== D1 misses: 163,354,097 ( 163,350,059 rd + 4,038 wr)
==9408== LLd misses: 74,749,207 ( 74,745,179 rd + 4,028 wr)
==9408== D1 miss rate: 0.8% ( 1.0% + 0.0% )
==9408== LLd miss rate: 0.3% ( 0.4% + 0.0% )
==9408==
==9408== LL refs: 163,355,579 ( 163,351,541 rd + 4,038 wr)
==9408== LL misses: 74,750,686 ( 74,746,658 rd + 4,028 wr)
==9408== LL miss rate: 0.0% ( 0.0% + 0.0% )
First thing, I look at – cache misses. But here it’s less than 1% so it can’t be the problem.
If you asking yourself how Cachegrind can be useful, I’ll tell you one of the work stories. To accelerate some of the RAID calculation algorithms my colleague has reduced multiplications for the price of increased additions and complicated data structure. In theory, it should’ve worked better like in Karatsuba multiplication. But in reality, it became much worse. After few days of hard debugging, we launched it under Cachegrind and saw cache miss rate about 80%. More additions invoked more memory accesses and reduced locality. So we abandoned the idea.
IMHO cachegrind is not that useful anymore since the advent of perf which does actual cache profiling using CPU’s PMU (performance monitoring unit), so perf is more precise and has much lower overhead.
Massif – heap profiler, in the sense that it shows dynamic of heap allocations, i.e. how much memory your applications were using at some moment.
To do that Massif samples heap state, generating a file that later transformed
to report with help of ms_print
tool.
Ok, launch it
$ valgrind --tool=massif ./block_hasher -d /dev/md0 -b 1048576 -t 10 -n 100
==29856== Massif, a heap profiler
==29856== Copyright (C) 2003-2010, and GNU GPL'd, by Nicholas Nethercote
==29856== Using Valgrind-3.6.0 and LibVEX; rerun with -h for copyright info
==29856== Command: ./block_hasher -d /dev/md0 -b 1048576 -t 10 -n 100
==29856==
==29856==
Got a massif.out.29856 file. Convert it to text profile:
$ ms_print massif.out.29856 > massif.profile
Profile contains histogram of heap allocations
MB
10.01^::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::#
|: #
|@ #::
|@ # :
|@ # ::
|@ # ::
|@ # ::@
|@ # ::@
|@ # ::@
|@ # ::@
|@ # ::@
|@ # ::@
|@ # ::@@
|@ # ::@@
|@ # ::@@
|@ # ::@@
|@ # ::@@
|@ # ::@@
|@ # ::@@
|@ # ::@@
0 +----------------------------------------------------------------------->Gi
0 15.63
and a summary table of most notable allocations.
Example:
--------------------------------------------------------------------------------
n time(i) total(B) useful-heap(B) extra-heap(B) stacks(B)
--------------------------------------------------------------------------------
40 344,706 9,443,296 9,442,896 400 0
41 346,448 10,491,880 10,491,472 408 0
42 346,527 10,491,936 10,491,520 416 0
43 346,723 10,492,056 10,491,624 432 0
44 15,509,791,074 10,492,056 10,491,624 432 0
100.00% (10,491,624B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc.
->99.94% (10,485,760B) 0x401169: thread_func (block_hasher.c:142)
| ->99.94% (10,485,760B) 0x54189CF: start_thread (in /lib64/libpthread-2.12.so)
| ->09.99% (1,048,576B) 0x6BDC6FE: ???
| |
| ->09.99% (1,048,576B) 0x7FDE6FE: ???
| |
| ->09.99% (1,048,576B) 0x75DD6FE: ???
| |
| ->09.99% (1,048,576B) 0x93E06FE: ???
| |
| ->09.99% (1,048,576B) 0x89DF6FE: ???
| |
| ->09.99% (1,048,576B) 0xA1E16FE: ???
| |
| ->09.99% (1,048,576B) 0xABE26FE: ???
| |
| ->09.99% (1,048,576B) 0xB9E36FE: ???
| |
| ->09.99% (1,048,576B) 0xC3E46FE: ???
| |
| ->09.99% (1,048,576B) 0xCDE56FE: ???
|
->00.06% (5,864B) in 1+ places, all below ms_print's threshold (01.00%)
In the table above we can see that we usually allocate in 10 MiB portions that are really just a 10 blocks of 1 MiB (our block size). Nothing special but it was interesting.
Of course, Massif is useful, because it can show you a history of allocation, how much memory was allocated with all the alignment and also what code pieces allocated most. Too bad I don’t have heap errors.
Helgrind is not a profiler but a tool to detect threading errors. It’s a thread debugger.
I just show how I’ve fixed bug in my code with Helgrind help.
When I’ve launched my block_hasher
under it I was sure that I will have 0
errors, but stuck in debugging for a couple of days.
$ valgrind --tool=helgrind ./block_hasher -d /dev/md0 -b 1048576 -t 10 -n 100
==3930== Helgrind, a thread error detector
==3930== Copyright (C) 2007-2010, and GNU GPL'd, by OpenWorks LLP et al.
==3930== Using Valgrind-3.6.0 and LibVEX; rerun with -h for copyright info
==3930== Command: ./block_hasher -d /dev/md0 -b 1048576 -t 10 -n 100
==3930==
==3930== Thread #3 was created
==3930== at 0x571DB2E: clone (in /lib64/libc-2.12.so)
==3930== by 0x541E8BF: do_clone.clone.0 (in /lib64/libpthread-2.12.so)
==3930== by 0x541EDA1: pthread_create@@GLIBC_2.2.5 (in /lib64/libpthread-2.12.so)
==3930== by 0x4C2CE76: pthread_create_WRK (hg_intercepts.c:257)
==3930== by 0x4019F0: main (block_hasher.c:350)
==3930==
==3930== Thread #2 was created
==3930== at 0x571DB2E: clone (in /lib64/libc-2.12.so)
==3930== by 0x541E8BF: do_clone.clone.0 (in /lib64/libpthread-2.12.so)
==3930== by 0x541EDA1: pthread_create@@GLIBC_2.2.5 (in /lib64/libpthread-2.12.so)
==3930== by 0x4C2CE76: pthread_create_WRK (hg_intercepts.c:257)
==3930== by 0x4019F0: main (block_hasher.c:350)
==3930==
==3930== Possible data race during write of size 4 at 0x5200380 by thread #3
==3930== at 0x4E98AF8: CRYPTO_malloc (in /usr/lib64/libcrypto.so.1.0.1e)
==3930== by 0x4F16FF6: EVP_MD_CTX_create (in /usr/lib64/libcrypto.so.1.0.1e)
==3930== by 0x401231: thread_func (block_hasher.c:163)
==3930== by 0x4C2D01D: mythread_wrapper (hg_intercepts.c:221)
==3930== by 0x541F9D0: start_thread (in /lib64/libpthread-2.12.so)
==3930== by 0x75E46FF: ???
==3930== This conflicts with a previous write of size 4 by thread #2
==3930== at 0x4E98AF8: CRYPTO_malloc (in /usr/lib64/libcrypto.so.1.0.1e)
==3930== by 0x4F16FF6: EVP_MD_CTX_create (in /usr/lib64/libcrypto.so.1.0.1e)
==3930== by 0x401231: thread_func (block_hasher.c:163)
==3930== by 0x4C2D01D: mythread_wrapper (hg_intercepts.c:221)
==3930== by 0x541F9D0: start_thread (in /lib64/libpthread-2.12.so)
==3930== by 0x6BE36FF: ???
==3930==
==3930==
==3930== For counts of detected and suppressed errors, rerun with: -v
==3930== Use --history-level=approx or =none to gain increased speed, at
==3930== the cost of reduced accuracy of conflicting-access information
==3930== ERROR SUMMARY: 9 errors from 1 contexts (suppressed: 955 from 9)
As we see, EVP_MD_CTX_create
leads to a data race. This is an OpenSSL’s
1 function that initializes context for hash calculation. I calculate
the hash for blocks read in each thread with EVP_DigestUpdate
and then write it to
file after final EVP_DigesFinal_ex
. So these Helgrind errors are related to
OpenSSL functions. And I asked myself – “Is libcrypto thread-safe?”. So I used
my google-fu and the answer is – by default no. To
use EVP functions in multithreaded applications OpenSSL recommends to either
register 2 crazy callbacks or use dynamic locks (see here).
As for me, I’ve just wrapped context initialization in pthread mutex and
that’s it.
@@ -159,9 +159,11 @@ void *thread_func(void *arg)
gap = num_threads * block_size; // Multiply here to avoid integer overflow
// Initialize EVP and start reading
+ pthread_mutex_lock( &mutex );
md = EVP_sha1();
mdctx = EVP_MD_CTX_create();
EVP_DigestInit_ex( mdctx, md, NULL );
+ pthread_mutex_unlock( &mutex );
If anyone knows something about this – please, tell me.
DRD is one more tool in Valgrind suite that can detect thread errors. It’s more thorough and has more features while less memory hungry.
In my case, it has detected some mysterious pread
data race.
==16358== Thread 3:
==16358== Conflicting load by thread 3 at 0x0563e398 size 4
==16358== at 0x5431030: pread (in /lib64/libpthread-2.12.so)
==16358== by 0x4012D9: thread_func (block_hasher.c:174)
==16358== by 0x4C33470: vgDrd_thread_wrapper (drd_pthread_intercepts.c:281)
==16358== by 0x54299D0: start_thread (in /lib64/libpthread-2.12.so)
==16358== by 0x75EE6FF: ???
pread
itself is thread-safe in the sense that it can be called from multiple
threads, but access to data might be not synchronized. For example, you can
call pread
in one thread while calling pwrite
in another and that’s where
you got data race. But in my case data blocks do not overlap, so I can’t tell
what’s a real problem here.
The conclusion will be dead simple – learn how to use Valgrind, it’s extremely useful.
libcrypto is a library of cryptography functions and primitives that’s openssl is based on. ↩︎