Ftrace

October 27, 2014

ftrace

Ftrace is a framework for tracing and profiling Linux kernel with the following features:

  • Kernel functions tracing
  • Call graph tracing
  • Tracepoints support
  • Dynamic tracing via kprobes
  • Statistics for kernel functions
  • Statistics for kernel events

Essentially, ftrace built around smart lockless ring buffer implementation (see Documentation/trace/ring-buffer-design.txt/). That buffer stores all ftrace info and imported via debugfs1 in /sys/kernel/debug/tracing/. All manipulations are done with simple files operations in this directory.

How ftrace works

As I’ve just said, ftrace is a framework meaning that it provides only ring buffer, all real work is done by so called tracers. Currently, ftrace includes next tracers:

  • function – default tracer;
  • function_graph – constructs call graph;
  • irqsoff, preempoff, preemptirqsoff, wakeup, wakeup_rt – latency tracers. These are origins of ftrace, they were presented in -rt kernel. I won’t give you lot of info on this topic cause it’s more about realtime, scheduling and hardware stuff;
  • nop – you guess.

Also, as additional features you’ll get:

  • kernel tracepoints support;
  • kprobes support;
  • blktrace support, though it’s going to be deleted.

Now let’s look at specific tracers.

Function tracing

Main ftrace function is, well, functions tracing (function and function_graph tracers). To achieve this, kernel function instrumented with mcount calls, just like with gprof. But kernel mcount, of course, totally differs from userspace, because it’s architecture dependent. This dependency is required to be able to build call graphs, and more specific to get caller address from previous stack frame.

This mcount is inserted in function prologue and if it’s turned off it’s doing nothing. But if it’s turned on then it’s calling ftrace function that depending on current tracer writes different data to ring buffer.

Events tracing

Events tracing is done with help of tracepoints. You set event via set_event file in /sys/kernel/debug/tracing and then it will be traced in the ring buffer. For example, to trace kmalloc, just issue

echo kmalloc > /sys/kernel/debug/tracing/set_event

and now you can see in trace:

tail-7747  [000] .... 12584.876544: kmalloc: call_site=c06c56da ptr=e9cf9eb0 bytes_req=4 bytes_alloc=8 gfp_flags=GFP_KERNEL|GFP_ZERO

and it’s the same as in include/trace/events/kmem.h, meaning it’s just a tracepoint.

kprobes tracing

In kernel 3.10 there was added support for kprobes and kretprobes for ftrace. Now you can do dynamic tracing without writing your own kernel module. But, unfortunately, there is nothing much to do with it, just

  • Registers values
  • Memory dumps
  • Symbols values
  • Stack values
  • Return values (kretprobes)

And again, this output is written to ring buffer. Also, you can calculate some statistic over it.

Let’s trace something that doesn’t have tracepoint like something not from the kernel but from the kernel module.

On my Samsung N210 laptop I have ath9k WiFi module that’s most likely doesn’t have any tracepoints. To check this just grep available_events:

[tracing]# grep ath available_events 
cfg80211:rdev_del_mpath
cfg80211:rdev_add_mpath
cfg80211:rdev_change_mpath
cfg80211:rdev_get_mpath
cfg80211:rdev_dump_mpath
cfg80211:rdev_return_int_mpath_info
ext4:ext4_ext_convert_to_initialized_fastpath

Let’s see what functions can we put probe on:

[tracing]# grep "\[ath9k\]" /proc/kallsyms | grep ' t ' | grep rx
f82e6ed0 t ath_rx_remove_buffer	[ath9k]
f82e6f60 t ath_rx_buf_link.isra.25	[ath9k]
f82e6ff0 t ath_get_next_rx_buf	[ath9k]
f82e7130 t ath_rx_edma_buf_link	[ath9k]
f82e7200 t ath_rx_addbuffer_edma	[ath9k]
f82e7250 t ath_rx_edma_cleanup	[ath9k]
f82f3720 t ath_debug_stat_rx	[ath9k]
f82e7a70 t ath_rx_tasklet	[ath9k]
f82e7310 t ath_rx_cleanup	[ath9k]
f82e7800 t ath_calcrxfilter	[ath9k]
f82e73e0 t ath_rx_init	[ath9k]

(First grep filters symbols from ath9k module, second grep filters functions which reside in text section and last grep filters receiver functions).

For example, we will trace ath_get_next_rx_buf function.

[tracing]# echo 'r:ath_probe ath9k:ath_get_next_rx_buf $retval' >> kprobe_events

This command is not from top of my head – check Documentation/tracing/kprobetrace.txt

This puts retprobe on our function and fetches return value (it’s just a pointer).

After we’ve put probe we must enable it:

[tracing]# echo 1 > events/kprobes/enable 

And then we can look for output in trace file and here it is:

midori-6741  [000] d.s.  3011.304724: ath_probe: (ath_rx_tasklet+0x35a/0xc30 [ath9k] <- ath_get_next_rx_buf) arg1=0xf6ae39f4

Example (block_hasher)

By default, ftrace is collecting info about all kernel functions and that’s huge. But, being a sophisticated kernel mechanism, ftrace has a lot of features, many kinds of options, tunable params and so on for which I don’t have a feeling to talk about because there are plenty of manuals and articles on lwn (see To read section). Hence, it’s no wonder that we can, for example, filter by PID. Here is the script:

#!/bin/sh

DEBUGFS=`grep debugfs /proc/mounts | awk '{ print $2; }'`

# Reset trace stat
echo 0 > $DEBUGFS/tracing/function_profile_enabled
echo 1 > $DEBUGFS/tracing/function_profile_enabled

echo $$ > $DEBUGFS/tracing/set_ftrace_pid
echo function > $DEBUGFS/tracing/current_tracer

exec $*

function_profile_enabled configures collecting statistical info.

Launch our magic script

./ftrace-me ./block_hasher -d /dev/md127 -b 1048576 -t10 -n10000

get per-processor statistics from files in tracing/trace_stat/

head -n50 tracing/trace_stat/function* > ~/trace_stat

and see top 5

==> function0 <==
  Function                               Hit    Time            Avg
  --------                               ---    ----            ---
  schedule                            444425    8653900277 us     19472.12 us 
  schedule_timeout                     36019    813403521 us     22582.62 us 
  do_IRQ                             8161576    796860573 us     97.635 us
  do_softirq                          486268    791706643 us     1628.128 us 
  __do_softirq                        486251    790968923 us     1626.667 us 

==> function1 <==
  Function                               Hit    Time            Avg
  --------                               ---    ----            ---
  schedule                           1352233    13378644495 us     9893.742 us 
  schedule_hrtimeout_range             11853    2708879282 us     228539.5 us 
  poll_schedule_timeout                 7733    2366753802 us     306058.9 us 
  schedule_timeout                    176343    1857637026 us     10534.22 us 
  schedule_timeout_interruptible          95    1637633935 us     17238251 us 

==> function2 <==
  Function                               Hit    Time            Avg
  --------                               ---    ----            ---
  schedule                           1260239    9324003483 us     7398.599 us 
  vfs_read                            215859    884716012 us     4098.582 us 
  do_sync_read                        214950    851281498 us     3960.369 us 
  sys_pread64                          13136    830103896 us     63193.04 us 
  generic_file_aio_read                14955    830034649 us     55502.14 us 

(Don’t pay attention to schedule – it’s just calls of scheduler).

Most of the time we are spending in schedule, do_IRQ, schedule_hrimeout_range and vfs_read meaning that we either waiting for reading or waiting for some timeout. Now that’s strange! To make it clearer we can disable so called graph time so that child functions wouldn’t be counted. Let me explain, by default ftrace counting function time as a time of function itself plus all subroutines calls. That’s and graph_time option in ftrace.

Tell

echo 0 > options/graph_time

And collect profile again

==> function0 <==
  Function                               Hit    Time            Avg
  --------                               ---    ----            ---
  schedule                             34129    6762529800 us     198146.1 us 
  mwait_idle                           50428    235821243 us     4676.394 us 
  mempool_free                      59292718    27764202 us     0.468 us    
  mempool_free_slab                 59292717    26628794 us     0.449 us    
  bio_endio                         49761249    24374630 us     0.489 us    

==> function1 <==
  Function                               Hit    Time            Avg
  --------                               ---    ----            ---
  schedule                            958708    9075670846 us     9466.564 us 
  mwait_idle                          406700    391923605 us     963.667 us  
  _spin_lock_irq                    22164884    15064205 us     0.679 us    
  __make_request                     3890969    14825567 us     3.810 us    
  get_page_from_freelist             7165243    14063386 us     1.962 us    

Now we see amusing mwait_idle that somebody is somehow calling. We can’t say how does it happen.

Maybe, we should get a function graph! We know that it all starts with pread so let’s try to trace down function calls from pread.

By that moment, I had tired to read/write to debugfs files and started to use CLI interface to ftrace which is trace-cmd.

Using trace-cmd is dead simple – first, you’re recording with trace-cmd record and then analyze it with trace-cmd report.

Record:

trace-cmd record -p function_graph -o graph_pread.dat -g sys_pread64 \
        ./block_hasher -d /dev/md127 -b 1048576 -t10 -n100

Look:

trace-cmd report -i graph_pread.dat | less

And it’s disappointing.

block_hasher-4102  [001]  2764.516562: funcgraph_entry:                   |                  __page_cache_alloc() {
block_hasher-4102  [001]  2764.516562: funcgraph_entry:                   |                    alloc_pages_current() {
block_hasher-4102  [001]  2764.516562: funcgraph_entry:        0.052 us   |                      policy_nodemask();
block_hasher-4102  [001]  2764.516563: funcgraph_entry:        0.058 us   |                      policy_zonelist();
block_hasher-4102  [001]  2764.516563: funcgraph_entry:                   |                      __alloc_pages_nodemask() {
block_hasher-4102  [001]  2764.516564: funcgraph_entry:        0.054 us   |                        _cond_resched();
block_hasher-4102  [001]  2764.516564: funcgraph_entry:        0.063 us   |                        next_zones_zonelist();
block_hasher-4109  [000]  2764.516564: funcgraph_entry:                   |  SyS_pread64() {
block_hasher-4102  [001]  2764.516564: funcgraph_entry:                   |                        get_page_from_freelist() {
block_hasher-4109  [000]  2764.516564: funcgraph_entry:                   |    __fdget() {
block_hasher-4102  [001]  2764.516565: funcgraph_entry:        0.052 us   |                          next_zones_zonelist();
block_hasher-4109  [000]  2764.516565: funcgraph_entry:                   |      __fget_light() {
block_hasher-4109  [000]  2764.516565: funcgraph_entry:        0.217 us   |        __fget();
block_hasher-4102  [001]  2764.516565: funcgraph_entry:        0.046 us   |                          __zone_watermark_ok();
block_hasher-4102  [001]  2764.516566: funcgraph_entry:        0.057 us   |                          __mod_zone_page_state();
block_hasher-4109  [000]  2764.516566: funcgraph_exit:         0.745 us   |      }
block_hasher-4109  [000]  2764.516566: funcgraph_exit:         1.229 us   |    }
block_hasher-4102  [001]  2764.516566: funcgraph_entry:                   |                          zone_statistics() {
block_hasher-4109  [000]  2764.516566: funcgraph_entry:                   |    vfs_read() {
block_hasher-4102  [001]  2764.516566: funcgraph_entry:        0.064 us   |                            __inc_zone_state();
block_hasher-4109  [000]  2764.516566: funcgraph_entry:                   |      rw_verify_area() {
block_hasher-4109  [000]  2764.516567: funcgraph_entry:                   |        security_file_permission() {
block_hasher-4102  [001]  2764.516567: funcgraph_entry:        0.057 us   |                            __inc_zone_state();
block_hasher-4109  [000]  2764.516567: funcgraph_entry:        0.048 us   |          cap_file_permission();
block_hasher-4102  [001]  2764.516567: funcgraph_exit:         0.907 us   |                          }
block_hasher-4102  [001]  2764.516567: funcgraph_entry:        0.056 us   |                          bad_range();
block_hasher-4109  [000]  2764.516567: funcgraph_entry:        0.115 us   |          __fsnotify_parent();
block_hasher-4109  [000]  2764.516568: funcgraph_entry:        0.159 us   |          fsnotify();
block_hasher-4102  [001]  2764.516568: funcgraph_entry:                   |                          mem_cgroup_bad_page_check() {
block_hasher-4102  [001]  2764.516568: funcgraph_entry:                   |                            lookup_page_cgroup_used() {
block_hasher-4102  [001]  2764.516568: funcgraph_entry:        0.052 us   |                              lookup_page_cgroup();
block_hasher-4109  [000]  2764.516569: funcgraph_exit:         1.958 us   |        }
block_hasher-4102  [001]  2764.516569: funcgraph_exit:         0.435 us   |                            }
block_hasher-4109  [000]  2764.516569: funcgraph_exit:         2.487 us   |      }
block_hasher-4102  [001]  2764.516569: funcgraph_exit:         0.813 us   |                          }
block_hasher-4102  [001]  2764.516569: funcgraph_exit:         4.666 us   |                        }

First of all, there is no straight function call chain, it’s constantly interrupted and transferred to another CPU. Secondly, there are a lot of noise e.g. inc_zone_state and __page_cache_alloc calls. And finally, there are neither mdraid function nor mwait_idle calls!

And the reasons are ftrace default sources (tracepoints) and async/callback nature of kernel code. You won’t see direct functions call chain from sys_pread64, the kernel doesn’t work this way.

But what if we setup kprobes on mdraid functions? No problem! Just add return probes for mwait_idle and md_make_request:

# echo 'r:md_make_request_probe md_make_request $retval' >> kprobe_events 
# echo 'r:mwait_probe mwait_idle $retval' >> kprobe_events

Repeat the routine with trace-cmd to get function graph:

# trace-cmd record -p function_graph -o graph_md.dat -g md_make_request -e md_make_request_probe -e mwait_probe -F \
            ./block_hasher -d /dev/md0 -b 1048576 -t10 -n100

-e enables particular event. Now, look at function graph:

block_hasher-28990 [000] 10235.125319: funcgraph_entry:                   |  md_make_request() {
block_hasher-28990 [000] 10235.125321: funcgraph_entry:                   |    make_request() {
block_hasher-28990 [000] 10235.125322: funcgraph_entry:        0.367 us   |      md_write_start();
block_hasher-28990 [000] 10235.125323: funcgraph_entry:                   |      bio_clone_mddev() {
block_hasher-28990 [000] 10235.125323: funcgraph_entry:                   |        bio_alloc_bioset() {
block_hasher-28990 [000] 10235.125323: funcgraph_entry:                   |          mempool_alloc() {
block_hasher-28990 [000] 10235.125323: funcgraph_entry:        0.178 us   |            _cond_resched();
block_hasher-28990 [000] 10235.125324: funcgraph_entry:                   |            mempool_alloc_slab() {
block_hasher-28990 [000] 10235.125324: funcgraph_entry:                   |              kmem_cache_alloc() {
block_hasher-28990 [000] 10235.125324: funcgraph_entry:                   |                cache_alloc_refill() {
block_hasher-28990 [000] 10235.125325: funcgraph_entry:        0.275 us   |                  _spin_lock();
block_hasher-28990 [000] 10235.125326: funcgraph_exit:         1.072 us   |                }
block_hasher-28990 [000] 10235.125326: funcgraph_exit:         1.721 us   |              }
block_hasher-28990 [000] 10235.125326: funcgraph_exit:         2.085 us   |            }
block_hasher-28990 [000] 10235.125326: funcgraph_exit:         2.865 us   |          }
block_hasher-28990 [000] 10235.125326: funcgraph_entry:        0.187 us   |          bio_init();
block_hasher-28990 [000] 10235.125327: funcgraph_exit:         3.665 us   |        }
block_hasher-28990 [000] 10235.125327: funcgraph_entry:        0.229 us   |        __bio_clone();
block_hasher-28990 [000] 10235.125327: funcgraph_exit:         4.584 us   |      }
block_hasher-28990 [000] 10235.125328: funcgraph_entry:        1.093 us   |      raid5_compute_sector();
block_hasher-28990 [000] 10235.125330: funcgraph_entry:                   |      blk_recount_segments() {
block_hasher-28990 [000] 10235.125330: funcgraph_entry:        0.340 us   |        __blk_recalc_rq_segments();
block_hasher-28990 [000] 10235.125331: funcgraph_exit:         0.769 us   |      }
block_hasher-28990 [000] 10235.125331: funcgraph_entry:        0.202 us   |      _spin_lock_irq();
block_hasher-28990 [000] 10235.125331: funcgraph_entry:        0.194 us   |      generic_make_request();
block_hasher-28990 [000] 10235.125332: funcgraph_exit:       + 10.613 us  |    }
block_hasher-28990 [000] 10235.125332: funcgraph_exit:       + 13.638 us  |  }

Much better! But for some reason, it doesn’t have mwait_idle calls. And it just calls generic_make_request. I’ve tried and record function graph for generic_make_request (-g option). Still no luck. I’ve extracted all function containing wait and here is the result:

# grep 'wait' graph_md.graph | cut -f 2 -d'|' | awk '{print $1}' | sort -n | uniq -c
     18 add_wait_queue()
   2064 bit_waitqueue()
      1 bit_waitqueue();
   1194 finish_wait()
     28 page_waitqueue()
   2033 page_waitqueue();
   1222 prepare_to_wait()
     25 remove_wait_queue()
      4 update_stats_wait_end()
    213 update_stats_wait_end();

(cut will separate function names, awk will print only function names, uniq with sort will reduce it to unique names)

Nothing related to timeouts. I’ve tried to grep for timeout and, damn, nothing suspicious.

So, right now I’m going to stop because it’s not going anywhere.

Conclusion

Well, it was really fun! ftrace is such a powerful tool but it’s made for debugging, not profiling. I was able to get kernel function call graph, get statistics for kernel execution on source code level (can you believe it?), trace some unknown function and all that happened thanks to ftrace. Bless it!

To read


  1. This is how debugfs mounted: mount -t debugfs none /sys/kernel/debug ↩︎