上QQ阅读APP看书，第一时间看更新

Inspecting with Linux perf

It's worth keeping in mind that Valgrind's cachegrind is a simulation. If you have access to a Linux system you can make use of the perf tool to get real, honest numbers about your program's cache performance and more. This is highly recommended and something we'll do throughout this book. Like git, perf is many tools arranged under a banner—perf—with its own subcommands that have their own options.

It is a tool well worth reading the documentation for. Anyhow, what does the standard look like under perf?

naive_hashmap > perf stat --event task-clock,context-switches,page-faults,cycles,instructions,branches,branch-misses,cache-references,cache-misses target/release/standard > /dev/null

 Performance counter stats for 'target/release/standard':

      6.923765      task-clock (msec)         #    0.948 CPUs utilized
             0      context-switches          #    0.000 K/sec
           633      page-faults               #    0.091 M/sec
    19,703,427      cycles                    #    2.846 GHz
    26,234,708      instructions              #    1.33  insn per cycle
     2,802,334      branches                  #  404.741 M/sec
      290,475      branch-misses             #   10.37% of all branches
       635,526      cache-references          #   91.789 M/sec
     67,304      cache-misses              # 10.590 % of all cache refs

       0.007301775 seconds time elapsed

How does the naive implementation stand up?

naive_hashmap > perf stat --event task-clock,context-switches,page-faults,cycles,instructions,branches,branch-misses,cache-references,cache-misses target/release/naive > /dev/null

 Performance counter stats for 'target/release/naive':

    1323.724713      task-clock (msec)         #    1.000 CPUs utilized
              1      context-switches          #    0.001 K/sec
            254      page-faults               #    0.192 K/sec
  3,953,955,131      cycles                    #    2.987 GHz 15,390,499,356      instructions               #  3.89  insn per cycle    4,428,637,974      branches                    # 3345.588 M/sec
      204,132      branch-misses               #  0.00% of all branches
    455,719,875      cache-references          #  344.271 M/sec
   21,311      cache-misses                 # 0.005 % of all cache refs

       1.324163884 seconds time elapsed

This squares pretty well with the Valgrind simulation and leads to the same conclusion: too much work is being done to insert. Too many branches, too many instructions, the well-studied reader will have seen this coming a mile off, it's just worthwhile to be able to put a thing you know to numbers.