Inspecting with the Valgrind Suite
Your specific benchmarking output will likely vary but it's pretty clear that the naive implementation is, well, not very fast. What tools do we have available to us to diagnose the issues with our code, guide us toward hot spots for optimization, or help convince us of the need for better algorithms? We'll now leave the realm of Rust-specific tools and dip into tools familiar to systems programmers at large. If you're not familiar with them personally that's a-okay—we'll describe their use and there are plenty of external materials available for the motivated reader. Okay, first, we're going to need some programs that exercise our naive HashMap and the standard HashMap. naive_interpreter would work, but it's doing a lot of extra things that'll muddy the water some. To that end, for examination purposes, we'll need two programs, one to establish a baseline and one for our implementation. Our baseline, called bin/standard.rs:
extern crate naive_hashmap; extern crate rand; use std::collections::HashMap; use rand::{Rng, SeedableRng, XorShiftRng}; fn main() { let mut rng: XorShiftRng = SeedableRng::from_seed([1981, 1986,
2003, 2011]); let mut hash_map = HashMap::new(); let mut insert_empty = 0; let mut insert_present = 0; let mut get_fail = 0; let mut get_success = 0; for _ in 0..100_000 { let key = rng.gen::<u16>(); if rng.gen::<bool>() { let value = rng.gen::<u32>(); if hash_map.insert(key, value).is_none() { insert_empty += 1; } else { insert_present += 1; } } else { if hash_map.get(&key).is_none() { get_fail += 1; } else { get_success += 1; } } } println!("INSERT"); println!(" empty: {}", insert_empty); println!(" present: {}", insert_present); println!("LOOKUP"); println!(" fail: {}", get_fail); println!(" success: {}", get_success); }
Our test article program is exactly the same, save the preamble is a bit different:
extern crate naive_hashmap; extern crate rand; use naive_hashmap::HashMap; use rand::{Rng, SeedableRng, XorShiftRng}; fn main() { let mut rng: XorShiftRng = SeedableRng::from_seed([1981, 1986,
2003, 2011]); let mut hash_map = HashMap::new(); let mut insert_empty = 0; let mut insert_present = 0; let mut get_fail = 0; let mut get_success = 0; for _ in 0..100_000 { let key = rng.gen::<u16>(); if rng.gen::<bool>() { let value = rng.gen::<u32>(); if hash_map.insert(key, value).is_none() { insert_empty += 1; } else { insert_present += 1; } } else { if hash_map.get(&key).is_none() { get_fail += 1; } else { get_success += 1; } } } println!("INSERT"); println!(" empty: {}", insert_empty); println!(" present: {}", insert_present); println!("LOOKUP"); println!(" fail: {}", get_fail); println!(" success: {}", get_success); }
Easy enough and similar in spirit to the benchmark code explored earlier. Now, let's set our baselines. First up, memory allocations:
naive_hashmap > valgrind --tool=memcheck target/release/standard ==13285== Memcheck, a memory error detector ==13285== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al. ==13285== Using Valgrind-3.12.0.SVN and LibVEX; rerun with -h for copyright info ==13285== Command: target/release/standard ==13285== INSERT empty: 35217 present: 15071 LOOKUP fail: 34551 success: 15161 ==13285== ==13285== HEAP SUMMARY: ==13285== in use at exit: 0 bytes in 0 blocks ==13285== total heap usage: 7 allocs, 7 frees, 2,032 bytes allocated ==13285== ==13285== All heap blocks were freed -- no leaks are possible ==13285== ==13285== For counts of detected and suppressed errors, rerun with: -v ==13285== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
Seven total allocations, seven total frees, and a grand total of 2,032 bytes allocated. Running memcheck against naive has the same result:
==13307== HEAP SUMMARY: ==13307== in use at exit: 0 bytes in 0 blocks ==13307== total heap usage: 7 allocs, 7 frees, 2,032 bytes allocated
The naive run is noticeably slower, so it's not allocating memory that kills us. As so little memory gets allocated by these programs we'll skip Valgrind massif—it's unlikely to turn up anything useful. Valgrind cachegrind should be interesting though. Here's baseline:
naive_hashmap > valgrind --tool=cachegrind --branch-sim=yes target/release/standard ==13372== Cachegrind, a cache and branch-prediction profiler ==13372== Copyright (C) 2002-2015, and GNU GPL'd, by Nicholas Nethercote et al. ==13372== Using Valgrind-3.12.0.SVN and LibVEX; rerun with -h for copyright info ==13372== Command: target/release/standard ==13372== --13372-- warning: L3 cache found, using its data for the LL simulation. INSERT empty: 35217 present: 15071 LOOKUP fail: 34551 success: 15161 ==13372== ==13372== I refs: 25,733,614 ==13372== I1 misses: 2,454 ==13372== LLi misses: 2,158 ==13372== I1 miss rate: 0.01% ==13372== LLi miss rate: 0.01% ==13372== ==13372== D refs: 5,400,581 (2,774,345 rd + 2,626,236 wr) ==13372== D1 misses: 273,218 ( 219,462 rd + 53,756 wr) ==13372== LLd misses: 36,267 ( 2,162 rd + 34,105 wr) ==13372== D1 miss rate: 5.1% ( 7.9% + 2.0% ) ==13372== LLd miss rate: 0.7% ( 0.1% + 1.3% ) ==13372== ==13372== LL refs: 275,672 ( 221,916 rd + 53,756 wr) ==13372== LL misses: 38,425 ( 4,320 rd + 34,105 wr) ==13372== LL miss rate: 0.1% ( 0.0% + 1.3% ) ==13372== ==13372== Branches: 3,008,198 (3,006,105 cond + 2,093 ind) ==13372== Mispredicts: 315,772 ( 315,198 cond + 574 ind) ==13372== Mispred rate: 10.5% ( 10.5% + 27.4% )
Let's break this up some:
==13372== I refs: 25,733,614 ==13372== I1 misses: 2,454 ==13372== LLi misses: 2,158 ==13372== I1 miss rate: 0.01% ==13372== LLi miss rate: 0.01%
The first section details our instruction cache behavior. I refs: 25,733,614 tells us that the standard program executed 25,733,614 instructions in all. This is often a useful comparison between closely related implementations, as we'll see here in a bit. Recall that cachegrind simulates a machine with two levels of instruction and data caching, the first level of caching being referred to as I1 or D1 for instruction or data caches and the last level cache being prefixed with LL. Here, we see the first and last level instruction caches each missed around 2,500 times during our 25 million instruction run. That squares with how tiny our program is. The second section is as follows:
==13372== D refs: 5,400,581 (2,774,345 rd + 2,626,236 wr) ==13372== D1 misses: 273,218 ( 219,462 rd + 53,756 wr) ==13372== LLd misses: 36,267 ( 2,162 rd + 34,105 wr) ==13372== D1 miss rate: 5.1% ( 7.9% + 2.0% ) ==13372== LLd miss rate: 0.7% ( 0.1% + 1.3% )
This is the data cache behavior, split into the two cache layers previously discussed and further pided into read and writes. The first line tells us that 5,400,581 total reads and writes were made to the caches, 2,774,345 reads and 2,626,236 writes. These totals are then further pided by first level and last level caches. Here it turns out that the standard library HashMap does real well, a D1 miss rate of 5.1% and a LLd miss rate of 0.7%. That last percentage is key: the higher it is, the more our program is accessing main memory. Doing so, as you'll recall from Chapter 1, Preliminaries – Machine Architecture and Getting Started with Rust, is painfully slow. The third section is as follows:
==13372== LL refs: 275,672 ( 221,916 rd + 53,756 wr) ==13372== LL misses: 38,425 ( 4,320 rd + 34,105 wr) ==13372== LL miss rate: 0.1% ( 0.0% + 1.3% )
This focuses on the combined behavior of data and instruction cache accesses to the LL cache. Personally, I don't often find this section illuminating compared to the previously discussed sections. Your mileage may vary. The final section is as follows:
==13372== Branches: 3,008,198 (3,006,105 cond + 2,093 ind) ==13372== Mispredicts: 315,772 ( 315,198 cond + 574 ind) ==13372== Mispred rate: 10.5% ( 10.5% + 27.4% )
This details the branch misprediction behavior of our program. We're drowning out the standard library's HashMap by branching so extensively in the runner program but it will still serve to establish some kind of baseline. The first line informs us that 3,008,198 total branches were taken during execution. The majority—3,006,105—were conditional branches, branches that jump to a location based on some condition. This squares with the number of conditional statements we have in the runner. The small majority of branches were indirect, meaning they jumped to offsets in memory based on the results of previous instructions. Overall, our branch misprediction rate is 10.5%.
Alright, how does the naive HashMap implementation stack up?
naive_hashmap > valgrind --tool=cachegrind --branch-sim=yes target/release/naive ==13439== Cachegrind, a cache and branch-prediction profiler ==13439== Copyright (C) 2002-2015, and GNU GPL'd, by Nicholas Nethercote et al. ==13439== Using Valgrind-3.12.0.SVN and LibVEX; rerun with -h for copyright info ==13439== Command: target/release/naive ==13439== --13439-- warning: L3 cache found, using its data for the LL simulation. INSERT empty: 35217 present: 15071 LOOKUP fail: 34551 success: 15161 ==13439== ==13439== I refs: 15,385,395,657 ==13439== I1 misses: 2,376 ==13439== LLi misses: 2,157 ==13439== I1 miss rate: 0.00% ==13439== LLi miss rate: 0.00% ==13439== ==13439== D refs: 1,609,901,359 (1,453,944,896 rd + 155,956,463 wr) ==13439== D1 misses: 398,465,518 ( 398,334,276 rd + 131,242 wr) ==13439== LLd misses: 12,647 ( 2,153 rd + 10,494 wr) ==13439== D1 miss rate: 24.8% ( 27.4% + 0.1% ) ==13439== LLd miss rate: 0.0% ( 0.0% + 0.0% ) ==13439== ==13439== LL refs: 398,467,894 ( 398,336,652 rd + 131,242 wr) ==13439== LL misses: 14,804 ( 4,310 rd + 10,494 wr) ==13439== LL miss rate: 0.0% ( 0.0% + 0.0% ) ==13439== ==13439== Branches: 4,430,578,248 (4,430,540,970 cond + 37,278 ind) ==13439== Mispredicts: 193,192 ( 192,618 cond + 574 ind) ==13439== Mispred rate: 0.0% ( 0.0% + 1.5% )
Already there are some standouts. Firstly, naive executed 15,385,395,657 instructions compared to the 25,733,614 of baseline. The naive implementation is simply doing much, much more work than that standard library is. At this point, without looking at any further data, it's reasonable to conclude that the program under inspection is fundamentally flawed: a rethink of the algorithm is in order. No amount of micro-optimization will fix this. But, that was understood; it's why the program is called naive to start with. The second major area of concern is that the D1 cache miss rate is just shy of 20% higher than baseline, not to ignore that there are simply just more reads and writes to the first level cache than at baseline. Curiously, the naive implementation suffers fewer LLd cache misses compared to baseline—10,494 to 34,105. No hypothesis there. Skipping on ahead to the branch misprediction section, we find that naive stays on-theme and performs drastically more branches than standard but with a lower total number of mispredictions. This squares with an algorithm dominated by linear seek and compares, as naive is.