46.2 顺序执行和并行执行的性能基准测试
根据是否并行执行,Go的性能基准测试可以分为两类:顺序执行的性能基准测试和并行执行的性能基准测试。
1. 顺序执行的性能基准测试
其代码写法如下:
func BenchmarkXxx(b *testing.B) { // ... for i := 0; i < b.N; i++ { // 被测对象的执行代码 } }
前面对多种字符串连接方法的性能基准测试就归属于这一类。关于顺序执行的性能基准测试的执行过程原理,可以通过下面的例子来说明:
// chapter8/sources/benchmark-impl/sequential_test.go var ( m map[int64]struct{} = make(map[int64]struct{}, 10) mu sync.Mutex round int64 = 1 ) func BenchmarkSequential(b *testing.B) { fmt.Printf("\ngoroutine[%d] enter BenchmarkSequential: round[%d], b.N[%d]\n", tls.ID(), atomic.LoadInt64(&round), b.N) defer func() { atomic.AddInt64(&round, 1) }() for i := 0; i < b.N; i++ { mu.Lock() _, ok := m[round] if !ok { m[round] = struct{}{} fmt.Printf("goroutine[%d] enter loop in BenchmarkSequential: round[%d], b.N[%d]\n", tls.ID(), atomic.LoadInt64(&round), b.N) } mu.Unlock() } fmt.Printf("goroutine[%d] exit BenchmarkSequential: round[%d], b.N[%d]\n", tls.ID(), atomic.LoadInt64(&round), b.N) }
运行这个例子:
$go test -bench . sequential_test.go goroutine[1] enter BenchmarkSequential: round[1], b.N[1] goroutine[1] enter loop in BenchmarkSequential: round[1], b.N[1] goroutine[1] exit BenchmarkSequential: round[1], b.N[1] goos: darwin goarch: amd64 BenchmarkSequential-8 goroutine[2] enter BenchmarkSequential: round[2], b.N[100] goroutine[2] enter loop in BenchmarkSequential: round[2], b.N[100] goroutine[2] exit BenchmarkSequential: round[2], b.N[100] goroutine[2] enter BenchmarkSequential: round[3], b.N[10000] goroutine[2] enter loop in BenchmarkSequential: round[3], b.N[10000] goroutine[2] exit BenchmarkSequential: round[3], b.N[10000] goroutine[2] enter BenchmarkSequential: round[4], b.N[1000000] goroutine[2] enter loop in BenchmarkSequential: round[4], b.N[1000000] goroutine[2] exit BenchmarkSequential: round[4], b.N[1000000] goroutine[2] enter BenchmarkSequential: round[5], b.N[65666582] goroutine[2] enter loop in BenchmarkSequential: round[5], b.N[65666582] goroutine[2] exit BenchmarkSequential: round[5], b.N[65666582] 65666582 20.6 ns/op PASS ok command-line-arguments 1.381s
我们看到:
- BenchmarkSequential被执行了多轮(见输出结果中的round值);
- 每一轮执行,for循环的b.N值均不相同,依次为1、100、10000、1000000和65666582;
- 除b.N为1的首轮,其余各轮均在一个goroutine(goroutine[2])中顺序执行。
默认情况下,每个性能基准测试函数(如BenchmarkSequential)的执行时间为1秒。如果执行一轮所消耗的时间不足1秒,那么go test会按就近的顺序增加b.N的值:1、2、3、5、10、20、30、50、100等。如果当b.N较小时,基准测试执行可以很快完成,那么go test基准测试框架将跳过中间的一些值,选择较大的值,比如像这里b.N从1直接跳到100。选定新的b.N之后,go test基准测试框架会启动新一轮性能基准测试函数的执行,直到某一轮执行所消耗的时间超出1秒。上面例子中最后一轮的b.N值为65666582,这个值应该是go test根据上一轮执行后得到的每次循环平均执行时间计算出来的。go test发现,如果将上一轮每次循环平均执行时间与再扩大100倍的N值相乘,那么下一轮的执行时间会超出1秒很多,于是go test用1秒与上一轮每次循环平均执行时间一起估算出一个循环次数,即上面的65666582。
如果基准测试仅运行1秒,且在这1秒内仅运行10轮迭代,那么这些基准测试运行所得的平均值可能会有较高的标准偏差。如果基准测试运行了数百万或数十亿次迭代,那么其所得平均值可能趋于准确。要增加迭代次数,可以使用-benchtime命令行选项来增加基准测试执行的时间。
下面的例子中,我们通过go test的命令行参数-benchtime将1秒这个默认性能基准测试函数执行时间改为2秒:
$go test -bench . sequential_test.go -benchtime 2s ... goroutine[2] enter BenchmarkSequential: round[4], b.N[1000000] goroutine[2] enter loop in BenchmarkSequential: round[4], b.N[1000000] goroutine[2] exit BenchmarkSequential: round[4], b.N[1000000] goroutine[2] enter BenchmarkSequential: round[5], b.N[100000000] goroutine[2] enter loop in BenchmarkSequential: round[5], b.N[100000000] goroutine[2] exit BenchmarkSequential: round[5], b.N[100000000] 100000000 20.5 ns/op PASS ok command-line-arguments 2.075s
我们看到性能基准测试函数执行时间改为2秒后,最终轮的b.N的值可以增大到100000000。
也可以通过-benchtime手动指定b.N的值,这样go test就会以你指定的N值作为最终轮的循环次数:
$go test -v -benchtime 5x -bench . sequential_test.go goos: darwin goarch: amd64 BenchmarkSequential goroutine[1] enter BenchmarkSequential: round[1], b.N[1] goroutine[1] enter loop in BenchmarkSequential: round[1], b.N[1] goroutine[1] exit BenchmarkSequential: round[1], b.N[1] goroutine[2] enter BenchmarkSequential: round[2], b.N[5] goroutine[2] enter loop in BenchmarkSequential: round[2], b.N[5] goroutine[2] exit BenchmarkSequential: round[2], b.N[5] BenchmarkSequential-8 5 5470 ns/op PASS ok command-line-arguments 0.006s
上面的每个性能基准测试函数(如BenchmarkSequential)虽然实际执行了多轮,但也仅算一次执行。有时候考虑到性能基准测试单次执行的数据不具代表性,我们可能会显式要求go test多次执行以收集多次数据,并将这些数据经过统计学方法处理后的结果作为最终结果。通过-count命令行选项可以显式指定每个性能基准测试函数执行次数:
$go test -v -count 2 -bench . benchmark_intro_test.go goos: darwin goarch: amd64 BenchmarkConcatStringByOperator BenchmarkConcatStringByOperator-8 12665250 89.8 ns/op BenchmarkConcatStringByOperator-8 13099075 89.7 ns/op BenchmarkConcatStringBySprintf BenchmarkConcatStringBySprintf-8 2781075 433 ns/op BenchmarkConcatStringBySprintf-8 2662507 433 ns/op BenchmarkConcatStringByJoin BenchmarkConcatStringByJoin-8 23679480 49.1 ns/op BenchmarkConcatStringByJoin-8 24135014 49.6 ns/op PASS ok command-line-arguments 8.225s
上面的例子中每个性能基准测试函数都被执行了两次(当然每次执行实质上都会运行多轮,b.N不同),输出了两个结果。
2. 并行执行的性能基准测试
并行执行的性能基准测试的代码写法如下:
func BenchmarkXxx(b *testing.B) { // ... b.RunParallel(func(pb *testing.PB) { for pb.Next() { // 被测对象的执行代码 } } }
并行执行的基准测试主要用于为包含多goroutine同步设施(如互斥锁、读写锁、原子操作等)的被测代码建立性能基准。相比于顺序执行的基准测试,并行执行的基准测试更能真实反映出多goroutine情况下,被测代码在goroutine同步上的真实消耗。比如下面这个例子:
// chapter8/sources/benchmark_paralell_demo_test.go var n1 int64 func addSyncByAtomic(delta int64) int64 { return atomic.AddInt64(&n1, delta) } func readSyncByAtomic() int64 { return atomic.LoadInt64(&n1) } var n2 int64 var rwmu sync.RWMutex func addSyncByMutex(delta int64) { rwmu.Lock() n2 += delta rwmu.Unlock() } func readSyncByMutex() int64 { var n int64 rwmu.RLock() n = n2 rwmu.RUnlock() return n } func BenchmarkAddSyncByAtomic(b *testing.B) { b.RunParallel(func(pb *testing.PB) { for pb.Next() { addSyncByAtomic(1) } }) } func BenchmarkReadSyncByAtomic(b *testing.B) { b.RunParallel(func(pb *testing.PB) { for pb.Next() { readSyncByAtomic() } }) } func BenchmarkAddSyncByMutex(b *testing.B) { b.RunParallel(func(pb *testing.PB) { for pb.Next() { addSyncByMutex(1) } }) } func BenchmarkReadSyncByMutex(b *testing.B) { b.RunParallel(func(pb *testing.PB) { for pb.Next() { readSyncByMutex() } }) }
运行该性能基准测试:
$go test -v -bench . benchmark_paralell_demo_test.go -cpu 2,4,8 goos: darwin goarch: amd64 BenchmarkAddSyncByAtomic BenchmarkAddSyncByAtomic-2 75208119 15.3 ns/op BenchmarkAddSyncByAtomic-4 70117809 17.0 ns/op BenchmarkAddSyncByAtomic-8 68664270 15.9 ns/op BenchmarkReadSyncByAtomic BenchmarkReadSyncByAtomic-2 1000000000 0.744 ns/op BenchmarkReadSyncByAtomic-4 1000000000 0.384 ns/op BenchmarkReadSyncByAtomic-8 1000000000 0.240 ns/op BenchmarkAddSyncByMutex BenchmarkAddSyncByMutex-2 37533390 31.4 ns/op BenchmarkAddSyncByMutex-4 21660948 57.5 ns/op BenchmarkAddSyncByMutex-8 16808721 72.6 ns/op BenchmarkReadSyncByMutex BenchmarkReadSyncByMutex-2 35535615 32.3 ns/op BenchmarkReadSyncByMutex-4 29839219 39.6 ns/op BenchmarkReadSyncByMutex-8 29936805 39.8 ns/op PASS ok command-line-arguments 12.454s
上面的例子中通过-cpu 2,4,8命令行选项告知go test将每个性能基准测试函数分别在GOMAXPROCS等于2、4、8的情况下各运行一次。从测试的输出结果,我们可以很容易地看出不同被测函数的性能随着GOMAXPROCS增大之后的性能变化情况。
和顺序执行的性能基准测试不同,并行执行的性能基准测试会启动多个goroutine并行执行基准测试函数中的循环。这里也用一个例子来说明一下其执行流程:
// chapter8/sources/benchmark-impl/paralell_test.go var ( m map[int64]int = make(map[int64]int, 20) mu sync.Mutex round int64 = 1 ) func BenchmarkParalell(b *testing.B) { fmt.Printf("\ngoroutine[%d] enter BenchmarkParalell: round[%d], b.N[%d]\n", tls.ID(), atomic.LoadInt64(&round), b.N) defer func() { atomic.AddInt64(&round, 1) }() b.RunParallel(func(pb *testing.PB) { id := tls.ID() fmt.Printf("goroutine[%d] enter loop func in BenchmarkParalell: round[%d], b.N[%d]\n", tls.ID(), atomic.LoadInt64(&round), b.N) for pb.Next() { mu.Lock() _, ok := m[id] if !ok { m[id] = 1 } else { m[id] = m[id] + 1 } mu.Unlock() } mu.Lock() count := m[id] mu.Unlock() fmt.Printf("goroutine[%d] exit loop func in BenchmarkParalell: round[%d], loop[%d]\n", tls.ID(), atomic.LoadInt64(&round), count) }) fmt.Printf("goroutine[%d] exit BenchmarkParalell: round[%d], b.N[%d]\n", tls.ID(), atomic.LoadInt64(&round), b.N) }
以-cpu=2运行该例子:
$go test -v -bench . paralell_test.go -cpu=2 goos: darwin goarch: amd64 BenchmarkParalell goroutine[1] enter BenchmarkParalell: round[1], b.N[1] goroutine[2] enter loop func in BenchmarkParalell: round[1], b.N[1] goroutine[2] exit loop func in BenchmarkParalell: round[1], loop[1] goroutine[3] enter loop func in BenchmarkParalell: round[1], b.N[1] goroutine[3] exit loop func in BenchmarkParalell: round[1], loop[0] goroutine[1] exit BenchmarkParalell: round[1], b.N[1] goroutine[4] enter BenchmarkParalell: round[2], b.N[100] goroutine[5] enter loop func in BenchmarkParalell: round[2], b.N[100] goroutine[5] exit loop func in BenchmarkParalell: round[2], loop[100] goroutine[6] enter loop func in BenchmarkParalell: round[2], b.N[100] goroutine[6] exit loop func in BenchmarkParalell: round[2], loop[0] goroutine[4] exit BenchmarkParalell: round[2], b.N[100] goroutine[4] enter BenchmarkParalell: round[3], b.N[10000] goroutine[7] enter loop func in BenchmarkParalell: round[3], b.N[10000] goroutine[8] enter loop func in BenchmarkParalell: round[3], b.N[10000] goroutine[8] exit loop func in BenchmarkParalell: round[3], loop[4576] goroutine[7] exit loop func in BenchmarkParalell: round[3], loop[5424] goroutine[4] exit BenchmarkParalell: round[3], b.N[10000] goroutine[4] enter BenchmarkParalell: round[4], b.N[1000000] goroutine[9] enter loop func in BenchmarkParalell: round[4], b.N[1000000] goroutine[10] enter loop func in BenchmarkParalell: round[4], b.N[1000000] goroutine[9] exit loop func in BenchmarkParalell: round[4], loop[478750] goroutine[10] exit loop func in BenchmarkParalell: round[4], loop[521250] goroutine[4] exit BenchmarkParalell: round[4], b.N[1000000] goroutine[4] enter BenchmarkParalell: round[5], b.N[25717561] goroutine[11] enter loop func in BenchmarkParalell: round[5], b.N[25717561] goroutine[12] enter loop func in BenchmarkParalell: round[5], b.N[25717561] goroutine[12] exit loop func in BenchmarkParalell: round[5], loop[11651491] goroutine[11] exit loop func in BenchmarkParalell: round[5], loop[14066070] goroutine[4] exit BenchmarkParalell: round[5], b.N[25717561] BenchmarkParalell-2 25717561 43.6 ns/op PASS ok command-line-arguments 1.176s
我们看到,针对BenchmarkParalell基准测试的每一轮执行,go test都会启动GOMAXPROCS数量的新goroutine,这些goroutine共同执行b.N次循环,每个goroutine会尽量相对均衡地分担循环次数。