4.3.3 直方图_Prometheus云原生监控：运维与开发实战-QQ阅读男生轻小说网

上QQ阅读APP看书，第一时间看更新

4.3.3　直方图

在大多数情况下，人们都倾向于使用某些量化指标的平均值，例如CPU的平均使用率、页面的平均响应时间。用这种方式呈现结果很明显，以系统API调用的平均响应时间为例，如果大多数API请求维持在100ms的响应时间范围内，而个别请求的响应时间需要5s，就表示出现了长尾问题。

响应慢可能是平均值大导致的，也可能是长尾效应导致的，区分二者的最简单方式就是按照请求延迟的范围进行分组。例如，统计延迟在0～10ms之间的请求数有多少，延迟在10～20ms之间的请求数又有多少。通过这种方式可以快速分析系统慢的原因。直方图就是为解决这样的问题而存在的。通过Histogram展示监控指标，我们可以快速了解监控样本的分布情况。

Histogram在一段时间范围内对数据进行采样（通常是请求持续时间或响应大小等），并将其计入可配置的存储桶（Bucket）中，后续可通过指定区间筛选样本，也可以统计样本总数，最后一般将数据展示为Histogram。Histogram可以用于应用性能等领域的分析观察。

安装并启动Prometheus后，在访问http://localhost:9090/metrics时可以看到Prometheus自带的一些Histogram信息，如下所示。

# HELP prometheus_http_request_duration_seconds Histogram of latencies for HTTP 
# requests.
# TYPE prometheus_http_request_duration_seconds histogram
prometheus_http_request_duration_seconds_bucket{handler="/api/v1/label/:name/
  values",le="0.1"} 10
prometheus_http_request_duration_seconds_bucket{handler="/api/v1/label/:name/
  values",le="0.2"} 10
prometheus_http_request_duration_seconds_bucket{handler="/api/v1/label/:name/
  values",le="0.4"} 10
prometheus_http_request_duration_seconds_bucket{handler="/api/v1/label/:name/
  values",le="1"} 10
prometheus_http_request_duration_seconds_bucket{handler="/api/v1/label/:name/
  values",le="3"} 10
prometheus_http_request_duration_seconds_bucket{handler="/api/v1/label/:name/
  values",le="8"} 10
prometheus_http_request_duration_seconds_bucket{handler="/api/v1/label/:name/
  values",le="20"} 10
prometheus_http_request_duration_seconds_bucket{handler="/api/v1/label/:name/
  values",le="60"} 10
prometheus_http_request_duration_seconds_bucket{handler="/api/v1/label/:name/
  values",le="120"} 10
prometheus_http_request_duration_seconds_bucket{handler="/api/v1/label/:name/
  values",le="+Inf"} 10
prometheus_http_request_duration_seconds_sum{handler="/api/v1/label/:name/values"} 
  0.017084245999999997
prometheus_http_request_duration_seconds_count{handler="/api/v1/label/:name/
  values"} 10
prometheus_http_request_duration_seconds_bucket{handler="/api/v1/query",le="0.1"} 61
prometheus_http_request_duration_seconds_bucket{handler="/api/v1/query",le="0.2"} 61
prometheus_http_request_duration_seconds_bucket{handler="/api/v1/query",le="0.4"} 61
prometheus_http_request_duration_seconds_bucket{handler="/api/v1/query",le="1"} 61
prometheus_http_request_duration_seconds_bucket{handler="/api/v1/query",le="3"} 61
prometheus_http_request_duration_seconds_bucket{handler="/api/v1/query",le="8"} 61
prometheus_http_request_duration_seconds_bucket{handler="/api/v1/query",le="20"} 61
prometheus_http_request_duration_seconds_bucket{handler="/api/v1/query",le="60"} 61
prometheus_http_request_duration_seconds_bucket{handler="/api/v1/query",le="120"} 61
prometheus_http_request_duration_seconds_bucket{handler="/api/v1/query",le="+Inf"} 61
prometheus_http_request_duration_seconds_sum{handler="/api/v1/query"} 0.037283
  51100000001
prometheus_http_request_duration_seconds_count{handler="/api/v1/query"} 61

如上述案例所示，Histogram类型的样本会提供3种指标，假设指标名称为<basename>。

·样本的值分布在Bucket中的数量，命名为<basename>_bucket{le="<上边界>"}。这个值表示指标值小于等于上边界的所有样本数量。上述案例中的prometheus_http_request_duration_seconds_bucket{handler="/api/v1/query"，le="0.1"}61就代表在总共的61次请求中，HTTP请求响应时间≤0.1s的请求一共是61次。

·所有样本值的总和，命名为<basename>_sum。上述案例中的prometheus_http_request_duration_seconds_sum{handler="/api/v1/query"}0.03728351100000001表示发生的61次HTTP请求总响应时间是0.03728351100000001s。

·样本总数，命名为<basename>_count，其值和<basename>_bucket{le="+Inf"}相同。上述案例中prometheus_http_request_duration_seconds_count{handler="/api/v1/query"}61表示当前总共发生了61次请求。

sum函数和count函数相除，可以得到一些平均值，比如Prometheus一天内的平均压缩时间，可由查询结果除以instance标签数量得到，如下所示。

sum without(instance)(rate(prometheus_tsdb_compaction_duration_sum[1d])) 
/ 
sum without(instance)(rate(prometheus_tsdb_compaction_duration_count[1d]))

除了Prometheus内置的压缩时间，prometheus_local_storage_series_chunks_persisted表示Prometheus中每个时序需要存储的chunk数量，也可以用于计算待持久化的数据的分位数。

Histogram可以用于观察样本数据的分布情况。Histogram的分位数计算需要通过histogram_quantile（φfloat，b instant-vector）函数进行计算，但是histogram_quantile计算所得并非精确值。其中，φ（0<φ<1）表示需要计算的分位数（这个值主要是通过prometheus_http_request_duration_seconds_bucket和prometheus_http_request_duration_seconds_sum两个指标得到的，是一个近似值）。例子如下。

histogram_quantile(0.1, prometheus_http_request_duration_seconds_bucket)

知识拓展

bucket可以理解为对数据指标值域的一个划分，划分的依据应该基于数据值的分布。假设xxx_bucket{...，le="1"}的值为0.01，而xxx_bucket{...，le="2"}的值为100，那么这100个采样点中，有10个是小于10ms的，其余90个（100-10=90）采样点的响应时间是介于10ms和100s之间的。

实际生产中，φ一般使用0.9分位数，它也被称为90%分位数。Prometheus 2.2.1版本提供了一个指标prometheus_tsdb_compaction_duration_seconds，它用来监控压缩时间序列数据库所需的秒数。压缩一般每2h进行1次，而prometheus_tsdb_compaction_duration_seconds指标是计数器类型，所以必须先用rate取一个速率，然后用大于2h的时间去承载，比如可以用如下的例子去算一天内压缩时间序列数据库所需秒数的90%分位数：

histogram_quantile(0.90, rate(prometheus_tsdb_compaction_duration_seconds[1d]))

知识拓展

一天是24h，每2h压缩1次，一共可压缩12次，上述案例得到的结果就是：90%的压缩（10次左右）比这个结果时间要短，但是还有10%的压缩（1～2次）比这个结果时间要长。

但是需要注意的是，如果你将φ设置为更精确的0.999，那么你至少要有几千个数据点，这样才能得到一个合理且准确的答案。如果你设置的φ为0.999，但是你的数据点远远小于推荐值，那么单个指标的异常就会极大地影响结果，造成计算数据不准确。

通常只推荐5～10min的Histogram，如果你选择的时间范围是小时或者天，由于bucket指标可能包含很多标签以及rate计算，那么这样时间跨度巨大的直方图会产生极高的计算消耗。

histogram_quantile一般是查询表达式的最后一步。从统计学的角度看，分位数不能被聚合也不能对其进行算术运算。Histogram的数据来自sum、count等指标，也会涉及rate函数，因此一定要先执行相关命令，最后再执行histogram_quantile函数。这里再强调一下，PromQL要先执行rate()再执行sum()，不能执行sum()后再执行rate()。刚才的例子计算了一天内压缩时间序列数据库所需秒数的90%分位数，下面的例子使用sum命令统计了所有Prometheus服务器一天内压缩时间序列数据库所需秒数的90%分位数结果，并产生了一个没有实例标签的结果，如{job="prometheus"}7.720000000000001。

histogram_quantile(0.90, sum without(instance)(rate(prometheus_tsdb_compaction_
  duration_seconds[1d]))

本周热推：

现代信息网（第2版）网页设计与制作：Dreamweaver+Flash+Photoshop+HTML5+CSS3（慕课版）DreamweaverCS6+HTML+CSS+DIV+JavaScript网站开发案例教程 Wordpress Web Application Development（Third Edition）十进制网络技术及应用