上QQ阅读APP看书，第一时间看更新

Digging into the PostgreSQL cost model

If only one CPU is used, the execution plan will look like this:

test=# EXPLAIN SELECT * FROM t_test WHERE id = 432332; 
                        QUERY PLAN                         
---------------------------------------------------------- 
 Seq Scan on t_test  (cost=0.00..71622.00 rows=1 width=9) 
   Filter: (id = 432332) 
(2 rows)

PostgreSQL will sequentially read (sequential scan) the entire table and apply the filter. It expects the operation to cost 71622 penalty points. Now what does that mean? Penalty points (or costs) are mostly an abstract concept. They are needed to compare different ways to execute the query. If a query can be executed by the executor in many different ways, PostgreSQL will decide on the execution plan promising the lowest cost possible. The question now is: how did PostgreSQL end up with 71622 points?

Here is how it works:

test=# SELECT pg_relation_size('t_test') / 8192.0; 
      ?column?       
-------------------- 
 21622.000000000000 
(1 row)

The pg_relation_size function will return the size of the table in bytes. Given the example, you can see that the relation consists of 21622 blocks (8,000 each). According to the cost model PostgreSQL will add costs of one for each block it has to read sequentially.

The configuration parameter to influence that is:

test=# SHOW seq_page_cost; 
 seq_page_cost  
--------------- 
 1 
(1 row)

However, reading a couple of blocks from disk is not everything we have to do. It is also necessary to apply the filter and to send those rows through the CPU. Two parameters are here to account for those costs:

test=# SHOW cpu_tuple_cost; 
 cpu_tuple_cost  
---------------- 
 0.01 
(1 row) 
test=# SHOW cpu_operator_cost; 
 cpu_operator_cost  
------------------- 
 0.0025 
(1 row)

This leads to the following calculation:

test=# SELECT 21622*1 + 4000000*0.01 + 4000000*0.0025; 
  ?column?   
------------ 
 71622.0000 
(1 row)

As you can see, this is exactly the number seen in the plan. Costs will consist of a CPU part and an I/O part, which will all be turned into a single number. The important thing here is that costs have nothing to do with real execution, so it is impossible to translate costs to milliseconds. The number the planner comes up with is really just an estimate.

Of course, there are some more parameters outlined in this brief example. PostgreSQL also has special parameters for index-related operations:

random_page_cost = 4: If PostgreSQL uses an index, there is usually a lot of random I/O involved. On traditional spinning disks, random reads are much more important than sequential reads, so PostgreSQL will account for them accordingly. Note that on SSDs, the difference between random and sequential reads does not exist anymore, so it can make sense to set random_page_cost = 1 in the postgresql.conf file.
cpu_index_tuple_cost = 0.005: If indexes are used, PostgreSQL will also consider that there is some CPU cost invoiced.

If you are utilizing parallel queries, there are even more cost parameters:

parallel_tuple_cost = 0.1: This defines the cost of transferring one tuple from a parallel worker process to another process. It basically accounts for overhead of moving rows around inside the infrastructure.

parallel_setup_cost = 1000.0: This adjusts the costs of firing up a worker process. Of course, starting processes to run queries in parallel is not free, and so this parameter tries to model those costs associated with process management.

min_parallel_relation_size = 8 MB: This defines the minimum size of a table considered for parallel queries. The larger a table grows, the more CPUs PostgreSQL will use. The size of the table has to triple to allow for one more worker process.