
In my previous post “Three Configuration Parameters for PostgreSQL That Are Worth Further Investigation!“, I introduced three PostgreSQL parameters that I feel are the most “neglected” and present the most opportunity for performance tuning in a PostgreSQL instance. In the first of what will become a three part series, I would like to demonstrate tuning the “work_mem” parameter.
The “work_mem” parameter optimizes database operations such as:
- sorts
- bitmap heap scans
- hash joins
- materialized common table expressions (WITH statements)
To get started, lets create a test table with 100M rows of data:
CREATE TABLE public.work_mem_test ( id int PRIMARY KEY, value numeric, product_id int, effective_date timestamp(3) ); INSERT INTO public.work_mem_test VALUES ( generate_series(0,100000000), random()*1000, random()*100, current_timestamp(3)); CREATE INDEX prod_value_idx ON public.work_mem_test (product_id); VACUUM ANALYZE public.work_mem_test;
We will then run an explain analyze with the “COSTS, BUFFERS, VERBOSE” options so that we can fully see what is going on with the query. For demonstration purposes, I have set the “work_mem” to the lowest possible setting of 64kB. In addition, so that we don’t get the variability of parallel processing I have set the “max_parallel_workers_per_gather” to zero to disable parallel processing. Most systems may also experience better gains than this test case as this was a very small 2 vCPU / 8GB Google CloudSQL PostgreSQL instance:
set session work_mem to '64kB';
set max_parallel_workers_per_gather = 0;
set effective_io_concurrency = 20;
EXPLAIN (ANALYZE, COSTS, VERBOSE, BUFFERS)
SELECT * FROM public.work_mem_test
WHERE id BETWEEN 10000 AND 100000
OR product_id BETWEEN 100 AND 200
ORDER BY value NULLS FIRST;
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
Sort (cost=2640424.97..2641959.63 rows=613866 width=27) (actual time=16593.228..16778.132 rows=589379 loops=1)
Output: id, value, product_id, effective_date
Sort Key: work_mem_test.value NULLS FIRST
Sort Method: external merge Disk: 24248kB
Buffers: shared hit=6 read=363702, temp read=15340 written=16486
I/O Timings: read=4120.104
-> Bitmap Heap Scan on public.work_mem_test (cost=7805.12..2539439.23 rows=613866 width=27) (actual time=82.454..15413.822 rows=589379 loops=1)
Output: id, value, product_id, effective_date
Recheck Cond: (((work_mem_test.id >= 10000) AND (work_mem_test.id <= 100000)) OR ((work_mem_test.product_id >= 100) AND (work_mem_test.product_id <= 200)))
Rows Removed by Index Recheck: 48506004
Heap Blocks: exact=2058 lossy=360975
Buffers: shared hit=6 read=363702
I/O Timings: read=4120.104
-> BitmapOr (cost=7805.12..7805.12 rows=614306 width=0) (actual time=80.340..80.342 rows=0 loops=1)
Buffers: shared hit=6 read=669
I/O Timings: read=17.683
-> Bitmap Index Scan on work_mem_test_pkey (cost=0.00..1280.95 rows=82638 width=0) (actual time=12.680..12.680 rows=90001 loops=1)
Index Cond: ((work_mem_test.id >= 10000) AND (work_mem_test.id <= 100000))
Buffers: shared hit=3 read=247
I/O Timings: read=7.831
-> Bitmap Index Scan on prod_value_idx (cost=0.00..6217.24 rows=531667 width=0) (actual time=67.657..67.657 rows=499851 loops=1)
Index Cond: ((work_mem_test.product_id >= 100) AND (work_mem_test.product_id <= 200))
Buffers: shared hit=3 read=422
I/O Timings: read=9.852
Query Identifier: 731698457235411789
Planning:
Buffers: shared hit=39 read=2
I/O Timings: read=2.588
Planning Time: 3.136 ms
Execution Time: 16811.970 ms
(30 rows)
Time: 16903.784 ms (00:16.904)
EXPLAIN, via the BUFFERS keyword gives us the following data points:
Rows Removed by Index Recheck: 48506004
Heap Blocks: exact=2058 lossy=360975
…
Execution Time: 16811.970 ms
This essentially means that the 64kB of work_mem can hold 2058 blocks in the bitmap structure within that work_mem size. To get the remainder of the results, everything that falls out of that bitmap
are lossy blocks, meaning that they don’t point to an exact tuple, but to rather a block with many tuples. The recheck condition then checks that block for the tuples the query is looking for.
The following formula is a starting point, but may or may not give you the exact setting needed based on various factors. Since we used the lowest possible work_mem, the setting becomes a multiple of that:
new_mem_in_mbytes = ((exact heap blocks + lossy heap blocks) / exact heap blocks) * work_mem_in_bytes / 1048576 = ceil(round(((2058 + 360975) / 2058) * 65536 / 1048576,1)) = 11MB
Note: In most cases, I have found that this formula has worked well on the first pass, however as you will see in the subsequent tests, this estimated work_mem setting wasn’t quite close to the actual amount needed and this is likely due to a mis-estimate by the planner
Reducing the Lossy Block Access
So for the next test I will increase the “work_mem” to 11MB and re-execute the test.
set session work_mem to '11MB';
EXPLAIN (ANALYZE, COSTS, VERBOSE, BUFFERS)
SELECT * FROM public.work_mem_test
WHERE id BETWEEN 10000 AND 100000
OR product_id BETWEEN 100 AND 200
ORDER BY value NULLS FIRST;
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
Sort (cost=2160340.77..2161875.43 rows=613866 width=27) (actual time=11382.002..11574.572 rows=589379 loops=1)
Output: id, value, product_id, effective_date
Sort Key: work_mem_test.value NULLS FIRST
Sort Method: external merge Disk: 24232kB
Buffers: shared hit=23329 read=340379, temp read=3029 written=3034
I/O Timings: read=3618.302
-> Bitmap Heap Scan on public.work_mem_test (cost=7805.12..2090832.53 rows=613866 width=27) (actual time=185.251..10923.764 rows=589379 loops=1)
Output: id, value, product_id, effective_date
Recheck Cond: (((work_mem_test.id >= 10000) AND (work_mem_test.id = 100) AND (work_mem_test.product_id BitmapOr (cost=7805.12..7805.12 rows=614306 width=0) (actual time=132.954..132.957 rows=0 loops=1)
Buffers: shared hit=675
-> Bitmap Index Scan on work_mem_test_pkey (cost=0.00..1280.95 rows=82638 width=0) (actual time=4.449..4.450 rows=90001 loops=1)
Index Cond: ((work_mem_test.id >= 10000) AND (work_mem_test.id Bitmap Index Scan on prod_value_idx (cost=0.00..6217.24 rows=531667 width=0) (actual time=128.503..128.503 rows=499851 loops=1)
Index Cond: ((work_mem_test.product_id >= 100) AND (work_mem_test.product_id <= 200))
Buffers: shared hit=425
Query Identifier: 731698457235411789
Planning:
Buffers: shared hit=30
Planning Time: 0.179 ms
Execution Time: 11611.071 ms
(26 rows)
Time: 11695.952 ms (00:11.696)
With 11MB, we got more exact heap blocks, but still not enough memory to process. Applying the formula based on the execution plan of the query….
new_mem_in_mbytes = ((exact heap blocks + lossy heap blocks) / exact heap blocks) * work_mem_in_bytes / 1048576 = ceil(round(((164090 + 198943) / 164090) * 12582912 / 1048576,1)); = 24MB
Let’s increase just a little bit more to 24MB as the latest iteration of the formula has suggested.
set session work_mem to '24MB';
EXPLAIN (ANALYZE, COSTS, VERBOSE, BUFFERS)
SELECT * FROM public.work_mem_test
WHERE id BETWEEN 10000 AND 100000
OR product_id BETWEEN 100 AND 200
ORDER BY value NULLS FIRST;
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
Sort (cost=1709385.33..1710919.99 rows=613866 width=27) (actual time=3651.589..3791.250 rows=589379 loops=1)
Output: id, value, product_id, effective_date
Sort Key: work_mem_test.value NULLS FIRST
Sort Method: external merge Disk: 24232kB
Buffers: shared hit=23329 read=340379, temp read=3029 written=3031
I/O Timings: read=1493.162
-> Bitmap Heap Scan on public.work_mem_test (cost=7805.12..1639877.09 rows=613866 width=27) (actual time=348.261..3201.421 rows=589379 loops=1)
Output: id, value, product_id, effective_date
Recheck Cond: (((work_mem_test.id >= 10000) AND (work_mem_test.id = 100) AND (work_mem_test.product_id BitmapOr (cost=7805.12..7805.12 rows=614306 width=0) (actual time=188.309..188.311 rows=0 loops=1)
Buffers: shared hit=675
-> Bitmap Index Scan on work_mem_test_pkey (cost=0.00..1280.95 rows=82638 width=0) (actual time=5.090..5.091 rows=90001 loops=1)
Index Cond: ((work_mem_test.id >= 10000) AND (work_mem_test.id Bitmap Index Scan on prod_value_idx (cost=0.00..6217.24 rows=531667 width=0) (actual time=183.215..183.215 rows=499851 loops=1)
Index Cond: ((work_mem_test.product_id >= 100) AND (work_mem_test.product_id <= 200))
Buffers: shared hit=425
Query Identifier: 731698457235411789
Planning:
Buffers: shared hit=30
Planning Time: 0.242 ms
Execution Time: 3828.215 ms
(25 rows)
Time: 3912.670 ms (00:03.913)
No more lossy block scans! Time has also reduced quite significantly from the first execution.
Handling the Sort Method
Now, we need to pay attention to the explain plan line:
"Bitmap Heap Scan on public.work_mem_test….rows=613866 width=27"
"Sort Method: external merge Disk: 24232kB."
Some of the sort is in memory and some is spilled to disk. So in order to fit the entire rowset in memory, we must multiply the input rows by the width, which is 16MB. In addition, the planner spilled another 24MB to disk, so let’s add that also to “work_mem”.
16Mb + 24Mb which is being spilled = 40Mb more "work_mem"
So with the current “work_mem” of 24MB plus the additional computed to remove the sort (rounded up), the total needed is 64MB.
Lets run one more test:
set session work_mem to '64MB';
EXPLAIN (ANALYZE, COSTS, VERBOSE, BUFFERS)
SELECT * FROM public.work_mem_test
WHERE id BETWEEN 10000 AND 100000
OR product_id BETWEEN 100 AND 200
ORDER BY value NULLS FIRST;
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
Sort (cost=613087.41..614622.07 rows=613866 width=27) (actual time=3186.918..3309.896 rows=589379 loops=1)
Output: id, value, product_id, effective_date
Sort Key: work_mem_test.value NULLS FIRST
Sort Method: quicksort Memory: 61169kB
Buffers: shared hit=6 read=363702
I/O Timings: read=1306.892
-> Bitmap Heap Scan on public.work_mem_test (cost=7805.12..554071.67 rows=613866 width=27) (actual time=245.348..2908.344 rows=589379 loops=1)
Output: id, value, product_id, effective_date
Recheck Cond: (((work_mem_test.id >= 10000) AND (work_mem_test.id = 100) AND (work_mem_test.product_id BitmapOr (cost=7805.12..7805.12 rows=614306 width=0) (actual time=115.051..115.053 rows=0 loops=1)
Buffers: shared hit=6 read=669
I/O Timings: read=3.561
-> Bitmap Index Scan on work_mem_test_pkey (cost=0.00..1280.95 rows=82638 width=0) (actual time=6.160..6.161 rows=90001 loops=1)
Index Cond: ((work_mem_test.id >= 10000) AND (work_mem_test.id Bitmap Index Scan on prod_value_idx (cost=0.00..6217.24 rows=531667 width=0) (actual time=108.889..108.889 rows=499851 loops=1)
Index Cond: ((work_mem_test.product_id >= 100) AND (work_mem_test.product_id <= 200))
Buffers: shared hit=3 read=422
I/O Timings: read=2.231
Query Identifier: 731698457235411789
Planning:
Buffers: shared hit=30
Planning Time: 0.180 ms
Execution Time: 3347.271 ms
(28 rows)
Time: 3431.188 ms (00:03.431)
With that adjustment, we have significantly increased the efficiency and performance of the query. From the beginning, just by tuning “work_mem”, we have shaved approximately 13.5 seconds of processing time!
What about a top-N Heapsort??
Now if we want to demonstrate a top-N Heapsort, we can change the query just a little bit more:
set session work_mem to '64MB';
EXPLAIN (ANALYZE, COSTS, VERBOSE, BUFFERS)
SELECT * FROM public.work_mem_test
WHERE id BETWEEN 10000 AND 100000
OR product_id BETWEEN 100 AND 200
ORDER BY value NULLS FIRST LIMIT 10;
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=567337.09..567337.12 rows=10 width=27) (actual time=3021.185..3021.190 rows=10 loops=1)
Output: id, value, product_id, effective_date
Buffers: shared hit=6 read=363702
I/O Timings: read=1313.044
-> Sort (cost=567337.09..568871.76 rows=613866 width=27) (actual time=3021.183..3021.186 rows=10 loops=1)
Output: id, value, product_id, effective_date
Sort Key: work_mem_test.value NULLS FIRST
Sort Method: top-N heapsort Memory: 26kB
Buffers: shared hit=6 read=363702
I/O Timings: read=1313.044
-> Bitmap Heap Scan on public.work_mem_test (cost=7805.12..554071.67 rows=613866 width=27) (actual time=235.909..2911.978 rows=589379 loops=1)
Output: id, value, product_id, effective_date
Recheck Cond: (((work_mem_test.id >= 10000) AND (work_mem_test.id = 100) AND (work_mem_test.product_id BitmapOr (cost=7805.12..7805.12 rows=614306 width=0) (actual time=108.429..108.431 rows=0 loops=1)
Buffers: shared hit=6 read=669
I/O Timings: read=3.114
-> Bitmap Index Scan on work_mem_test_pkey (cost=0.00..1280.95 rows=82638 width=0) (actual time=5.582..5.582 rows=90001 loops=1)
Index Cond: ((work_mem_test.id >= 10000) AND (work_mem_test.id Bitmap Index Scan on prod_value_idx (cost=0.00..6217.24 rows=531667 width=0) (actual time=102.845..102.845 rows=499851 loops=1)
Index Cond: ((work_mem_test.product_id >= 100) AND (work_mem_test.product_id <= 200))
Buffers: shared hit=3 read=422
I/O Timings: read=2.037
Query Identifier: 4969544646514690020
Planning:
Buffers: shared hit=30
Planning Time: 0.177 ms
Execution Time: 3023.304 ms
(32 rows)
Time: 3107.421 ms (00:03.107)
Because we are only returning the top N rows, the memory used is not as high because a different sort methodology can be used. In addition, the time is further reduced.
As you can see with, a little tuning of the “work_mem” parameter, lots of performance can be gained in the system. In this example, we have increased “work_mem” a fairly small amount from 64kb to 64MB. In my mind you never want to increase “work_mem” to a setting where if all workers were being worked by CPU, you could overrun the free memory on the system. Also, remember that there is some overhead to maintaining that memory so it’s really important to find a good balance for your workload. Keep in mind that you can set this parameter at the server level, as an alter in the query text or at the user level as a profile.
Enjoy!

Pingback: Tuning the PostgreSQL “random_page_cost” Parameter | 🛩️ Shane Borden's Technology Blog
Pingback: GCP AlloyDB Blog Series Part 7 : Working with AlloyDB | My Big Data World