Category Archives: Vacuum

Understanding High Water Mark Locking Issues in PostgreSQL Vacuums

I recently had a customer that wanted to leverage read replicas to ensure that their read queries were not going to impeded with work being done on the primary instance and also required an SLA of at worst a few seconds. Ultimately they weren’t meeting the SLA and my colleagues and I were asked to look at what was going on.

The first thing we came to understand is that the pattern of work on the primary is a somewhat frequent large DELETE statement followed by a data refresh accomplished by a COPY from STDIN command against a partitioned table with 16 hash partitions.

The problem being observed was that periodically the SELECTs occurring on the read replica would time out and not meet the SLA. Upon investigation, we found that the “startup” process on the read replica would periodically request an “exclusive lock” on some random partition. This exclusive lock would block the SELECT (which is partition unaware) and then cause the timeout. But what is causing the timeout?

After spending some time investigating, the team was able to correlate the exclusive lock with a routine “autovacuum” occurring on the primary. But why was it locking? After inspection of the WAL, it turns out that it the issue was due to a step in the vacuum process whereby it tries to return free pages at the end of the table back to the OS, truncation of the High Water Mark (HWM). Essentially the lock is requested on the primary and then transmitted to the replica via the WAL so that the tables can be kept consistent.

To confirm that it was in fact the step in VACUUM that truncates the HWM, we decided to alter each partition of the table to allow VACUUM to skip that step:

ALTER TABLE [table name / partition name] SET (vacuum_truncate = false);

After letting this run for 24 hours, we in fact saw no further blocking locks causing SLA misses on the replicas. Should we worry about shrinking the High Water Mark (HWM)? Well as with everything in IT, it depends. Other DBMS engines like Oracle do not shrink the High Water Mark (HWM), typically maintenance operations such as DBMS_REDEF or ALTER TABLE … SHRINK SPACE / SHRINK SPACE COMPACT deal with that. So now that we are talking about PostgreSQL do we need to worry about it? This is where the pg_freespacemap extension can help. We can use this extension and a script to check to see if in fact the High Water Mark (HWM) is growing or staying put. If it is growing, we can just execute a regular VACUUM with an additional option called TRUNCATE to handle it:

VACUUM (verbose, truncate true) [schema].[table name];

When you do this, you will see one additional message in the VACUUM output signifying that the VACUUM truncated the High Water Mark (HWM):

INFO:  table "large_table": truncated 302534 to 302233 pages

As I stated earlier, we can use pg_freespacemap to see if we actually need to worry about the High Water Mark (HWM) growing. I could have taken a lot of time to write a script to figure it out, but instead, I enlisted Google Gemini to see what it would come up with. After a few iterations, the output was nearly perfect!

CREATE EXTENSION pg_freespacemap;

CREATE OR REPLACE FUNCTION show_empty_pages(p_table_name TEXT)
RETURNS VOID AS $$
DECLARE
    -- Core processing variables
    table_oid_regclass  REGCLASS;
    block_size          BIGINT;
    fsm_granularity     BIGINT;
    max_fsm_free_space  BIGINT;
    total_pages         BIGINT;
    high_water_mark     BIGINT := 0;

    -- Variables for the final summary
    first_empty_block   BIGINT;
    free_pages_at_end   BIGINT;
    free_space_at_end   TEXT;
BEGIN
    -- Setup
    table_oid_regclass := p_table_name::regclass;
    block_size  := current_setting('block_size')::bigint;
    SELECT relpages INTO total_pages FROM pg_class WHERE oid = table_oid_regclass;
    fsm_granularity    := block_size / 256;
    max_fsm_free_space := floor((block_size - 24) / fsm_granularity) * fsm_granularity;

    --------------------------------------------------------------------------------
    -- PASS 1: FIND THE HIGH-WATER MARK (last page with data)
    --------------------------------------------------------------------------------
    FOR i IN REVERSE (total_pages - 1)..0 LOOP
        IF pg_freespace(table_oid_regclass, i) < max_fsm_free_space THEN
            high_water_mark := i;
            EXIT;
        END IF;
    END LOOP;

    --------------------------------------------------------------------------------
    -- FINAL STEP: CALCULATE AND RAISE THE SUMMARY NOTICE
    --------------------------------------------------------------------------------
    first_empty_block := high_water_mark + 1;
    free_pages_at_end := total_pages - first_empty_block;
    IF free_pages_at_end < 0 THEN
        free_pages_at_end := 0;
    END IF;
    free_space_at_end := pg_size_pretty(free_pages_at_end * block_size);

    RAISE NOTICE '-------------------------------------------------------------';
    RAISE NOTICE 'Summary for table: %', p_table_name;
    RAISE NOTICE '-------------------------------------------------------------';
    RAISE NOTICE 'The High Water Mark (HWM) is at page: %', total_pages;
    IF total_pages <> first_empty_block THEN
    	RAISE NOTICE 'First potentially empty page is at: %', first_empty_block;
    	RAISE NOTICE 'Total Pages in Table: %', total_pages;
    	RAISE NOTICE 'Number of potentially truncatable pages at the end: %', free_pages_at_end;
    	RAISE NOTICE 'Amount of free space at the end of the table: %', free_space_at_end;
    ELSE
    	RAISE NOTICE 'There are no empty pages to truncate';
    END IF;
    RAISE NOTICE '-------------------------------------------------------------';
END;
$$ LANGUAGE plpgsql;

This handy script could be periodically executed to check the High Water Mark (HWM) and will produce the following output:

(postgres@10.3.1.17:5432) [postgres] > SELECT * FROM show_empty_pages('public.large_table');
NOTICE:  -------------------------------------------------------------
NOTICE:  Summary for table: public.large_table
NOTICE:  -------------------------------------------------------------
NOTICE:  The High Water Mark (HWM) is at page: 302534
NOTICE:  First potentially empty page is at: 302233
NOTICE:  Total Pages in Table: 302534
NOTICE:  Number of potentially truncatable pages at the end: 301
NOTICE:  Amount of free space at the end of the table: 2408 kB
NOTICE:  -------------------------------------------------------------

If there is no freespace after the last full block the output will look like this:

NOTICE:  -------------------------------------------------------------
NOTICE:  Summary for table: public.large_table
NOTICE:  -------------------------------------------------------------
NOTICE:  The High Water Mark (HWM) is at page: 302233
NOTICE:  There are no empty pages to truncate
NOTICE:  -------------------------------------------------------------

So while there is no right answer on how to deal with this, ensure you know the implications of each step in the process. In this case, we have decided to turn the “vacuum_truncation” option to false, but maybe another option might be to tune vacuum in another way such as either making it more or less frequent. Always evaluate your own situation, but in any case it’s always good to know what happens in your database when certain commands are executed.

Enjoy!

Follow-Up: Reduce Vacuum by Using “ON CONFLICT” Directive

I previously blogged about ensuring that the “ON CONFLICT” directive is used in order to avoid vacuum from having to do additional work. You can read the original blog here: Reduce Vacuum by Using “ON CONFLICT” Directive

Now that Postgres has incorporated the “MERGE” functionality into Postgres 15 and above, I wanted to ensure that there was no “strange” behavior as it relates to vacuum when using merge. As you can see here, the “MERGE” functionality does perform exactly as expected. For example, when you attempt to have a merge where the directive is to try an insert first followed by an update, exactly one row is marked dead when the insert fails and the update succeeds.

/* Create the table: */
CREATE TABLE public.pk_violation_test (
        id int PRIMARY KEY, 
        value numeric,
        product_id int,
        effective_date timestamp(3)
        );
 
 
/* Insert some mocked up data */
INSERT INTO public.pk_violation_test VALUES ( 
        generate_series(0,10000), 
        random()*1000,
        random()*100,
        current_timestamp(3));
 
/* Verify that there are no dead tuples: */
SELECT
    schemaname,
    relname,
    n_live_tup,
    n_dead_tup
FROM
    pg_stat_all_tables
WHERE
    relname = 'pk_violation_test';
 
 schemaname |      relname      | n_live_tup | n_dead_tup
------------+-------------------+------------+------------
 public     | pk_violation_test |    100001  |          0

Then, create a simple merge and check the results:

WITH insert_query AS (
    SELECT
        0 AS id,
        44.33893489873 AS value,
        46 AS product_id,
        now() AS effective_date) MERGE INTO pk_violation_test pkt
    USING insert_query i ON pkt.id = i.id
    WHEN MATCHED THEN
        UPDATE SET
            value = i.value, product_id = i.product_id, effective_date = i.effective_date
    WHEN NOT MATCHED THEN
        INSERT (id, value, product_id, effective_date)
            VALUES (i.id, i.value, i.product_id, i.effective_date);
MERGE 1

And then check the dead tuple count:

SELECT
    schemaname,
    relname,
    n_live_tup,
    n_dead_tup
FROM
    pg_stat_all_tables
WHERE
    relname = 'pk_violation_test';
 schemaname |      relname      | n_live_tup | n_dead_tup
------------+-------------------+------------+------------
 public     | pk_violation_test |      10001 |          1
(1 row)

As expected only one row is marked dead. Merge is such great functionality and I am glad to see it in Postgres. As you get time, all of your “ON CONFLICT” statements should be converted to use this functionality. Enjoy!

Reduce Vacuum by Using “ON CONFLICT” Directive

I’m always working with customers migrating from Oracle to PostgreSQL. In both DBMS systems, there is I/O impact when using exception handlers such as when handling a PK constraint violation, but the impact in PostgreSQL is different and you should be aware of what is actually going on. For example, when an exception occurs, redo is generated in Oracle (WAL in Postgres) and additional catalog queries are issued in both DBMS systems to get pertinent data about the constraint violation. But what actually happens in Postgres as it relates to MVCC?

Let’s use a simple test to demonstrate. First, create a table with some mocked up “product” data:

/* Create the table: */
CREATE TABLE public.pk_violation_test (
		id int PRIMARY KEY, 
		value numeric,
		product_id int,
		effective_date timestamp(3)
		);


/* Insert some mocked up data */
INSERT INTO public.pk_violation_test VALUES ( 
		generate_series(0,10000), 
		random()*1000,
		random()*100,
		current_timestamp(3));

/* Verify that there are no dead tuples: */
SELECT
    schemaname,
    relname,
    n_live_tup,
    n_dead_tup
FROM
    pg_stat_all_tables
WHERE
    relname = 'pk_violation_test';

 schemaname |      relname      | n_live_tup | n_dead_tup
------------+-------------------+------------+------------
 public     | pk_violation_test |    100001  |          0

Now, we will execute a simple insert statement using no directive:

/* Perform a simple insert: */
INSERT INTO pk_violation_test
    VALUES (0, 44.33893489873, 46, now());

ERROR:  duplicate key value violates unique constraint "pk_violation_test_pkey"
DETAIL:  Key (id)=(0) already exists.
Time: 1.292 ms

/* Verify the dead tuple count: */
SELECT
    schemaname,
    relname,
    n_live_tup,
    n_dead_tup
FROM
    pg_stat_all_tables
WHERE
    relname = 'pk_violation_test';

 schemaname |      relname      | n_live_tup | n_dead_tup
------------+-------------------+------------+------------
 public     | pk_violation_test |    100001  |          1

As you can see the error is produced that the insert violated the PK and the tuple which was in violation is now a dead tuple.

Now change the insert to use the “ON CONFLICT” directive and check the dead tuple count:

/* Perform a simple insert using the directive: */
INSERT INTO pk_violation_test
    VALUES (0, 44.33893489873, 46, now())
ON CONFLICT
    DO NOTHING;

INSERT 0 0
Time: 0.889 ms

/* Verify the unchanged dead tuple count: */
SELECT
    schemaname,
    relname,
    n_live_tup,
    n_dead_tup
FROM
    pg_stat_all_tables
WHERE
    relname = 'pk_violation_test';

 schemaname |      relname      | n_live_tup | n_dead_tup
------------+-------------------+------------+------------
 public     | pk_violation_test |    100001  |          1

As you can see the dead tuple count did not increase, thus reducing the amount of vacuum needed!

Now, most of the code conversion tools might try to do something with that exception block. Maybe it gets converted into something like this:

/* Create a simple function with exception logic: */
CREATE OR REPLACE FUNCTION pk_violation_test_func (p_id int, p_value numeric, p_product_id int)
    RETURNS VOID
    AS $$
BEGIN
    BEGIN
        INSERT INTO pk_violation_test (id, value, product_id, effective_date)
            VALUES (p_id, p_value, p_product_id, now());
        RETURN;
    EXCEPTION
        WHEN unique_violation THEN
            -- try an update
            UPDATE
                pk_violation_test
            SET
                value = p_value,
                product_id = p_product_id,
                effective_date = now()
            WHERE
                id = p_id;
    IF found THEN
        RETURN;
        END IF;
    END;
END;

$$
LANGUAGE plpgsql;

So what happens in this case? Watch how now we get TWO dead tuples. One for the insert and one for the update (if it’s not a HOT Update). I will vacuum the table first so that there is no question around how many dead tuples there are:

/* Vacuum the table: */
vacuum pk_violation_test;

/* Verify the dead tuple count: */
SELECT
    schemaname,
    relname,
    n_live_tup,
    n_dead_tup
FROM
    pg_stat_all_tables
WHERE
    relname = 'pk_violation_test';
 schemaname |      relname      | n_live_tup | n_dead_tup
------------+-------------------+------------+------------
 public     | pk_violation_test |    1000001 |          0

/* Call the sample function */
select * from pk_violation_test_func(0, 44.33893489873, 46);
 pk_violation_test_func
------------------------

(1 row)

/* Verify the dead tuple count: */
SELECT
    schemaname,
    relname,
    n_live_tup,
    n_dead_tup
FROM
    pg_stat_all_tables
WHERE
    relname = 'pk_violation_test';
 schemaname |      relname      | n_live_tup | n_dead_tup
------------+-------------------+------------+------------
 public     | pk_violation_test |    1000001 |          2

As you can see the insert attempt using the “ON CONFLICT” directive did not create a dead tuple.  In turn this will make your inserts which violate a PK more efficient, faster and not cause unnecessary vacuum.  Remember that logging message that was a result of the PK violation not being handled?  That log message is gone too. A WIN all around!

Point being that it is very important to understand whatever DBMS you are running it. Things that seem very simple can have pros and cons that need to be dealt with.

Enjoy!