Yearly Archives: 2024

Why Isn’t My Query Benefiting from Partition Pruning?

BLOGGERS NOTE… The original posting had verbiage where it wasn’t clear that I was referring to the planner not partition pruning vs the execution not partition pruning. Apologies for the error. The below post has been corrected:

Recently I had a customer come to me with a poorly performing query that had a perfectly fine execution time on Oracle, but once migrated to Postgres the query was slow. Upon investigation, it was found that the planning portion of the query was not pruning partitions causing a much longer planning time. That said even though the planning portion was not accomplishing partition pruning, the execution portion was (This is indicated by the portion of the explain plan notating “never executed”) .

The query had “WHERE” and “JOIN” clauses which explicitly specified partition keys and were joined on the partition key, so why wasn’t pruning happening during planning time?

The setup for the test is at the bottom of the blog post: Jump to Test Case Setup

Original Query

The tables in question are all partitioned by “hire_date” as as you can see in the “WHERE” clause below, the planner should project partition pruning because the “hire_date” from the driving table is equal to that of the joined table, yet the planner did not prune partitions:

WITH top_emp AS (
   SELECT 
       DISTINCT
       id,
       first_name,
       last_name,
       hire_date,
       birth_date
   FROM
       employees.employee_part emp
       INNER JOIN employees.salary sal on (sal.employee_id = emp.id)
   WHERE
       emp.hire_date = '1985-01-01'::date
       and emp.gender = 'M'
       and emp.birth_date between '1957-01-01'::date and '1970-01-01'::date
       and sal.amount > 50000
)
SELECT
    dept.employee_id,
    dept.department_id,
    dept.hire_date
FROM
    top_emp emp1,
    employees.department_employee_part dept
WHERE
    emp1.hire_date = dept.hire_date and
    emp1.id = dept.employee_id;

/* Explain Plan */
 Nested Loop  (cost=1467.16..2646.20 rows=307 width=17) (actual time=4.565..4.613 rows=4 loops=1)
   Output: dept.employee_id, dept.department_id, dept.hire_date
   Buffers: shared hit=330
   ->  Unique  (cost=1466.87..1471.11 rows=37 width=31) (actual time=4.537..4.560 rows=4 loops=1)
         Output: emp.id, emp.first_name, emp.last_name, emp.hire_date, emp.birth_date
         Buffers: shared hit=318
         ->  Sort  (cost=1466.87..1467.72 rows=339 width=31) (actual time=4.534..4.539 rows=69 loops=1)
               Output: emp.id, emp.first_name, emp.last_name, emp.hire_date, emp.birth_date
               Sort Key: emp.id, emp.first_name, emp.last_name, emp.birth_date
               Sort Method: quicksort  Memory: 30kB
               Buffers: shared hit=318
               ->  Nested Loop  (cost=0.43..1452.62 rows=339 width=31) (actual time=1.524..4.450 rows=69 loops=1)
                     Output: emp.id, emp.first_name, emp.last_name, emp.hire_date, emp.birth_date
                     Buffers: shared hit=309
                     ->  Seq Scan on employees.employee_part_p1985 emp  (cost=0.00..999.32 rows=37 width=31) (actual time=1.500..4.380 rows=4 loops=1)
                           Output: emp.id, emp.first_name, emp.last_name, emp.hire_date, emp.birth_date
                           Filter: ((emp.birth_date >= '1957-01-01'::date) AND (emp.birth_date <= '1970-01-01'::date) AND (emp.hire_date = '1985-01-01'::date) AND (emp.gender = 'M'::employee_gender))
                           Rows Removed by Filter: 35312
                           Buffers: shared hit=293
                     ->  Index Scan using idx_16991_primary on employees.salary sal  (cost=0.43..12.16 rows=9 width=8) (actual time=0.010..0.014 rows=17 loops=4)
                           Output: sal.employee_id, sal.amount, sal.from_date, sal.to_date
                           Index Cond: (sal.employee_id = emp.id)
                           Filter: (sal.amount > 50000)
                           Rows Removed by Filter: 1
                           Buffers: shared hit=16
   ->  Append  (cost=0.29..31.59 rows=16 width=17) (actual time=0.010..0.011 rows=1 loops=4)
         Buffers: shared hit=12
         ->  Index Scan using contract_employees_p1985_pkey on employees.department_employee_part_p1985 dept_1  (cost=0.29..2.33 rows=1 width=17) (actual time=0.006..0.006 rows=1 loops=4)
               Output: dept_1.employee_id, dept_1.department_id, dept_1.hire_date
               Index Cond: (dept_1.employee_id = emp.id)
               Filter: (emp.hire_date = dept_1.hire_date)
               Buffers: shared hit=12
         ->  Index Scan using contract_employees_p1986_pkey on employees.department_employee_part_p1986 dept_2  (cost=0.29..2.36 rows=1 width=17) (never executed)
               Output: dept_2.employee_id, dept_2.department_id, dept_2.hire_date
               Index Cond: (dept_2.employee_id = emp.id)
               Filter: (emp.hire_date = dept_2.hire_date)
         ->  Index Scan using contract_employees_p1987_pkey on employees.department_employee_part_p1987 dept_3  (cost=0.29..2.33 rows=1 width=17) (never executed)
               Output: dept_3.employee_id, dept_3.department_id, dept_3.hire_date
               Index Cond: (dept_3.employee_id = emp.id)
               Filter: (emp.hire_date = dept_3.hire_date)

......... LOTS OF PARTITIONS ........

         ->  Index Scan using contract_employees_p2000_pkey on employees.department_employee_part_p2000 dept_16  (cost=0.14..0.24 rows=1 width=17) (never executed)
               Output: dept_16.employee_id, dept_16.department_id, dept_16.hire_date
               Index Cond: (dept_16.employee_id = emp.id)
               Filter: (emp.hire_date = dept_16.hire_date)
 Settings: effective_cache_size = '3256704kB', effective_io_concurrency = '128', random_page_cost = '1.1', search_path = '"$user", public, employees', temp_buffers = '128MB', work_mem = '256MB'
 Query Identifier: -1847211568235447798
 Planning:
   Buffers: shared hit=1313
 Planning Time: 4.111 ms
 Execution Time: 5.176 ms

So if the partition keys are equal and the tables are partitioned the exact same way, then why is the planner not projecting partition pruning? The answer can be found in the Postgres Listserv by clicking here.

The planner can only push quals down into a subquery, it cannot pull
quals from a subquery into the outer query.

So I want (and need the planner to partition prune) so how can I get it? Both alternatives require re-writing the query although one is a little less invasive than the other. If you know that the partition keys are exactly equal and there can only be one value, then you can modify the the where clause to have an additional entry like this (notice the additional entry in the lower “where” clause):

WITH top_emp AS (
   SELECT 
       DISTINCT
       id,
       first_name,
       last_name,
       hire_date,
       birth_date
   FROM
       employees.employee_part emp
       INNER JOIN employees.salary sal on (sal.employee_id = emp.id)
   WHERE
       emp.hire_date = '1985-01-01'::date
       and emp.gender = 'M'
       and emp.birth_date between '1957-01-01'::date and '1970-01-01'::date
       and sal.amount > 50000
)
SELECT
    dept.employee_id,
    dept.department_id,
    dept.hire_date
FROM
    top_emp emp1,
    employees.department_employee_part dept
WHERE
    emp1.hire_date = dept.hire_date and
    dept.hire_date = '1985-01-01'::date and
    emp1.id = dept.employee_id;

Or, you could totally re-write the query like this so, just in case you could match multiple partitions in the other table, that situation is handled:

SELECT
   employee_id,
   department_id,
   hire_date
FROM (
   SELECT 
       dept.employee_id,
       dept.department_id,
       dept.hire_date,
       count(1)
   FROM
       employees.employee_part emp
       INNER JOIN employees.salary sal on (sal.employee_id = emp.id)
       JOIN employees.department_employee_part dept on (emp.hire_date = dept.hire_date and emp.id = dept.employee_id)  
   WHERE
       emp.hire_date = '1985-01-01'::date
       and emp.gender = 'M'
       and emp.birth_date between '1957-01-01'::date and '1970-01-01'::date
       and sal.amount > 50000
   GROUP BY
       dept.employee_id,
       dept.department_id,
       dept.hire_date) a;

In this case, the following plan is used and partition pruning occurs. Both planning and execution time are significantly less as well:

 Subquery Scan on a  (cost=1273.80..1273.83 rows=1 width=17) (actual time=2.629..2.641 rows=4 loops=1)
   Output: a.employee_id, a.department_id, a.hire_date
   Buffers: shared hit=331
   ->  GroupAggregate  (cost=1273.80..1273.82 rows=1 width=25) (actual time=2.628..2.639 rows=4 loops=1)
         Output: dept.employee_id, dept.department_id, dept.hire_date, NULL::bigint
         Group Key: dept.employee_id, dept.department_id, dept.hire_date
         Buffers: shared hit=331
         ->  Sort  (cost=1273.80..1273.80 rows=1 width=17) (actual time=2.620..2.624 rows=69 loops=1)
               Output: dept.employee_id, dept.department_id, dept.hire_date
               Sort Key: dept.employee_id, dept.department_id
               Sort Method: quicksort  Memory: 29kB
               Buffers: shared hit=331
               ->  Nested Loop  (cost=0.74..1273.79 rows=1 width=17) (actual time=0.836..2.604 rows=69 loops=1)
                     Output: dept.employee_id, dept.department_id, dept.hire_date
                     Buffers: shared hit=331
                     ->  Nested Loop  (cost=0.30..967.66 rows=43 width=25) (actual time=0.828..2.557 rows=4 loops=1)
                           Output: emp.id, dept.employee_id, dept.department_id, dept.hire_date
                           Buffers: shared hit=315
                           ->  Seq Scan on employees.department_employee_part_p1985 dept  (cost=0.00..776.50 rows=109 width=17) (actual time=0.801..2.515 rows=9 loops=1)
                                 Output: dept.employee_id, dept.department_id, dept.hire_date
                                 Filter: (dept.hire_date = '1985-01-01'::date)
                                 Rows Removed by Filter: 39071
                                 Buffers: shared hit=288
                           ->  Memoize  (cost=0.30..2.03 rows=1 width=12) (actual time=0.004..0.004 max time=0.020 rows=0 loops=9)
                                 Output: emp.id, emp.hire_date
                                 Cache Key: dept.employee_id
                                 Cache Mode: logical
                                 Hits: 0  Misses: 9  Evictions: 0  Overflows: 0  Memory Usage: 1kB
                                 Buffers: shared hit=27
                                 ->  Index Scan using employee_part_p1985_pkey on employees.employee_part_p1985 emp  (cost=0.29..2.02 rows=1 width=12) (actual time=0.002..0.003 max time=0.012 rows=0 loops=9)
                                       Output: emp.id, emp.hire_date
                                       Index Cond: (emp.id = dept.employee_id)
                                       Filter: ((emp.birth_date >= '1957-01-01'::date) AND (emp.birth_date <= '1970-01-01'::date) AND (emp.hire_date = '1985-01-01'::date) AND (emp.gender = 'M'::employees.employee_gender))
                                       Rows Removed by Filter: 1
                                       Buffers: shared hit=27
                     ->  Memoize  (cost=0.44..8.11 rows=9 width=8) (actual time=0.004..0.009 max time=0.014 rows=17 loops=4)
                           Output: sal.employee_id
                           Cache Key: emp.id
                           Cache Mode: logical
                           Hits: 0  Misses: 4  Evictions: 0  Overflows: 0  Memory Usage: 3kB
                           Buffers: shared hit=16
                           ->  Index Scan using idx_16991_primary on employees.salary sal  (cost=0.43..8.10 rows=9 width=8) (actual time=0.003..0.006 max time=0.009 rows=17 loops=4)
                                 Output: sal.employee_id
                                 Index Cond: (sal.employee_id = emp.id)
                                 Filter: (sal.amount > 50000)
                                 Rows Removed by Filter: 1
                                 Buffers: shared hit=16
 Settings: effective_cache_size = '9830040kB', random_page_cost = '1.1', work_mem = '256MB'
 Query Identifier: 2871877186944609726
 Planning:
   Buffers: shared hit=26
 Planning Time: 0.460 ms
 Execution Time: 2.707 ms

The Verdict

So as you can see it is very important to get the planner to project what you want. While ultimately the execution phase was not fully executed, it still had to look at the stats of the partition to decide not to execute it resulting in some wasted time. Also, keep in mind that this test case is a very simple one. The customer issue I worked on had a lot more partitions and a lot more data so the planning time being wasted was much higher resulting in a much higher impact.

Test Case Setup

To create a test case for this example, I used the employee database located on this Github site:
https://github.com/h8/employees-database

I then created some partitioned versions of the tables from the sample database as shown below:

-- Table: test_schema.contract_employees

CREATE TABLE employees.department_employee_part (
	employee_id int8 NOT NULL,
	department_id bpchar(4) NOT NULL,
	from_date date NOT NULL,
	to_date date NOT NULL,
    hire_date date NOT NULL
)  PARTITION BY RANGE (hire_date);

-- Partitions SQL

CREATE TABLE employees.department_employee_part_p1985 PARTITION OF employees.department_employee_part
( 
    CONSTRAINT department_employee_part_p1985_pkey PRIMARY KEY (employee_id, department_id)
)
    FOR VALUES FROM ('1985-01-01') TO ('1986-01-01');
CREATE TABLE employees.department_employee_part_p1986 PARTITION OF employees.department_employee_part
( 
    CONSTRAINT department_employee_part_p1986_pkey PRIMARY KEY (employee_id, department_id)
)
    FOR VALUES FROM ('1986-01-01') TO ('1987-01-01');
CREATE TABLE employees.department_employee_part_p1987 PARTITION OF employees.department_employee_part
( 
    CONSTRAINT department_employee_part_p1987_pkey PRIMARY KEY (employee_id, department_id)
)
    FOR VALUES FROM ('1987-01-01') TO ('1988-01-01');
CREATE TABLE employees.department_employee_part_p1988 PARTITION OF employees.department_employee_part
( 
    CONSTRAINT department_employee_part_p1988_pkey PRIMARY KEY (employee_id, department_id)
)
    FOR VALUES FROM ('1988-01-01') TO ('1989-01-01');
CREATE TABLE employees.department_employee_part_p1989 PARTITION OF employees.department_employee_part
( 
    CONSTRAINT department_employee_part_p1989_pkey PRIMARY KEY (employee_id, department_id)
)
    FOR VALUES FROM ('1989-01-01') TO ('1990-01-01');
CREATE TABLE employees.department_employee_part_p1990 PARTITION OF employees.department_employee_part
( 
    CONSTRAINT department_employee_part_p1990_pkey PRIMARY KEY (employee_id, department_id)
)
    FOR VALUES FROM ('1990-01-01') TO ('1991-01-01');
CREATE TABLE employees.department_employee_part_p1991 PARTITION OF employees.department_employee_part
( 
    CONSTRAINT department_employee_part_p1991_pkey PRIMARY KEY (employee_id, department_id)
)
    FOR VALUES FROM ('1991-01-01') TO ('1992-01-01');
CREATE TABLE employees.department_employee_part_p1992 PARTITION OF employees.department_employee_part
( 
    CONSTRAINT department_employee_part_p1992_pkey PRIMARY KEY (employee_id, department_id)
)
    FOR VALUES FROM ('1992-01-01') TO ('1993-01-01');
CREATE TABLE employees.department_employee_part_p1993 PARTITION OF employees.department_employee_part
( 
    CONSTRAINT department_employee_part_p1993_pkey PRIMARY KEY (employee_id, department_id)
)
    FOR VALUES FROM ('1993-01-01') TO ('1994-01-01');
CREATE TABLE employees.department_employee_part_p1994 PARTITION OF employees.department_employee_part
( 
    CONSTRAINT department_employee_part_p1994_pkey PRIMARY KEY (employee_id, department_id)
)
    FOR VALUES FROM ('1994-01-01') TO ('1995-01-01');
CREATE TABLE employees.department_employee_part_p1995 PARTITION OF employees.department_employee_part
( 
    CONSTRAINT department_employee_part_p1995_pkey PRIMARY KEY (employee_id, department_id)
)
    FOR VALUES FROM ('1995-01-01') TO ('1996-01-01');
CREATE TABLE employees.department_employee_part_p1996 PARTITION OF employees.department_employee_part
( 
    CONSTRAINT department_employee_part_p1996_pkey PRIMARY KEY (employee_id, department_id)
)
    FOR VALUES FROM ('1996-01-01') TO ('1997-01-01');
CREATE TABLE employees.department_employee_part_p1997 PARTITION OF employees.department_employee_part
( 
    CONSTRAINT department_employee_part_p1997_pkey PRIMARY KEY (employee_id, department_id)
)
    FOR VALUES FROM ('1997-01-01') TO ('1998-01-01');
CREATE TABLE employees.department_employee_part_p1998 PARTITION OF employees.department_employee_part
( 
    CONSTRAINT department_employee_part_p1998_pkey PRIMARY KEY (employee_id, department_id)
)
    FOR VALUES FROM ('1998-01-01') TO ('1999-01-01');
CREATE TABLE employees.department_employee_part_p1999 PARTITION OF employees.department_employee_part
( 
    CONSTRAINT department_employee_part_p1999_pkey PRIMARY KEY (employee_id, department_id)
)
    FOR VALUES FROM ('1999-01-01') TO ('2000-01-01');
CREATE TABLE employees.department_employee_part_p2000 PARTITION OF employees.department_employee_part
( 
    CONSTRAINT department_employee_part_p2000_pkey PRIMARY KEY (employee_id, department_id)
)
    FOR VALUES FROM ('2000-01-01') TO ('2001-01-01');

CREATE INDEX ON employees.department_employee_part USING btree (department_id);

insert into employees.department_employee_part
select de.*,emp.hire_date 
from employees.employee emp join employees.department_employee de on (emp.id = de.employee_id);

vacuum (verbose, analyze) employees.department_employee_part;

CREATE TABLE employees.employee_part (
	id int8 DEFAULT nextval('employees.id_employee_seq'::regclass) NOT NULL,
	birth_date date NOT NULL,
	first_name varchar(14) NOT NULL,
	last_name varchar(16) NOT NULL,
	gender employees.employee_gender NOT NULL,
	hire_date date NOT NULL
) PARTITION BY RANGE (hire_date);

CREATE TABLE employees.employee_part_p1985 PARTITION OF employees.employee_part
( 
    CONSTRAINT employee_part_p1985_pkey PRIMARY KEY (id)
)
    FOR VALUES FROM ('1985-01-01') TO ('1986-01-01');
CREATE TABLE employees.employee_part_p1986 PARTITION OF employees.employee_part
( 
    CONSTRAINT employee_part_p1986_pkey PRIMARY KEY (id)
)
    FOR VALUES FROM ('1986-01-01') TO ('1987-01-01');
CREATE TABLE employees.employee_part_p1987 PARTITION OF employees.employee_part
( 
    CONSTRAINT employee_part_p1987_pkey PRIMARY KEY (id)
)
    FOR VALUES FROM ('1987-01-01') TO ('1988-01-01');
CREATE TABLE employees.employee_part_p1988 PARTITION OF employees.employee_part
( 
    CONSTRAINT employee_part_p1988_pkey PRIMARY KEY (id)
)
    FOR VALUES FROM ('1988-01-01') TO ('1989-01-01');
CREATE TABLE employees.employee_part_p1989 PARTITION OF employees.employee_part
( 
    CONSTRAINT employee_part_p1989_pkey PRIMARY KEY (id)
)
    FOR VALUES FROM ('1989-01-01') TO ('1990-01-01');
CREATE TABLE employees.employee_part_p1990 PARTITION OF employees.employee_part
( 
    CONSTRAINT employee_part_p1990_pkey PRIMARY KEY (id)
)
    FOR VALUES FROM ('1990-01-01') TO ('1991-01-01');
CREATE TABLE employees.employee_part_p1991 PARTITION OF employees.employee_part
( 
    CONSTRAINT employee_part_p1991_pkey PRIMARY KEY (id)
)
    FOR VALUES FROM ('1991-01-01') TO ('1992-01-01');
CREATE TABLE employees.employee_part_p1992 PARTITION OF employees.employee_part
( 
    CONSTRAINT employee_part_p1992_pkey PRIMARY KEY (id)
)
    FOR VALUES FROM ('1992-01-01') TO ('1993-01-01');
CREATE TABLE employees.employee_part_p1993 PARTITION OF employees.employee_part
( 
    CONSTRAINT employee_part_p1993_pkey PRIMARY KEY (id)
)
    FOR VALUES FROM ('1993-01-01') TO ('1994-01-01');
CREATE TABLE employees.employee_part_p1994 PARTITION OF employees.employee_part
( 
    CONSTRAINT employee_part_p1994_pkey PRIMARY KEY (id)
)
    FOR VALUES FROM ('1994-01-01') TO ('1995-01-01');
CREATE TABLE employees.employee_part_p1995 PARTITION OF employees.employee_part
( 
    CONSTRAINT employee_part_p1995_pkey PRIMARY KEY (id)
)
    FOR VALUES FROM ('1995-01-01') TO ('1996-01-01');
CREATE TABLE employees.employee_part_p1996 PARTITION OF employees.employee_part
( 
    CONSTRAINT employee_part_p1996_pkey PRIMARY KEY (id)
)
    FOR VALUES FROM ('1996-01-01') TO ('1997-01-01');
CREATE TABLE employees.employee_part_p1997 PARTITION OF employees.employee_part
( 
    CONSTRAINT employee_part_p1997_pkey PRIMARY KEY (id)
)
    FOR VALUES FROM ('1997-01-01') TO ('1998-01-01');
CREATE TABLE employees.employee_part_p1998 PARTITION OF employees.employee_part
( 
    CONSTRAINT employee_part_p1998_pkey PRIMARY KEY (id)
)
    FOR VALUES FROM ('1998-01-01') TO ('1999-01-01');
CREATE TABLE employees.employee_part_p1999 PARTITION OF employees.employee_part
( 
    CONSTRAINT employee_part_p1999_pkey PRIMARY KEY (id)
)
    FOR VALUES FROM ('1999-01-01') TO ('2000-01-01');
CREATE TABLE employees.employee_part_p2000 PARTITION OF employees.employee_part
( 
    CONSTRAINT employee_part_p2000_pkey PRIMARY KEY (id)
)
    FOR VALUES FROM ('2000-01-01') TO ('2001-01-01');
    
insert into  employees.employee_part select * from  employees.employee;

vacuum (verbose, analyze) employees.employee_part;


Some of My Favorite Things – Postgres Queries

In the spirit of the holiday season, I thought I would write a quick post regarding some of my favorite queries that I use on a day to day basis working on Postgres. Some of these queries I have developed and others were found on the internet (hat tip to those who have previously posted) and further refined.

Many more are found on my github site:

https://github.com/shane-borden/sqlScripts/tree/master/postgres

Hope these queries can also help you in your day to day quest to make Postgres run better!

The first three queries rank top SQL from pg_stat_statements either by execution count, mean_exec_time and total_exec_time. I love to use these queries to get a quick glance of what I should focus on tuning. Given that pg_stat_statements tracks lots of things, you can filter out certain “query text” as necessary so that those do not influence the ranking.

Top SQL by Mean Exec Time

WITH
hist AS (
SELECT queryid::text,
       SUBSTRING(query from 1 for 1000) query,
       ROW_NUMBER () OVER (ORDER BY mean_exec_time::numeric DESC) rn,
       SUM(mean_exec_time::numeric) mean_exec_time
  FROM pg_stat_statements
 WHERE queryid IS NOT NULL
 		AND query::text not like '%pg\_%' 
 		AND query::text not like '%g\_%'
        /* Add more filters here */
 GROUP BY
       queryid,
       SUBSTRING(query from 1 for 1000),
       mean_exec_time::numeric
),
total AS (
SELECT SUM(mean_exec_time::numeric) mean_exec_time FROM hist
)
SELECT DISTINCT
       h.queryid::text,
       ROUND(h.mean_exec_time::numeric,3) mean_exec_time,
       ROUND(100 * h.mean_exec_time / t.mean_exec_time, 1) percent,
       h.query
  FROM hist h,
       total t
 WHERE h.mean_exec_time >= t.mean_exec_time / 1000 AND rn <= 14
 UNION ALL
SELECT 'Others',
       ROUND(COALESCE(SUM(h.mean_exec_time), 0), 3) mean_exec_time,
       COALESCE(ROUND(100 * SUM(h.mean_exec_time) / AVG(t.mean_exec_time), 1), 0) percent,
       NULL sql_text
  FROM hist h,
       total t
 WHERE h.mean_exec_time < t.mean_exec_time / 1000 OR rn > 14
 ORDER BY 3 DESC NULLS LAST;

Top SQL by Total Exec Time

WITH
hist AS (
SELECT queryid::text,
       SUBSTRING(query from 1 for 100) query,
       ROW_NUMBER () OVER (ORDER BY total_exec_time::numeric DESC) rn,
       SUM(total_exec_time::numeric) total_exec_time
  FROM pg_stat_statements
 WHERE queryid IS NOT NULL
 		AND query::text not like '%pg\_%' 
 		AND query::text not like '%g\_%'
 		/* Add more filters here */
 GROUP BY
       queryid,
       SUBSTRING(query from 1 for 100),
       total_exec_time::numeric
),
total AS (
SELECT SUM(total_exec_time::numeric) total_exec_time FROM hist
)
SELECT DISTINCT
       h.queryid::text,
       ROUND(h.total_exec_time::numeric,3) total_exec_time,
       ROUND(100 * h.total_exec_time / t.total_exec_time, 1) percent,
       h.query
  FROM hist h,
       total t
 WHERE h.total_exec_time >= t.total_exec_time / 1000 AND rn <= 14
 UNION ALL
SELECT 'Others',
       ROUND(COALESCE(SUM(h.total_exec_time::numeric), 0), 3) total_exec_time,
       COALESCE(ROUND(100 * SUM(h.total_exec_time) / AVG(t.total_exec_time), 1), 0) percent,
       NULL sql_text
  FROM hist h,
       total t
 WHERE h.total_exec_time < t.total_exec_time / 1000 OR rn > 14
 ORDER BY 3 DESC NULLS LAST;

Top SQL by Execution Count

WITH
hist AS (
SELECT queryid::text,
       SUBSTRING(query from 1 for 100) query,
       ROW_NUMBER () OVER (ORDER BY calls DESC) rn,
       calls
  FROM pg_stat_statements 
 WHERE queryid IS NOT NULL 
 		AND query::text not like '%pg\_%' 
 		AND query::text not like '%g\_%'
 		/* Add more filters here */
 GROUP BY
       queryid,
       SUBSTRING(query from 1 for 100),
       calls
),
total AS (
SELECT SUM(calls) calls FROM hist
)
SELECT DISTINCT
       h.queryid::text,
       h.calls,
       ROUND(100 * h.calls / t.calls, 1) percent,
       h.query
  FROM hist h,
       total t
 WHERE h.calls >= t.calls / 1000 AND rn <= 14
 UNION ALL
SELECT 'Others',
       COALESCE(SUM(h.calls), 0) calls,
       COALESCE(ROUND(100 * SUM(h.calls) / AVG(t.calls), 1), 0) percent,
       NULL sql_text
  FROM hist h,
       total t
 WHERE h.calls < t.calls / 1000 OR rn > 14
 ORDER BY 2 DESC NULLS LAST;

Object Sizes Including Toast

Shows total size for a table including a break down for its index and toast size.

SELECT
  *,
  pg_size_pretty(table_bytes) AS table,
  pg_size_pretty(toast_bytes) AS toast,
  pg_size_pretty(index_bytes) AS index,
  pg_size_pretty(total_bytes) AS total
FROM (
  SELECT
    *, total_bytes - index_bytes - COALESCE(toast_bytes, 0) AS table_bytes
  FROM (
    SELECT
      c.oid,
      nspname AS table_schema,
      relname AS table_name,
      c.reltuples AS row_estimate,
      pg_total_relation_size(c.oid) AS total_bytes,
      pg_indexes_size(c.oid) AS index_bytes,
      pg_total_relation_size(reltoastrelid) AS toast_bytes
    FROM
      pg_class c
      LEFT JOIN pg_namespace n ON n.oid = c.relnamespace
    WHERE relkind = 'r'
  ) a
) a
WHERE table_schema like '%'
AND table_name like '%'
AND total_bytes > 0
ORDER BY total_bytes DESC;

SQL Statements Using CPU

Using pg_stat_statements, this query will allocate timing totals as CPU time.

SELECT
    pss.userid,
    pss.dbid,
    pd.datname AS db_name,
    pss.queryid,
    round((pss.total_exec_time + pss.total_plan_time)::numeric, 2) AS total_time,
    pss.calls,
    round((pss.mean_exec_time + pss.mean_plan_time)::numeric, 2) AS mean,
    round((100 * (pss.total_exec_time + pss.total_plan_time) / sum((pss.total_exec_time + pss.total_plan_time)::numeric) OVER ())::numeric, 2) AS cpu_portion_pctg,
    substr(pss.query, 1, 200) short_query
FROM
    pg_stat_statements pss,
    pg_database pd
WHERE
    pd.oid = pss.dbid
    AND query::text NOT LIKE '%FOR UPDATE%'
    /* Add more filters here */
ORDER BY
    (pss.total_exec_time + pss.total_plan_time) DESC
LIMIT 30;

Stats / Vacuum Projection Script

This script looks at the database and table options set for vacuum and analyze to give a report of when vacuum / analyze is projected to run and the last time it did run. This script will give you a good idea how well vacuum and analyze is running:

WITH tbl_reloptions AS (
SELECT
	oid,
    oid::regclass table_name,
    substr(unnest(reloptions), 1,  strpos(unnest(reloptions), '=') -1) option,
    substr(unnest(reloptions), 1 + strpos(unnest(reloptions), '=')) value
FROM
    pg_class c
WHERE reloptions is NOT null)
SELECT
    s.schemaname ||'.'|| s.relname as relname,
    n_live_tup live_tup,
    n_dead_tup dead_dup,
    n_tup_hot_upd hot_upd,
    n_mod_since_analyze mod_since_stats,
    n_ins_since_vacuum ins_since_vac,
    case 
	  when avacinsscalefactor.value is not null and avacinsthresh.value is not null
        then ROUND(((n_live_tup * avacinsscalefactor.value::numeric) + avacinsthresh.value::numeric),0)
      when avacinsscalefactor.value is null and avacinsthresh.value is not null
      	then ROUND(((n_live_tup * (select setting::numeric from pg_settings where name = 'autovacuum_vacuum_insert_scale_factor')) + avacinsthresh.value::numeric),0)
      when avacinsscalefactor.value is not null and avacinsthresh.value is null
      	then ROUND(((n_live_tup * avacinsscalefactor.value::numeric) + (select setting::numeric from pg_settings where name = 'autovacuum_vacuum_insert_threshold')),0)
      else ROUND(((n_live_tup * (select setting::numeric from pg_settings where name = 'autovacuum_vacuum_insert_scale_factor')) + (select setting::numeric from pg_settings where name = 'autovacuum_vacuum_insert_threshold')),0) 
    end as ins_for_vac,
    case 
	  when avacscalefactor.value is not null and avacthresh.value is not null
        then ROUND(((n_live_tup * avacscalefactor.value::numeric) + avacthresh.value::numeric),0)
      when avacscalefactor.value is null and avacthresh.value is not null
      	then ROUND(((n_live_tup * (select setting::numeric from pg_settings where name = 'autovacuum_vacuum_scale_factor')) + avacthresh.value::numeric),0)
      when avacscalefactor.value is not null and avacthresh.value is null
      	then ROUND(((n_live_tup * avacscalefactor.value::numeric) + (select setting::numeric from pg_settings where name = 'autovacuum_vacuum_threshold')),0)
      else ROUND(((n_live_tup * (select setting::numeric from pg_settings where name = 'autovacuum_vacuum_scale_factor')) + (select setting::numeric from pg_settings where name = 'autovacuum_vacuum_threshold')),0) 
    end as mods_for_vac,
    case 
	  when avacanalyzescalefactor.value is not null and avacanalyzethresh.value is not null
        then ROUND(((n_live_tup * avacanalyzescalefactor.value::numeric) + avacanalyzethresh.value::numeric),0)
      when avacanalyzescalefactor.value is null and avacanalyzethresh.value is not null
      	then ROUND(((n_live_tup * (select setting::numeric from pg_settings where name = 'autovacuum_analyze_scale_factor')) + avacanalyzethresh.value::numeric),0)
      when avacanalyzescalefactor.value is not null and avacanalyzethresh.value is null
      	then ROUND(((n_live_tup * avacanalyzescalefactor.value::numeric) + (select setting::numeric from pg_settings where name = 'autovacuum_analyze_threshold')),0)
      else ROUND(((n_live_tup * (select setting::numeric from pg_settings where name = 'autovacuum_analyze_scale_factor')) + (select setting::numeric from pg_settings where name = 'autovacuum_analyze_threshold')),0) 
    end as mods_for_stats,
    case 
      when avacfreezeage is not null
        then ROUND((greatest(age(c.relfrozenxid),age(t.relfrozenxid))::numeric / avacfreezeage.value::numeric * 100),2) 
      else ROUND((greatest(age(c.relfrozenxid),age(t.relfrozenxid))::numeric / (select setting::numeric from pg_settings where name = 'autovacuum_freeze_max_age') * 100),2) 
      end as avac_pct_frz,
    greatest(age(c.relfrozenxid),age(t.relfrozenxid)) max_txid_age,
    to_char(last_vacuum, 'YYYY-MM-DD HH24:MI') last_vac,
    to_char(last_analyze, 'YYYY-MM-DD HH24:MI') last_stats,
    to_char(last_autovacuum, 'YYYY-MM-DD HH24:MI') last_avac,
    to_char(last_autoanalyze, 'YYYY-MM-DD HH24:MI') last_astats,
    vacuum_count vac_cnt,
    analyze_count stats_cnt,
    autovacuum_count avac_cnt,
    autoanalyze_count astats_cnt,
    c.reloptions,
    case
      when avacenabled.value is not null
        then avacenabled.value::text
      when (select setting::text from pg_settings where name = 'autovacuum') = 'on'
        then 'true'
      else 'false'
    end as autovac_enabled
FROM
    pg_stat_all_tables s
JOIN pg_class c ON (s.relid = c.oid)
LEFT JOIN pg_class t ON c.reltoastrelid = t.oid
LEFT JOIN tbl_reloptions avacinsscalefactor on (s.relid = avacinsscalefactor.oid and avacinsscalefactor.option = 'autovacuum_vacuum_insert_scale_factor')
LEFT JOIN tbl_reloptions avacinsthresh on (s.relid = avacinsthresh.oid and avacinsthresh.option = 'autovacuum_vacuum_insert_threshold')
LEFT JOIN tbl_reloptions avacscalefactor on (s.relid = avacscalefactor.oid and avacscalefactor.option = 'autovacuum_vacuum_scale_factor')
LEFT JOIN tbl_reloptions avacthresh on (s.relid = avacthresh.oid and avacthresh.option = 'autovacuum_vacuum_threshold')
LEFT JOIN tbl_reloptions avacanalyzescalefactor on (s.relid = avacanalyzescalefactor.oid and avacanalyzescalefactor.option = 'autovacuum_analyze_scale_factor')
LEFT JOIN tbl_reloptions avacanalyzethresh on (s.relid = avacanalyzethresh.oid and avacanalyzethresh.option = 'autovacuum_analyze_threshold')
LEFT JOIN tbl_reloptions avacfreezeage on (s.relid = avacfreezeage.oid and avacfreezeage.option = 'autovacuum_freeze_max_age')
LEFT JOIN tbl_reloptions avacenabled on (s.relid = avacenabled.oid and avacenabled.option = 'autovacuum_enabled')
WHERE
    s.relname IN (
        SELECT
            t.table_name
		FROM
    		information_schema.tables t
    		JOIN pg_catalog.pg_class c ON (t.table_name = c.relname)
    		LEFT JOIN pg_catalog.pg_user u ON (c.relowner = u.usesysid)
        WHERE
            t.table_schema like '%'
            AND (u.usename like '%' OR u.usename is null)
            AND t.table_name like '%'
            AND t.table_schema not in ('information_schema','pg_catalog')
            AND t.table_type not in ('VIEW')
			AND t.table_catalog = current_database())
    AND n_dead_tup >= 0
    AND n_live_tup > 0
ORDER BY 3;

Unused / Rarely Used Indexes

To keep a well run system, it’s important to maintain as few indexes as possible. This will show which indexes have not been recently used. The original version of this was obtained from https://www.databasesoup.com/2014/05/new-finding-unused-indexes-query.html (Josh Berkus)

WITH table_scans AS (
    SELECT
        relid,
        tables.idx_scan + tables.seq_scan AS all_scans,
        (tables.n_tup_ins + tables.n_tup_upd + tables.n_tup_del) AS writes,
        pg_relation_size(relid) AS table_size
    FROM
        pg_stat_all_tables AS tables
    WHERE
        schemaname NOT IN ('pg_toast', 'pg_catalog', 'partman')
),
all_writes AS (
    SELECT
        sum(writes) AS total_writes
    FROM
        table_scans
),
indexes AS (
    SELECT
        idx_stat.relid,
        idx_stat.indexrelid,
        idx_stat.schemaname,
        idx_stat.relname AS tablename,
        idx_stat.indexrelname AS indexname,
        idx_stat.idx_scan,
        pg_relation_size(idx_stat.indexrelid) AS index_bytes,
        indexdef ~* 'USING btree' AS idx_is_btree
    FROM
        pg_stat_user_indexes AS idx_stat
        JOIN pg_index USING (indexrelid)
        JOIN pg_indexes AS indexes ON idx_stat.schemaname = indexes.schemaname
            AND idx_stat.relname = indexes.tablename
            AND idx_stat.indexrelname = indexes.indexname
    WHERE
        pg_index.indisunique = FALSE
),
index_ratios AS (
    SELECT
        schemaname,
        tablename,
        indexname,
        idx_scan,
        all_scans,
        round((
            CASE WHEN all_scans = 0 THEN
                0.0::numeric
            ELSE
                idx_scan::numeric / all_scans * 100
            END), 2) AS index_scan_pct,
        writes,
        round((
            CASE WHEN writes = 0 THEN
                idx_scan::numeric
            ELSE
                idx_scan::numeric / writes
            END), 2) AS scans_per_write,
        pg_size_pretty(index_bytes) AS index_size,
        pg_size_pretty(table_size) AS table_size,
        idx_is_btree,
        index_bytes
    FROM
        indexes
        JOIN table_scans USING (relid)
),
index_groups AS (
    SELECT
        'Never Used Indexes' AS reason,
        *,
        1 AS grp
    FROM
        index_ratios
    WHERE
        idx_scan = 0
        AND idx_is_btree
    UNION ALL
    SELECT
        'Low Scans, High Writes' AS reason,
        *,
        2 AS grp
    FROM
        index_ratios
    WHERE
        scans_per_write <= 1
        AND index_scan_pct < 10
        AND idx_scan > 0
        AND writes > 100
        AND idx_is_btree
    UNION ALL
    SELECT
        'Seldom Used Large Indexes' AS reason,
        *,
        3 AS grp
    FROM
        index_ratios
    WHERE
        index_scan_pct < 5
        AND scans_per_write > 1
        AND idx_scan > 0
        AND idx_is_btree
        AND index_bytes > 100000000
    UNION ALL
    SELECT
        'High-Write Large Non-Btree' AS reason,
        index_ratios.*,
        4 AS grp
    FROM
        index_ratios,
        all_writes
    WHERE (writes::numeric / (total_writes + 1)) > 0.02
    AND NOT idx_is_btree
    AND index_bytes > 100000000
ORDER BY
    grp,
    index_bytes DESC
)
SELECT
    reason,
    schemaname,
    tablename,
    indexname,
    index_scan_pct,
    scans_per_write,
    index_size,
    table_size
FROM
    index_groups
WHERE
    tablename LIKE '%';

Rank Wait Events

This is a great query to rank from most frequent to less frequent wait events being observed on they system in pg_stat_activity. This does not provide historical reference, but a look at the current moment in time:

WITH waits AS (
    SELECT
        wait_event,
        rank() OVER (ORDER BY count(wait_event) DESC) rn
    FROM pg_stat_activity
    WHERE wait_event IS NOT NULL
GROUP BY wait_event ORDER BY count(wait_event) ASC
),
total AS (
    SELECT
        SUM(rn) total_waits
    FROM
        waits
)
SELECT DISTINCT
    h.wait_event,
    h.rn,
    ROUND(100 * h.rn / t.total_waits, 1) percent
FROM
    waits h,
    total t
WHERE
    h.rn >= t.total_waits / 1000
    AND rn <= 14
UNION ALL
SELECT
    'Others',
    COALESCE(SUM(h.rn), 0) rn,
    COALESCE(ROUND(100 * SUM(h.rn) / AVG(t.total_waits), 1), 0) percent
FROM
    waits h,
    total t
WHERE
    h.rn < t.total_waits / 1000
    OR rn > 14
ORDER BY
    2 DESC NULLS LAST;

Tables With Missing FK Indexes

Often, queries will experience poor run times when foreign keys do not have indexes supporting them. This is a good query to show those missing indexes:

WITH y AS (
    SELECT
        pg_catalog.format('%I.%I', n1.nspname, c1.relname) AS referencing_tbl,
        pg_catalog.quote_ident(a1.attname) AS referencing_column,
        t.conname AS existing_fk_on_referencing_tbl,
        pg_catalog.format('%I.%I', n2.nspname, c2.relname) AS referenced_tbl,
        pg_catalog.quote_ident(a2.attname) AS referenced_column,
        pg_relation_size(pg_catalog.format('%I.%I', n1.nspname, c1.relname)) AS referencing_tbl_bytes,
        pg_relation_size(pg_catalog.format('%I.%I', n2.nspname, c2.relname)) AS referenced_tbl_bytes,
        pg_catalog.format($$CREATE INDEX ON %I.%I(%I);$$, n1.nspname, c1.relname, a1.attname) AS suggestion
    FROM
        pg_catalog.pg_constraint t
        JOIN pg_catalog.pg_attribute a1 ON a1.attrelid = t.conrelid
            AND a1.attnum = t.conkey[1]
        JOIN pg_catalog.pg_class c1 ON c1.oid = t.conrelid
        JOIN pg_catalog.pg_namespace n1 ON n1.oid = c1.relnamespace
        JOIN pg_catalog.pg_class c2 ON c2.oid = t.confrelid
        JOIN pg_catalog.pg_namespace n2 ON n2.oid = c2.relnamespace
        JOIN pg_catalog.pg_attribute a2 ON a2.attrelid = t.confrelid
            AND a2.attnum = t.confkey[1]
    WHERE
        t.contype = 'f'
        AND NOT EXISTS (
            SELECT
                1
            FROM
                pg_catalog.pg_index i
            WHERE
                i.indrelid = t.conrelid
                AND i.indkey[0] = t.conkey[1]))
SELECT
    referencing_tbl,
    referencing_column,
    existing_fk_on_referencing_tbl,
    referenced_tbl,
    referenced_column,
    pg_size_pretty(referencing_tbl_bytes) AS referencing_tbl_size,
    pg_size_pretty(referenced_tbl_bytes) AS referenced_tbl_size,
    suggestion
FROM
    y
ORDER BY
    referencing_tbl_bytes DESC,
    referenced_tbl_bytes DESC,
    referencing_tbl,
    referenced_tbl,
    referencing_column,
    referenced_column;

Blocking Lock Tree

postgres.ai is a great place to get some observability queries and this is one of my favorites:

with recursive activity as (
  select
    pg_blocking_pids(pid) blocked_by,
    *,
    age(clock_timestamp(), xact_start)::interval(0) as tx_age,
    -- "pg_locks.waitstart" – PG14+ only; for older versions:  age(clock_timestamp(), state_change) as wait_age,
    age(clock_timestamp(), (select max(l.waitstart) from pg_locks l where a.pid = l.pid))::interval(0) as wait_age
  from pg_stat_activity a
  where state is distinct from 'idle'
), blockers as (
  select
    array_agg(distinct c order by c) as pids
  from (
    select unnest(blocked_by)
    from activity
  ) as dt(c)
), tree as (
  select
    activity.*,
    1 as level,
    activity.pid as top_blocker_pid,
    array[activity.pid] as path,
    array[activity.pid]::int[] as all_blockers_above
  from activity, blockers
  where
    array[pid] <@ blockers.pids
    and blocked_by = '{}'::int[]
  union all
  select
    activity.*,
    tree.level + 1 as level,
    tree.top_blocker_pid,
    path || array[activity.pid] as path,
    tree.all_blockers_above || array_agg(activity.pid) over () as all_blockers_above
  from activity, tree
  where
    not array[activity.pid] <@ tree.all_blockers_above
    and activity.blocked_by <> '{}'::int[]
    and activity.blocked_by <@ tree.all_blockers_above
)
select
  pid,
  blocked_by,
  case when wait_event_type <> 'Lock' then replace(state, 'idle in transaction', 'idletx') else 'waiting' end as state,
  wait_event_type || ':' || wait_event as wait,
  wait_age,
  tx_age,
  to_char(age(backend_xid), 'FM999,999,999,990') as xid_age,
  to_char(2147483647 - age(backend_xmin), 'FM999,999,999,990') as xmin_ttf,
  datname,
  usename,
  (select count(distinct t1.pid) from tree t1 where array[tree.pid] <@ t1.path and t1.pid <> tree.pid) as blkd,
  format(
    '%s %s%s',
    lpad('[' || pid::text || ']', 9, ' '),
    repeat('.', level - 1) || case when level > 1 then ' ' end,
    left(query, 1000)
  ) as query
from tree
order by top_blocker_pid, level, pid;

Follow-Up: Reduce Vacuum by Using “ON CONFLICT” Directive

I previously blogged about ensuring that the “ON CONFLICT” directive is used in order to avoid vacuum from having to do additional work. You can read the original blog here: Reduce Vacuum by Using “ON CONFLICT” Directive

Now that Postgres has incorporated the “MERGE” functionality into Postgres 15 and above, I wanted to ensure that there was no “strange” behavior as it relates to vacuum when using merge. As you can see here, the “MERGE” functionality does perform exactly as expected. For example, when you attempt to have a merge where the directive is to try an insert first followed by an update, exactly one row is marked dead when the insert fails and the update succeeds.

/* Create the table: */
CREATE TABLE public.pk_violation_test (
        id int PRIMARY KEY, 
        value numeric,
        product_id int,
        effective_date timestamp(3)
        );
 
 
/* Insert some mocked up data */
INSERT INTO public.pk_violation_test VALUES ( 
        generate_series(0,10000), 
        random()*1000,
        random()*100,
        current_timestamp(3));
 
/* Verify that there are no dead tuples: */
SELECT
    schemaname,
    relname,
    n_live_tup,
    n_dead_tup
FROM
    pg_stat_all_tables
WHERE
    relname = 'pk_violation_test';
 
 schemaname |      relname      | n_live_tup | n_dead_tup
------------+-------------------+------------+------------
 public     | pk_violation_test |    100001  |          0

Then, create a simple merge and check the results:

WITH insert_query AS (
    SELECT
        0 AS id,
        44.33893489873 AS value,
        46 AS product_id,
        now() AS effective_date) MERGE INTO pk_violation_test pkt
    USING insert_query i ON pkt.id = i.id
    WHEN MATCHED THEN
        UPDATE SET
            value = i.value, product_id = i.product_id, effective_date = i.effective_date
    WHEN NOT MATCHED THEN
        INSERT (id, value, product_id, effective_date)
            VALUES (i.id, i.value, i.product_id, i.effective_date);
MERGE 1

And then check the dead tuple count:

SELECT
    schemaname,
    relname,
    n_live_tup,
    n_dead_tup
FROM
    pg_stat_all_tables
WHERE
    relname = 'pk_violation_test';
 schemaname |      relname      | n_live_tup | n_dead_tup
------------+-------------------+------------+------------
 public     | pk_violation_test |      10001 |          1
(1 row)

As expected only one row is marked dead. Merge is such great functionality and I am glad to see it in Postgres. As you get time, all of your “ON CONFLICT” statements should be converted to use this functionality. Enjoy!

Reduce Vacuum by Using “ON CONFLICT” Directive

I’m always working with customers migrating from Oracle to PostgreSQL. In both DBMS systems, there is I/O impact when using exception handlers such as when handling a PK constraint violation, but the impact in PostgreSQL is different and you should be aware of what is actually going on. For example, when an exception occurs, redo is generated in Oracle (WAL in Postgres) and additional catalog queries are issued in both DBMS systems to get pertinent data about the constraint violation. But what actually happens in Postgres as it relates to MVCC?

Let’s use a simple test to demonstrate. First, create a table with some mocked up “product” data:

/* Create the table: */
CREATE TABLE public.pk_violation_test (
		id int PRIMARY KEY, 
		value numeric,
		product_id int,
		effective_date timestamp(3)
		);


/* Insert some mocked up data */
INSERT INTO public.pk_violation_test VALUES ( 
		generate_series(0,10000), 
		random()*1000,
		random()*100,
		current_timestamp(3));

/* Verify that there are no dead tuples: */
SELECT
    schemaname,
    relname,
    n_live_tup,
    n_dead_tup
FROM
    pg_stat_all_tables
WHERE
    relname = 'pk_violation_test';

 schemaname |      relname      | n_live_tup | n_dead_tup
------------+-------------------+------------+------------
 public     | pk_violation_test |    100001  |          0

Now, we will execute a simple insert statement using no directive:

/* Perform a simple insert: */
INSERT INTO pk_violation_test
    VALUES (0, 44.33893489873, 46, now());

ERROR:  duplicate key value violates unique constraint "pk_violation_test_pkey"
DETAIL:  Key (id)=(0) already exists.
Time: 1.292 ms

/* Verify the dead tuple count: */
SELECT
    schemaname,
    relname,
    n_live_tup,
    n_dead_tup
FROM
    pg_stat_all_tables
WHERE
    relname = 'pk_violation_test';

 schemaname |      relname      | n_live_tup | n_dead_tup
------------+-------------------+------------+------------
 public     | pk_violation_test |    100001  |          1

As you can see the error is produced that the insert violated the PK and the tuple which was in violation is now a dead tuple.

Now change the insert to use the “ON CONFLICT” directive and check the dead tuple count:

/* Perform a simple insert using the directive: */
INSERT INTO pk_violation_test
    VALUES (0, 44.33893489873, 46, now())
ON CONFLICT
    DO NOTHING;

INSERT 0 0
Time: 0.889 ms

/* Verify the unchanged dead tuple count: */
SELECT
    schemaname,
    relname,
    n_live_tup,
    n_dead_tup
FROM
    pg_stat_all_tables
WHERE
    relname = 'pk_violation_test';

 schemaname |      relname      | n_live_tup | n_dead_tup
------------+-------------------+------------+------------
 public     | pk_violation_test |    100001  |          1

As you can see the dead tuple count did not increase, thus reducing the amount of vacuum needed!

Now, most of the code conversion tools might try to do something with that exception block. Maybe it gets converted into something like this:

/* Create a simple function with exception logic: */
CREATE OR REPLACE FUNCTION pk_violation_test_func (p_id int, p_value numeric, p_product_id int)
    RETURNS VOID
    AS $$
BEGIN
    BEGIN
        INSERT INTO pk_violation_test (id, value, product_id, effective_date)
            VALUES (p_id, p_value, p_product_id, now());
        RETURN;
    EXCEPTION
        WHEN unique_violation THEN
            -- try an update
            UPDATE
                pk_violation_test
            SET
                value = p_value,
                product_id = p_product_id,
                effective_date = now()
            WHERE
                id = p_id;
    IF found THEN
        RETURN;
        END IF;
    END;
END;

$$
LANGUAGE plpgsql;

So what happens in this case? Watch how now we get TWO dead tuples. One for the insert and one for the update (if it’s not a HOT Update). I will vacuum the table first so that there is no question around how many dead tuples there are:

/* Vacuum the table: */
vacuum pk_violation_test;

/* Verify the dead tuple count: */
SELECT
    schemaname,
    relname,
    n_live_tup,
    n_dead_tup
FROM
    pg_stat_all_tables
WHERE
    relname = 'pk_violation_test';
 schemaname |      relname      | n_live_tup | n_dead_tup
------------+-------------------+------------+------------
 public     | pk_violation_test |    1000001 |          0

/* Call the sample function */
select * from pk_violation_test_func(0, 44.33893489873, 46);
 pk_violation_test_func
------------------------

(1 row)

/* Verify the dead tuple count: */
SELECT
    schemaname,
    relname,
    n_live_tup,
    n_dead_tup
FROM
    pg_stat_all_tables
WHERE
    relname = 'pk_violation_test';
 schemaname |      relname      | n_live_tup | n_dead_tup
------------+-------------------+------------+------------
 public     | pk_violation_test |    1000001 |          2

As you can see the insert attempt using the “ON CONFLICT” directive did not create a dead tuple.  In turn this will make your inserts which violate a PK more efficient, faster and not cause unnecessary vacuum.  Remember that logging message that was a result of the PK violation not being handled?  That log message is gone too. A WIN all around!

Point being that it is very important to understand whatever DBMS you are running it. Things that seem very simple can have pros and cons that need to be dealt with.

Enjoy!

Use of “pgtt” Extension in Self Managed vs. Cloud Products

There are several benefits to using temp tables in SQL DML operations. Given my long relationship and work on Oracle databases, I am highly familiar with the Oracle style “Global Temporary Tables” and have used them to solve many performance issues when complex joins are involved. That’s why I was excited when Gilles Darold released the extension “pgtt” for PostgreSQL. I was even more excited when Google released the extension for use in the Google CloudSQL for Postgres and AlloyDB for Postgres products.

In my opinion, native Postgres temporary tables leave a lot to be desired. For example, each table has to be created upon use. Nothing is persistent. This takes time and leads to Postgres catalog bloat (looking to do a post on this soon). That being said, in self managed instances, ‘pgtt’ is very easy to use. You can load the extension either at connection time by user, or for anyone that connects to a database. For example (make sure the extension is created in the database before setting these):

alter role test_user SET session_preload_libraries to 'pgtt';

alter database test_pgtt SET session_preload_libraries = 'pgtt';

This will, by default, use the ‘pgtt’ version of temporary tables instead of the default temporary tables Postgres offers. You can then create ‘pgtt’ style tables and refer to them via the pgtt_schema with no further modifications.

To revert back in the session, you must “disconnect” the session alter the user / database to revert the behavior to “old” style temporary tables. To my knowledge, if the library is enabled this way there is no way to use both styles of temporary tables at the same time.

alter role test_user RESET session_preload_libraries;

alter database pgtt_test RESET session_preload_libraries;

The Set Up…..

/* set flag "pgtt.enabled" on alloydb / cloudsql postgres console  to on */

/* create the extension in every database you need to use global temp tables */
CREATE EXTENSION pgtt;
LOAD 'pgtt';

/* Additional extensions needed for this test */
CREATE EXTENSION "uuid-ossp";

/* create the persistent global temporary table */
CREATE /*GLOBAL*/ TEMPORARY TABLE test_gtt_table (
id integer,
lbl text) ON COMMIT DELETE ROWS;

/* create an index on the global temp table */
CREATE INDEX ON pgtt_schema.test_gtt_table (id);

Use of ‘pgtt’ in Cloud Database Products

When you look at managed Cloud databases (from any vendor), you have to understand that some items need to be locked down in order to safeguard against the “managed” nature of the product. Unfortunately (as of this writing), at least in Google Cloud Postgres products, you cannot manipulate the parameter “session_preload_libraries”. So then you ask “How can I programmatically use the extension”? As the time of this writing, I have found only two ways. Either the user has to explicitly execute “LOAD ‘pgtt'” at the beginning / at some point in their session or the “LOAD” must be embedded in a function / procedure similar to the following:

/* function definition with loading of extension as a work around */
create or replace function public.test_gtt_function(P_IN_IDS integer[])
	returns table (
		id integer,
		lbl text
	)
    language plpgsql 
AS $func$
BEGIN
LOAD 'pgtt';
	/* revised query below */
	INSERT INTO pgtt_schema.test_gtt_table
		select generate_series,substr(uuid_generate_v4()::text,1,15) FROM generate_series(1, 5000);
	RETURN QUERY
	SELECT mygtt.id,mygtt.lbl from pgtt_schema.test_gtt_table mygtt where mygtt.id = ANY (P_IN_IDS);
END
$func$;

As you can see in the above function, the “LOAD” directive is the first thing in the function. While I am not a fan of having to do it this way, there doesn’t seem to be a huge performance impact but one nonetheless that you must account for:

(postgres@##########:5432) [tpcc] > LOAD 'pgtt';
LOAD
Time: 3.811 ms
(postgres@##########:5432) [tpcc] > \watch 1
LOAD
Time: 2.922 ms
LOAD
Time: 1.219 ms
LOAD
Time: 1.335 ms
LOAD
Time: 1.300 ms
LOAD
Time: 1.282 ms
LOAD

One reason I do like the implementation within the function is that it is easier to remove the directive vs the directive being embedded within the code. Either way the right direction is to get the vendor to allow targeted values for this parameter. This will allow the most flexibility and the least amount of awareness and hassle when using this extension.

So What Happens If You Don’t Load The Library??

So if you do not load the library with one of the options above, the functions still works. But what happens? The table actually behaves as a “real” table and the data is never truncated… Even after disconnection:

create or replace function public.test_gtt_wo_load_function(P_IN_IDS integer[])
	returns table (
		id integer,
		lbl text
	)
    language plpgsql 
AS $func$
BEGIN
	/* revised query below */
	INSERT INTO pgtt_schema.test_gtt_table
		select generate_series,substr(uuid_generate_v4()::text,1,15) FROM generate_series(1, 5000);
	RETURN QUERY
	SELECT mygtt.id,mygtt.lbl from pgtt_schema.test_gtt_table mygtt where mygtt.id = ANY (P_IN_IDS);
END
$func$;

(postgres@#######:5432) [tpcc] > \set ids ARRAY[3,28,69,235,389,512,678,899]
(postgres@#######:5432) [tpcc] > select * from public.test_gtt_wo_load_function(:ids);
 id  |       lbl
-----+-----------------
   3 | 2f82b04f-db13-4
  28 | 49dfea9a-5c41-4
  69 | d2d27343-a8d4-4
 235 | d5f32cd9-9495-4
 389 | 69842fc5-f0e5-4
 512 | e52c1625-0fab-4
 678 | 0fd320a9-8f08-4
 899 | 452a5f95-6e16-4
(8 rows)

(postgres@#######:5432) [tpcc] > select * from public.test_gtt_wo_load_function(:ids);
 id  |       lbl
-----+-----------------
   3 | 2f82b04f-db13-4
   3 | 632e097d-0f80-4
  28 | 49dfea9a-5c41-4
  28 | bf9ed2c4-14d2-4
  69 | d2d27343-a8d4-4
........
 899 | 452a5f95-6e16-4
 899 | d40c915f-04c7-4
(16 rows)

(postgres@#######:5432) [tpcc] > select * from public.test_gtt_wo_load_function(:ids);
 id  |       lbl
-----+-----------------
   3 | 2f82b04f-db13-4
   3 | 632e097d-0f80-4
   3 | 9cd65fda-ab4a-4
  28 | 49dfea9a-5c41-4
  28 | bf9ed2c4-14d2-4
  28 | 6cdd9a22-5c60-4
.........
 899 | 452a5f95-6e16-4
 899 | d40c915f-04c7-4
 899 | 0319d855-a935-4
(24 rows)

(postgres@#######:5432) [tpcc] > \q
shaneborden_google_com@replication-instance-1:/home/shaneborden_google_com $ psql -h ###### -p 5432 -U postgres -d tpcc
Password for user postgres:
(postgres@#######:5432) [tpcc] > \set ids ARRAY[3,28,69,235,389,512,678,899]
(postgres@#######:5432) [tpcc] > select * from public.test_gtt_wo_load_function(:ids);
 id  |       lbl
-----+-----------------
   3 | 82f66c20-f484-4
   3 | 01ad1b4d-2790-4
   3 | a3c38d05-fb7d-4
   3 | 7a4b2ff8-3e32-4
   3 | af8a96e1-194c-4
   3 | 02586796-1cc6-4
  28 | 62b1fb31-d3c0-4
  28 | 3acfe744-574f-4
  28 | 1e2d64fa-1106-4
  28 | d92a9321-4330-4
  28 | f3b1d45e-d4a7-4
  28 | 270146eb-1098-4
  69 | 506ad97c-fdad-4
  69 | bf3340d5-17a2-4
  69 | 47cbd9eb-83da-4
  69 | 487e8b22-4a05-4
  69 | b17314bc-5822-4
.......
 678 | a8d7dc45-c027-4
 678 | 17b666e5-1fc1-4
 678 | 4386eeb4-7b4e-4
 678 | 2942f25f-3d13-4
 678 | 3853bac0-e0ba-4
 678 | 95499bef-4440-4
 899 | 4503351c-4fe4-4
 899 | 900337bf-a658-4
 899 | 5f1f02f8-29a3-4
 899 | d635c921-8f9b-4
 899 | 0dd42f22-0f8e-4
 899 | 16bb8dcf-4e35-4
(48 rows)

Pretty crazy. So if for some reason the LOAD does not occur or fails, there could be an issue where the temp table has duplicate rows.

So what happens if the “pgtt” temp table was never created? You get a failure as you expect:

(postgres@#######:5432) [tpcc] > select * from public.test_gtt_wo_load_function(:ids);
ERROR:  relation "pgtt_schema.test_gtt_table" does not exist
LINE 1: INSERT INTO pgtt_schema.test_gtt_table
                    ^
QUERY:  INSERT INTO pgtt_schema.test_gtt_table
select generate_series,substr(uuid_generate_v4()::text,1,15) FROM generate_series(1, 5000)
CONTEXT:  PL/pgSQL function test_gtt_wo_load_function(integer[]) line 4 at SQL statement
Time: 5.204 ms

So it is important to ensure that if you intend to use “pgtt” you have a programmatic method to load the extension….. and in the meantime, I will be working on my end to get a cleaner way to load it in the managed products I have access to.