Category Archives: SQL

Understanding and Setting PostgreSQL JDBC Fetch Size

By default, the PostgreSQL JDBC driver fetches all rows at once and attempts to load them into memory vs. other drivers such as Oracle that by default only fetches 10 rows at a time. Both defaults have pros and cons, however in the context of the types of workloads I see every day, the PostgreSQL default is typically not optimal.

As a frame of reference, the default PostgreSQL fetch size is just fine if you have queries that always return small result sets. If there is a chance that larger results sets could be retrieved, a high variance of performance between small and large result sets will be seen and an explicit fetch size should be considered.

To demonstrate, I wanted to create a demo application which would create a simulated table with 2 million rows and select them with various fetch sizes:

  • fetchSize of 1000 (stream 1000 rows at a time)
  • fetchSize of 5000 (stream 5000 rows at a time)
  • fetchSize of 0 (fetch all rows at once) – Default

For a query returning 2 million rows, leveraging the default fetch size produce the following results:

java -cp .:/home/shaneborden_google_com/java/postgresql-42.5.4.jar DatabaseFetchSizeTest

--- Database Setup ---
  [MEMORY] Initial Baseline: 6.68 MB Used (Total Heap: 56.00 MB)
Existing table dropped.
New table created: large_data_test
Inserting 2000000 rows... Done in 44.36 seconds.
  [MEMORY] After Data Insertion: 6.72 MB Used (Total Heap: 40.00 MB)

------------------------------------------------------------
--- Running Test: Small Chunk (1000 rows) (Fetch Size: 1000) ---
------------------------------------------------------------
Executing query with fetch size 1000...
  [MEMORY] 1. Before Query Execution: 6.63 MB Used (Total Heap: 40.00 MB)
  [MEMORY] 2. After Query Execution (Data Loaded/Cursor Open): 6.86 MB Used (Total Heap: 40.00 MB)
Test Complete.
  Total Rows Read: 2000000
  Total Time Taken: 1613 ms
  [MEMORY] 3. After All Rows Processed: 6.67 MB Used (Total Heap: 68.00 MB)
  Mode: STREAMING. Expect memory usage to remain low and flat across stages 1-3.

------------------------------------------------------------
--- Running Test: Optimal Chunk (5000 rows) (Fetch Size: 5000) ---
------------------------------------------------------------
Executing query with fetch size 5000...
  [MEMORY] 1. Before Query Execution: 6.67 MB Used (Total Heap: 40.00 MB)
  [MEMORY] 2. After Query Execution (Data Loaded/Cursor Open): 7.76 MB Used (Total Heap: 40.00 MB)
Test Complete.
  Total Rows Read: 2000000
  Total Time Taken: 1104 ms
  [MEMORY] 3. After All Rows Processed: 6.67 MB Used (Total Heap: 56.00 MB)
  Mode: STREAMING. Expect memory usage to remain low and flat across stages 1-3.

------------------------------------------------------------
--- Running Test: Default (0 - all at once) (Fetch Size: 0) ---
------------------------------------------------------------
Executing query with fetch size 0...
  [MEMORY] 1. Before Query Execution: 6.67 MB Used (Total Heap: 40.00 MB)
  [MEMORY] 2. After Query Execution (Data Loaded/Cursor Open): 454.03 MB Used (Total Heap: 780.00 MB)
Test Complete.
  Total Rows Read: 2000000
  Total Time Taken: 1480 ms
  [MEMORY] 3. After All Rows Processed: 454.03 MB Used (Total Heap: 1560.00 MB)
  Mode: ALL-AT-ONCE. Expect a large memory spike at stage 2.
  • fetchSize of 1000 (stream 1000 rows at a time) – Total Time: 1613ms
  • fetchSize of 5000 (stream 5000 rows at a time) – Total Time: 1104ms
  • fetchSize of 0 (fetch all rows at once) Total Time: 1480ms

So in the above example, just setting the fetch size down from unlimited to 5000 yielded a performance improvement of just over 25%!

If you had a mixture of result set sizes, this second test shows the result of the fetch size setting when smaller result sets are expected. As demonstrated with a 2000 row test, the variance is still exists, but not as dramatic so setting the fetch size at 5000 for most queries would be acceptable:

java -cp .:/home/shaneborden_google_com/java/postgresql-42.5.4.jar DatabaseFetchSizeTest

--- Database Setup ---
  [MEMORY] Initial Baseline: 6.68 MB Used (Total Heap: 68.00 MB)
Existing table dropped.
New table created: large_data_test
Inserting 2000 rows... Done in 0.15 seconds.
  [MEMORY] After Data Insertion: 6.60 MB Used (Total Heap: 40.00 MB)

------------------------------------------------------------
--- Running Test: Small Chunk (1000 rows) (Fetch Size: 1000) ---
------------------------------------------------------------
Executing query with fetch size 1000...
  [MEMORY] 1. Before Query Execution: 6.60 MB Used (Total Heap: 40.00 MB)
  [MEMORY] 2. After Query Execution (Data Loaded/Cursor Open): 6.83 MB Used (Total Heap: 40.00 MB)
Test Complete.
  Total Rows Read: 2000
  Total Time Taken: 41 ms
  [MEMORY] 3. After All Rows Processed: 6.63 MB Used (Total Heap: 40.00 MB)
  Mode: STREAMING. Expect memory usage to remain low and flat across stages 1-3.

------------------------------------------------------------
--- Running Test: Optimal Chunk (5000 rows) (Fetch Size: 5000) ---
------------------------------------------------------------
Executing query with fetch size 5000...
  [MEMORY] 1. Before Query Execution: 6.63 MB Used (Total Heap: 40.00 MB)
  [MEMORY] 2. After Query Execution (Data Loaded/Cursor Open): 7.07 MB Used (Total Heap: 40.00 MB)
Test Complete.
  Total Rows Read: 2000
  Total Time Taken: 28 ms
  [MEMORY] 3. After All Rows Processed: 7.07 MB Used (Total Heap: 40.00 MB)
  Mode: STREAMING. Expect memory usage to remain low and flat across stages 1-3.

------------------------------------------------------------
--- Running Test: Default (0 - all at once) (Fetch Size: 0) ---
------------------------------------------------------------
Executing query with fetch size 0...
  [MEMORY] 1. Before Query Execution: 6.63 MB Used (Total Heap: 40.00 MB)
  [MEMORY] 2. After Query Execution (Data Loaded/Cursor Open): 7.07 MB Used (Total Heap: 40.00 MB)
Test Complete.
  Total Rows Read: 2000
  Total Time Taken: 36 ms
  [MEMORY] 3. After All Rows Processed: 7.08 MB Used (Total Heap: 40.00 MB)
  Mode: ALL-AT-ONCE. Expect a large memory spike at stage 2.
  • fetchSize of 1000 (stream 1000 rows at a time) – Total Time: 41ms
  • fetchSize of 5000 (stream 5000 rows at a time) – Total Time: 28ms
  • fetchSize of 0 (fetch all rows at once) Total Time: 36ms

So in the above example, using a fetch size of 5000 even for smaller result sets still yielded a 22% improvement!

One thing to keep in mind is that depending on the width of your row, the fetch size may need to be tuned yet again. A network packet is only so large and in cases of very wide rows, a smaller fetch size may be more beneficial.

Check out changing the fetchSize with your application!

Code Appendix

DatabaseFetchSizeTest.java

DatabaseFetchSizeTest.java
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.PreparedStatement;
import java.sql.ResultSet;
import java.sql.SQLException;
import java.sql.Statement;
import java.util.Properties;

/**
 * A Java application to demonstrate the effect of the JDBC fetchSize parameter
 * on query performance when retrieving a large result set from PostgreSQL.
 *
 * The PostgreSQL JDBC driver, by default, fetches all rows at once. By setting
 * a non-zero fetchSize, we enable cursor mode on the server, allowing the driver
 * to retrieve results in chunks, which is crucial for handling large datasets
 * without running out of memory on the client side.
 */
public class DatabaseFetchSizeTest {

    // --- CONFIGURATION ---
    private static final String DB_URL = "jdbc:postgresql://10.3.1.17:5432/postgres"; // Change 'your_database'
    private static final String DB_USER = "postgres";     // Change this
    private static final String DB_PASSWORD = "Google54321"; // Change this

    private static final String TABLE_NAME = "large_data_test";
    private static final int NUM_ROWS_TO_INSERT = 2000;
    private static final int SMALL_FETCH_SIZE = 1000; // Small size for initial run
    private static final int OPTIMAL_FETCH_SIZE = 5000; // A reasonable chunk size
    private static final int DEFAULT_FETCH_SIZE = 0; // PostgreSQL default (fetches all rows at once)

    public static void main(String[] args) {
        // Ensure the PostgreSQL driver is loaded
        try {
            Class.forName("org.postgresql.Driver");
        } catch (ClassNotFoundException e) {
            System.err.println("PostgreSQL JDBC Driver not found. Please ensure the JAR is in your classpath.");
            return;
        }

        // 1. Setup Environment
        try (Connection conn = getConnection()) {
            setupDatabase(conn);

            // 2. Run Tests
            // Running the 'default' last usually gives the best contrast in memory usage.
            runTest(conn, SMALL_FETCH_SIZE, "Small Chunk (1000 rows)");
            runTest(conn, OPTIMAL_FETCH_SIZE, "Optimal Chunk (5000 rows)");
            runTest(conn, DEFAULT_FETCH_SIZE, "Default (0 - all at once)");


        } catch (SQLException e) {
            System.err.println("Database error occurred: " + e.getMessage());
            e.printStackTrace();
        }
    }

    /**
     * Establishes a connection and sets auto-commit to false to allow streaming.
     */
    private static Connection getConnection() throws SQLException {
        Properties props = new Properties();
        props.setProperty("user", DB_USER);
        props.setProperty("password", DB_PASSWORD);
        // Important: Set the auto-commit to false to enable server-side cursors (streaming)
        Connection conn = DriverManager.getConnection(DB_URL, props);
        conn.setAutoCommit(false);
        return conn;
    }

    /**
     * Creates the test table and populates it with a large number of rows.
     */
    private static void setupDatabase(Connection conn) throws SQLException {
        System.out.println("--- Database Setup ---");
        displayMemory("Initial Baseline");

        try (Statement stmt = conn.createStatement()) {
            // Drop table if it exists
            stmt.executeUpdate("DROP TABLE IF EXISTS " + TABLE_NAME);
            System.out.println("Existing table dropped.");

            // Create new table
            stmt.executeUpdate("CREATE TABLE " + TABLE_NAME + " (id SERIAL PRIMARY KEY, data_value TEXT, timestamp TIMESTAMP)");
            System.out.println("New table created: " + TABLE_NAME);
        }

        // Insert a large number of rows
        String insertSQL = "INSERT INTO " + TABLE_NAME + " (data_value, timestamp) VALUES (?, NOW())";
        try (PreparedStatement pstmt = conn.prepareStatement(insertSQL)) {
            System.out.print("Inserting " + NUM_ROWS_TO_INSERT + " rows...");
            long startTime = System.currentTimeMillis();

            for (int i = 1; i <= NUM_ROWS_TO_INSERT; i++) {
                // Each row has a string of decent length
                pstmt.setString(1, "Data record number " + i + " - Padding to increase row size for better memory test.");
                pstmt.addBatch();
                if (i % 10000 == 0) { // Execute batch every 10,000 rows
                    pstmt.executeBatch();
                    conn.commit();
                }
            }
            pstmt.executeBatch();
            conn.commit();
            long endTime = System.currentTimeMillis();
            System.out.printf(" Done in %.2f seconds.\n", (endTime - startTime) / 1000.0);
        }
        displayMemory("After Data Insertion");
    }

    /**
     * Helper method to calculate and display current used heap memory.
     * @param stage A string label describing the point of measurement.
     */
    private static void displayMemory(String stage) {
        // Request garbage collection hint to get a cleaner reading
        System.gc();
        long totalMemory = Runtime.getRuntime().totalMemory();
        long freeMemory = Runtime.getRuntime().freeMemory();
        long usedMemory = totalMemory - freeMemory;
        double usedMB = usedMemory / (1024.0 * 1024.0);
        double totalMB = totalMemory / (1024.0 * 1024.0);

        System.out.printf("  [MEMORY] %s: %.2f MB Used (Total Heap: %.2f MB)\n",
                stage, usedMB, totalMB);
    }

    /**
     * Runs the query test with a specified fetch size and measures performance and memory.
     * @param conn The database connection.
     * @param fetchSize The JDBC fetch size to use.
     * @param label A description for the test.
     */
    private static void runTest(Connection conn, int fetchSize, String label) {
        System.out.println("\n------------------------------------------------------------");
        System.out.println("--- Running Test: " + label + " (Fetch Size: " + fetchSize + ") ---");
        System.out.println("------------------------------------------------------------");

        String querySQL = "SELECT id, data_value, timestamp FROM " + TABLE_NAME;
        long rowsRead = 0;

        try (Statement stmt = conn.createStatement()) {
            // Set the crucial parameter
            stmt.setFetchSize(fetchSize);
            System.out.println("Executing query with fetch size " + fetchSize + "...");

            displayMemory("1. Before Query Execution");
            long startTime = System.currentTimeMillis();

            try (ResultSet rs = stmt.executeQuery(querySQL)) {

                // CRITICAL MEMORY MEASUREMENT POINT:
                // For fetchSize=0, the entire result set is in memory here.
                displayMemory("2. After Query Execution (Data Loaded/Cursor Open)");

                // Process the result set (simulating application logic)
                while (rs.next()) {
                    rowsRead++;
                    // Read data to ensure all data is pulled from the driver/server
                    rs.getString("data_value");
                }

                long endTime = System.currentTimeMillis();
                long duration = endTime - startTime;

                System.out.println("Test Complete.");
                System.out.println("  Total Rows Read: " + rowsRead);
                System.out.printf("  Total Time Taken: %d ms\n", duration);

                displayMemory("3. After All Rows Processed");

                // Analyze the difference
                if (fetchSize == 0) {
                    System.out.println("  Mode: ALL-AT-ONCE. Expect a large memory spike at stage 2.");
                } else {
                    System.out.println("  Mode: STREAMING. Expect memory usage to remain low and flat across stages 1-3.");
                }
            }

        } catch (SQLException e) {
            System.err.println("Error during test run: " + e.getMessage());
            System.err.println("HINT: Ensure autoCommit is set to FALSE on the connection for fetchSize > 0 to work.");
        }
    }
}

Compile and Execute

PostgreSQL JDBC Fetch Size and Memory Test

This application is designed to demonstrate how the JDBC fetchSize parameter affects not only query time but also the client-side heap memory usage. It measures memory at critical stages: before the query, immediately after execution (when data is fetched), and after all rows have been processed.

1. Prerequisites & Database Configuration

To run this test, you will need the following:

  • Java Development Kit (JDK): Version 8 or newer.
  • PostgreSQL Database: A running instance that you can connect to.
  • PostgreSQL JDBC Driver: The official .jar file for the driver (e.g., postgresql-42.x.x.jar), which you can download from the PostgreSQL website.

Before running, you must update the connection details in DatabaseFetchSizeTest.java to match your environment.

2. Running the Application

When running the application, it’s beneficial to give the Java Virtual Machine (JVM) a specific amount of memory. This makes the memory spike during the default fetch (fetchSize = 0) more noticeable.

You can set the maximum heap size using the -Xmx flag when you run the application from your terminal:

# Example: Setting max heap size to 512MB
java -Xmx512m -cp .:path/to/postgresql-42.x.x.jar DatabaseFetchSizeTest

3. Expected Observations

The most critical comparison is the memory usage reported between Stage 1 (Before Query) and Stage 2 (After Query Execution).

Fetch Size Stage 1 to Stage 2 Memory Change Reason
0 (Default) Large Increase The driver loads all 200,000 rows into the JVM’s memory immediately at Stage 2.
> 0 (Streaming) Minimal Increase Only a small chunk of rows (e.g., 5000) is loaded into memory at Stage 2, keeping usage low.

By running this test, you’ll see concrete proof that a streaming fetch (fetchSize > 0) is essential for maintaining memory stability when processing large volumes of data.


Some of My Favorite Things – Postgres Queries

In the spirit of the holiday season, I thought I would write a quick post regarding some of my favorite queries that I use on a day to day basis working on Postgres. Some of these queries I have developed and others were found on the internet (hat tip to those who have previously posted) and further refined.

Many more are found on my github site:

https://github.com/shane-borden/sqlScripts/tree/master/postgres

Hope these queries can also help you in your day to day quest to make Postgres run better!

The first three queries rank top SQL from pg_stat_statements either by execution count, mean_exec_time and total_exec_time. I love to use these queries to get a quick glance of what I should focus on tuning. Given that pg_stat_statements tracks lots of things, you can filter out certain “query text” as necessary so that those do not influence the ranking.

Top SQL by Mean Exec Time

WITH
hist AS (
SELECT queryid::text,
       SUBSTRING(query from 1 for 1000) query,
       ROW_NUMBER () OVER (ORDER BY mean_exec_time::numeric DESC) rn,
       SUM(mean_exec_time::numeric) mean_exec_time
  FROM pg_stat_statements
 WHERE queryid IS NOT NULL
 		AND query::text not like '%pg\_%' 
 		AND query::text not like '%g\_%'
        /* Add more filters here */
 GROUP BY
       queryid,
       SUBSTRING(query from 1 for 1000),
       mean_exec_time::numeric
),
total AS (
SELECT SUM(mean_exec_time::numeric) mean_exec_time FROM hist
)
SELECT DISTINCT
       h.queryid::text,
       ROUND(h.mean_exec_time::numeric,3) mean_exec_time,
       ROUND(100 * h.mean_exec_time / t.mean_exec_time, 1) percent,
       h.query
  FROM hist h,
       total t
 WHERE h.mean_exec_time >= t.mean_exec_time / 1000 AND rn <= 14
 UNION ALL
SELECT 'Others',
       ROUND(COALESCE(SUM(h.mean_exec_time), 0), 3) mean_exec_time,
       COALESCE(ROUND(100 * SUM(h.mean_exec_time) / AVG(t.mean_exec_time), 1), 0) percent,
       NULL sql_text
  FROM hist h,
       total t
 WHERE h.mean_exec_time < t.mean_exec_time / 1000 OR rn > 14
 ORDER BY 3 DESC NULLS LAST;

Top SQL by Total Exec Time

WITH
hist AS (
SELECT queryid::text,
       SUBSTRING(query from 1 for 100) query,
       ROW_NUMBER () OVER (ORDER BY total_exec_time::numeric DESC) rn,
       SUM(total_exec_time::numeric) total_exec_time
  FROM pg_stat_statements
 WHERE queryid IS NOT NULL
 		AND query::text not like '%pg\_%' 
 		AND query::text not like '%g\_%'
 		/* Add more filters here */
 GROUP BY
       queryid,
       SUBSTRING(query from 1 for 100),
       total_exec_time::numeric
),
total AS (
SELECT SUM(total_exec_time::numeric) total_exec_time FROM hist
)
SELECT DISTINCT
       h.queryid::text,
       ROUND(h.total_exec_time::numeric,3) total_exec_time,
       ROUND(100 * h.total_exec_time / t.total_exec_time, 1) percent,
       h.query
  FROM hist h,
       total t
 WHERE h.total_exec_time >= t.total_exec_time / 1000 AND rn <= 14
 UNION ALL
SELECT 'Others',
       ROUND(COALESCE(SUM(h.total_exec_time::numeric), 0), 3) total_exec_time,
       COALESCE(ROUND(100 * SUM(h.total_exec_time) / AVG(t.total_exec_time), 1), 0) percent,
       NULL sql_text
  FROM hist h,
       total t
 WHERE h.total_exec_time < t.total_exec_time / 1000 OR rn > 14
 ORDER BY 3 DESC NULLS LAST;

Top SQL by Execution Count

WITH
hist AS (
SELECT queryid::text,
       SUBSTRING(query from 1 for 100) query,
       ROW_NUMBER () OVER (ORDER BY calls DESC) rn,
       calls
  FROM pg_stat_statements 
 WHERE queryid IS NOT NULL 
 		AND query::text not like '%pg\_%' 
 		AND query::text not like '%g\_%'
 		/* Add more filters here */
 GROUP BY
       queryid,
       SUBSTRING(query from 1 for 100),
       calls
),
total AS (
SELECT SUM(calls) calls FROM hist
)
SELECT DISTINCT
       h.queryid::text,
       h.calls,
       ROUND(100 * h.calls / t.calls, 1) percent,
       h.query
  FROM hist h,
       total t
 WHERE h.calls >= t.calls / 1000 AND rn <= 14
 UNION ALL
SELECT 'Others',
       COALESCE(SUM(h.calls), 0) calls,
       COALESCE(ROUND(100 * SUM(h.calls) / AVG(t.calls), 1), 0) percent,
       NULL sql_text
  FROM hist h,
       total t
 WHERE h.calls < t.calls / 1000 OR rn > 14
 ORDER BY 2 DESC NULLS LAST;

Object Sizes Including Toast

Shows total size for a table including a break down for its index and toast size.

SELECT
  *,
  pg_size_pretty(table_bytes) AS table,
  pg_size_pretty(toast_bytes) AS toast,
  pg_size_pretty(index_bytes) AS index,
  pg_size_pretty(total_bytes) AS total
FROM (
  SELECT
    *, total_bytes - index_bytes - COALESCE(toast_bytes, 0) AS table_bytes
  FROM (
    SELECT
      c.oid,
      nspname AS table_schema,
      relname AS table_name,
      c.reltuples AS row_estimate,
      pg_total_relation_size(c.oid) AS total_bytes,
      pg_indexes_size(c.oid) AS index_bytes,
      pg_total_relation_size(reltoastrelid) AS toast_bytes
    FROM
      pg_class c
      LEFT JOIN pg_namespace n ON n.oid = c.relnamespace
    WHERE relkind = 'r'
  ) a
) a
WHERE table_schema like '%'
AND table_name like '%'
AND total_bytes > 0
ORDER BY total_bytes DESC;

SQL Statements Using CPU

Using pg_stat_statements, this query will allocate timing totals as CPU time.

SELECT
    pss.userid,
    pss.dbid,
    pd.datname AS db_name,
    pss.queryid,
    round((pss.total_exec_time + pss.total_plan_time)::numeric, 2) AS total_time,
    pss.calls,
    round((pss.mean_exec_time + pss.mean_plan_time)::numeric, 2) AS mean,
    round((100 * (pss.total_exec_time + pss.total_plan_time) / sum((pss.total_exec_time + pss.total_plan_time)::numeric) OVER ())::numeric, 2) AS cpu_portion_pctg,
    substr(pss.query, 1, 200) short_query
FROM
    pg_stat_statements pss,
    pg_database pd
WHERE
    pd.oid = pss.dbid
    AND query::text NOT LIKE '%FOR UPDATE%'
    /* Add more filters here */
ORDER BY
    (pss.total_exec_time + pss.total_plan_time) DESC
LIMIT 30;

Stats / Vacuum Projection Script

This script looks at the database and table options set for vacuum and analyze to give a report of when vacuum / analyze is projected to run and the last time it did run. This script will give you a good idea how well vacuum and analyze is running:

WITH tbl_reloptions AS (
SELECT
	oid,
    oid::regclass table_name,
    substr(unnest(reloptions), 1,  strpos(unnest(reloptions), '=') -1) option,
    substr(unnest(reloptions), 1 + strpos(unnest(reloptions), '=')) value
FROM
    pg_class c
WHERE reloptions is NOT null)
SELECT
    s.schemaname ||'.'|| s.relname as relname,
    n_live_tup live_tup,
    n_dead_tup dead_dup,
    n_tup_hot_upd hot_upd,
    n_mod_since_analyze mod_since_stats,
    n_ins_since_vacuum ins_since_vac,
    case 
	  when avacinsscalefactor.value is not null and avacinsthresh.value is not null
        then ROUND(((n_live_tup * avacinsscalefactor.value::numeric) + avacinsthresh.value::numeric),0)
      when avacinsscalefactor.value is null and avacinsthresh.value is not null
      	then ROUND(((n_live_tup * (select setting::numeric from pg_settings where name = 'autovacuum_vacuum_insert_scale_factor')) + avacinsthresh.value::numeric),0)
      when avacinsscalefactor.value is not null and avacinsthresh.value is null
      	then ROUND(((n_live_tup * avacinsscalefactor.value::numeric) + (select setting::numeric from pg_settings where name = 'autovacuum_vacuum_insert_threshold')),0)
      else ROUND(((n_live_tup * (select setting::numeric from pg_settings where name = 'autovacuum_vacuum_insert_scale_factor')) + (select setting::numeric from pg_settings where name = 'autovacuum_vacuum_insert_threshold')),0) 
    end as ins_for_vac,
    case 
	  when avacscalefactor.value is not null and avacthresh.value is not null
        then ROUND(((n_live_tup * avacscalefactor.value::numeric) + avacthresh.value::numeric),0)
      when avacscalefactor.value is null and avacthresh.value is not null
      	then ROUND(((n_live_tup * (select setting::numeric from pg_settings where name = 'autovacuum_vacuum_scale_factor')) + avacthresh.value::numeric),0)
      when avacscalefactor.value is not null and avacthresh.value is null
      	then ROUND(((n_live_tup * avacscalefactor.value::numeric) + (select setting::numeric from pg_settings where name = 'autovacuum_vacuum_threshold')),0)
      else ROUND(((n_live_tup * (select setting::numeric from pg_settings where name = 'autovacuum_vacuum_scale_factor')) + (select setting::numeric from pg_settings where name = 'autovacuum_vacuum_threshold')),0) 
    end as mods_for_vac,
    case 
	  when avacanalyzescalefactor.value is not null and avacanalyzethresh.value is not null
        then ROUND(((n_live_tup * avacanalyzescalefactor.value::numeric) + avacanalyzethresh.value::numeric),0)
      when avacanalyzescalefactor.value is null and avacanalyzethresh.value is not null
      	then ROUND(((n_live_tup * (select setting::numeric from pg_settings where name = 'autovacuum_analyze_scale_factor')) + avacanalyzethresh.value::numeric),0)
      when avacanalyzescalefactor.value is not null and avacanalyzethresh.value is null
      	then ROUND(((n_live_tup * avacanalyzescalefactor.value::numeric) + (select setting::numeric from pg_settings where name = 'autovacuum_analyze_threshold')),0)
      else ROUND(((n_live_tup * (select setting::numeric from pg_settings where name = 'autovacuum_analyze_scale_factor')) + (select setting::numeric from pg_settings where name = 'autovacuum_analyze_threshold')),0) 
    end as mods_for_stats,
    case 
      when avacfreezeage is not null
        then ROUND((greatest(age(c.relfrozenxid),age(t.relfrozenxid))::numeric / avacfreezeage.value::numeric * 100),2) 
      else ROUND((greatest(age(c.relfrozenxid),age(t.relfrozenxid))::numeric / (select setting::numeric from pg_settings where name = 'autovacuum_freeze_max_age') * 100),2) 
      end as avac_pct_frz,
    greatest(age(c.relfrozenxid),age(t.relfrozenxid)) max_txid_age,
    to_char(last_vacuum, 'YYYY-MM-DD HH24:MI') last_vac,
    to_char(last_analyze, 'YYYY-MM-DD HH24:MI') last_stats,
    to_char(last_autovacuum, 'YYYY-MM-DD HH24:MI') last_avac,
    to_char(last_autoanalyze, 'YYYY-MM-DD HH24:MI') last_astats,
    vacuum_count vac_cnt,
    analyze_count stats_cnt,
    autovacuum_count avac_cnt,
    autoanalyze_count astats_cnt,
    c.reloptions,
    case
      when avacenabled.value is not null
        then avacenabled.value::text
      when (select setting::text from pg_settings where name = 'autovacuum') = 'on'
        then 'true'
      else 'false'
    end as autovac_enabled
FROM
    pg_stat_all_tables s
JOIN pg_class c ON (s.relid = c.oid)
LEFT JOIN pg_class t ON c.reltoastrelid = t.oid
LEFT JOIN tbl_reloptions avacinsscalefactor on (s.relid = avacinsscalefactor.oid and avacinsscalefactor.option = 'autovacuum_vacuum_insert_scale_factor')
LEFT JOIN tbl_reloptions avacinsthresh on (s.relid = avacinsthresh.oid and avacinsthresh.option = 'autovacuum_vacuum_insert_threshold')
LEFT JOIN tbl_reloptions avacscalefactor on (s.relid = avacscalefactor.oid and avacscalefactor.option = 'autovacuum_vacuum_scale_factor')
LEFT JOIN tbl_reloptions avacthresh on (s.relid = avacthresh.oid and avacthresh.option = 'autovacuum_vacuum_threshold')
LEFT JOIN tbl_reloptions avacanalyzescalefactor on (s.relid = avacanalyzescalefactor.oid and avacanalyzescalefactor.option = 'autovacuum_analyze_scale_factor')
LEFT JOIN tbl_reloptions avacanalyzethresh on (s.relid = avacanalyzethresh.oid and avacanalyzethresh.option = 'autovacuum_analyze_threshold')
LEFT JOIN tbl_reloptions avacfreezeage on (s.relid = avacfreezeage.oid and avacfreezeage.option = 'autovacuum_freeze_max_age')
LEFT JOIN tbl_reloptions avacenabled on (s.relid = avacenabled.oid and avacenabled.option = 'autovacuum_enabled')
WHERE
    s.relname IN (
        SELECT
            t.table_name
		FROM
    		information_schema.tables t
    		JOIN pg_catalog.pg_class c ON (t.table_name = c.relname)
    		LEFT JOIN pg_catalog.pg_user u ON (c.relowner = u.usesysid)
        WHERE
            t.table_schema like '%'
            AND (u.usename like '%' OR u.usename is null)
            AND t.table_name like '%'
            AND t.table_schema not in ('information_schema','pg_catalog')
            AND t.table_type not in ('VIEW')
			AND t.table_catalog = current_database())
    AND n_dead_tup >= 0
    AND n_live_tup > 0
ORDER BY 3;

Unused / Rarely Used Indexes

To keep a well run system, it’s important to maintain as few indexes as possible. This will show which indexes have not been recently used. The original version of this was obtained from https://www.databasesoup.com/2014/05/new-finding-unused-indexes-query.html (Josh Berkus)

WITH table_scans AS (
    SELECT
        relid,
        tables.idx_scan + tables.seq_scan AS all_scans,
        (tables.n_tup_ins + tables.n_tup_upd + tables.n_tup_del) AS writes,
        pg_relation_size(relid) AS table_size
    FROM
        pg_stat_all_tables AS tables
    WHERE
        schemaname NOT IN ('pg_toast', 'pg_catalog', 'partman')
),
all_writes AS (
    SELECT
        sum(writes) AS total_writes
    FROM
        table_scans
),
indexes AS (
    SELECT
        idx_stat.relid,
        idx_stat.indexrelid,
        idx_stat.schemaname,
        idx_stat.relname AS tablename,
        idx_stat.indexrelname AS indexname,
        idx_stat.idx_scan,
        pg_relation_size(idx_stat.indexrelid) AS index_bytes,
        indexdef ~* 'USING btree' AS idx_is_btree
    FROM
        pg_stat_user_indexes AS idx_stat
        JOIN pg_index USING (indexrelid)
        JOIN pg_indexes AS indexes ON idx_stat.schemaname = indexes.schemaname
            AND idx_stat.relname = indexes.tablename
            AND idx_stat.indexrelname = indexes.indexname
    WHERE
        pg_index.indisunique = FALSE
),
index_ratios AS (
    SELECT
        schemaname,
        tablename,
        indexname,
        idx_scan,
        all_scans,
        round((
            CASE WHEN all_scans = 0 THEN
                0.0::numeric
            ELSE
                idx_scan::numeric / all_scans * 100
            END), 2) AS index_scan_pct,
        writes,
        round((
            CASE WHEN writes = 0 THEN
                idx_scan::numeric
            ELSE
                idx_scan::numeric / writes
            END), 2) AS scans_per_write,
        pg_size_pretty(index_bytes) AS index_size,
        pg_size_pretty(table_size) AS table_size,
        idx_is_btree,
        index_bytes
    FROM
        indexes
        JOIN table_scans USING (relid)
),
index_groups AS (
    SELECT
        'Never Used Indexes' AS reason,
        *,
        1 AS grp
    FROM
        index_ratios
    WHERE
        idx_scan = 0
        AND idx_is_btree
    UNION ALL
    SELECT
        'Low Scans, High Writes' AS reason,
        *,
        2 AS grp
    FROM
        index_ratios
    WHERE
        scans_per_write <= 1
        AND index_scan_pct < 10
        AND idx_scan > 0
        AND writes > 100
        AND idx_is_btree
    UNION ALL
    SELECT
        'Seldom Used Large Indexes' AS reason,
        *,
        3 AS grp
    FROM
        index_ratios
    WHERE
        index_scan_pct < 5
        AND scans_per_write > 1
        AND idx_scan > 0
        AND idx_is_btree
        AND index_bytes > 100000000
    UNION ALL
    SELECT
        'High-Write Large Non-Btree' AS reason,
        index_ratios.*,
        4 AS grp
    FROM
        index_ratios,
        all_writes
    WHERE (writes::numeric / (total_writes + 1)) > 0.02
    AND NOT idx_is_btree
    AND index_bytes > 100000000
ORDER BY
    grp,
    index_bytes DESC
)
SELECT
    reason,
    schemaname,
    tablename,
    indexname,
    index_scan_pct,
    scans_per_write,
    index_size,
    table_size
FROM
    index_groups
WHERE
    tablename LIKE '%';

Rank Wait Events

This is a great query to rank from most frequent to less frequent wait events being observed on they system in pg_stat_activity. This does not provide historical reference, but a look at the current moment in time:

WITH waits AS (
    SELECT
        wait_event,
        rank() OVER (ORDER BY count(wait_event) DESC) rn
    FROM pg_stat_activity
    WHERE wait_event IS NOT NULL
GROUP BY wait_event ORDER BY count(wait_event) ASC
),
total AS (
    SELECT
        SUM(rn) total_waits
    FROM
        waits
)
SELECT DISTINCT
    h.wait_event,
    h.rn,
    ROUND(100 * h.rn / t.total_waits, 1) percent
FROM
    waits h,
    total t
WHERE
    h.rn >= t.total_waits / 1000
    AND rn <= 14
UNION ALL
SELECT
    'Others',
    COALESCE(SUM(h.rn), 0) rn,
    COALESCE(ROUND(100 * SUM(h.rn) / AVG(t.total_waits), 1), 0) percent
FROM
    waits h,
    total t
WHERE
    h.rn < t.total_waits / 1000
    OR rn > 14
ORDER BY
    2 DESC NULLS LAST;

Tables With Missing FK Indexes

Often, queries will experience poor run times when foreign keys do not have indexes supporting them. This is a good query to show those missing indexes:

WITH y AS (
    SELECT
        pg_catalog.format('%I.%I', n1.nspname, c1.relname) AS referencing_tbl,
        pg_catalog.quote_ident(a1.attname) AS referencing_column,
        t.conname AS existing_fk_on_referencing_tbl,
        pg_catalog.format('%I.%I', n2.nspname, c2.relname) AS referenced_tbl,
        pg_catalog.quote_ident(a2.attname) AS referenced_column,
        pg_relation_size(pg_catalog.format('%I.%I', n1.nspname, c1.relname)) AS referencing_tbl_bytes,
        pg_relation_size(pg_catalog.format('%I.%I', n2.nspname, c2.relname)) AS referenced_tbl_bytes,
        pg_catalog.format($$CREATE INDEX ON %I.%I(%I);$$, n1.nspname, c1.relname, a1.attname) AS suggestion
    FROM
        pg_catalog.pg_constraint t
        JOIN pg_catalog.pg_attribute a1 ON a1.attrelid = t.conrelid
            AND a1.attnum = t.conkey[1]
        JOIN pg_catalog.pg_class c1 ON c1.oid = t.conrelid
        JOIN pg_catalog.pg_namespace n1 ON n1.oid = c1.relnamespace
        JOIN pg_catalog.pg_class c2 ON c2.oid = t.confrelid
        JOIN pg_catalog.pg_namespace n2 ON n2.oid = c2.relnamespace
        JOIN pg_catalog.pg_attribute a2 ON a2.attrelid = t.confrelid
            AND a2.attnum = t.confkey[1]
    WHERE
        t.contype = 'f'
        AND NOT EXISTS (
            SELECT
                1
            FROM
                pg_catalog.pg_index i
            WHERE
                i.indrelid = t.conrelid
                AND i.indkey[0] = t.conkey[1]))
SELECT
    referencing_tbl,
    referencing_column,
    existing_fk_on_referencing_tbl,
    referenced_tbl,
    referenced_column,
    pg_size_pretty(referencing_tbl_bytes) AS referencing_tbl_size,
    pg_size_pretty(referenced_tbl_bytes) AS referenced_tbl_size,
    suggestion
FROM
    y
ORDER BY
    referencing_tbl_bytes DESC,
    referenced_tbl_bytes DESC,
    referencing_tbl,
    referenced_tbl,
    referencing_column,
    referenced_column;

Blocking Lock Tree

postgres.ai is a great place to get some observability queries and this is one of my favorites:

with recursive activity as (
  select
    pg_blocking_pids(pid) blocked_by,
    *,
    age(clock_timestamp(), xact_start)::interval(0) as tx_age,
    -- "pg_locks.waitstart" – PG14+ only; for older versions:  age(clock_timestamp(), state_change) as wait_age,
    age(clock_timestamp(), (select max(l.waitstart) from pg_locks l where a.pid = l.pid))::interval(0) as wait_age
  from pg_stat_activity a
  where state is distinct from 'idle'
), blockers as (
  select
    array_agg(distinct c order by c) as pids
  from (
    select unnest(blocked_by)
    from activity
  ) as dt(c)
), tree as (
  select
    activity.*,
    1 as level,
    activity.pid as top_blocker_pid,
    array[activity.pid] as path,
    array[activity.pid]::int[] as all_blockers_above
  from activity, blockers
  where
    array[pid] <@ blockers.pids
    and blocked_by = '{}'::int[]
  union all
  select
    activity.*,
    tree.level + 1 as level,
    tree.top_blocker_pid,
    path || array[activity.pid] as path,
    tree.all_blockers_above || array_agg(activity.pid) over () as all_blockers_above
  from activity, tree
  where
    not array[activity.pid] <@ tree.all_blockers_above
    and activity.blocked_by <> '{}'::int[]
    and activity.blocked_by <@ tree.all_blockers_above
)
select
  pid,
  blocked_by,
  case when wait_event_type <> 'Lock' then replace(state, 'idle in transaction', 'idletx') else 'waiting' end as state,
  wait_event_type || ':' || wait_event as wait,
  wait_age,
  tx_age,
  to_char(age(backend_xid), 'FM999,999,999,990') as xid_age,
  to_char(2147483647 - age(backend_xmin), 'FM999,999,999,990') as xmin_ttf,
  datname,
  usename,
  (select count(distinct t1.pid) from tree t1 where array[tree.pid] <@ t1.path and t1.pid <> tree.pid) as blkd,
  format(
    '%s %s%s',
    lpad('[' || pid::text || ']', 9, ' '),
    repeat('.', level - 1) || case when level > 1 then ' ' end,
    left(query, 1000)
  ) as query
from tree
order by top_blocker_pid, level, pid;

Follow-Up: Reduce Vacuum by Using “ON CONFLICT” Directive

I previously blogged about ensuring that the “ON CONFLICT” directive is used in order to avoid vacuum from having to do additional work. You can read the original blog here: Reduce Vacuum by Using “ON CONFLICT” Directive

Now that Postgres has incorporated the “MERGE” functionality into Postgres 15 and above, I wanted to ensure that there was no “strange” behavior as it relates to vacuum when using merge. As you can see here, the “MERGE” functionality does perform exactly as expected. For example, when you attempt to have a merge where the directive is to try an insert first followed by an update, exactly one row is marked dead when the insert fails and the update succeeds.

/* Create the table: */
CREATE TABLE public.pk_violation_test (
        id int PRIMARY KEY, 
        value numeric,
        product_id int,
        effective_date timestamp(3)
        );
 
 
/* Insert some mocked up data */
INSERT INTO public.pk_violation_test VALUES ( 
        generate_series(0,10000), 
        random()*1000,
        random()*100,
        current_timestamp(3));
 
/* Verify that there are no dead tuples: */
SELECT
    schemaname,
    relname,
    n_live_tup,
    n_dead_tup
FROM
    pg_stat_all_tables
WHERE
    relname = 'pk_violation_test';
 
 schemaname |      relname      | n_live_tup | n_dead_tup
------------+-------------------+------------+------------
 public     | pk_violation_test |    100001  |          0

Then, create a simple merge and check the results:

WITH insert_query AS (
    SELECT
        0 AS id,
        44.33893489873 AS value,
        46 AS product_id,
        now() AS effective_date) MERGE INTO pk_violation_test pkt
    USING insert_query i ON pkt.id = i.id
    WHEN MATCHED THEN
        UPDATE SET
            value = i.value, product_id = i.product_id, effective_date = i.effective_date
    WHEN NOT MATCHED THEN
        INSERT (id, value, product_id, effective_date)
            VALUES (i.id, i.value, i.product_id, i.effective_date);
MERGE 1

And then check the dead tuple count:

SELECT
    schemaname,
    relname,
    n_live_tup,
    n_dead_tup
FROM
    pg_stat_all_tables
WHERE
    relname = 'pk_violation_test';
 schemaname |      relname      | n_live_tup | n_dead_tup
------------+-------------------+------------+------------
 public     | pk_violation_test |      10001 |          1
(1 row)

As expected only one row is marked dead. Merge is such great functionality and I am glad to see it in Postgres. As you get time, all of your “ON CONFLICT” statements should be converted to use this functionality. Enjoy!

Use of “pgtt” Extension in Self Managed vs. Cloud Products

There are several benefits to using temp tables in SQL DML operations. Given my long relationship and work on Oracle databases, I am highly familiar with the Oracle style “Global Temporary Tables” and have used them to solve many performance issues when complex joins are involved. That’s why I was excited when Gilles Darold released the extension “pgtt” for PostgreSQL. I was even more excited when Google released the extension for use in the Google CloudSQL for Postgres and AlloyDB for Postgres products.

In my opinion, native Postgres temporary tables leave a lot to be desired. For example, each table has to be created upon use. Nothing is persistent. This takes time and leads to Postgres catalog bloat (looking to do a post on this soon). That being said, in self managed instances, ‘pgtt’ is very easy to use. You can load the extension either at connection time by user, or for anyone that connects to a database. For example (make sure the extension is created in the database before setting these):

alter role test_user SET session_preload_libraries to 'pgtt';

alter database test_pgtt SET session_preload_libraries = 'pgtt';

This will, by default, use the ‘pgtt’ version of temporary tables instead of the default temporary tables Postgres offers. You can then create ‘pgtt’ style tables and refer to them via the pgtt_schema with no further modifications.

To revert back in the session, you must “disconnect” the session alter the user / database to revert the behavior to “old” style temporary tables. To my knowledge, if the library is enabled this way there is no way to use both styles of temporary tables at the same time.

alter role test_user RESET session_preload_libraries;

alter database pgtt_test RESET session_preload_libraries;

The Set Up…..

/* set flag "pgtt.enabled" on alloydb / cloudsql postgres console  to on */

/* create the extension in every database you need to use global temp tables */
CREATE EXTENSION pgtt;
LOAD 'pgtt';

/* Additional extensions needed for this test */
CREATE EXTENSION "uuid-ossp";

/* create the persistent global temporary table */
CREATE /*GLOBAL*/ TEMPORARY TABLE test_gtt_table (
id integer,
lbl text) ON COMMIT DELETE ROWS;

/* create an index on the global temp table */
CREATE INDEX ON pgtt_schema.test_gtt_table (id);

Use of ‘pgtt’ in Cloud Database Products

When you look at managed Cloud databases (from any vendor), you have to understand that some items need to be locked down in order to safeguard against the “managed” nature of the product. Unfortunately (as of this writing), at least in Google Cloud Postgres products, you cannot manipulate the parameter “session_preload_libraries”. So then you ask “How can I programmatically use the extension”? As the time of this writing, I have found only two ways. Either the user has to explicitly execute “LOAD ‘pgtt'” at the beginning / at some point in their session or the “LOAD” must be embedded in a function / procedure similar to the following:

/* function definition with loading of extension as a work around */
create or replace function public.test_gtt_function(P_IN_IDS integer[])
	returns table (
		id integer,
		lbl text
	)
    language plpgsql 
AS $func$
BEGIN
LOAD 'pgtt';
	/* revised query below */
	INSERT INTO pgtt_schema.test_gtt_table
		select generate_series,substr(uuid_generate_v4()::text,1,15) FROM generate_series(1, 5000);
	RETURN QUERY
	SELECT mygtt.id,mygtt.lbl from pgtt_schema.test_gtt_table mygtt where mygtt.id = ANY (P_IN_IDS);
END
$func$;

As you can see in the above function, the “LOAD” directive is the first thing in the function. While I am not a fan of having to do it this way, there doesn’t seem to be a huge performance impact but one nonetheless that you must account for:

(postgres@##########:5432) [tpcc] > LOAD 'pgtt';
LOAD
Time: 3.811 ms
(postgres@##########:5432) [tpcc] > \watch 1
LOAD
Time: 2.922 ms
LOAD
Time: 1.219 ms
LOAD
Time: 1.335 ms
LOAD
Time: 1.300 ms
LOAD
Time: 1.282 ms
LOAD

One reason I do like the implementation within the function is that it is easier to remove the directive vs the directive being embedded within the code. Either way the right direction is to get the vendor to allow targeted values for this parameter. This will allow the most flexibility and the least amount of awareness and hassle when using this extension.

So What Happens If You Don’t Load The Library??

So if you do not load the library with one of the options above, the functions still works. But what happens? The table actually behaves as a “real” table and the data is never truncated… Even after disconnection:

create or replace function public.test_gtt_wo_load_function(P_IN_IDS integer[])
	returns table (
		id integer,
		lbl text
	)
    language plpgsql 
AS $func$
BEGIN
	/* revised query below */
	INSERT INTO pgtt_schema.test_gtt_table
		select generate_series,substr(uuid_generate_v4()::text,1,15) FROM generate_series(1, 5000);
	RETURN QUERY
	SELECT mygtt.id,mygtt.lbl from pgtt_schema.test_gtt_table mygtt where mygtt.id = ANY (P_IN_IDS);
END
$func$;

(postgres@#######:5432) [tpcc] > \set ids ARRAY[3,28,69,235,389,512,678,899]
(postgres@#######:5432) [tpcc] > select * from public.test_gtt_wo_load_function(:ids);
 id  |       lbl
-----+-----------------
   3 | 2f82b04f-db13-4
  28 | 49dfea9a-5c41-4
  69 | d2d27343-a8d4-4
 235 | d5f32cd9-9495-4
 389 | 69842fc5-f0e5-4
 512 | e52c1625-0fab-4
 678 | 0fd320a9-8f08-4
 899 | 452a5f95-6e16-4
(8 rows)

(postgres@#######:5432) [tpcc] > select * from public.test_gtt_wo_load_function(:ids);
 id  |       lbl
-----+-----------------
   3 | 2f82b04f-db13-4
   3 | 632e097d-0f80-4
  28 | 49dfea9a-5c41-4
  28 | bf9ed2c4-14d2-4
  69 | d2d27343-a8d4-4
........
 899 | 452a5f95-6e16-4
 899 | d40c915f-04c7-4
(16 rows)

(postgres@#######:5432) [tpcc] > select * from public.test_gtt_wo_load_function(:ids);
 id  |       lbl
-----+-----------------
   3 | 2f82b04f-db13-4
   3 | 632e097d-0f80-4
   3 | 9cd65fda-ab4a-4
  28 | 49dfea9a-5c41-4
  28 | bf9ed2c4-14d2-4
  28 | 6cdd9a22-5c60-4
.........
 899 | 452a5f95-6e16-4
 899 | d40c915f-04c7-4
 899 | 0319d855-a935-4
(24 rows)

(postgres@#######:5432) [tpcc] > \q
shaneborden_google_com@replication-instance-1:/home/shaneborden_google_com $ psql -h ###### -p 5432 -U postgres -d tpcc
Password for user postgres:
(postgres@#######:5432) [tpcc] > \set ids ARRAY[3,28,69,235,389,512,678,899]
(postgres@#######:5432) [tpcc] > select * from public.test_gtt_wo_load_function(:ids);
 id  |       lbl
-----+-----------------
   3 | 82f66c20-f484-4
   3 | 01ad1b4d-2790-4
   3 | a3c38d05-fb7d-4
   3 | 7a4b2ff8-3e32-4
   3 | af8a96e1-194c-4
   3 | 02586796-1cc6-4
  28 | 62b1fb31-d3c0-4
  28 | 3acfe744-574f-4
  28 | 1e2d64fa-1106-4
  28 | d92a9321-4330-4
  28 | f3b1d45e-d4a7-4
  28 | 270146eb-1098-4
  69 | 506ad97c-fdad-4
  69 | bf3340d5-17a2-4
  69 | 47cbd9eb-83da-4
  69 | 487e8b22-4a05-4
  69 | b17314bc-5822-4
.......
 678 | a8d7dc45-c027-4
 678 | 17b666e5-1fc1-4
 678 | 4386eeb4-7b4e-4
 678 | 2942f25f-3d13-4
 678 | 3853bac0-e0ba-4
 678 | 95499bef-4440-4
 899 | 4503351c-4fe4-4
 899 | 900337bf-a658-4
 899 | 5f1f02f8-29a3-4
 899 | d635c921-8f9b-4
 899 | 0dd42f22-0f8e-4
 899 | 16bb8dcf-4e35-4
(48 rows)

Pretty crazy. So if for some reason the LOAD does not occur or fails, there could be an issue where the temp table has duplicate rows.

So what happens if the “pgtt” temp table was never created? You get a failure as you expect:

(postgres@#######:5432) [tpcc] > select * from public.test_gtt_wo_load_function(:ids);
ERROR:  relation "pgtt_schema.test_gtt_table" does not exist
LINE 1: INSERT INTO pgtt_schema.test_gtt_table
                    ^
QUERY:  INSERT INTO pgtt_schema.test_gtt_table
select generate_series,substr(uuid_generate_v4()::text,1,15) FROM generate_series(1, 5000)
CONTEXT:  PL/pgSQL function test_gtt_wo_load_function(integer[]) line 4 at SQL statement
Time: 5.204 ms

So it is important to ensure that if you intend to use “pgtt” you have a programmatic method to load the extension….. and in the meantime, I will be working on my end to get a cleaner way to load it in the managed products I have access to.

“Row Movement” in PostgreSQL… Is it bad?

In Oracle, right or wrong, I was always taught to try to avoid “row movement” between partitions due to the general thought that the extra workload of a “delete” + “insert” (rewrite of the row) should be avoided due to the extra I/O, index fragmentation and the associated risks of a migrating ROWID in the cases where the app developers might have used it in their code (now that’s a whole other problem). Oracle didn’t even let you do it by default.

Table by table, you had to explicitly set:

alter table [table name] enable row movement;

Now, you also had to set this to do table reorganizations such as “alter table…. shrink space / shrink space compact” so it wasn’t something unheard of. However, when a customer recently explained to me that they were going to partition a PostgreSQL table and update the partition key column from null to the date when the row got processed, my mind immediately went to the space of that’s probably bad……. RIGHT??

Well, once I thought about it, maybe it’s not all that bad due to the way MVCC and the subsequent VACUUM operations occur in PostgreSQL. The only thing I could think of that might be a factor is that you would lose any potential benefit of HOT (Heap-Only-Tuple) updates since the row will no longer be part of the original partition, seeing that partitions in PostgreSQL are just another table. The benefit though is that I could limit my vacuum operations to one single partition and SMALLER table. A plus for this customer.

**** Note: An implementation like this does not necessarily follow best practices with regards to partitioning. That being said, I was attempting to validate the idea with regards to how PostgreSQL MVCC behaves.

That being said, I wanted to at least be able to prove / disprove my thoughts with a demonstration, so off to PostgreSQL we go. First let’s create a simple partitioned table and use pg_partman to help:

CREATE TABLE partman_test.partman_partitioned (
	id integer not null, 
	val varchar(20) not null,
	created_tmstp timestamp not null,
	event_tmstp timestamp null) 
PARTITION BY RANGE (event_tmstp);

CREATE INDEX partman_partitioned_ix1 ON partman_test.partman_partitioned (id);

SELECT partman.create_parent( p_parent_table => 'partman_test.partman_partitioned',
 p_control => 'event_tmstp',
 p_type => 'native',
 p_interval=> 'daily',
 p_premake => 3);

Now, lets insert some random data using a date randomizer function to spread the data across new partitions:

CREATE OR REPLACE FUNCTION partman_test.random_date(out random_date_entry timestamp) AS $$
select current_timestamp(3) + random() * interval '2 days'
$$ LANGUAGE SQL;

INSERT INTO partman_test.partman_partitioned VALUES ( 
		generate_series(0,10000), 
		substr(md5(random()::text), 0,10),
		partman_test.random_date(),
		NULL);

And then for demonstration purposes, I will set autovacuum to “off” for all the partitions” and run 100 updates to move the data into random partitions using the following statement:

ALTER TABLE partman_test.partman_partitioned_default SET (autovacuum_enabled = false);
ALTER TABLE partman_test.partman_partitioned_p2023_09_05 SET (autovacuum_enabled = false);
ALTER TABLE partman_test.partman_partitioned_p2023_09_06 SET (autovacuum_enabled = false);
ALTER TABLE partman_test.partman_partitioned_p2023_09_07 SET (autovacuum_enabled = false);

do $$
declare
  v_id integer;
begin
	for cnt in 1..100 loop
	  select id 
	  FROM partman_test.partman_partitioned 
	  WHERE event_tmstp is null 
	  LIMIT 1 FOR UPDATE SKIP LOCKED
	  INTO v_id;
	  UPDATE partman_test.partman_partitioned
	    SET event_tmstp = partman_test.random_date()
	    WHERE id = v_id and event_tmstp is null;
	  commit;
	end loop;
end; $$;

Once the updates finish, let’s look at the vacuum stats:

relname                                     |autovac_enabled|live_tup|dead_dup|hot_upd|mod_since_stats|ins_since_vac|
--------------------------------------------+---------------+--------+--------+-------+---------------+-------------+
partman_test.partman_partitioned            |true           |       0|       0|      0|              0|            0|
partman_test.partman_partitioned_default    |false          |       0|     100|      0|            100|            0|
partman_test.partman_partitioned_p2023_09_05|false          |      10|       0|      0|             10|           10|
partman_test.partman_partitioned_p2023_09_06|false          |      52|       0|      0|             52|           52|
partman_test.partman_partitioned_p2023_09_07|false          |      38|       0|      0|             38|           38|

Extension “pg_stattuple” confirms that dead tuples only exist in the “default” partition. The reason as to why the numbers don’t match pg_stat_all_tables is a discussion for another day:

table_len|tuple_count|tuple_len|tuple_percent|dead_tuple_count|dead_tuple_len|dead_tuple_percent|free_space|free_percent|
---------+-----------+---------+-------------+----------------+--------------+------------------+----------+------------+
   524288|       9901|   475248|        90.65|              82|          3936|              0.75|      3308|        0.63|
     8192|         10|      560|         6.84|               0|             0|               0.0|      7564|       92.33|
     8192|         52|     2912|        35.55|               0|             0|               0.0|      5044|       61.57|
     8192|         38|     2128|        25.98|               0|             0|               0.0|      5884|       71.83|

So, we definitely proved that we didn’t get the benefit of HOT updates, but due to the MVCC model of PostgreSQL, the update becomes just like any other non-HOT update. This is due to the fact that the updated row is behaving as if it had an index on the row (primary cause of a non-HOT update and sometimes common) and the rest of the MVCC model is just behaving as it would anyway. I did want to validate with one more tool, but unfortunately the extension, “pg_walinspect” was not installed on this CloudSQL for Postgres instance so I was unable to use it.

What about locks? We do get additional locks to manage because we are effecting two partitions instead of one (but they are all fastpath locks):

(postgres@10.3.0.31:5432) [tpcc] > select locktype, database, relation::regclass, page, tuple, pid, mode,granted,fastpath, waitstart from pg_locks;
   locktype    | database |              relation              | page | tuple |  pid   |       mode       | granted | fastpath | waitstart
---------------+----------+------------------------------------+------+-------+--------+------------------+---------+----------+-----------
 relation      |    69323 | partman_partitioned_p2023_09_05    | NULL |  NULL | 129825 | RowExclusiveLock | t       | t        | NULL
 relation      |    69323 | partman_partitioned_default_id_idx | NULL |  NULL | 129825 | RowExclusiveLock | t       | t        | NULL
 relation      |    69323 | partman_partitioned_default        | NULL |  NULL | 129825 | RowExclusiveLock | t       | t        | NULL
 relation      |    69323 | partman_partitioned                | NULL |  NULL | 129825 | RowExclusiveLock | t       | t        | NULL

If we were to have no row movement between partitions there is a slightly lesser amount of locks to manage:

(postgres@10.3.0.31:5432) [tpcc] > select locktype, database, relation::regclass, page, tuple, pid, mode,granted,fastpath, waitstart from pg_locks;
   locktype    | database |                relation                | page | tuple |  pid   |       mode       | granted | fastpath | waitstart
---------------+----------+----------------------------------------+------+-------+--------+------------------+---------+----------+-----------
 relation      |    69323 | partman_partitioned_p2023_09_05_id_idx | NULL |  NULL | 129825 | RowExclusiveLock | t       | t        | NULL
 relation      |    69323 | partman_partitioned_p2023_09_05        | NULL |  NULL | 129825 | RowExclusiveLock | t       | t        | NULL
 relation      |    69323 | partman_partitioned                    | NULL |  NULL | 129825 | RowExclusiveLock | t       | t        | NULL

Also, be aware that you may need to pay special attention to the vacuum operations and settings of the default partition as this type of operation may cause some significant bloat over time. However, one positive is that the bloat will be contained to one and only one partition.

One last caveat that comes to mind. Be sure that either you specify the partition key or explicitly update the “default” partition in your query because otherwise you would get a multiple partition scan which could cause other performance and locking issues.

Enjoy!

Using the “hint_plan” Table Provided by the PostgreSQL Extension “pg_hint_plan”

Introduction

For those who have worked with Oracle, the pg_hint_plan extension is one that will allow you to hint plans in patterns that you are likely very familiar with:

  • sql_patch
  • sql_profile
  • sql_plan_baselines

While currently, the functionality provided by pg_hint_plan is not nearly as robust (hints list), it does provide most of what you would encounter day to day as a DBA. That being said, one thing that is currently missing is the ability to easily add hints without changing code via stored_procedures / functions like in Oracle. The only way to currently do this in Open Source PostgreSQL is to manually manipulate a table named “hints” typically located in the “hint_plan” schema.

The “hints” table which is provided by the extension is highly dependent (just like Oracle) on a normalized SQL statement. A normalized SQL statement in PostgreSQL is one that has all carriage returns removed, all spaces converted to single spaces and all literals and parameters replaced with a “?”. Typically you have to do this manually, but in this blog post, I am going to show how I have leveraged entries in “pg_stat_statements” along with custom written functions to normalize the statement and place it into the “hints” table. To use this “hints” table feature, the following setting must be enabled at either the session or system level:

set session pg_hint_plan.enable_hint_table to on;
or

in the postgresql.conf:
pg_hint_plan.enable_hint_table to on;

What Does a Normalized Statement Look Like?

Typically, when you receive code from a developer or even code that you work on yourself, you format it in order to to make it human readable and easier to interpret. For example, you might want your statement to look like this (notice the parameters / literals in the statement:

SELECT
    b.bid,
    sum(abalance)
FROM
    pgbench_branches b
    JOIN pgbench_accounts a ON (b.bid = a.bid)
WHERE
    b.bid = 12345
    AND a.aid BETWEEN 100 AND 200
GROUP BY
    b.bid
ORDER BY
    1;

Now to normalize the statement for use with the “hints” table it needs to look like this:

select b.bid, sum(abalance) from pgbench_branches b join pgbench_accounts a on (b.bid = a.bid) where b.bid = ? and a.aid between ? and ? group by b.bid order by 1;

You can either manually manipulate the statement to get it in this format do this or we can attempt to do it programmatically. I prefer as much as possible to let the system format it for me so I have written a few helper scripts to do this:

Helper Queries:

**** Feel free to utilize these functions, however they may contain errors or may not normalize all statements. They depend on the pg_stat_statements table and if the entire statement will not fit within the query field of that table, then these functions will not produce the correct output. I will also place them on my public github. If you find any errors or omissions, please let me know. ****

hint_plan.display_candidate_pg_hint_plan_queries

While you can easily select from the “hints” table on your own, this query will show what a normalized statement will look like before loading it to the table. You can leave the “p_query_id” parameter null to return all queries present in the pg_stat_statements in a normalized form or you can populate it with a valid “query_id” and it will return a single normalized statement:

CREATE OR REPLACE FUNCTION hint_plan.display_candidate_pg_hint_plan_queries(
  p_query_id bigint default null
  )
  RETURNS TABLE(queryid bigint, norm_query_string text)
  LANGUAGE 'plpgsql'
  COST 100
  VOLATILE PARALLEL UNSAFE
AS $BODY$
 DECLARE 
 	pg_stat_statements_exists boolean := false;
 BEGIN
   SELECT EXISTS (
    SELECT FROM 
        information_schema.tables 
    WHERE 
        table_schema LIKE 'public' AND 
        table_type LIKE 'VIEW' AND
        table_name = 'pg_stat_statements'
    ) INTO pg_stat_statements_exists;
   IF pg_stat_statements_exists AND p_query_id is not null THEN
    RETURN QUERY
    SELECT pss.queryid,
           substr(regexp_replace(
             regexp_replace(
                regexp_replace(
                   regexp_replace(
                      regexp_replace(pss.query, '\$\d+', '?', 'g'),
                                E'\r', ' ', 'g'),
                              E'\t', ' ', 'g'),
                           E'\n', ' ', 'g'),
                         '\s+', ' ', 'g') || ';',1,100)
 	FROM pg_stat_statements pss where pss.queryid = p_query_id;
   ELSE
    RETURN QUERY
    SELECT pss.queryid,
           substr(regexp_replace(
             regexp_replace(
                regexp_replace(
                   regexp_replace(
                      regexp_replace(pss.query, '\$\d+', '?', 'g'),
                                E'\r', ' ', 'g'),
                              E'\t', ' ', 'g'),
                           E'\n', ' ', 'g'),
                         '\s+', ' ', 'g') || ';',1,100)
 	FROM pg_stat_statements pss;
   END IF;
 END; 
$BODY$;

If our candidate query was this:

select queryid, query from pg_stat_statements where queryid =  -8949523101378282526;
       queryid        |            query
----------------------+-----------------------------
 -8949523101378282526 | select b.bid, sum(abalance)+
                      | from pgbench_branches b    +
                      | join pgbench_accounts a    +
                      | on (b.bid = a.bid)         +
                      | where b.bid = $1           +
                      | group by b.bid             +
                      | order by 1
(1 row)

The display function would return the following normalized query:

SELECT hint_plan.display_candidate_pg_hint_plan_queries(p_query_id => -8949523101378282526);
-[ RECORD 1 ]--------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------
display_candidate_pg_hint_plan_queries | (-8949523101378282526,"select b.bid, sum(abalance) from pgbench_branches b join pgbench_accounts a on (b.bid = a.bid) where b.bid = ? group by b.bid order by 1;")

You can then verify that the query is normalized properly and then move on toward using the next function to add the normalized query to the “hints” table.

hint_plan.add_stored_pg_hint_plan

Using the same query in the previous section, we will now add it to the “hints” table. This is where it is important to understand what hint you want to add.

CREATE OR REPLACE FUNCTION hint_plan.add_stored_pg_hint_plan(
  p_query_id bigint,
  p_hint_text text,
  p_application_name text default ''
  )
  RETURNS varchar
  LANGUAGE 'plpgsql'
  COST 100
  VOLATILE PARALLEL UNSAFE
AS $BODY$
-- p_hint_text can contain one or more hints either separated by a space or
-- a carriage return character.  Examples include:
-- Space Separated: SeqScan(a) Parallel(a 0 hard)
-- ASCII CRLF Separated: SeqScan(a)'||chr(10)||'Parallel(a 0 hard)
-- Single Hint: SeqScan(a)
-- 
-- Escaped text does not work: /* E'SeqScan(a)\nParallel(a 0 hard)'
 DECLARE 
 	hint_id hint_plan.hints.id%TYPE;
 	normalized_query_text hint_plan.hints.norm_query_string%TYPE;
 	pg_stat_statements_exists boolean := false;
 BEGIN
   SELECT EXISTS (
    SELECT FROM 
        information_schema.tables 
    WHERE 
        table_schema LIKE 'public' AND 
        table_type LIKE 'VIEW' AND
        table_name = 'pg_stat_statements'
    ) INTO pg_stat_statements_exists;
   IF NOT pg_stat_statements_exists THEN
    RAISE NOTICE 'pg_stat_statements extension has not been loaded, exiting';
    RETURN 'error';
   ELSE
    SELECT regexp_replace(
             regexp_replace(
                regexp_replace(
                   regexp_replace(
                      regexp_replace(query, '\$\d+', '?', 'g'),
                                E'\r', ' ', 'g'),
                              E'\t', ' ', 'g'),
                           E'\n', ' ', 'g'),
                         '\s+', ' ', 'g') || ';'
 	 INTO normalized_query_text
 	 FROM pg_stat_statements where queryid = p_query_id;
     IF normalized_query_text IS NOT NULL THEN
		INSERT INTO hint_plan.hints(norm_query_string, application_name, hints)
    	VALUES (normalized_query_text,
    			p_application_name,
    			p_hint_text
    	);
    	SELECT id into hint_id
    	FROM hint_plan.hints
    	WHERE norm_query_string = normalized_query_text;
 	    RETURN cast(hint_id as text);
     ELSE
 		RAISE NOTICE 'Query ID %q does not exist in pg_stat_statements', cast(p_query_id as text);
 		RETURN 'error';
     END IF;
   END IF;
 END; 
$BODY$;

Hint text contain one or more hints either separated by a space or a carriage return character. Examples include:

  • Space Separated: SeqScan(a) Parallel(a 0 hard)
  • ASCII CRLF Separated: SeqScan(a)’||chr(10)||’Parallel(a 0 hard)
  • Single Hint: SeqScan(a)
  • Escaped text does not work in the context of this function although this can be used if you are inserting manually to the “hints” table: E’SeqScan(a)\nParallel(a 0 hard)’
SELECT hint_plan.add_stored_pg_hint_plan(p_query_id => -8949523101378282526,
						p_hint_text => 'SeqScan(a) Parallel(a 0 hard)',
						p_application_name => '');

-[ RECORD 1 ]-----------+---
add_stored_pg_hint_plan | 28

Time: 40.889 ms

select * from hint_plan.hints where id = 28;
-[ RECORD 1 ]-----+------------------------------------------------------------------------------------------------------------------------------------------
id                | 28
norm_query_string | select b.bid, sum(abalance) from pgbench_branches b join pgbench_accounts a on (b.bid = a.bid) where b.bid = ? group by b.bid order by 1;
application_name  |
hints             | SeqScan(a) Parallel(a 0 hard)

In the above example, we are forcing a serial sequential scan of the “pgbench_accounts”. We left the “application name” parameter empty so that the hint applies to any calling application.

hint_plan.delete_stored_pg_hint_plan

You could easily just issue a delete against the “hints” table, but in keeping with utilizing a “function” approach to utilizing this functionality, a delete helper has also been developed:

CREATE OR REPLACE FUNCTION hint_plan.delete_stored_pg_hint_plan(
  p_hint_id bigint
  )
  RETURNS TABLE(id integer, norm_query_string text, application_name text, hints text)
  LANGUAGE 'plpgsql'
  COST 100
  VOLATILE PARALLEL UNSAFE
AS $BODY$
 BEGIN
    RETURN QUERY
    DELETE FROM hint_plan.hints h WHERE h.id = p_hint_id RETURNING *;
 END; 
$BODY$;

To delete a plan you can call the procedure as follows:

 SELECT hint_plan.delete_stored_pg_hint_plan(p_hint_id => 28);
-[ RECORD 1 ]--------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
delete_stored_pg_hint_plan | (28,"select b.bid, sum(abalance) from pgbench_branches b join pgbench_accounts a on (b.bid = a.bid) where b.bid = ? group by b.bid order by 1;","","SeqScan(a) Parallel(a 0 hard)")

Time: 33.685 ms
select * from hint_plan.hints where id = 28;
(0 rows)

Time: 24.868 ms

As you can see the “hints” table is very useful and can help you emulate many parts of SQL Plan Management just like in Oracle.

Enjoy and all feedback is welcomed!!!

Indexes Can Check-Out Anytime, But Can They Leave?

When I think about indexes in any DBMS, I immediately think about the lyrics in the Eagles song “Hotel California”.

“You can check-out any time you like, but you can never leave…..”

Almost every system that has been around for a while is over indexed. Every index costs something to maintain and some more than others. Also, systems and code change. There could be an index that was once needed, but over time the code shifts to use another one thats more efficient instead. Guess what though, you are still maintaining that old index. That costs CPU, I/O which translates into TIME.

Thankfully in Oracle 12.2, they have introduced a feature called INDEX USAGE TRACKING. In older versions this was called INDEX MONITORING and it was much harder and invasive to use. In the newer versions, it just works and you can read more about both options here:

https://oracle-base.com/articles/12c/index-usage-tracking-12cr2

By default, indexes in the views “DBA_INDEX_USAGE” and “V$INDEX_USAGE_INFO”. The “V$” table is flushed to the “DBA” table every 15 minutes. For most purposes, using the “DBA” table is just fine. The thing to remember is that querying only these tables tell you the indexes that are being USED, not the ones that are NOT USED. To get the ones that are not used, we just need to expand the query a bit:

SELECT
    di.owner,
    di.index_name,
    di.index_type,
    di.table_name
FROM
    dba_indexes di
WHERE
             di.owner in (select username from dba_users where oracle_maintained = 'N')
    AND di.uniqueness = 'NONUNIQUE'
    AND NOT EXISTS (
        SELECT
            1
        FROM
            dba_index_usage iu
        WHERE
                iu.owner = di.owner
            AND iu.name = di.index_name
    )
GROUP BY
    di.owner,
    di.index_name,
    di.index_type,
    di.table_name
ORDER BY
    di.owner,
    di.table_name,
    di.index_name;

Now that we have the indexes that are not used, which ones have the biggest impact on database operations? We just expand the query again to leverage the “DBA_TAB_MODIFICATIONS” table. This will allow us to see how many operations occur on each table that the index sits on:

SELECT
    di.owner,
    di.index_name,
    di.index_type,
    di.table_name,
    SUM(tm.inserts + tm.updates + tm.deletes) operations
FROM
    dba_indexes di,
    dba_tab_modifications tm
WHERE
             di.owner in (select username from dba_users where oracle_maintained = 'N')
    AND di.owner = tm.table_owner
    AND di.table_name = tm.table_name
    AND di.uniqueness = 'NONUNIQUE'
    AND NOT EXISTS (
        SELECT
            1
        FROM
            dba_index_usage iu
        WHERE
                iu.owner = di.owner
            AND iu.name = di.index_name
    )
GROUP BY
    di.owner,
    di.index_name,
    di.index_type,
    di.table_name
ORDER BY
    SUM(tm.inserts + tm.updates + tm.deletes) ASC NULLS FIRST,
    di.owner,
    di.table_name,
    di.index_name;

Now, I have done an equi-join condition against the DBA_TAB_MODIFICATIONS so this is only showing us unused indexes against tables that the system has recorded operations against. If you wanted to show all indexes regardless if there were operations or not, then just change the join condition against DBA_TAB_MODIFICATIONS to be an outer join.

In my case, there were some very large indexes that were unused sitting on tables that had LOTS of operations (names changed to protect the innocent):

OWNERINDEX NAMEINDEX TYPETABLE NAMEOPERATIONS
HRFK_1_IDXNORMALEMPLOYEES295931545
HR1EMP_TOTFUNCTION-BASED NORMALEMP_HOURS1374288673
HR2EMP_TOTFUNCTION-BASED NORMALEMP_HOURS1425114200
HR3EMP_TOTFUNCTION-BASED NORMALEMP_HOURS2487284020

Adding “OPERATIONS” to the query now gives us some sense of the “BENEFIT” of dropping the index. Based on the results you can present your case for dropping the index. People always love to have the “BENEFIT” quantified and this is the closest thing you can do without performing a session trace which could prove to be invasive and fairly difficult to get the exact statement you are looking for.

Hopefully this will give you the additional information you need to advocate for dropping that index that is no longer being used.

SQLPlus is Expensive!


I was recently tasked to work on a server that had severe performance issues. The first thing I noticed was that the server was, at times, running at a load average 3-4x and under steady state about 1.5x the number of cores on the host! Further investigation, yielded that there was a large quantity of SQPlus invocations from a set of ETL scripts. Digging even more, I found that the script performed ETL across more than 100 schemas 75% of which were done via parallel background processes with a random “sleep” built in.

This is where I noticed something interesting. Instead of coding the script to invoke SQLPlus once, run a bunch of statements and exit, the coder constructed the script to invoke SQLPlus, run a statement, exit, invoke SQLPlus, run a statement, and on and on and on.

My first thought was that this has to be EXPENSIVE right?

When a SQLPlus login is made, a lot of things happen. Simply stated, the OS has to start a database process, and the database has to do a bunch of things to get your session ready (password authentication, memory allocation, update some tables with login details, setup permissions), etc, etc. Managing processes is one of the most expensive things an Operating System can do.

In terms of the database itself, no less than 19 individual statements are executed in order to prepare your session. Digging into those statements, probably one of the most expensive is:

sql_id:  9zg9qd9bm4spu
update user$ set spare6=DECODE(to_char(:2, 'YYYY-MM-DD'), '0000-00-00', to_date(NULL), :2) where user#=:1


It can be so expensive, Oracle has actually logged a bug (Bug 33121934  Library cache lock / load lock / mutex x during connection storm due to update user$) where you can discontinue updating the last login time for every login and do it every X number of seconds. The explain plan for this statement is actually quite lengthy and there is a lot more work going on than you think!

So based on all of that, I headed back to my scripts only to discover for every invocation of the script (the script is invoked for each of the 114 schemas), there were a minimum of 11 SQLPlus invocations and a max of of 27 invocations depending on the schema. So every minute that this script was executed there was a total of almost 1400 SQLPlus Invocations!!! Definitely a big contributor to the very high load average on the server.

As with anything in IT, change is tough and there’s hardly ever much appetite to change something that “works”. That is where it is important to always quantify the benefits. So to do this I wrote a quick script to show the difference in CPU Utilization for a script that does many SQLPlus invocations vs one that does one invocation and then calls many sql scripts.

The resultant stats are as follows. For the original script in the test system I used had one SQLPlus invocation for each sql file that was called, the following resources were used. The original script executed SQLPlus 61 times and yielded the following system usage:

real     0m0.876s
user     0m0.480s
sys      0m0.463s

The modified script that reduced the number of calls from 61 to 14 (77%) yielded the following system usage:

real     0m0.188s
user     0m0.117s
sys      0m0.085s

*** NOTE: The test system I was working on had no data to manipulate for ETL so most of the work was the SQLPlus call. Which is what I wanted to demonstrate anyway!

As you can see, reducing the number of SQLPlus calls by 77%, reduced system usage by:

  • real: 79%
  • user: 75%
  • sys: 82%

So with a small change to group sql statements into one SQLPlus call, we are able to drastically improve system stability and reduce resource utilization. A WIN in my book! Once this gets through the test cycles, I will follow up with a post on the improvement seen in production.

Remember the Recyclebin

When poking around a system, make sure to look at the dba_recyclebin. Oftentimes, this functionality within Oracle seems to be overlooked and forgotten about. This is due to the fact that the default for the parameter “recyclebin” in Oracle 19c is “on”. You can also manage this parameter at the session level as well.

https://docs.oracle.com/en/database/oracle/oracle-database/19/refrn/RECYCLEBIN.html#GUID-B6F1AA5B-DF1E-4EAF-BDFD-E70EFBAD997A

Should a table drop occur by mistake, often the table can be recovered so long as the column “CAN_UNDROP” for that object shows the value “YES” in the dba_recyclebin table by issuing the command:

select owner, object_name, original_name, can_undrop from dba_recyclebin;
flashback table hr.emp to before drop;

It’s good functionality. Also the recyclebin helps in the case where your application must drop / create / recreate objects on a frequent basis because the operation is much faster as the drop command does not have to wait for the segments to drop.

The area that most forget about is “purging” the recyclebin. As a matter of fact, I recently saw a system the other day that had 5M objects in the recyclebin totaling several hundred GB of space most of which was several years old. Thats several hundred GB of space being backed up every day for objects which have been dropped!

Generally, purging the recyclebin is a manual operation. However, there is a way to implement a dbms_scheduler job which will do this on a regularly scheduled basis.

BEGIN
  DBMS_SCHEDULER.CREATE_JOB (
	job_name => 'WEEKLY_PURGE_RECYCLEBIN',
	job_type => 'PLSQL_BLOCK',
	job_action => 'DECLARE
		V_PRE_RECYCLEBIN_COUNT PLS_INTEGER := 0;
		V_PRE_RECYCLEBIN_SIZE NUMBER(38) :=0;
		V_POST_RECYCLEBIN_COUNT PLS_INTEGER := 0;
        V_SQL_STATEMENT VARCHAR2(500);
        V_DAYS_TO_PURGE PLS_INTEGER := 7;
        V_STALE_DICT_STATS PLS_INTEGER := 0;
        err_code NUMBER;
        err_msg VARCHAR2(500);
        CURSOR purge_bin_cursor IS select owner, object_name, original_name from dba_recyclebin where type=''TABLE'' and to_date(droptime,''YYYY-MM-DD:HH24:MI:SS'') < sysdate - V_DAYS_TO_PURGE;
		BEGIN
			select round(nvl(sum(space * 8192),0)/1024/1024,2) into V_PRE_RECYCLEBIN_SIZE from dba_recyclebin;
			select count(*) into V_PRE_RECYCLEBIN_COUNT from dba_recyclebin where to_date(droptime,''YYYY-MM-DD:HH24:MI:SS'')  0
			THEN
				DBMS_OUTPUT.PUT_LINE(''Purging ''||V_PRE_RECYCLEBIN_COUNT||'' objects, reclaiming ''||V_PRE_RECYCLEBIN_SIZE||'' MB from the recyclebin'');
				FOR purge_rec in purge_bin_cursor
				LOOP
                	V_SQL_STATEMENT := ''purge table ''||purge_rec.owner||''."''||purge_rec.object_name||''"'';
                	DBMS_OUTPUT.PUT_LINE(''Executing: ''||V_SQL_STATEMENT||'' for original object name ''||purge_rec.original_name);
					EXECUTE IMMEDIATE V_SQL_STATEMENT;
				END LOOP;
				select count(*) into V_POST_RECYCLEBIN_COUNT from dba_recyclebin;
				DBMS_OUTPUT.PUT_LINE(''Recyclebin has ''||V_POST_RECYCLEBIN_COUNT||'' objects remaining in the recyclebin'');
				select count(*) into V_STALE_DICT_STATS from dba_tab_statistics where owner = ''SYS'' AND TABLE_NAME = ''OBJ$'' and stale_stats = ''YES'';
				IF V_STALE_DICT_STATS = 1
				THEN
					DBMS_OUTPUT.PUT_LINE(''Executing Dictionary Statistics'');
					DBMS_STATS.GATHER_DICTIONARY_STATS;
				ELSE
					DBMS_OUTPUT.PUT_LINE(''Dictionary Statistics not stale'');
				END IF;
			ELSE
				DBMS_OUTPUT.PUT_LINE(''Nothing to purge from the recyclebin'');
			END IF;
		EXCEPTION
			WHEN OTHERS THEN
		    err_code := SQLCODE;
      		err_msg := SUBSTR(SQLERRM, 1, 500);
      		DBMS_OUTPUT.PUT_LINE(''An error was encountered - ''||SQLCODE||'' -ERROR- ''||SQLERRM);
		END;',
	start_date => sysdate,
	repeat_interval => 'FREQ=MONTHLY;BYDAY=1SAT;BYHOUR=18',
	end_date => NULL,
	enabled => TRUE,
	auto_drop => FALSE,
	comments => 'This job purges the recyclebin on the first Saturday of the Month at 18:00'
);
END;
/

As implemented above, this job will run once per week and delete everything that is older than 7 days that is in the recyclebin. No longer will you have to worry about the recyclebin and it containing too many un-purged objects!

Enjoy!

Improper Use of the Oracle ‘Rownum’ Pseudocolumn

The other day I found myself needing to explain to some developers why their use-case of the Oracle ‘rownum’ pseudocolumn was yielding a result in one database instance, but a completely different result in another.

In this situation, the correct result is the ‘maximum’ value of the column, however this query was also occasionally returning the exact ‘minimum’ value of this column. How could this happen? The answer lies in the using the ‘rownum’ pseudocolumn correctly. Of course there are other (probably better) ways to write this query without the use of ‘rownum’, but I’m not here to debate that right now….

** Note the tables in the query have been changed to protect the innocent.

select column_a from (select column_a,rownum rowid0 from schema.table order by column_a desc ) aa where aa.rowid0 =1;

Oracle documentation states that it depends how Oracle accessed the rows in the query as to which result you will get. For example your results can vary depending on a lot of factors (ie: the order that you inserted the data in the table or if there is an index on the table and how that index is used). For further information you can see the documentation here:

https://docs.oracle.com/cd/B28359_01/server.111/b28286/pseudocolumns009.htm#SQLRF00255

For further explanation, lets explore the explain plans encountered used in each system:

Correct Result:


--------------------------------------------------------------------------------------------------------------
| Id  | Operation                    | Name                          | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |                               |       |       |     1 (100)|          |
|*  1 |  VIEW                        |                               |  1257 | 32682 |     1   (0)| 00:00:01 |
|   2 |   COUNT                      |                               |       |       |            |          |
|   3 |    INDEX FULL SCAN DESCENDING| SCHEMA_TABLE_PK               |  1257 |  6285 |     1   (0)| 00:00:01 |
--------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter("AA"."ROWID0"=1)


22 rows selected.

Incorrect Result:


---------------------------------------------------------------------------------------------------------
| Id  | Operation               | Name                          | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT        |                               |       |       |     4 (100)|          |
|*  1 |  VIEW                   |                               |  1257 | 32682 |     4  (25)| 00:00:01 |
|   2 |   SORT ORDER BY         |                               |  1257 |  6285 |     4  (25)| 00:00:01 |
|   3 |    COUNT                |                               |       |       |            |          |
|   4 |     INDEX FAST FULL SCAN| SCHEMA_TABLE_PK               |  1257 |  6285 |     3   (0)| 00:00:01 |
---------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter("AA"."ROWID0"=1)


24 rows selected.

As you can see, the major difference here is that the two systems have not chosen the same access path in which to return the data. In one system a plan utilized an ‘INDEX FULL SCAN DESCENDING’ access path, while the other utilized an ‘INDEX FAST FULL SCAN’ access path.

Is this really that different? Turns out it is.

ASK Tom Explained the reason why very concisely:
(Ask TOM “Difference between Full Index Scans and Fast Full Index Scans”)

They state that:

“An index fast full scan reads the ENTIRE index, unsorted, as it exists on disk. It is basically using the index as a “skinny” version of the table. The query in question would only be accessing attributes in the index (we are not using the index as a way to get to the table, we are using the index INSTEAD of the table) We use multiblock IO and read all of the leaf, branch and the root block. We ignore the branch and root blocks and just process the (unordered) data on the leaf blocks.

An index full scan is when we read the index a block at a time – from start to finish. We’ll read the root block, navigate down the left hand side of the index (or right if we are doing a descending full scan) and then when we hit the leaf block – we’ll read across the entire bottom of the index – a block at a time – in sorted order. We use single block IO, not multiblock IO for this operation.”

Well there you have it. And this is why the result is different. How can we keep this from occurring in the future? The answer is to utilize the ‘rownum’ pseudocolumn correctly. Remember, rownum is not a real column so in order to get the right results, it needs to be added after the data is in the sorted order that you want. To do that, make sure you write the query so that ‘rownum’ is applied after the sort. Using the same query above, lets ‘rewrite’ it in such a way that it will achieve the desired results:

select column_a from (select column_a,rownum from (select column_a from schema.table order by column_a desc)) where rownum = 1;

See the steps now?

  1. Retrieve data in sorted order
  2. Apply the ‘rownum’ pseudocolumn
  3. Filter for the desired value in the list

If you must use the ‘rownum’ pseudocolumn, writing your query in this manner will ensure that you always get the same result.

Enjoy!