Bloat that won't come back: prevention, pg_repack, and PostgreSQL 19's REPACK

Frank Heikens·
postgresqlOperations

Bloat that won't come back: prevention, pg_repack, and PostgreSQL 19's REPACK

You delete 60% of an old audit table and the disk usage doesn't budge. Autovacuum is running on schedule. Read traffic hasn't changed. There are no new writes filling the holes. A week later, the table is the same size. A month later, it's larger.

Two months in, the nightly backup of that schema is 40% bigger than last quarter. The restore drill misses its recovery objective. The read replica falls behind during business hours, and the storage line on the cloud bill keeps drifting up without any traffic to explain it. None of this is broken — this is exactly what PostgreSQL is supposed to do. But the longer it goes uncorrected, the more it costs you, and the dashboards most teams watch will not tell you which tables are causing it.

The pattern: irreversible bloat

Some bloat is self-healing. Some isn't. The difference is whether the empty space gets reused.

Picture two tables on the same database:

  • Table A is a transactional ledger. Twenty thousand rows in per hour, ten thousand updates, two thousand deletes. Bloat exists at any moment, but autovacuum keeps it bounded and the new writes fill the holes. The table breathes.
  • Table B is an audit log. It accumulated for three years. Last month, retention rules required deleting everything older than 18 months — about 60% of rows. Almost nothing new comes in. The pages are now mostly empty, and they will stay mostly empty.

Bloat in PostgreSQL is not one phenomenon — it is two. Reusable bloat is what Table A produces: dead tuples appear, autovacuum marks them reusable, the next write fills the slot. Irreversible bloat is what Table B produces: dead tuples accumulate in regions of a table that no longer take writes, and nothing in PostgreSQL returns that disk to the OS on its own.

The second shape is the one that deserves intervention. Table A doesn't need pg_repack. Table B does — but ideally, Table B shouldn't have existed in that shape in the first place.

That's the through-line of this post: predictable table structure first, reclamation tools when prevention failed.

Why monitoring misses this

Standard monitoring shows disk usage and autovacuum activity. Neither distinguishes the two shapes. A table at constant size could be breathing healthily or permanently hemorrhaging space — they look identical on a graph. Autovacuum running every thirty seconds tells you the system is doing its job; it tells you nothing about whether the work is recoverable. The metrics that matter — dead-tuple ratio in regions that no longer take writes, free-space distribution across pages, last-write age per region — are rarely in the dashboards most teams inherit.

How Arq sees this

This is the distinction Arq is built to make. Reusable bloat on healthy tables is tracked but not flagged. Irreversible bloat — dead tuples accumulating in regions of a table that no longer take writes — surfaces as a finding, with the table, the affected disk, and the reason the space won't come back on its own. The dashboard tells you what is happening. The finding tells you whether it's the kind you need to act on.

Prevention: partitioning instead of deleting

If you know in advance that data will age out — logs, events, metrics, time-series anything — partition by the same boundary you'll use to expire it. Then "delete the old data" becomes "drop the partition", which is metadata-only, returns the disk to the OS immediately, and never produces dead tuples.

CREATE TABLE events (
    id          bigint GENERATED ALWAYS AS IDENTITY,
    occurred_at timestamptz NOT NULL,
    payload     jsonb NOT NULL
) PARTITION BY RANGE (occurred_at);

CREATE TABLE events_2026_01 PARTITION OF events
    FOR VALUES FROM ('2026-01-01') TO ('2026-02-01');
-- ... one per month, automated by pg_partman or your own job

Eighteen months later, expiring the oldest month is a single command:

DROP TABLE events_2026_01;

There is no DELETE, no autovacuum work, no bloat. The table that used to be the textbook bloat-magnet is now self-cleaning.

If you can't partition (the boundary doesn't fit your data, you have foreign keys pointing in, you're stuck on managed Postgres without declarative partitioning support), you can still get most of the benefit with DELETE in chunks plus VACUUM between batches — but you've already accepted you'll need to think about reclamation later.

Prevention: fillfactor for update-heavy tables

Even on tables that aren't delete-heavy, updates can produce slow bloat that surprises people. PostgreSQL's MVCC model never updates a row in place: an UPDATE writes a new row version and marks the old one dead. If the new version fits on the same page as the old one, PostgreSQL can do a HOT update — Heap-Only Tuple — which avoids having to update every index that points at the row. If the new version doesn't fit on the page, PostgreSQL puts it on a different page, and now every index needs a fresh entry. Inserts and updates on that table get slower, indexes grow, and dead tuples accumulate unevenly.

The fillfactor storage parameter controls how full PostgreSQL fills each page when writing it the first time. The default is 100 — pack pages to the brim. Excellent for read-mostly tables; hostile for update-heavy ones, because every update has to spill to a new page.

For tables that get updated frequently, lowering fillfactor leaves slack on each page so updates stay HOT, indexes stay smaller, and bloat stays predictable. 80 is a reasonable starting point for an update-heavy table, but the right setting depends on the table's shape, the columns being updated, the indexes on them, and how read-heavy versus write-heavy the workload is. A table where every update touches a wide TEXT column may need 70 or even 60 to stay HOT; a table whose updates only flip a small integer flag may be fine at 90.

ALTER TABLE accounts SET (fillfactor = 80);
-- existing pages are unaffected; new pages will be written at 80% full
VACUUM FULL accounts;  -- or pg_repack, see below — to apply to existing data

The honest way to tune fillfactor is to measure. The HOT-update ratio in pg_stat_user_tables tells you whether your updates are landing in-page or spilling:

SELECT relname,
       n_tup_upd,
       n_tup_hot_upd,
       round(100.0 * n_tup_hot_upd / NULLIF(n_tup_upd, 0), 1) AS hot_pct
FROM   pg_stat_user_tables
WHERE  n_tup_upd > 0
ORDER  BY n_tup_upd DESC;

If a frequently-updated table is consistently sitting at a low HOT-update percentage, lower its fillfactor, repack so existing pages pick up the new setting, and check the numbers again a week later. There's no universal best value — only a value that makes your table predictable.

This isn't free, either: you trade some disk space upfront for better update behaviour. Worth it on hot transactional tables. Wrong on archive tables you only ever read.

What this costs you

The technical bill turns into four operational ones, and they compound.

  • Storage cost. On managed Postgres — RDS, Cloud SQL, Aiven, Heroku, Supabase, Neon at the provisioned tier — you pay for allocated disk, not for used disk. A 40% bloated 200 GB table is 80 GB you bought and aren't using, every month, until you reclaim it. The bill does not reset because autovacuum ran.
  • IO amplification. A bloated table has the same row count and more pages. Sequential scans read more bytes for the same answer. Index scans visit more disk because the index bloats alongside the heap. WAL volume goes up because more pages get touched. Read latency drifts upward without any application change to explain it.
  • Backups and restore. Bigger heap, bigger pg_dump, bigger physical backups. More consequentially, bigger restores. A 90-minute RTO that comfortably covered last quarter's database may not cover this quarter's bloated one — and you usually find out during a real incident, not a drill.
  • Replication and failover. Larger WAL means slower catch-up on read replicas, especially after a network blip. Larger base backups mean longer time-to-warm for a fresh standby. Failover windows stretch, often by minutes.

None of this surfaces as a single visible failure. It surfaces as the cost and risk profile drifting in the wrong direction across several dimensions at once.

When you still need to reclaim space

You inherited a table. It's already bloated. Prevention is somebody else's problem, in the past. What now?

PostgreSQL ships with three options before extensions enter the picture, and it helps to be precise about what each one does:

Command Lock Returns disk to OS Notes
VACUUM minor No Marks dead tuples reusable. Internal-only.
VACUUM FULL ACCESS EXCLUSIVE Yes Rewrites the table. Blocks all reads and writes.
CLUSTER ACCESS EXCLUSIVE Yes Rewrites the table physically ordered by an index. Blocks all reads and writes.

For an online system, "blocks all reads and writes" is a non-starter on any table large enough to need reclaiming. That gap is what pg_repack filled.

pg_repack: what we've leaned on for ~15 years (PG 18 and earlier)

pg_repack is an extension that does what VACUUM FULL does — fully rewrite the table to a new file, then swap — but online. While the rewrite runs, the table stays readable and writable.

Mechanism: pg_repack copies the table chunk by chunk while a trigger captures concurrent writes into a log table, replays those changes against the copy, and takes a brief ACCESS EXCLUSIVE lock to swap names. The heavy work doesn't block; only the swap does.

# repack one table
pg_repack -d mydb -t public.events

# repack everything in the database
pg_repack -d mydb --all

# repack and physically reorder by an index (online CLUSTER)
pg_repack -d mydb -t public.events --order-by=occurred_at

What bites in practice:

  • Disk space. Roughly table + indexes worth of free disk during the rewrite. At 90% full, plan first.
  • Primary key required. No PK or replica identity, no online repack — same constraint as REPACK CONCURRENTLY in PG 19.
  • Replication. Streaming replicas are fine; logical replication subscribers interact awkwardly with the swap, so check the version notes on both publisher and subscriber.

What changes in PostgreSQL 19: the REPACK SQL statement

PostgreSQL 19 (currently in development; see the devel docs) adds a native REPACK SQL statement. Same mechanism as pg_repack, delivered as a first-class SQL command rather than an extension.

REPACK [ ( option [, ...] ) ] [ table_name [ USING INDEX [ index_name ] ] ]

Options:

  • CONCURRENTLY — online operation, brief swap lock only. Uses a temporary replication slot, capped by max_repack_replication_slots.
  • USING INDEX index_name — reorder rows by the index (the CLUSTER-equivalent).
  • VERBOSE — INFO-level progress reporting.
  • ANALYZE — run ANALYZE afterwards.

Without a table name, REPACK walks every table where the caller has the new MAINTAIN privilege. Progress shows up in pg_stat_progress_repack alongside VACUUM and CLUSTER.

Same task, both worlds

Goal PG ≤ 18 (pg_repack extension) PG 19+ (built-in)
Repack one table online pg_repack -d db -t public.events REPACK (CONCURRENTLY) public.events;
Reorder by an index online pg_repack -d db -t public.events --order-by=occurred_at REPACK events USING INDEX events_occurred_at_idx;
Repack everything pg_repack -d db --all REPACK;
Skip a table when repacking everything pg_repack -d db --all -T public.huge_archive (no built-in equivalent yet — script around it)
Show progress --verbose flag (VERBOSE) option or query pg_stat_progress_repack
Update statistics afterwards --no-analyze to skip; otherwise it analyses (ANALYZE) option, or run ANALYZE separately
Required privilege Owner or superuser MAINTAIN (or owner)
Required setup CREATE EXTENSION pg_repack; per database, plus matching client binary None — built in

The mental model is unchanged. The ergonomics improve. You stop managing an extension's lifecycle alongside your server's.

What stays the same, what to watch

The constraints carry over. Online repack still requires a primary key or replica identity. The rewrite still needs roughly table-plus-indexes worth of free disk; if it runs out, the operation fails, not your database. The brief ACCESS EXCLUSIVE lock at the swap is still a tiny outage. The new note: REPACK CONCURRENTLY consumes a replication slot, so max_repack_replication_slots needs capacity in clusters already near their slot cap for logical replication.

Bottom line

PostgreSQL doesn't free space automatically. It reuses space — until it can't.

If you can't predict your storage behaviour, you don't control your system.

Bloat is not an operational issue. It's a design decision that shows up later — sometimes years later, sometimes during a recovery drill. Partition tables you'll expire. Lower fillfactor on tables you'll update heavily. Repack only when prevention failed. On PostgreSQL 18 and earlier, that means pg_repack. From 19 forward, it means REPACK. The mechanics carry over; only the typing gets shorter.

Reactions