Bloat that won't come back: prevention, pg_repack, and PostgreSQL 19's REPACK
Bloat that won't come back: prevention, pg_repack, and PostgreSQL 19's REPACK
You delete 60% of an old audit table and the disk usage doesn't budge. Autovacuum is running on schedule. Read traffic hasn't changed. There are no new writes filling the holes. A week later, the table is the same size. A month later, it's larger.
Two months in, the nightly backup of that schema is 40% bigger than last quarter. The restore drill misses its recovery objective. The read replica falls behind during business hours, and the storage line on the cloud bill keeps drifting up without any traffic to explain it. None of this is broken — this is exactly what PostgreSQL is supposed to do. But the longer it goes uncorrected, the more it costs you, and the dashboards most teams watch will not tell you which tables are causing it.
The pattern: irreversible bloat
Some bloat is self-healing. Some isn't. The difference is whether the empty space gets reused.
Picture two tables on the same database:
- Table A is a transactional ledger. Twenty thousand rows in per hour, ten thousand updates, two thousand deletes. Bloat exists at any moment, but autovacuum keeps it bounded and the new writes fill the holes. The table breathes.
- Table B is an audit log. It accumulated for three years. Last month, retention rules required deleting everything older than 18 months — about 60% of rows. Almost nothing new comes in. The pages are now mostly empty, and they will stay mostly empty.
Bloat in PostgreSQL is not one phenomenon — it is two. Reusable bloat is what Table A produces: dead tuples appear, autovacuum marks them reusable, the next write fills the slot. Irreversible bloat is what Table B produces: dead tuples accumulate in regions of a table that no longer take writes, and nothing in PostgreSQL returns that disk to the OS on its own.
The second shape is the one that deserves intervention. Table A
doesn't need pg_repack. Table B does — but ideally, Table B
shouldn't have existed in that shape in the first place.
That's the through-line of this post: predictable table structure first, reclamation tools when prevention failed.
Why monitoring misses this
Standard monitoring shows disk usage and autovacuum activity. Neither distinguishes the two shapes. A table at constant size could be breathing healthily or permanently hemorrhaging space — they look identical on a graph. Autovacuum running every thirty seconds tells you the system is doing its job; it tells you nothing about whether the work is recoverable. The metrics that matter — dead-tuple ratio in regions that no longer take writes, free-space distribution across pages, last-write age per region — are rarely in the dashboards most teams inherit.
How Arq sees this
This is the distinction Arq is built to make. Reusable bloat on healthy tables is tracked but not flagged. Irreversible bloat — dead tuples accumulating in regions of a table that no longer take writes — surfaces as a finding, with the table, the affected disk, and the reason the space won't come back on its own. The dashboard tells you what is happening. The finding tells you whether it's the kind you need to act on.
Prevention: partitioning instead of deleting
If you know in advance that data will age out — logs, events, metrics, time-series anything — partition by the same boundary you'll use to expire it. Then "delete the old data" becomes "drop the partition", which is metadata-only, returns the disk to the OS immediately, and never produces dead tuples.
CREATE TABLE events (
id bigint GENERATED ALWAYS AS IDENTITY,
occurred_at timestamptz NOT NULL,
payload jsonb NOT NULL
) PARTITION BY RANGE (occurred_at);
CREATE TABLE events_2026_01 PARTITION OF events
FOR VALUES FROM ('2026-01-01') TO ('2026-02-01');
-- ... one per month, automated by pg_partman or your own job
Eighteen months later, expiring the oldest month is a single command:
DROP TABLE events_2026_01;
There is no DELETE, no autovacuum work, no bloat. The table that
used to be the textbook bloat-magnet is now self-cleaning.
If you can't partition (the boundary doesn't fit your data, you have
foreign keys pointing in, you're stuck on managed Postgres without
declarative partitioning support), you can still get most of the
benefit with DELETE in chunks plus VACUUM between batches — but
you've already accepted you'll need to think about reclamation later.
Prevention: fillfactor for update-heavy tables
Even on tables that aren't delete-heavy, updates can produce slow
bloat that surprises people. PostgreSQL's MVCC model never updates a
row in place: an UPDATE writes a new row version and marks the old
one dead. If the new version fits on the same page as the old one,
PostgreSQL can do a HOT update — Heap-Only Tuple — which avoids
having to update every index that points at the row. If the new
version doesn't fit on the page, PostgreSQL puts it on a different
page, and now every index needs a fresh entry. Inserts and updates on
that table get slower, indexes grow, and dead tuples accumulate
unevenly.
The fillfactor storage parameter controls how full PostgreSQL fills
each page when writing it the first time. The default is 100 — pack
pages to the brim. Excellent for read-mostly tables; hostile for
update-heavy ones, because every update has to spill to a new page.
For tables that get updated frequently, lowering fillfactor leaves
slack on each page so updates stay HOT, indexes stay smaller, and
bloat stays predictable. 80 is a reasonable starting point for an
update-heavy table, but the right setting depends on the table's
shape, the columns being updated, the indexes on them, and how
read-heavy versus write-heavy the workload is. A table where every
update touches a wide TEXT column may need 70 or even 60 to stay HOT;
a table whose updates only flip a small integer flag may be fine at
90.
ALTER TABLE accounts SET (fillfactor = 80);
-- existing pages are unaffected; new pages will be written at 80% full
VACUUM FULL accounts; -- or pg_repack, see below — to apply to existing data
The honest way to tune fillfactor is to measure. The HOT-update
ratio in pg_stat_user_tables tells you whether your updates are
landing in-page or spilling:
SELECT relname,
n_tup_upd,
n_tup_hot_upd,
round(100.0 * n_tup_hot_upd / NULLIF(n_tup_upd, 0), 1) AS hot_pct
FROM pg_stat_user_tables
WHERE n_tup_upd > 0
ORDER BY n_tup_upd DESC;
If a frequently-updated table is consistently sitting at a low
HOT-update percentage, lower its fillfactor, repack so existing
pages pick up the new setting, and check the numbers again a week
later. There's no universal best value — only a value that makes
your table predictable.
This isn't free, either: you trade some disk space upfront for better update behaviour. Worth it on hot transactional tables. Wrong on archive tables you only ever read.
What this costs you
The technical bill turns into four operational ones, and they compound.
- Storage cost. On managed Postgres — RDS, Cloud SQL, Aiven, Heroku, Supabase, Neon at the provisioned tier — you pay for allocated disk, not for used disk. A 40% bloated 200 GB table is 80 GB you bought and aren't using, every month, until you reclaim it. The bill does not reset because autovacuum ran.
- IO amplification. A bloated table has the same row count and more pages. Sequential scans read more bytes for the same answer. Index scans visit more disk because the index bloats alongside the heap. WAL volume goes up because more pages get touched. Read latency drifts upward without any application change to explain it.
- Backups and restore. Bigger heap, bigger
pg_dump, bigger physical backups. More consequentially, bigger restores. A 90-minute RTO that comfortably covered last quarter's database may not cover this quarter's bloated one — and you usually find out during a real incident, not a drill. - Replication and failover. Larger WAL means slower catch-up on read replicas, especially after a network blip. Larger base backups mean longer time-to-warm for a fresh standby. Failover windows stretch, often by minutes.
None of this surfaces as a single visible failure. It surfaces as the cost and risk profile drifting in the wrong direction across several dimensions at once.
When you still need to reclaim space
You inherited a table. It's already bloated. Prevention is somebody else's problem, in the past. What now?
PostgreSQL ships with three options before extensions enter the picture, and it helps to be precise about what each one does:
| Command | Lock | Returns disk to OS | Notes |
|---|---|---|---|
VACUUM |
minor | No | Marks dead tuples reusable. Internal-only. |
VACUUM FULL |
ACCESS EXCLUSIVE |
Yes | Rewrites the table. Blocks all reads and writes. |
CLUSTER |
ACCESS EXCLUSIVE |
Yes | Rewrites the table physically ordered by an index. Blocks all reads and writes. |
For an online system, "blocks all reads and writes" is a non-starter
on any table large enough to need reclaiming. That gap is what
pg_repack filled.
pg_repack: what we've leaned on for ~15 years (PG 18 and earlier)
pg_repack is an extension that does what VACUUM FULL does —
fully rewrite the table to a new file, then swap — but online.
While the rewrite runs, the table stays readable and writable.
Mechanism: pg_repack copies the table chunk by chunk while a
trigger captures concurrent writes into a log table, replays those
changes against the copy, and takes a brief ACCESS EXCLUSIVE lock
to swap names. The heavy work doesn't block; only the swap does.
# repack one table
pg_repack -d mydb -t public.events
# repack everything in the database
pg_repack -d mydb --all
# repack and physically reorder by an index (online CLUSTER)
pg_repack -d mydb -t public.events --order-by=occurred_at
What bites in practice:
- Disk space. Roughly table + indexes worth of free disk during the rewrite. At 90% full, plan first.
- Primary key required. No PK or replica identity, no online
repack — same constraint as
REPACK CONCURRENTLYin PG 19. - Replication. Streaming replicas are fine; logical replication subscribers interact awkwardly with the swap, so check the version notes on both publisher and subscriber.
What changes in PostgreSQL 19: the REPACK SQL statement
PostgreSQL 19 (currently in development; see the devel
docs) adds a
native REPACK SQL statement. Same mechanism as pg_repack,
delivered as a first-class SQL command rather than an extension.
REPACK [ ( option [, ...] ) ] [ table_name [ USING INDEX [ index_name ] ] ]
Options:
CONCURRENTLY— online operation, brief swap lock only. Uses a temporary replication slot, capped bymax_repack_replication_slots.USING INDEX index_name— reorder rows by the index (the CLUSTER-equivalent).VERBOSE— INFO-level progress reporting.ANALYZE— runANALYZEafterwards.
Without a table name, REPACK walks every table where the caller
has the new MAINTAIN privilege. Progress shows up in
pg_stat_progress_repack alongside VACUUM and CLUSTER.
Same task, both worlds
| Goal | PG ≤ 18 (pg_repack extension) |
PG 19+ (built-in) |
|---|---|---|
| Repack one table online | pg_repack -d db -t public.events |
REPACK (CONCURRENTLY) public.events; |
| Reorder by an index online | pg_repack -d db -t public.events --order-by=occurred_at |
REPACK events USING INDEX events_occurred_at_idx; |
| Repack everything | pg_repack -d db --all |
REPACK; |
| Skip a table when repacking everything | pg_repack -d db --all -T public.huge_archive |
(no built-in equivalent yet — script around it) |
| Show progress | --verbose flag |
(VERBOSE) option or query pg_stat_progress_repack |
| Update statistics afterwards | --no-analyze to skip; otherwise it analyses |
(ANALYZE) option, or run ANALYZE separately |
| Required privilege | Owner or superuser | MAINTAIN (or owner) |
| Required setup | CREATE EXTENSION pg_repack; per database, plus matching client binary |
None — built in |
The mental model is unchanged. The ergonomics improve. You stop managing an extension's lifecycle alongside your server's.
What stays the same, what to watch
The constraints carry over. Online repack still requires a primary
key or replica identity. The rewrite still needs roughly
table-plus-indexes worth of free disk; if it runs out, the operation
fails, not your database. The brief ACCESS EXCLUSIVE lock at the
swap is still a tiny outage. The new note: REPACK CONCURRENTLY
consumes a replication slot, so max_repack_replication_slots
needs capacity in clusters already near their slot cap for logical
replication.
Bottom line
PostgreSQL doesn't free space automatically. It reuses space — until it can't.
If you can't predict your storage behaviour, you don't control your system.
Bloat is not an operational issue. It's a design decision that
shows up later — sometimes years later, sometimes during a recovery
drill. Partition tables you'll expire. Lower fillfactor on tables
you'll update heavily. Repack only when prevention failed. On
PostgreSQL 18 and earlier, that means pg_repack. From 19 forward,
it means REPACK. The mechanics carry over; only the typing gets
shorter.