The challenge
An analytics platform was ingesting product events from a growing customer base into a single "raw_events" table. The original schema was correct but slow: every event hit a logged table with seven indexes, the WAL was the bottleneck, and ingest throughput had topped out around 8,000 events per second. RDS IOPS were the dominant cost line, and a planned 3x customer expansion was going to break the pipeline before the next contract was signed.
The team had read the usual advice — "use a queue," "shard the table," "move to a dedicated TSDB" — but those answers all meant a multi-quarter migration. The real question was: can PostgreSQL keep up if we stop fighting it? The catch was a hard requirement that anything that reached the audited reporting tables had to be durable, consistent, and indexed. We couldn't sacrifice integrity on the analytics side to save the ingest side.
Our solution
We split the pipeline into two clearly separated stages with two different durability contracts.
Stage 1 (ingest): an UNLOGGED staging table that takes the firehose. UNLOGGED tables in PostgreSQL skip WAL writes for inserts, which is exactly the right trade-off for short-lived staging data — we get 5-10x raw insert throughput in exchange for losing the staged rows if the server crashes. Combined with batched COPY (not INSERT) from a Python ingest service and a deliberate "no indexes on the staging table" rule, raw ingest jumped from 8K to 80K+ events/sec on the same RDS instance.
Stage 2 (durable): a Celery worker drains the staging table in micro-batches into the real, fully-logged, indexed `events` table inside a single transaction with idempotent upserts. The audited reporting path only ever reads from the durable table. If the server crashes mid-ingest, we lose at most a few seconds of in-flight events from the staging table — and the upstream producers retry, so the durable table converges back to correct.
We paired this with a small but careful operational layer: a Grafana dashboard for stage-1 fill rate, a hard alert if the drainer falls behind, partition rotation on the durable table, and a written runbook for the three failure modes that matter.
- UNLOGGED staging table with no indexes — engineered specifically for raw insert throughput
- Batched COPY ingest from a Python/FastAPI service, not row-at-a-time INSERTs
- Celery drainer moving micro-batches into the durable, fully-indexed events table
- Idempotent upserts (ON CONFLICT DO NOTHING) so producer retries are safe
- Monthly partitioning on the durable table to keep vacuum and index work bounded
- End-to-end lag dashboards and alerts in Grafana / Datadog
- Written runbook covering RDS failover, drainer crash, and producer back-pressure
- Audit sign-off on the durability profile before launch — no surprises