February 26, 2026 ubb

The Metering Problem Nobody Talks About

Everyone talks about whether you should charge usage-based. Almost nobody talks about the unglamorous reality that comes right after you make that decision: you have to actually measure the usage. Accurately. In real time. At scale. Without double-counting, without gaps, without your engineers debugging a billing discrepancy at 2am because two microservices disagreed on how many API calls a customer made in November.

Metering is the unsexy infrastructure problem that makes or breaks usage-based pricing. And most companies get it wrong in ways that are embarrassingly predictable.

The Three Failure Modes

1. Off-by-one errors that add up to real money

Off-by-one errors in metering are not just a developer annoyance. They're a financial liability. Consider a system that counts API calls with an inclusive range on one end and exclusive on the other — a bug that's trivially easy to introduce in billing aggregation code. At 1 million API calls per customer per month at $0.001/call, a consistent 1% overcounting error is $1,000 per customer per month in wrongful overcharges. Discover it 18 months into your Series B and the refund calculation becomes an interesting conversation with your CFO.

The Metronome engineering team has written extensively about this: billing logic needs to be treated with the same rigor as financial accounting code, not the same rigor as "let's grep the logs and see what we get." Every aggregation query needs deterministic output. Every count needs to be reproducible against the same raw event data.

2. Clock skew between distributed systems

Your API gateway, your application server, your database, and your message queue are all running on different machines with clocks that drift. NTP helps but doesn't eliminate skew. In a distributed system processing millions of events, clock skew creates events that arrive "late" — after the billing period has closed — or "early" — before the usage technically happened from the customer's perspective.

This matters because billing periods are sacred. A customer billed for October usage doesn't want November events included because your Kafka consumer was processing a backlog. The production-grade solution, as Orb documents in their architecture guide, is to timestamp events at the point of occurrence — not at the point of ingestion — and use idempotent event keys to deduplicate late arrivals. Your billing system should accept events out of order and reconstruct the correct picture retroactively.

3. Idempotency: the failure that double-bills your best customers

Idempotency is what separates a billing system from a billing incident. When your event pipeline has a hiccup — a retry, a network partition, a consumer restart — events get delivered more than once. If your metering system isn't idempotent, those duplicates get counted. That customer who does $50k/month with you? They just got billed $55k because your Kafka consumer retried a batch during a deployment.

Every usage event needs a globally unique ID. Every metering endpoint needs to deduplicate against those IDs. This sounds obvious. It's also the thing that gets skipped in the rush to ship. Metronome's architecture uses event IDs as deduplication keys at ingest time, ensuring that even if the same event arrives 10 times, it's counted exactly once. This is non-negotiable infrastructure for any production billing system.

What Production-Grade Metering Actually Looks Like

The companies that get metering right treat their usage pipeline as a first-class financial system, not a logging afterthought. The architecture typically has four layers:

Event emission — Every billable action emits a structured event with an immutable ID, an accurate timestamp, a customer identifier, and an event type. Events are emitted at the application layer, not inferred from logs.
Durable ingestion — Events land in a durable queue (Kafka, Kinesis, Pub/Sub) before any processing. If processing fails, the events survive. The queue is the source of truth, not your database.
Idempotent aggregation — Aggregation jobs consume from the queue and write to a billing store using the event ID as a deduplication key. Re-running the same aggregation job over the same window always produces the same number.
Audit trail — Raw events are retained separately from aggregations. If a customer disputes their bill, you can reconstruct their usage from first principles. This is also a compliance requirement in most enterprise contracts.

Orb's billing infrastructure documentation describes this as "immutable metering" — the idea that your raw event log is append-only and sacrosanct, and all billing computations are derived from it deterministically. If you need to fix a billing error, you don't edit the events. You add a correction event and re-run the aggregation.

The Organizational Problem

The technical challenges are solvable. The organizational challenge is harder: most engineering teams don't treat billing as critical infrastructure until they've had their first billing incident. Metering gets built as a side project by the engineer who "owns" billing, using whatever tooling is already in the stack, with test coverage that's optimistically described as "basic."

The correct posture is to treat your metering pipeline with the same operational rigor as your payments processing code. That means: dedicated ownership, comprehensive test coverage including edge cases around billing period boundaries, chaos testing for duplicate event delivery, and customer-facing dashboards so customers can see their usage before they get the bill. That last one isn't just a nice feature — it's your first line of defense against billing disputes and the single most effective thing you can do to reduce invoice shock.

Billing surprises are churn events. Customers who understand their consumption in real time don't get surprised. Customers who get surprised call their account manager at renewal and ask for credits. Get the metering right and the pricing model takes care of itself. Get it wrong and you'll spend more time on billing disputes than on product.

Sources

Metronome — Billing Concepts & Event Architecture — idempotency, event IDs, and deduplication in production billing
Orb — Metered Billing Architecture — immutable event logs, late arrival handling, billing period boundaries
Orb — Real-Time Usage-Based Billing — clock skew, distributed systems challenges, customer-facing dashboards
Metronome Engineering Blog — Why We Built Metronome — production pain points that drove the product design

← Usage-Based Billing · All posts