Building a Cost- Aware Data Platform at Petabyte Scale

Written by Stefan Feissli | Feb 06, 2026

To understand the challenge at Amberdata, you have to start with scale. Today, our platform:

Produces 8–10 TB of data per day

Consumes 24–30 TB daily via Redpanda

Stores 2.6 PB in object storage

Handles 50B messages per day

Three years ago, we were operating at roughly 10% of this scale, yet our AWS bills were already unsustainable. Scaling the system as it existed would have broken the business financially. That forced a hard realization: efficiency could not be an afterthought. It had to become a core engineering objective.

“Growth at all costs” is a myth. Growth has to be defensible.

When I joined, we lacked the observability needed to define what "efficient" even meant. Costs were opaque, targets were aspirational, and tradeoffs were implicit. We changed that by treating cost as a first-class engineering metric, on equal footing with uptime and latency. Our North Star became clear: a cost-accountable organization built on strong observability, explicit ownership, and architectures designed to scale economically.

Phase 1: Enforcing Cost Discipline

Early-stage startups optimize for speed and product-market fit. Efficiency usually comes later. We didn’t have the luxury of waiting. Our first step was brutally simple: we stopped paying for things we did not need. Instead of chasing marginal optimizations, we focused on a few governing principles:

Buy efficiency at the silicon level: Migrated high-concurrency workloads, including RDS, to AWS Graviton, delivering up to 40% better price-performance.
Decouple performance from capacity: Switched from EBS gp2 to gp3, eliminating over-provisioning by separating IOPS from volume size.
Automate storage economics: S3 Intelligent-Tiering automatically moved petabyte-scale historical data to cheaper tiers without human intervention.
Eliminate network taxes: Routed internal S3 traffic through Gateway Endpoints, removing per-GB NAT Gateway costs that added no value.

We paired these changes with commitment-based savings, such as Compute Savings Plans and the Enterprise Discount Program (EDP). This phase did not make us cheap; it made us stable, buying us time to redesign the system properly.

Phase 2: Decoupling the Platform

The original architecture was well-suited for finding product-market fit, but eventually became a bottleneck for high-volume ingestion. Over three years, we systematically rebuilt the platform without interrupting 24/7 operations. Stabilizing costs in Phase 1 provided the leverage to transition to Redpanda and a Delta Lake foundation.

Streaming and Ingestion

For a high-frequency data platform, ingestion resiliency and cost-efficiency are paramount. We achieved this by decoupling our ingestion layer from the streaming backbone, migrating our collectors to Redpanda Connect (Benthos) on Kubernetes. This transition significantly accelerated our speed-to-market and established High Availability (HA) by distributing collectors across multiple Availability Zones (AZs). This elastic architecture enables us to scale instantly to meet market volatility while maintaining a lean resource profile.

The move from self-managed Kafka to Redpanda was the turning point for our platform's economics. Redpanda’s streamlined architecture requires a much smaller infrastructure footprint, delivering superior performance with far fewer resources than the legacy setup. By running across multiple AZs, we achieved the necessary redundancy that the old system lacked, while significantly reducing the hardware costs and operational burden of the legacy environment.

The result is a system capable of handling 10x the throughput with vastly improved cost efficiency and near-zero data gaps.

Blockchain Ingestion

Running commodity nodes at scale was a significant cost center that did not differentiate our product. By outsourcing standard node infrastructure to specialized partners like StreamingFast and QuickNode, we effectively traded high and unpredictable infrastructure costs for scalable, predictable OpEx. Their specialized streaming primitives allow for massive parallelization at a fraction of the total cost of ownership (TCO) required to maintain an in-house platform team.

At the same time, we continue to operate proprietary nodes where they provide a strategic advantage or a direct revenue-generating opportunity. This hybrid approach eliminates the overhead of routine maintenance while ensuring we capture the highest margins on the infrastructure that actually moves the needle for our business.

Storage and the Lakehouse

Legacy PostgreSQL and TimescaleDB instances did not scale economically. The cost of maintaining enough “always-on” compute and high-performance EBS storage to handle our growing volume was a primary driver of our move to S3.

Today, our entire 2.6 PB footprint lives in our Delta Lake, which has fundamentally changed our cost structure:

Eliminating the database “tax”: By moving data out of expensive, overprovisioned database instances and into low-cost S3 storage, we decoupled storage costs from compute costs. We now only pay for heavy computing when queries are actually running.
Query optimization: Features such as Z-Ordering and data skipping enable us to scan only the necessary data. This directly reduces the “cost per query” by minimizing the resource consumption of our analytical engines. However, these technical features are only as effective as the underlying data modeling. By aligning our partitioning and indexing strategies with downstream query patterns, we ensure that researchers and traders can retrieve specific insights without triggering expensive, full-table scans.
Operational resilience: ACID transactions and schema enforcement prevent data corruption that previously required expensive, manual engineering hours to remediate.
Integrity without overhead: Time Travel provides a built-in audit trail and instant rollback, eliminating complex, storage-intensive backup routines.

Delivery

We decoupled real-time and historical access to avoid resource contention. Traders need low-latency streams, while researchers need high-throughput scans. Forcing both through the same compute path was inefficient and fragile.

Our data now lives in a unified Delta Lakehouse, providing ACID reliability and schema consistency. We are migrating our compute layer to Trino to enable high-concurrency SQL directly on S3-backed Delta tables, reducing reliance on expensive warehouse credits.

For sub-second analytical workloads, we continue to use Apache Pinot, where its indexing excels. Real-time delivery leverages WebSockets across horizontally scalable microservices, with Istio enforcing edge traffic controls to protect infrastructure margins.

Phase 3: Making Cost Everyone's Job

Infrastructure changes alone are not enough. We decentralized cost ownership to domain teams. Cost stopped being a leadership concern and became part of everyday engineering decisions. Every team owns:

A real-time dashboard showing service-level spend
An explicit budget tied to their domain
Automated alerts delivered directly to their Slack channel

When spending spikes, the owning team investigates within hours and fixes it immediately. The result is a tight feedback loop where engineers see the impact of architectural decisions in near real time.

Closing

Three years ago, our infrastructure was unsustainable. Today, we process ten times the data volume with significantly higher reliability, while reducing our total infrastructure spend by nearly 50% from where we started.

By treating cost as a primary engineering metric, we evolved a fragile startup stack into a resilient, enterprise-grade platform. We’ve proven that speed and efficiency are mutually reinforcing; by architecting for scale, we unlocked faster delivery cycles and redirected our talent from maintenance to delivering the product features our customers care about most.

View full post