Upon joining Amberdata in early 2023, we had coverage of 4 major blockchains, those being Bitcoin, Bitcoin Cash, Litecoin, and Ethereum. However, we suffered from one big issue: onboarding new blockchains quickly and cost-effectively. Several factors contributed to this, but the most critical was collection speed. Sweeping an entire chain from genesis to the latest block requires processing tens of terabytes of data. In this post, we will look at a process that used to take months and cost over $50,000 per chain, which has been reduced to just a few hours and under $5,000.
To start the journey, let's dive into what the collection architecture looked like for Ethereum and what it took to onboard a new blockchain with the below architecture.
Node Setup - Setting up nodes in our ecosystem, syncing them, and then bringing up nodes with our providers
Block Collection - Adding blockchain-specific components: ECS templates, EC2 instances, Postgres instances, etc.
Block Enrichment Collection - Captures transaction call stacks to accurately track balances and state changes.
Re-org Handling - Use a collector that will collect the top (n) blocks. This requires replicating all the Block Collection infrastructure.
REST Service - Extend our api to support the new blockchain.
This architecture was well suited to support the blockchain from the present moving forward, but the historical data remained problematic. The system was not built to backfill years of history, so we had to manually deploy multiple instances to export data in ranges, load it manually, and validate it extensively. This process was slow, error-prone, and lacked observability.
In short, while the pipeline handled live data effectively, it wasn’t built for rapid onboarding of new blockchains or historical completeness.
We set out to solve that problem, beginning with collection. Our new approach needed to be:
Fast and Scalable
Redundant and Idempotent
Repeatable and Extensible
Accurate and Cost Effective
The first step was to figure out how to get all the data from a blockchain as fast as possible. Our original system was built over half a decade ago when RPC calls were the only method to collect blockchain data.
Today, the landscape looks very different, multiple high-performance options exist. After a round of discovery and benchmarking, we adopted a hybrid model: using QuickNode RPC for flexibility and StreamingFast for high-throughput streaming.
Let’s take a closer look at how each fits into our new architecture.
Streaming vs. RPC
|
Aspect |
StreamingFast (Streaming) |
QuickNode (RPC) |
|
Architecture |
Event-driven and streaming-based. Indexes and streams decoded blockchain data (blocks, traces, state deltas). |
Request-response model. Calls full nodes and returns raw block, transaction, and call data. |
|
Access Pattern |
Push-based, can stream full blocks or specific datasets using WASM-compiled Rust modules. |
Pull-based, limited by per-request rate limits and latency; can be batched for higher throughput. |
|
Throughput |
Millions of events per second possible. |
Limited by rate limits and node performance; difficult to scale beyond tens of thousands of RPC calls/sec. |
|
Granularity |
Complete block data with all transactions, logs, and trace data. |
Piecewise data requiring hundreds of calls for a full block; e.g., balances, token transfers, traces. |
|
Integration & Processing |
Deterministic, parallelizable modules that preprocess on-chain events into structured Protobuf datasets. |
Developers must manually parse logs, decode events, and maintain their own indexing layer, high engineering overhead. |
|
Scalability |
Horizontally scalable via Substreams modules. |
Limited by single-node throughput and provider constraints. |
|
Cost per TB |
Lower, single pass through the chain, parallelizable. |
Higher, repeated RPC calls and slow replay for trace data. |
|
Latency |
Sub-second for live streaming. |
Dependent on RPC response time and batching. |
|
Reliability |
Built for production ingestion workloads. |
Dependent on provider rate limits and node sync state. |
|
Vendor & Ecosystem |
Open-source (core to The Graph Network). Can self-host or use decentralized indexers. |
Managed SaaS; clo |
|
Use Case |
Best Option |
Why |
|---|---|---|
|
Building trading or analytics platforms (e.g. Amberdata, Nansen, Dune) |
StreamingFast |
Efficiently consumes all on-chain events and builds derived datasets. |
|
Dapps & ad-hoc information (token details, DEX state, etc.) |
QuickNode (RPC) |
Lightweight and cost-effective for point queries. |
|
Per-block, event-driven data (swaps, balance changes, etc.) |
StreamingFast |
Streams only the relevant data as it happens. |
|
Historical indexing (full-chain scans) |
StreamingFast |
Indexed data accessible without replaying the chain. |
|
Event pipelines & downstream analytics |
StreamingFast |
Substreams deliver low-latency, structured data streams. |
While both approaches play a role in our ecosystem, StreamingFast emerged as the foundation of our large-scale ingestion strategy. For Amberdata, the key requirements were clear:
Full historical indexing with the highest possible granularity
Decoded blockchain data pipelines for analytical and derived datasets
Multi-chain support through reproducible, template-based ingestion
Cost efficiency at petabyte scale
Meeting these needs required a platform capable of replacing thousands of RPC calls and custom decoders with a composable, high-throughput, and extensible data pipeline.
With StreamingFast, we can now ingest and process terabytes of on-chain data in hours instead of months, while maintaining the performance, scalability, and reproducibility that institutional data platforms demand.
QuickNode RPC still plays a complementary role, ideal for lightweight queries, on-demand lookups, and API-driven integrations. But for high-volume, historical, and analytical ingestion, StreamingFast powers the core of our architecture.
With StreamingFast as the backbone, we designed a modular collection system that could ingest, validate, and serve blockchain data at scale. The architecture is composed of five core components:
Mappers: WASM-compiled Rust modules running within the StreamingFast ecosystem. Each mapper emits the specific dataset we need (e.g., blocks, transactions, balances) over gRPC.
Collector: A Go service that registers the mapper and opens a gRPC channel to continuously receive the streaming data.
Auditor: A validation job ensuring complete coverage across every block and transaction. It uses collection metadata emitted from the mappers to verify that nothing is missed between collection and sink.
Loader: For datasets exposed via REST, the loader writes validated data into the appropriate database for API serving.
Enhancement Layer: Handles on-demand enrichment via RPC, such as fetching token metadata or other contextual data that doesn’t need to be streamed.
While it took a couple of months to build all components, the generalized design allows us to onboard a new blockchain, including full historical data, in about a week.
We achieved this by running hundreds of collectors in parallel, orchestrated through Argo Workflows, Kubernetes, and Karpenter. This combination gave us elastic scaling, fault tolerance, and cost efficiency, all with minimal manual intervention.
In the process, we developed a reusable template for parallel processing, now a workhorse across multiple data pipelines at Amberdata. The template can be adapted for any large-scale collection task with only a few key configuration parameters:
Processing Range - Total block range to collect
Number of Workers - Number of discrete worker pods to run
Processor Steps - Collect → Validate → Load
In just a few months, we built a scalable collection engine capable of scanning and indexing an entire blockchain in hours, not months.
Standing up infrastructure for a new chain now takes less than a week, and the cost per chain dropped from ~$50,000 to under $5,000.
Beyond speed and cost, we gained several structural benefits:
Comprehensive validation: every block, transaction, and event is accounted for.
Full observability: immediate visibility into collection gaps or anomalies.
Idempotency: any failed range can be safely reprocessed without affecting downstream datasets.
What was once a manual, error-prone process has evolved into a deterministic, scalable, and reproducible collection framework, a foundation that continues to power Amberdata’s expansion into new blockchains today.
Today we focused our discussion on the ingestion portion of the pipeline; but once you can collect the data what then? Well, you need to transform, enrich, and aggregate what you have collected into a digestible format for customers. That is exactly what we will be looking at next time with our dex trades pipeline; it collects raw swaps from diverse protocols on a handful of different EVM chains and turns them into a single cross-chain trades dataset to give customers an overview of the defi ecosystem.