Upon joining Amberdata in early 2023, we had coverage of 4 major blockchains, those being Bitcoin, Bitcoin Cash, Litecoin, and Ethereum. However, we suffered from one big issue: onboarding new blockchains quickly and cost-effectively. Several factors contributed to this, but the most critical was collection speed. Sweeping an entire chain from genesis to the latest block requires processing tens of terabytes of data. In this post, we will look at a process that used to take months and cost over $50,000 per chain, which has been reduced to just a few hours and under $5,000.

To start the journey, let's dive into what the collection architecture looked like for Ethereum and what it took to onboard a new blockchain with the below architecture.

ae5e7a78-dc8b-4b32-a24a-d89cb4f4286b
 
  1. Node Setup - Setting up nodes in our ecosystem, syncing them, and then bringing up nodes with our providers

  2. Block Collection - Adding blockchain-specific components: ECS templates, EC2 instances, Postgres instances, etc.

  3. Block Enrichment Collection - Captures transaction call stacks to accurately track balances and state changes.

  4. Re-org Handling - Use a collector that will collect the top (n) blocks. This requires replicating all the Block Collection infrastructure.

  5. REST Service - Extend our api to support the new blockchain.

This architecture was well suited to support the blockchain from the present moving forward, but the historical data remained problematic. The system was not built to backfill years of history, so we had to manually deploy multiple instances to export data in ranges, load it manually, and validate it extensively. This process was slow, error-prone, and lacked observability.

In short, while the pipeline handled live data effectively, it wasn’t built for rapid onboarding of new blockchains or historical completeness.Screenshot 2025-10-30 at 2.25.00 PM

We set out to solve that problem, beginning with collection. Our new approach needed to be:

  1. Fast and Scalable

  2. Redundant and Idempotent

  3. Repeatable and Extensible

  4. Accurate and Cost Effective

The first step was to figure out how to get all the data from a blockchain as fast as possible. Our original system was built over half a decade ago when RPC calls were the only method to collect blockchain data. 

Today, the landscape looks very different, multiple high-performance options exist. After a round of discovery and benchmarking, we adopted a hybrid model: using QuickNode RPC for flexibility and StreamingFast for high-throughput streaming.

Let’s take a closer look at how each fits into our new architecture.

Streaming vs. RPC

Aspect

StreamingFast (Streaming)

QuickNode (RPC)

Architecture

Event-driven and streaming-based. Indexes and streams decoded blockchain data (blocks, traces, state deltas).

Request-response model. Calls full nodes and returns raw block, transaction, and call data.

Access Pattern

Push-based, can stream full blocks or specific datasets using WASM-compiled Rust modules.

Pull-based, limited by per-request rate limits and latency; can be batched for higher throughput.

Throughput

Millions of events per second possible.

Limited by rate limits and node performance; difficult to scale beyond tens of thousands of RPC calls/sec.

Granularity

Complete block data with all transactions, logs, and trace data.

Piecewise data requiring hundreds of calls for a full block; e.g., balances, token transfers, traces.

Integration & Processing

Deterministic, parallelizable modules that preprocess on-chain events into structured Protobuf datasets.

Developers must manually parse logs, decode events, and maintain their own indexing layer, high engineering overhead.

Scalability

Horizontally scalable via Substreams modules.

Limited by single-node throughput and provider constraints.

Cost per TB

Lower, single pass through the chain, parallelizable.

Higher, repeated RPC calls and slow replay for trace data.

Latency

Sub-second for live streaming.

Dependent on RPC response time and batching.

Reliability

Built for production ingestion workloads.

Dependent on provider rate limits and node sync state.

Vendor & Ecosystem

Open-source (core to The Graph Network). Can self-host or use decentralized indexers.

Managed SaaS; clo

Use Case Comparison 

Use Case 

Best Option

Why

Building trading or analytics platforms (e.g. Amberdata, Nansen, Dune)

StreamingFast

Efficiently consumes all on-chain events and builds derived datasets.

Dapps & ad-hoc information (token details, DEX state, etc.)

QuickNode (RPC)

Lightweight and cost-effective for point queries.

Per-block, event-driven data (swaps, balance changes, etc.)

StreamingFast

Streams only the relevant data as it happens.

Historical indexing (full-chain scans)

StreamingFast

Indexed data accessible without replaying the chain.

Event pipelines & downstream analytics

StreamingFast

Substreams deliver low-latency, structured data streams.


Why StreamingFast Became Our Backbone

While both approaches play a role in our ecosystem, StreamingFast emerged as the foundation of our large-scale ingestion strategy. For Amberdata, the key requirements were clear:

  • Full historical indexing with the highest possible granularity

  • Decoded blockchain data pipelines for analytical and derived datasets

  • Multi-chain support through reproducible, template-based ingestion

  • Cost efficiency at petabyte scale

Meeting these needs required a platform capable of replacing thousands of RPC calls and custom decoders with a composable, high-throughput, and extensible data pipeline.

With StreamingFast, we can now ingest and process terabytes of on-chain data in hours instead of months, while maintaining the performance, scalability, and reproducibility that institutional data platforms demand.

QuickNode RPC still plays a complementary role, ideal for lightweight queries, on-demand lookups, and API-driven integrations. But for high-volume, historical, and analytical ingestion, StreamingFast powers the core of our architecture.

Scaling Collection

With StreamingFast as the backbone, we designed a modular collection system that could ingest, validate, and serve blockchain data at scale. The architecture is composed of five core components:

dcf93ca2-b5ff-4a9b-babf-578f41b1d79b

  1. Mappers: WASM-compiled Rust modules running within the StreamingFast ecosystem. Each mapper emits the specific dataset we need (e.g., blocks, transactions, balances) over gRPC.

  2. Collector: A Go service that registers the mapper and opens a gRPC channel to continuously receive the streaming data.

  3. Auditor: A validation job ensuring complete coverage across every block and transaction. It uses collection metadata emitted from the mappers to verify that nothing is missed between collection and sink.

  4. Loader: For datasets exposed via REST, the loader writes validated data into the appropriate database for API serving.

  5. Enhancement Layer: Handles on-demand enrichment via RPC, such as fetching token metadata or other contextual data that doesn’t need to be streamed.

Parallelization and Scale

While it took a couple of months to build all components, the generalized design allows us to onboard a new blockchain, including full historical data, in about a week.

We achieved this by running hundreds of collectors in parallel, orchestrated through Argo Workflows, Kubernetes, and Karpenter. This combination gave us elastic scaling, fault tolerance, and cost efficiency, all with minimal manual intervention.

In the process, we developed a reusable template for parallel processing, now a workhorse across multiple data pipelines at Amberdata. The template can be adapted for any large-scale collection task with only a few key configuration parameters:

  • Processing Range - Total block range to collect

  • Number of Workers - Number of discrete worker pods to run

  • Processor Steps - Collect → Validate → Load

9cc4934e-70ae-4d28-a12b-8b99e35527a3

Result and Impact

In just a few months, we built a scalable collection engine capable of scanning and indexing an entire blockchain in hours, not months.

Standing up infrastructure for a new chain now takes less than a week, and the cost per chain dropped from ~$50,000 to under $5,000.

Beyond speed and cost, we gained several structural benefits:

  • Comprehensive validation: every block, transaction, and event is accounted for.

  • Full observability: immediate visibility into collection gaps or anomalies.

  • Idempotency: any failed range can be safely reprocessed without affecting downstream datasets.

What was once a manual, error-prone process has evolved into a deterministic, scalable, and reproducible collection framework, a foundation that continues to power Amberdata’s expansion into new blockchains today.

What’s Next

Today we focused our discussion on the ingestion portion of the pipeline; but once you can collect the data what then? Well, you need to transform, enrich, and aggregate what you have collected into a digestible format for customers. That is exactly what we will be looking at next time with our dex trades pipeline; it collects raw swaps from diverse protocols on a handful of different EVM chains and turns them into a single cross-chain trades dataset to give customers an overview of the defi ecosystem.

Cory VanHooser

Cory is an engineering leader with over a 12 years of experience working in fast-paced startups. His experience ranges from greenfield projects at early stage to building out scaling solutions at later stages. Cory is a Staff Engineer at Amberdata, and leads the BDA team responsible for the company's blockchain and...

Amberdata Blog

View All Posts