Authors: Zhuolun Xiang, Siyuan Han, Zekun Li, and Alexander Spiegelman

Since the birth of computing technology, engineers and researchers have continuously explored how to push computing resources to their performance limits, striving to maximize efficiency while minimizing the latency of computing tasks. The two pillars of high performance and low latency have always shaped the development of computer science, influencing a wide range of fields from CPUs, FPGAs, and database systems to recent artificial intelligence infrastructures and blockchain systems. In the pursuit of high performance, pipelining technology has become an indispensable means. Since the introduction of pipelining technology with the IBM System/360 in 1964 [1], it has been at the core of high-performance system design, driving key discussions and innovations in the field.

Pipelining technology is not only applied in hardware but also has widespread applications in the database field. For example, Jim Gray introduced the pipelined parallel approach in his book "High-Performance Database Systems" [2]. This method breaks down complex database queries into multiple stages that run simultaneously, thereby improving efficiency and performance. Pipelining technology is also crucial in the field of artificial intelligence, particularly in the widely used deep learning framework TensorFlow. It utilizes data pipelining to parallel process data preprocessing and loading, ensuring a smooth data flow for training and inference, making AI workflows faster and more efficient [3].

Blockchain is no exception. Its core functionality is similar to that of a database, processing transactions and updating states, but it adds the challenge of Byzantine fault-tolerant consensus. The key to improving blockchain throughput (transactions per second) and reducing latency (time to final confirmation) lies in optimizing the interactions of different stages—ordering, execution, submission, and synchronization of transactions—under high load. This challenge is particularly critical in high-throughput scenarios, as traditional designs struggle to maintain low latency.

To explore these concepts, let’s revisit a familiar analogy: the automobile factory. Understanding how assembly lines revolutionized manufacturing helps us grasp the evolution of blockchain pipelining—and why next-generation designs like Zaptos [8] will push blockchain performance to new heights.

From Automobile Factory to Blockchain

Imagine you are the owner of an automobile factory with two main goals:

· Maximize throughput: Assemble as many cars as possible each day.

· Minimize latency: Shorten the construction time for each car.

Now, envision three types of factories:

Simple Factory

In a simple factory, a group of multi-skilled workers assembles a car step by step. One worker assembles the engine, the next installs the wheels, and so on—producing one car at a time.

What’s the problem? Some workers often find themselves waiting, leading to overall low production efficiency because no one is working on different parts of the same car simultaneously.

Ford Factory

Enter the Ford assembly line [4]! Here, each worker focuses on a single task. The car moves along a conveyor belt, and as each car passes, a dedicated worker adds their components.

What’s the result? Multiple cars are simultaneously at different stages of assembly, and all workers are busy. Throughput significantly increases—but each car still needs to pass through each worker sequentially, meaning the latency time for each car remains unchanged.

Magic Factory

Imagine a magic factory where all workers can work on a car simultaneously! No longer needing to move the car from one station to the next, every part of the car is built at the same time.

What’s the outcome? Cars are assembled at record speed, with every step occurring in sync. This is the ideal scenario for solving the throughput and latency problem.

Now, that’s enough about automobile factories—what about blockchain? It turns out that designing high-performance blockchains is not much different from optimizing assembly lines.

Blockchain as an Automobile Factory

In blockchain, processing a block is akin to assembling a car. The analogy is as follows:

· Workers = Validator resources

· Car = A block

· Assembly tasks = Stages such as consensus, execution, and submission

Just as a simple factory processes one car at a time, a blockchain that processes only one block at a time will lead to underutilization of resources. In contrast, modern blockchain designs strive to operate like the Ford assembly line—simultaneously processing different stages of multiple blocks. This is where pipelining technology comes into play.

The Evolution of Blockchain Pipelining

Traditional Architecture: Sequential Blockchain

Imagine a blockchain that processes blocks sequentially. Validators need to:

Receive block proposals.
Execute the block to update the blockchain state.
Continue to reach consensus on that state.
Persist the state to the database.
Start consensus on the next block.

Where’s the problem?

· Execution and submission are on the critical path of the consensus process.

· Each consensus instance must wait for the previous one to complete before starting.

This setup is akin to factories before Ford: workers (resources) often find themselves idle when focusing on just one block (car) at a time. Unfortunately, many existing blockchains still fall into this category, resulting in low throughput and high latency.

Aptos: Parallelized Performance

Diem introduced a pipelined architecture that decouples execution and submission from the consensus stage, while the consensus stage itself also adopts a pipelined design.

· Asynchronous execution and submission [5]: Validators first reach consensus on a block, then execute the block based on the state of the parent block. Once the required number of validators signs off, the state is persisted to storage.

· Pipelined consensus (Jolteon [6]): New consensus instances can start before the previous one completes, similar to a moving assembly line.

This increases throughput by allowing different blocks to be at different stages simultaneously and significantly reduces block time to just 2 message delays. However, the leader-based design of Jolteon may create bottlenecks, as the leader can become overloaded during transaction distribution.

Aptos further optimized the pipeline with Quorum Store [7], a mechanism that decouples data distribution from consensus. Quorum Store no longer relies on a single leader to broadcast large data blocks in the consensus protocol but separates data distribution from metadata ordering, allowing validators to asynchronously and in parallel distribute data. This design leverages the total bandwidth of all validators, effectively eliminating the leader bottleneck in consensus.

Illustration: How Quorum Store balances resource utilization in leader-based consensus protocols.

Thus, the Aptos blockchain has created the "Ford factory" of blockchains. Just as Ford's assembly line revolutionized car production—allowing different stages of different cars to occur simultaneously—Aptos processes different stages of different blocks at the same time. Each validator's resources are fully utilized, ensuring that no part of the process is left waiting. This clever orchestration results in a high-throughput system, making Aptos a powerful platform for efficiently and scalably handling blockchain transactions.

Illustration: Pipelined processing of consecutive blocks in the Aptos blockchain. Validators can pipeline different stages of consecutive blocks to maximize resource utilization and increase throughput.

While throughput is crucial, end-to-end latency—from transaction submission to final confirmation—is equally important. For applications like payments, decentralized finance (DeFi), and gaming, every millisecond counts. Many users have experienced delays during high-traffic events because each transaction must sequentially pass through a series of stages: client-full node-validator communication, consensus, execution, state certification, submission, and full node synchronization. Under high load, stages like execution and full node synchronization can introduce additional latency.

Illustration: The pipelined architecture of the Aptos blockchain. The diagram shows client Ci, full node Fi, and validator Vi. Each box represents a stage that a transaction block must go through from left to right in the blockchain. The pipeline includes five stages: consensus (including distribution and ordering), execution, certification, submission, and full node synchronization.

This is akin to the Ford factory: while the assembly line maximizes overall throughput, each car still needs to pass through each worker sequentially, resulting in longer completion times. To truly push blockchain performance to the limit, we need to create a "magic factory"—where these stages run in parallel.

Zaptos: Towards Optimal Blockchain Latency

Zaptos [8] further reduces latency through three key optimizations, without sacrificing throughput.

· Optimistic execution: Reduces pipeline latency by starting execution immediately upon receiving a block proposal. Validators immediately add the block to the pipeline and speculatively execute it after the parent block completes. Full nodes also perform optimistic execution to verify state proofs after receiving proposals from validators.

· Optimistic submission: Immediately writes the state to storage after block execution—even before state certification. When validators finally certify the state, only minimal updates are needed to complete the submission. If a block is ultimately not ordered, its optimistically submitted state will be rolled back to maintain consistency.

· Fast certification: Validators begin certifying the state of executed blocks in parallel by sending certification messages in the final consensus round, without waiting for consensus to complete. This optimization effectively reduces one round of pipeline latency in common scenarios.

Illustration: The parallel pipelined architecture of Zaptos. Other stages, except for consensus, are effectively hidden within the consensus stage, thereby reducing end-to-end latency.

Through these optimizations, Zaptos effectively hides the latency of other pipeline stages within the consensus stage. Therefore, if the blockchain adopts a consensus protocol with optimal latency, the overall blockchain latency can also reach optimal levels!

Talk is cheap; data speaks

We evaluated the end-to-end performance of Zaptos through geographically distributed experiments, using Aptos as a high-performance baseline. More details can be found in the paper [8].

On Google Cloud, we simulated a globally decentralized network consisting of 100 validators and 30 full nodes, distributed across 10 regions, using commercially available machines similar to those deployed by Aptos.

Throughput-Latency

Illustration: Common performance comparison between Zaptos and Aptos blockchains.

The above figure compares the relationship between end-to-end latency and throughput for the two systems. Both experience a gradual increase in latency as the load increases, with sharp spikes occurring at maximum capacity, but Zaptos consistently shows more stable latency before reaching peak throughput, reducing latency by 160 milliseconds under low load and over 500 milliseconds under high load.

Impressively, Zaptos achieves sub-second latency at 20k TPS in a production-grade mainnet environment—this breakthrough makes real-world applications that require speed and scalability possible.

Latency Breakdown

Illustration: Latency breakdown of the Aptos blockchain.

Illustration: Latency breakdown of Zaptos.

The latency breakdown charts detail the duration of each pipeline stage for validators and full nodes. Key insights include:

· Up to 10k TPS: The overall latency of Zaptos is nearly equivalent to its consensus latency, as the optimistic execution, certification, and optimistic submission stages are effectively "hidden" within the consensus stage.

· Beyond 10k TPS: As the time for optimistic execution and full node synchronization increases, the non-consensus stages become more significant. Nevertheless, Zaptos significantly reduces overall latency by overlapping most stages. For example, at 20k TPS, the baseline total latency is 1.32 seconds (consensus 0.68 seconds, other stages 0.64 seconds), while Zaptos achieves 0.78 seconds (consensus 0.67 seconds, other stages 0.11 seconds).

Conclusion

The evolution of blockchain architecture is akin to the transformation in manufacturing—from simple sequential workflows to highly parallelized assembly lines. Aptos's pipelined approach significantly enhances throughput, while Zaptos takes it a step further by reducing latency to sub-second levels while maintaining high TPS. Just as modern computing architectures leverage parallelism to maximize efficiency, blockchains must continuously optimize their designs to eliminate unnecessary latency. By comprehensively optimizing the blockchain pipeline for minimal latency, Zaptos paves the way for real-world blockchain applications that require speed and scale.

References

[1] Gene M. Amdahl, Gerrit A. Blaauw, and Frederick P. Brooks. 1964. "Architecture of the IBM System/360." IBM Journal of Research and Development. https://doi.org/10.1147/rd.82.0087

[2] David DeWitt, and Jim Gray. 1992. "Parallel Database Systems: The Future of High Performance Database Systems." Communications of the ACM. https://doi.org/10.1145/129888.129894

[3] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin et al. 2016. "TensorFlow: a System for Large-Scale Machine Learning." In 12th USENIX symposium on operating systems design and implementation (OSDI). https://arxiv.org/abs/1605.08695

[4] The Moving Assembly Line and the Five-Dollar Workday. https://corporate.ford.com/articles/history/moving-assembly-line.html

[5] Zekun Li, and Yu Xia. 2021. DIP-213 - Decoupled Execution. https://github.com/diem/dip/blob/7dc44ee57bb7efe76559f05dcc6851d97e2d3149/dips/dip-213.md

[6] Rati Gelashvili, Lefteris Kokoris-Kogias, Alberto Sonnino, Alexander Spiegelman, and Zhuolun Xiang. 2022. "Jolteon and Ditto: Network-Adaptive Efficient Consensus with Asynchronous Fallback." In International conference on financial cryptography and data security (FC). https://arxiv.org/abs/2106.10362

[7] Quorum Store: How Consensus Horizontally Scales on the Aptos Blockchain. https://medium.com/aptoslabs/quorum-store-how-consensus-horizontally-scales-on-the-aptos-blockchain-988866f6d5b0

[8] Zhuolun Xiang, Zekun Li, Balaji Arun, Teng Zhang, and Alexander Spiegelman. 202 2025. "Zaptos: Towards Optimal Blockchain Latency." arXiv preprint arXiv:2501.10612. https://arxiv.org/abs/2501.10612

免责声明：本文章仅代表作者个人观点，不代表本平台的立场和观点。本文章仅供信息分享，不构成对任何人的任何投资建议。用户与作者之间的任何争议，与本平台无关。如网页中刊载的文章或图片涉及侵权，请提供相关的权利证明和身份证明发送邮件到support@aicoin.com，本平台相关工作人员将会进行核查。

How to build a high-performance blockchain