a16z: How to Achieve Secure and Efficient zkVM in Phases (A Must-Read for Developers)

CN
10 hours ago

The original text is from a16zcrypto

Compiled by Odaily Planet Daily Golem (@web3_golem)

a16z: How to Achieve Secure and Efficient zkVM in Phases (A Must-Read for Developers)

zkVM (Zero-Knowledge Virtual Machine) promises to "democratize SNARKs," allowing anyone (even those without specialized SNARK knowledge) to prove they have correctly executed any program on a given input (or witness). Their core advantage lies in the developer experience, but currently, they face significant challenges in security and performance. To fulfill the vision of zkVM, designers must overcome these challenges. In this article, I outline the possible phases of zkVM development, which will take years to complete.

Challenges

In terms of security, zkVM is a highly complex software project that is still full of vulnerabilities. In terms of performance, the speed of proving that a program has executed correctly can be hundreds of thousands of times slower than running natively, making most applications currently unable to be deployed in the real world.

Despite these real challenges, most companies in the blockchain industry depict zkVM as something that can be deployed immediately. In fact, some projects have already incurred significant computational costs to generate proofs of on-chain activities. However, because zkVM is still imperfect, this is merely an expensive way to pretend that the system is protected by SNARKs, when in reality it is either protected by permission or, worse, vulnerable to attacks.

We are years away from achieving a secure and high-performance zkVM. This article proposes a series of phased concrete goals to track the progress of zkVM—these goals can eliminate hype and help the community focus on real advancements.

Security Phases

SNARK-based zkVM typically includes two main components:

  • Polynomial Interactive Oracle Proof (PIOP): An interactive proof framework for proving statements about polynomials (or constraints derived from them).

  • Polynomial Commitment Scheme (PCS): Ensures that the prover cannot lie about polynomial evaluations without being detected.

zkVM essentially encodes effective execution tracing as a constraint system—broadly meaning that they enforce the virtual machine to correctly use registers and memory—then applies SNARKs to prove that these constraints are satisfied.

The only way to ensure that a complex system like zkVM is free of errors is through formal verification. Below is a breakdown of the security phases. Phase 1 focuses on the correct protocol, while Phases 2 and 3 focus on the correct implementation.

Security Phase 1: Correct Protocol

  1. Formal verification proof of PIOP reliability;

  2. Formal verification proof of PCS under certain cryptographic assumptions or ideal models;

  3. If using Fiat-Shamir, formal verification proof that the concise argument obtained by combining PIOP and PCS is secure in the random oracle model (enhanced as needed with other cryptographic assumptions);

  4. Formal verification proof that the constraint system applied by PIOP is equivalent to the semantics of the VM;

  5. A comprehensive "gluing" of all the above parts into a single, formally verified secure SNARK proof for running any program specified by the VM bytecode. If the protocol intends to achieve zero-knowledge, this property must also be formally verified to ensure that no sensitive information about the witness is leaked.

Recursive Warning: If zkVM uses recursion, every PIOP, commitment scheme, and constraint system involved anywhere in that recursion must be verified for this phase to be considered complete.

Security Phase 2: Correct Validator Implementation

Formal verification proves that the actual implementation of the zkVM validator (using Rust, Solidity, etc.) matches the protocol verified in Phase 1. Achieving this ensures that the implemented protocol is sound (and not just a theoretical design or an inefficient specification written in Lean, etc.).

The reason Phase 2 focuses solely on the validator implementation (rather than the prover) is twofold. First, the correct use of the validator is sufficient to ensure reliability (i.e., ensuring that the validator cannot believe any false statement is actually true). Second, the zkVM validator implementation is an order of magnitude simpler than the prover implementation.

Security Phase 3: Correct Prover Implementation

The actual implementation of the zkVM prover correctly generates proofs of the proof system verified in Phases 1 and 2, meaning it is formally verified. This ensures integrity, meaning that any system using zkVM will not be "stuck" on unprovable statements. If the prover intends to achieve zero-knowledge, this property must be formally verified.

Expected Timeline

  • Phase 1 Progress: We can expect gradual achievements next year (e.g., ZKLib). However, no zkVM will fully meet the requirements of Phase 1 for at least two years;

  • Phases 2 and 3: These phases can progress alongside certain aspects of Phase 1. For example, some teams have already proven that the implementation of the Plonk verifier matches the protocol in the paper (although the protocol in the paper itself may not have been fully verified). Nevertheless, I expect that no zkVM will reach Phase 3 in less than four years—and possibly longer.

Key Considerations: Fiat-Shamir Security and Verified Bytecode

A major complicating factor is the unresolved research issues surrounding the security of the Fiat-Shamir transformation. All three phases consider Fiat-Shamir and random oracles as part of their impeccable security, but in reality, the entire paradigm may have vulnerabilities. This is due to the differences between the overly idealized random oracle and the hash functions used in practice. In the worst case, systems that have reached Phase 2 may later be found to be completely insecure due to the Fiat-Shamir problem. This raises serious concerns and ongoing research. We may need to modify the transformation itself to better guard against such vulnerabilities.

Non-recursive systems are theoretically more robust because certain known attacks involve circuits similar to those used in recursive proofs.

Another point to note is that if the bytecode itself has flaws, then the proof that a computer program has been correctly executed (as specified by the bytecode) is of limited value. Therefore, the practicality of zkVM largely depends on the method of generating formally verified bytecode—this is a huge challenge that goes beyond the scope of this article.

About Post-Quantum Security

For at least the next five years (possibly longer), quantum computers will not pose a serious threat, while vulnerabilities represent a survival risk. Therefore, the main focus now should be on meeting the security and performance phases discussed in this article. If we can meet these security requirements faster using non-quantum secure SNARKs, we should do so until post-quantum SNARKs catch up, or until there are serious concerns about the emergence of quantum computers related to cryptography.

Current Performance of zkVM

Currently, the overhead factor produced by zkVM provers is nearly one million times the cost of native execution. If a program takes X cycles to run, the cost of proving correct execution is approximately X multiplied by one million CPU cycles. This was the case a year ago and remains so today.

Popular narratives often describe this overhead in ways that sound acceptable. For example:

  • "The cost of generating proofs for all Ethereum mainnet activities is less than a million dollars per year."

  • "We can almost generate Ethereum block proofs in real-time using a cluster of dozens of GPUs."

  • "Our latest zkVM is 1000 times faster than its predecessor."

While these statements are technically accurate, they can be misleading without proper context. For example:

  • It is 1000 times faster than the old zkVM, but the absolute speed is still very slow. This says more about how bad things are than how good they are.

  • There have been proposals to increase the computational load handled by the Ethereum mainnet by 10 times. This would make the current zkVM performance even slower.

  • What people refer to as "near real-time proofs for Ethereum blocks" is still much slower than what many blockchain applications require (e.g., Optimism's block time is 2 seconds, much faster than Ethereum's 12 seconds block time).

  • "Dozens of GPUs running continuously, without error" does not achieve acceptable liveness guarantees.

  • Spending less than a million dollars a year to prove all activities on the Ethereum mainnet reflects the fact that a full Ethereum node only needs to spend about $25 a year to perform computations.

For applications outside of blockchain, such overhead is clearly too high. No amount of parallelization or engineering can offset such a massive overhead. We should set a baseline that the speed of zkVM does not slow down more than 100,000 times compared to native execution—this is just the first step. True mainstream adoption may require overhead closer to 10,000 times or lower.

How to Measure Performance

SNARK performance has three main components:

  • The inherent efficiency of the underlying proof system.

  • Application-specific optimizations (e.g., precompilation).

  • Engineering and hardware acceleration (e.g., GPUs, FPGAs, or multi-core CPUs).

While the latter two are crucial for practical deployment, they typically apply to any proof system and may not necessarily reflect the inherent overhead. For example, adding GPU acceleration and precompilation to zkEVM can easily achieve a 50-fold speedup, which is much faster than a purely CPU-based method without precompilation—enough to make a fundamentally inefficient system appear superior to one that hasn't been similarly refined.

Therefore, the following focuses on the performance of SNARKs without specialized hardware and precompilation. This differs from current benchmarking methods, which often reduce all three factors to a single "headline number." This is akin to judging the value of a diamond based on its polishing time rather than its inherent clarity. Our goal is to exclude the inherent overhead of general proof systems—helping the community eliminate confounding factors and focus on the real progress in proof system design.

Performance Phases

Here are five milestones for performance implementation. First, we need to reduce the prover overhead on CPUs by several orders of magnitude. Only then should the focus shift to further reductions through hardware. Memory usage must also be improved.

In all the phases below, developers should not have to customize code based on zkVM settings to achieve the necessary performance. Developer experience is the main advantage of zkVM. Sacrificing DevEx to meet performance benchmarks would contradict the very purpose of zkVM.

These metrics focus on prover costs. However, if unlimited verifier costs are allowed (i.e., no upper limit on proof size or verification time), any prover metric can be easily met. Therefore, systems must specify maximum values for proof size and verification time to meet the stated phases.

Performance Requirements

Phase 1 Requirement: "Reasonable and Non-Trivial Verification Costs":

  • Proof Size: The proof size must be smaller than the witness size.

  • Verification Time: The speed of verifying the proof must not be slower than running the program natively (i.e., executing the computation without a correctness proof).

These are minimal concise requirements. They ensure that proof size and verification time are not worse than sending the witness to the verifier and letting the verifier check its correctness directly.

Phase 2 and Beyond Requirements:

  • Maximum Proof Size: 256 KB.

  • Maximum Verification Time: 16 milliseconds.

These cutoffs are intentionally generous to accommodate new types of fast proof technologies that may incur higher verification costs. At the same time, they exclude proofs that are prohibitively expensive, to the point that few projects would be willing to include them on the blockchain.

Speed Phase 1

Single-threaded proofs must be at most 100,000 times slower than native execution, measured across a range of applications (not just proving Ethereum blocks), without relying on precompilation.

Specifically, imagine a RISC-V process running at about 3 billion cycles per second on a modern laptop. Achieving Phase 1 means you can prove about 30,000 RISC-V cycles per second (single-threaded) on the same laptop. But the verification cost must be "reasonable and non-trivial" as stated above.

Speed Phase 2

Single-threaded proofs must be at most 10,000 times slower than native execution.

Alternatively, since some promising SNARK methods (especially those based on binary fields) are hindered by current CPUs and GPUs, you can qualify for this phase by comparing the number of RISC-V cores that can be simulated at native speed using FPGAs (or even ASICs):

  • The number of RISC-V cores that can be simulated at native speed on an FPGA;

  • The number of FPGAs required to simulate and prove (nearly) real-time execution of RISC-V.

If the latter is at most 10,000 times more than the former, you qualify for Phase 2. On standard CPUs, the proof size must be at most 256 KB, and the verifier time must be at most 16 milliseconds.

Speed Phase 3

In addition to achieving Speed Phase 2, low proof overhead below 1,000 times can be achieved using precompilation with automatic synthesis and formal verification (applicable to a wide range of applications). Essentially, instruction sets can be dynamically customized for each program to accelerate proofs, but in a user-friendly and formally verifiable manner.

Memory Phase 1

The speed of Phase 1 is achieved with less than 2 GB of memory required by the prover (while also achieving zero-knowledge).

This is crucial for many mobile devices or browsers, thus opening up countless client-side zkVM use cases. Client-side proofs are important because our phones are our ongoing connection to the real world: they track our location, credentials, etc. If generating a proof requires more than 1-2 GB of memory, that is simply too much for most mobile devices today. Two points need clarification:

  • The 2 GB space limit applies to large statements (statements that require trillions of CPU cycles to run natively). Proof systems that implement space limits only for small statements lack broad applicability.

  • If the prover is very slow, it is easy to keep the prover's space below 2 GB of memory. Therefore, to make Memory Phase 1 non-trivial, I require that Speed Phase 1 be met within the 2 GB space limit.

Memory Phase 2

The speed of Phase 1 is achieved with memory usage of less than 200 MB (10 times better than Memory Phase 1).

Why push below 2 GB? Consider a non-blockchain example: every time you access a website via HTTPS, you download a certificate for authentication and encryption. Instead, the website could send zk proofs possessing those certificates. Large websites may issue millions of such proofs per second. If each proof requires 2 GB of memory to generate, that totals PB-level RAM. Further reducing memory usage is crucial for non-blockchain deployments.

Precompilation: The Last Mile or a Crutch?

In zkVM design, precompilation refers to dedicated SNARKs (or constraint systems) tailored for specific functionalities, such as Keccak/SHA hashing or elliptic curve group operations for digital signatures. In Ethereum (where most of the heavy lifting involves Merkle hashing and signature checks), some manually crafted precompilations can reduce prover overhead. However, relying on them as a crutch does not allow SNARKs to achieve what they need to. The reasons are as follows:

  • Still Too Slow for Most Applications (Both Inside and Outside Blockchain): Even with hashing and signature precompilations, current zkVMs are still too slow (whether in a blockchain environment or outside it) due to the inefficiency of the core proof system.

  • Security Failures: Handwritten precompilations that have not been formally verified are almost certainly riddled with errors, potentially leading to catastrophic security failures.

  • Poor Developer Experience: In most current zkVMs, adding new precompilations means manually writing constraint systems for each functionality—essentially reverting to a 1960s-style workflow. Even with existing precompilations, developers must refactor code to call each precompilation. We should optimize for security and developer experience, rather than sacrificing both for incremental performance. Doing so only proves that performance has not reached the desired level.

  • I/O Overhead and No RAM: While precompilations can improve performance for heavy cryptographic tasks, they may not provide meaningful acceleration for more diverse workloads, as they incur significant overhead when passing input/output and cannot utilize RAM. Even in a blockchain environment, as soon as you move beyond monolithic L1s like Ethereum (for example, if you want to build a series of cross-chain bridges), you will face different hashing functions and signature schemes. Repeatedly doing precompilation on the same issue cannot scale and poses enormous security risks.

For all these reasons, our top priority should be to improve the efficiency of the underlying zkVM. The technology that produces the best zkVM will also produce the best precompilations. I do believe that precompilations will remain crucial in the long run, but only if they are automatically synthesized and formally verified. This way, the advantages of zkVM's developer experience can be maintained while avoiding catastrophic security risks. This perspective is reflected in Speed Phase 3.

Expected Timeline

I expect a few zkVMs to achieve Speed Phase 1 and Memory Phase 1 later this year. I believe we will also achieve Speed Phase 2 within the next two years, but it is currently unclear whether we can achieve this without some new ideas that have yet to emerge. I anticipate that the remaining phases (Speed Phase 3 and Memory Phase 2) will take several years to realize.

Summary

While I have identified the phases of zkVM security and performance separately in this article, these aspects of zkVM are not entirely independent. As more vulnerabilities are discovered in zkVM, it is expected that some vulnerabilities can only be fixed at the cost of significant performance degradation. Performance should be deferred until zkVM reaches Security Phase 2.

zkVM has the potential to truly democratize zero-knowledge proofs, but they are still in their infancy—filled with security challenges and massive performance overhead. Hype and marketing make it difficult to assess real progress. By outlining clear security and performance milestones, I hope to provide a roadmap that eliminates distractions. We will achieve our goals, but it will take time and sustained effort.

免责声明:本文章仅代表作者个人观点,不代表本平台的立场和观点。本文章仅供信息分享,不构成对任何人的任何投资建议。用户与作者之间的任何争议,与本平台无关。如网页中刊载的文章或图片涉及侵权,请提供相关的权利证明和身份证明发送邮件到support@aicoin.com,本平台相关工作人员将会进行核查。

Share To
APP

X

Telegram

Facebook

Reddit

CopyLink