Summary of the first interruption issue of the Sui mainnet

CN
11 hours ago

When issues arise, the Sui engineering team quickly diagnoses the problem and releases a fix, which is then deployed by the validating nodes, minimizing network downtime.

Event Overview

Between approximately 1:15 AM and 3:45 AM Pacific Time on November 21, 2024 (5:15 PM to 7:45 PM UTC+8 on November 21, 2024), the Sui mainnet experienced a complete network stall. All validating nodes fell into a crash loop, resulting in a total interruption of transaction processing.

Cause of the Problem

An assertion in the blocking control code triggered an error: if the estimated execution cost is zero, it causes the validating nodes to crash. This issue occurs under the following conditions:

  1. The blocking control is set to TotalGasBudgetWithCap mode:
  • This mode was briefly enabled in protocol version 63 and then retracted, before being re-enabled in protocol version 68 with the cumulative scheduler.
  1. The network receives transactions that simultaneously meet the following conditions:
  • Variable shared object input

  • Zero MoveCall instructions

When the network receives such transactions, all validating nodes immediately crash.

What is Blocking Control?

The Sui network's object-based architecture supports large-scale parallel processing of different user transactions, which is not achievable in most other networks. However, if multiple transactions write to the same shared object simultaneously, those transactions must be executed sequentially, and there is a limit to the transaction processing volume involving that specific object.

The blocking control system prevents the network from being overloaded by long-running checkpoints by limiting the transaction rate that writes to the same shared object.

We recently upgraded the blocking control system to improve the utilization of shared objects by more accurately estimating transaction complexity. However, there was a bug in the code of the new mode TotalGasBudgetWithCap, which led to this issue.

How Was the Problem Resolved?

Once the problem was identified, the code fix was straightforward (see PR #20365). The fix has been deployed to the mainnet (v1.37.4) and testnet (v1.38.1).

PR #20365: Modified bumpobjectexecution_cost to use saturated addition and allow zero-cost transactions.

🌟 Mainnet v1.37.4: https://github.com/MystenLabs/sui/releases

Thanks to the active response from the validating node community, it took only 15 minutes from the release of the fix to the Sui network returning to normal.

What Did We Learn?

  • The event detection and response system worked well: Automatic alerts and community reports were triggered almost simultaneously, allowing us to quickly mobilize team resources for diagnosis and repair.

  • The validating node community performed excellently: The Sui network returned to normal almost immediately after the fix was released.

Preventive Measures

  1. Improve the testing system: Increase the number of adversarial transaction types similar to those that triggered this crash to identify potential issues.

  2. Optimize the build process: Increase the speed of debugging and releasing binary files to further reduce event response time. Part of the downtime during this interruption was due to waiting for the build of the release version.

Thanks to the support of the community and validating nodes, we ensured the rapid recovery of the Sui network!

免责声明:本文章仅代表作者个人观点,不代表本平台的立场和观点。本文章仅供信息分享,不构成对任何人的任何投资建议。用户与作者之间的任何争议,与本平台无关。如网页中刊载的文章或图片涉及侵权,请提供相关的权利证明和身份证明发送邮件到support@aicoin.com,本平台相关工作人员将会进行核查。

Share To
APP

X

Telegram

Facebook

Reddit

CopyLink