Incident Report: Mumbai 2 user-activated protocol override

TL;DR The updated Mumbai upgrade proposal, Mumbai 2, patched a vulnerability that could potentially halt block production on the Tezos network. No funds were at risk.

On March 7th 2023, we announced Mumbai 2, a patched version of the Mumbai protocol proposal addressing a liveness vulnerability witnessed on the Ghostnet test network. In this report, we revisit the events as they occurred and the decisions taken in response to this incident. But first, we provide a short summary:

An “inconsistent hash” error was reported on Ghostnet level #2,022,087.
Issue was tracked back to an inconsistency in the Tezos protocol cache, which stored different representations of certain values depending on node uptime for a deployed smart contract.
Network was safe: at worst this issue would affect network liveness (i.e. block production would stall or stop), but not lead to an inconsistent ledger state.
Mumbai 2: A patched version of the Mumbai protocol proposal was published on March 7th.
Mumbai 2 activated successfully on Tezos Mainnet on block level #3,268,609.
There was no evidence of this issue being exploited (or attempted to be exploited) on Ghostnet or Tezos Mainnet.

Incident discovery

On February 21st 2023, we observed that several nodes in the Ghostnet test network reported an “inconsistent hash” error message for block proposals for level #2,022,087.

Our investigation identified an issue in the way Michelson Lambdas are stored on the economic protocol’s cache, which manifested as a divergence in the cache entry for a deployed contract. KT1Ja7Cq1HUTzmk1Qh8iERrEzp1LCjRXvqei, between different nodes.

This divergence was not the result of a bug in the smart contract nor the Michelson interpreter, but rather on a difference in runtime behavior that would lead to certain nodes storing an “unoptimized”, human readable version of their argument in one case, versus an optimized byte-based representation in the other, depending on their uptime and recent activity.

Risk assessment and mitigation

This issue threatened network liveness, as it risked dividing nodes depending on the values stored in their protocol caches. If each side accounted for more than one third of the attestation power but less than two thirds, block production would halt. It’s worth highlighting that under Tenderbake consensus rules, there was no risk of a diverging network split where each fork progressed separately¹.

In this worst case scenario, the ledger state would remain safe, but not live.

Even so, it was quickly noted that should that scenario occur, there would be a relatively straightforward path to recovery. It would suffice to reboot enough nodes holding the unoptimized value in the cache for them to be updated with the optimized value, and get the network unstuck.

In spite of this, and the lack of evidence that there was indeed a Byzantine motivation in the deployment of this contract, we still decided to treat this issue as a 0-day vulnerability, and worked on fixing this issue quickly and silently:

The vulnerability had been witnessed on Ghostnet, a public test network, and nothing prevented it from happening again on the same test network nor on Mainnet – if an exploit was derived.
The information necessary to weaponize this bug was public, and moreover didn’t require a very deep understanding of the core issue – it needed only deploying and interacting with a similar contract on Mainnet.

We had also investigated various ways of further mitigating the problem via an Octez shell update, but none were found to be satisfactory. Thus, we had no option to address this issue at its core, and modify the way the Tezos economic protocol interacts with the cache: when updating an entry, values should be always normalized to the optimized, byte representation.

We took the decision to deploy this fix only on Mumbai, after considering the following:

While this was a high risk threat, it was not a safety issue – no funds were compromised, there was no risk of an inconsistent ledger state –, but rather a liveness one: at worst, block production would slow down or grind to a halt.
Even if this bug was indeed present in Lima on Tezos Mainnet, there was no evidence that the contract triggering the incident was deployed with an intent to attack the network, nor were there consecutive attempts to exploit it – neither on Ghostnet nor Mainnet.
Mitigation was straightforward by rebooting affected nodes. If the situation escalated (e.g. by a repeated exploit), we would still have the option to deploy a patch for Lima on Mainnet as a user-activated upgrade.

After thorough testing and review, an updated protocol proposal, Mumbai 2, was announced on March 7th.

In hindsight, this decision seems to be the adequate one: the user-activated protocol override was not controversial, and after the original Mumbai proposal successfully passed the Promotion period vote, Mumbai 2 activated at block #3,268,609 on March 29th. Moreover, we did not witness neither a repetition of the incident nor any attempt to exploit it.

Moving forward

Ideally, we would have discovered this issue on bleeding-edge test networks like Dailynet or Mondaynet.

However, this requires being able to support environments closer to Tezos Mainnet in test networks: more smart contracts and rollups deployed, increased traffic, etc. Moving forward, we are working towards increasing the capability to reproduce Mainnet conditions in our test infrastructure. It is also imperative to increase community participation in bleeding-edge test networks.

Tezos is constantly evolving, and a new protocol upgrade, Nairobi, is set to activate on Tezos Mainnet around June 23rd. Staying on top of the blockchain game requires us to move fast. Yet, we don’t buy the part of the mantra which requires breaking things – at least, not on Mainnet.

Indeed when we said that adopting Tenderbake was a trade-off of between (more) safety and (less) liveness, we had scenarios like this in mind: in the event of a network split, it is not possible for both forks to advance as at most one can reach sufficient attestation power. ↩