Lessons Learned from the Babylon Protocol Upgrade: A&nbsp;Retrospective

Summary:

Babylon (aka protocol 005), the second Tezos protocol amendment jointly developed by Nomadic Labs and Cryptium Labs, was successfully activated on block 655361. Since then, we’ve continued analysing and monitoring the new features, but have also engaged in a deeper reflection on the upgrade process from its development period, pre-injection, to the period following the activation.

This article summarises the lessons learned in five parts: the development of Babylon, the proposal period, continued testing during the on-chain test period, activation, and post-activation. For each of these parts we identify the missteps made over the past months and draw lessons for current and future core developers to improve on the process for future protocol upgrades.

The process of developing Babylon had many firsts. It was the first time that two independent core development teams worked together on a proposal, which was a non-trivial step towards decentralisation of core development. Furthermore, given its large set of features, it was a steep step from Athens. While Athens proved that live upgrades worked, Babylon showed that large parts of the codebase can be amended in order to deliver meaningful improvements.

TL;DR : Less rush, more testing, more documentation, more community involvement.

Developing Babylon

Issue: controlling the set of changes and timeline

An issue during the development of Babylon was that, despite agreeing beforehand on a set of features, we ended up adding a few more along the way. This was due partly to some backlog from Athens and partly because we received requests for features that we were afraid to leave unanswered for 3 more months. Furthermore we were trying to respect a specific timeline for injection because we were afraid of a never ending development cycle where changes kept piling up.

Working towards a moving target on a fixed timeline complicated considerably the internal development and rushed the final stages of the release, impacting testing and the communication with the community.

How to improve

In future upgrades we will compile a list of features early on and gather input from the wider ecosystem. We will limit development towards the specific features and this will give us a clearer vision of a possible timeline that, if needed, will be delayed until proper review, testing and communication is done.

Proposal Period for Babylon

The first proposal (Babylon PsBABY5n) was injected on the 26th of July 2019. Shortly after this, a new updated version (Babylon PsBABY5H) was proposed during the same proposal period.

Issue: Binary format of endorsements

The second injection was in response to feedback we received from wallet developers which led us to revert a change in the binary format of endorsements. The binary format of manager transactions was kept in order to ensure that user transaction from 004 could not be executed on 005.

The proposal phase was indeed conceived with the idea that there could be a series of proposals, counter proposals and iterations.

However, this extra injection implied more off-chain coordination, and it could have been avoided if we had gathered feedback from developers impacted by the binary format of operations before the first proposal.

How to Improve

We see three areas of improvement:

Ensure that the binary formats are well documented, before the injection

All the binary formats of the protocol can be obtained with the new binary tezos-codec for any of the encodings used in the Tezos codebase. We are also working on a new tool that will parse and display the binary format of a specific operation.

Any breaking change introduced by a proposal is highlighted in the Changelog that we publish together with every release and there are specific guidelines that are suggested to handle the migration.

We will continue improving the quality and accessibility of this documentation and we will make sure that it is available to third party developers a few weeks before a proposal.
Allow developers to test properly, before the injection

Using the simple bash sandbox is usually enough to interact with a Tezos node and client and rapidly test by hand. Additionally a number of more complex end-to-end tests can be automated using one of the two frameworks that are present in the code base, Flextesa and the Python test framework.

These frameworks are used by core developers, which means that they will be maintained in the future and that there are a number of existing tests that can serve as inspiration to third party developers. We believe that with the tools presented any future incompatibility in the format of operations can be detected early in the development process and independently, without the need for a testchain.

However, these tools might not have been advertised enough to third party developers, and we will make an effort to better document them and encourage their adoption and feedback.
Publish the code and changelog before the injection

For Babylon we published a description of the features as well as links to their implementation as we build them over time, for example Quorum Caps and Making implicit accounts delegatable.

However for both Athens and Babylon the changelog and final code was only published at the same time as the proposal was injected. In the future we will make sure to leave a few weeks of time between publication and injection to leave ample time for last minute corrections.

Exploration & Testing Periods for Babylon

Issue: the issue with Big Maps and the Hotfix

During the test period of Babylon PsBABY5H we discovered a bug affecting big maps. More details are in the documentation. The bug was particularly hard to catch because it was a single line in one large patch and because it didn’t break any functionality but caused a performance degradation.

Despite the above we believe that with a more rigid review process and more automated tests we should be able to avoid this situations in the future.

The natural course of action would have been to vote nay in the promotion vote and to propose a patched version of Babylon, PsBabyM1, incurring a delay of 3 months for the voting procedure to complete again.

However, given that multiple big maps (together with entry-points) were strongly requested by smart contract developers, we felt a sense of urgency from some parts of the community to deliver this feature as soon as possible.

For this reason we decided to add the possibility to download a new version of the Tezos node that in case of a positive vote for Babylon PsBABY5H would activate the patched version PsBabyM1, explained in the related blog post.

Adding this possibility was not meant as a suggestion, but merely as an additional option for the community. The majority of nodes preferred this option and PsBabyM1 was activated successfully.

We have always considered user-activated updates a legitimate part of the governance process when used for urgent and critical hotfixes, like we did in the past and as explained in the “Amendments at work” blog post.

On the positive side, this hotfix allowed to unlock the many new possibilities given by big maps. We are preparing a blog post on the topic for smart contract authors.

How to Improve

There were several issues at play here.

To allow for faster iteration over multiple improved versions of a proposal, we’ve discussed several ways to reduce the length of a voting procedure in case of a negative vote.

The hot fix also led us to reflect on the user-activated update mechanism. Because of a technical limitation in the current implementation, user-activated updates can only be introduced by releasing new binaries or recompiling the code, a process which can leave non developers out. We are implementing a node where user-activated updates can be set easily by users in a config file, instead of depending on a mainnet release.

As for the big map itself, this kind of bug can be found through more thorough testing and stricter review requirements during the development process. As the number of developers competent in protocol development grows so will the number of reviews per merge request.

In addition, more developer activity on the testnets prior to injection would be helpful. One way to achieve that is to look into incentivizing participation in the testnet. Another way, which has already been enacted, consists in always having a testnet running the proposal currently being voted on.

Activation of Babylon

Issue: the mempool glitch at activation

Right after the activation of Babylon there was a slowdown of block production in the network due to some nodes being blocked. These nodes had correctly migrated to Babylon but were still receiving operations from Athens which they were no longer capable of deserialising, causing a failure of the mempool and of the node.

Fortunately the situation could easily be resolved by simply restarting the nodes, thus erasing the old operations, however, those operations would still be propagated by the network for an hour, requiring multiple restarts of the node. A patch to fix the mempool was ready minutes after we realised the problem and many bakers readily updated their nodes.

The failure of the mempool was caused by a bug which was known and fixed on the master branch of the Tezos code base but was incorrectly ported to the mainnet branch.

How to Improve

Once again this kind of bug can be avoided using a simpler release process which we are currently implementing and will be in place before the next release. The goal is to reduce to a minimum the number of steps needed to release Mainnet starting from the master development branch. This will simplify testing and leave less room for errors.

In the future, we will run testnets before injection that simulate the migration from the old state machine to the new. In this instance it would have meant running a testnet with 004 and then upgrading the running network to 005, while at the same time generating load by randomly sending transactions. This kind of testnet would have caught the previous bug.

Post-Activation of Babylon

Issue: the formula for block rewards

After the activation of Babylon, users realised that the computation of rewards was slighly different with respect to the formula published in the blog post describing Emmy+, the improved consensus algorithm.

The reason for this difference is a bug in the implementation of the formula, which results in a loss of precision when calculating the baking rewards.

This precision loss does not affect the security of Emmy+ and a fix will be offered in the upcoming protocol upgrade.

How to Improve

This kind of bugs should be more easily spotted and avoided, we see two lessons to learn here.

Can we write fewer bugs in the future?

We are going to improve our release process for protocols by having
- comprehensive unit tests
- a clearer, stricter, policy for code reviews
- a clearer freeze and review period before injection
Bugs still happen, how can we make sure we notice them?

Despite having Babylonnet running since September 27th, we didn’t sufficiently encourage participation and for this reason the traffic on the network has not been a realistic sample of mainnet.

Furthermore the lack of testnet support from block explorers has strongly limited the capacity of users and developers to inspect the data of the testnet.

In the last month there was a reorganisation of the testnets infrastructure to improve the service and engage more members of the community.

Moreover several block explorers, such as TzStats, are working to support Babylonnet and will support future test chains.

Issue: restricting originated (KT1) contracts from paying transaction fees

With the implementation of the delegation process simplification, originated accounts (KT1) can no longer pay for transaction fees. The goal of this change is to ensure that all transaction fees are always paid by tz1 addresses, and remove the computational overhead produced by fees paid through KT1 accounts, as smart contracts need to be fully executed in order to verify their validity. This results in the potential for dramatic mempool optimisations and increased throughput.

However, the now legacy multi-step delegation process led to a common scenario, where all the funds of the tz1 account were transferred to the KT1 account to maximise the amount delegated. Before Babylon, this was not an issue as the KT1 account was able to pay for transaction fees.

With Babylon and KT1 accounts no longer able to pay for transaction fees, implicit accounts that had empty balances before the protocol upgrade where funded with 1µꜩ (0.000001ꜩ) to prevent the requirement for a new allocation burn.

This led to two issues. First, the creation of these accounts and their funding with 1µꜩ was not documented, which led to trouble for block explorers. Second, the 1µꜩ balance was not sufficient to pay for a transaction.

To assist affected accounts, Cryptium Labs funded all the implicit accounts in this situation with 0.01ꜩ, which is high enough for the account to pay for at least one transfer transaction (funding, for instance, the tz1 address with enough fees to pay for several transactions).

How to Improve

Regarding the effect of token creation during protocol migration and the way they are interpreted by block explorers, we feel the cleanest, most consistent, approach is to introduce a receipt attached to migration blocks showing all of the balance updates caused by the migration. This however, requires a change in the Tezos environment, and not merely a protocol change. Protocol environments are versioned and designed to change over time (for instance to accomodate new cryptographic libraries).

Regarding the effect of the transaction fee, funding the accounts directly outside of the protocol upgrade was a simple lo-tech solution which worked. However, if anything of the sort were to be repeated in the future, it should involve clearer communication with the affected users in order to ensure a smooth upgrade for everyone.

Things that Worked

This is the first time that a running decentralised, permissionless and censorship resistant blockchain protocol evolved in a meaningful way. Babylon paves the way for a chain that can evolve over time and adapt the best technologies from the entire ecosystem.

Furthermore, it is the first time, for Tezos, that two independent core development teams worked on the same protocol. Hiccups were abundant, but it worked and it showed that core development can be decentralised while also moving fast and evolving the protocol.

Lastly, let’s not forget the many features that worked:

Closing Remarks

In summary, we have identified the following as key areas to improve the proposal process:

Outline the desired features ahead of time, stricly focus development on them and wait all reviews, testing and documentation to be done.
Make feature documentation and changelogs more accessible and visible ahead of time, to give the community more time to engage.
Release of independent features to testnets, so ecosystem developers have an early access and can provide feedback in advance.
Reorganisation and maintainance of testnets.
More reviews per merge requests and more unit and integration testing.

Lastly it is important to remember the scope of these changes for Tezos. Babylon is, feature-wise, a steep step from Athens, which touched almost all the main areas of the protocol: Michelson, voting procedure, accounts and consensus. With the Babylon upgrade the Tezos community proved to the world that we are the first blockchain that can significantly amend a running protocol. Although there were more than a few drawbacks, it should not deter us from improving Tezos over time, as now is the time to create the foundation for a long lasting and relevant blockchain protocol.

Lessons Learned from the Babylon Protocol Upgrade: A Retrospective