Also coming with Jakarta: spring-cleaning the Michelson interpreter

In recent months the Michelson team at Nomadic Labs has launched a project dedicated to paying off the technical debt in the interpreter. Michelson is the language of smart contracts in Tezos blockchain and its interpreter is an integral part of the Tezos economic protocol. It evolves along with the protocol and in fact many new features were added to it in previous upgrades. As happens commonly when developing software, technical debt accumulated with these changes and it was decided that it’s time to make a dedicated effort to pay at least a part of it off.

What is technical debt?

It is a very broad term encompassing all inefficiencies, suboptimal design choices, insufficient testing and even bugs that accumulate in a software project along its lifetime. Writing a good piece of software is a very difficult and time-consuming task. It relies on information that isn’t always readily available. Environments change and so do users’ needs. When pressed by deadlines or uncertainty, developers often create suboptimal solutions, either erroneously or even intentionally as a trade-off necessary to deliver on time.

It is called “debt” because allowing for these deficiencies in the short term allows software to be delivered more quickly, but over time they take a toll on its quality. Hence it’s important to keep the technical debt low and pay it off by fixing inefficiencies on a regular basis.

Technical debt in the Michelson interpreter

When creating a programming language, an additional complication with respect to technical debt arises. While the language’s interpreter (or compiler) accumulates technical debt like any other piece of software, it must still support all the software previously written in the language. So it is imperative to ensure that every program previously written still works in the same way it did before any upgrade. This requirement makes it challenging to redesign programming languages and demands extensive testing of each change.

That said, it is sometimes possible (albeit risky) to announce a breaking change to the language, making it clear to the users that their programs may no longer work with the new version of the language. This was done for instance with Python. Python 3 was incompatible with Python 2 and the breaking changes were at least partially motivated by technical debt. The transition was so painful that most Linux distributions still ship with Python 2 (either as the default or additional interpreter), as a lot of software was never updated to work with Python 3.

However, in the case of Michelson this approach cannot work, because a smart contract, once submitted to the blockchain, stays there forever. The protocol must retain the ability to interpret old contracts, no matter how many protocol upgrades were implemented since the contract’s origination. The need to support every existing contract is often a great obstacle in refactoring and improving the code.

In addition, sometimes certain features of Michelson turn out to be less useful or safe than expected. They might also conflict with other features that are later deemed to be more important. It seems that technical debt can also arise in utterly normal development processes, because of external situational changes! For example, users behave differently than expected, or new scientific developments are announced that would be good to take advantage of. So, is the blockchain doomed to support those legacy features forever?

Fortunately there is a solution! The Michelson interpreter has a so-called legacy mode, which supports all the features that were ever used on the blockchain. As the name suggests, it is an optional behaviour, which is only enabled for executing contracts already existing on-chain. New contracts before they’re originated are type-checked in normal mode, which does not have to support all the legacy features. Indeed, thanks to these two distinct modes of operation it is possible to deprecate old features and disallow origination of contracts using them, while still supporting them in old contracts.

But does it really help? The blockchain still has to support deprecated features in the legacy mode, doesn’t it?

Patching legacy contracts

In the Michelson team at Nomadic Labs, we decided that the time has come to make steps towards eliminating currently¹ deprecated features for good. Contrary to Python, which is a general-purpose language, Michelson only makes sense within the Tezos blockchain. Because the blockchain is publicly available, unlike developers of Python, we do have access to all the Michelson programs, or at least to all those that can have an impact on the blockchain. Although we cannot modify the on-chain data², we can tell the nodes to replace one contract with another at a later block. Because of the evolving nature of the Tezos protocol, each new protocol version can (and often does) alter the way nodes store information about the current state of the blockchain (called the context). We can use this mechanism to modify contracts stored in the context. Albeit contracts retain their original form in the blocks that originated them, they can be patched at a later block by the protocol migration.

This gives us a clear procedure to remove deprecated features even from the legacy mode of the interpreter:

Announce the feature deprecation and remove the feature from the normal mode.
Wait until the protocol deprecating the feature gets activated.
Type check all the contracts on chain in the normal mode and select those that fail.
Patch the selected contracts so that they type check successfully again.
Finally remove legacy features from the legacy mode.

The waiting step is unfortunate, but necessary. We have to wait for the new protocol with the deprecated feature to get activated, so that no new contracts are originated using the deprecated feature after we do the patching.

In the past we did deprecate features in several protocol upgrades, but this was the first time we actually started patching contracts in order to remove those features. After some patching and hacking on the tezos-node we managed to extract all the smart contract scripts from the mainnet and type checked them using tezos-client typecheck command. We have found 8 contract scripts that required patching, although most of them were instantiated multiple times. Patches were mostly trivial to do and all the scripts type checked successfully again. But does the story end here? As a careful reader will probably guess, in fact it has only just begun! We now need to make sure that all the patched contracts work exactly as they did before.

One example of a deprecated feature we decided to patch away was the possibility to store typed references to other smart contracts in storage (of Michelson contract type). These references were at some point forbidden from appearing in contract storages, so we had to find those contracts that still held contract references in storage and remove them. The following patch to smart contract KT1MzfYSbq18fYr4f44aQRoZBQN72BAtiz5 is an example of such a change:

--- patched_contracts/exprtgpMFzTtyg1STJqANLQsjsMXmkf8UuJTuczQh8GPtqfw18x6Lc.original.tz
+++ patched_contracts/exprtgpMFzTtyg1STJqANLQsjsMXmkf8UuJTuczQh8GPtqfw18x6Lc.patched.tz
@@ -1,10 +1,5 @@
 { parameter (or (lambda %do unit (list operation)) (unit %default)) ;
-  storage
-    (pair key_hash
-          (contract
-             (or (option address)
-                 (or (pair (option address) (option mutez))
-                     (or mutez (or (pair (option address) (option mutez)) address)))))) ;
+  storage (pair key_hash address) ;
   code { DUP ;
          CAR ;
          IF_LEFT
@@ -28,6 +23,8 @@
                NIL operation ;
                { DIP { DIP { DUP } ; SWAP } ; SWAP } ;
                { DIP { DIP { DIP { DROP } } } } ;
+               CONTRACT (or (option address) (or (pair (option address) (option mutez)) (or mutez (or (pair (option address) (option mutez)) address))));
+               IF_SOME {} {PUSH string "Bad contract in storage"; FAILWITH};
                AMOUNT ;
                SENDER ;
                SOME ;

As can be seen here, contract ... in the storage type is replaced by an un-typed address. The difference between these values is only in their types: contract is parametrised by the referenced contract’s type, while address is not. The latter can be converted into the former, by specifying the said type. This is exactly what the two lines added in the body of the contract do.

It’s worth noticing that this patched version of the contract can fail at runtime while the original version couldn’t. It would fail when the stored address points to a contract of a wrong type. This is unfortunate, but could not be helped. We will take steps to verify that the change doesn’t break the contract and once we do it, all should be well.

Verifying the patched contracts

How do we make sure that we didn’t break a program of perhaps a thousand lines of code that we’re seeing now for the first time? It’s especially worrisome, considering that Michelson is not a particularly human-friendly language. There’s no (formal or even informal) specification for the contract, no documentation, and all we know about the author is their public cryptographic key (albeit, in actuality there’s no guarantee that the person who originated the contract is also the one who wrote it). The situation seems almost hopeless.

Fortunately one of the particularly terrifying contracts was written by us some time ago. There was some documentation and even some tests written in the form of a shell script. So we started with that one. We rewrote the test, because it relied too much on old behavior of the tezos-client that has since changed. Also reworking that test gave us a thorough understanding of the purpose and interface of the contract. We ran the test against both the original and the patched version of the contract and fortunately both versions passed. So far, so good

With other contracts we had less luck though. Fortunately what once happens on a blockchain, stays there forever. Thus we have a complete record of transactions ever made to these contracts. We could replay these transactions with the patched version of the script and check if they yield the same results as they did originally. How can we replay a transaction to a smart contract altering some of its aspects? Of course, we have the tezos-client run script command. It accepts a script, a storage, and a parameter and executes the given script in the current context of the node. However, there’s a lot more of the blockchain’s state to reproduce than just storage and parameter. The contract has a balance, an address (which is more or less randomly assigned to the contract during origination). In addition, the transaction comes from some other account which also has an address and so forth.

Indeed, as we replayed transactions with the original scripts to verify the soundness of our procedure, we faced some contracts failing on transactions that succeeded in the past. The apparent point of failure was checking user signatures. But why would a signature that used to be valid, become invalid over time? Fortunately, failures occurred in the contract we had tests for. From these tests we managed to deduce that the signature consists in encrypting a certain message with user’s private key, and a part of that message was the address of the contract being called.

With the run script command the node originated a new contract with a fresh address to run the script for. Hence the replayed scripts had a different address than originally and that is why historic signatures didn’t work in replay. Therefore, it was necessary to force execution with a given address. So before testing the scripts we had to implement additional options to the tezos-client. This is how paying off technical debt can sometimes lead to new features in the software.

Having done that we were able to replay transactions from the past.

Scaling up

Of course with all the contracts there were several hundred transactions to replay in total. Typing the parameters for all these transactions on the command line takes a lot of time. So it was necessary to devise an automatic process of replaying the transactions and checking results.

We decided to use the API of the tzkt.io indexer to download historic transaction data for our contracts. We wrote a script where for each transaction the local tezos-client is called to execute the contract. Then, it would parse the transaction result and compare them to whatever the indexer reported.

Developing the script took a lot of time as we had to translate between 2 different encodings of Michelson values (JSON and Micheline), distinguish between internal operations generated by the script and the “main” operation that triggered them, and generally understand the data returned by the indexer in various situations. We also had to decode operation results returned by the indexer and compare them to those produced by tezos-client. Lastly, we had to produce a nice summary, indicating which operations to which contracts failed so that we could debug them.

After fixing all false-positives and errors we discovered some operations that we couldn’t get to work. One of the contracts was previously patched during a migration so that its storage type already changed. Of course, reproducing transactions from before the patch was not possible, so we had to skip those³. Moreover, some contracts relied on big maps, which are stored in the node’s storage and currently there’s no way to inject their modified values into the run script command. It’s not at all obvious how an interface for this should look, not to mention implementation problems. Hence we decided to give up and leave these operations unchecked. There were few of them though.

Code review

The contracts will be patched along with a protocol migration. The node iterates over all the contracts, looking for hashes of the contracts we want to modify. Once such a hash is found, the contract for that account is replaced with a hard-coded binary string representing the patched version of the contract. How do we review such a change? Binary representation of Michelson code is even less readable to a human eye than the code itself. Reviewers could and did review the test for the one contract that we had a written test for, as well as the migration code. But how do reviewers read the contract patches if they’re binary-encoded?

To make the review process easier we attached to the migration code both the original and the patched code for all the modified contracts as well as diffs between them. We wrote unit tests to verify that the patched versions were equivalent to hard-coded binary representations. We also checked that the supplied diffs are the same when comparing the hard-coded binary representation and the text files attached to the original code.

Finally we published the aforementioned script to let the reviewers run the test for themselves. Replaying all the transactions takes a few hours, so it’s not very convenient to run, but not that bad if one has to do it only once.

At the time of writing this post, the migration code is merged into the master branch, and constitutes a part of the Jakarta proposal. Obviously it won’t be executed until Jakarta is accepted by the community and activated, though. That said, having performed all these tests and checks we’re confident that the patched contracts work the same as they did before, or at least they do for all the test cases that reality itself graciously provided us. After all, what better test to a software than real usage by real users?

Actually the word “currently” here refers to the moment of making the decision, because while working on this, we already decided to deprecate more features, which were not included in the scope of this work. ↩
Changing the contracts directly would disturb block hashes, thus making these blocks invalid. ↩
The states of the storage that the indexer remembered were invalid according to the patched version of the contract, so attempts to replay these transactions with the patched version resulted in a type error. ↩