1. Research & development
  2. > Blog >
  3. Measuring test coverage

Measuring test coverage

28 July 2020
Nomadic Labs

As we have discussed earlier on this blog, notably in the post on regression testing, testing is an important complement to formal verification. A critical task when testing is to ensure a high degree on test coverage. In this post we discuss how we measure test coverage of the OCaml implementation of tezos using bisect_ppx, and our work towards ensuring that the Michelson interpreter of the node is fully covered by tests.

Test coverage

Testing is simpler to set up than formal verification but comes with a major caveat: it is incomplete. Each test run verifies only one execution path — as opposed to formal methods that can give guarantees on all possible execution paths of the system under test.

The incompleteness of testing provokes a natural question in the mind of the wary test engineer: to what extent do my tests cover the set of behaviors of the tested system? Or, what parts of the code are tested and what parts are not? Many measures have been proposed to answer these and related questions, collected under the term of “test coverage”.

In this blog post, we focus on one such measure, namely program point coverage. This consists of measuring the ratio of program points (informally, a program point can be thought of as a location in the source code) that have been visited during the test run. Other measures include function coverage (which functions have been visited?), edge coverage (which edges in the program’s control-flow graph have been visited?) and condition coverage (which boolean sub-expressions in guard-expressions have been evaluated to both true and false).

However, full coverage does not imply that all behaviors of the system has been explored, not for any of the measures above. Notably, the behavior of a program statement may change depending on the program state when the statement is executed. But none of the above measures quantify to what extent possible such program states are explored. Instead, they are defined in terms of control-flow graph properties of the tested executions. The takeaway is that full code coverage does not guarantee full system correctness.

Nonetheless, there is still value in pursuing a high degree of test coverage. First, it is an important proxy measure for the amount of testing that has been performed. And second, because it detects code that has not been tested at all. Such code may contain bugs that are triggered when executed in any program state. Systematically testing code is a cheap insurance against such bugs.

In this section, we will detail our usage of the bisect_ppx tool to measure program point coverage in the Michelson interpreter by our test suite.

The bisect_ppx code coverage tool

bisect_ppx is a code coverage tool for OCaml and Reason programs. A program that has been compiled with bisect_ppx writes a log file with coverage data just before terminating. A set of such log files can be converted into an HTML report detailing and summarizing the set of program points that has been visited during the execution.

For each instrumented .ml file that has been visited we obtain a coverage measure (e.g. “73% of the program points were visited”), and each file can be inspected in detail to understand what program points have been visited and not.

Demonstration of bisect_ppx
An example of using bisect_ppx, courtesy of bisect_ppx

bisect_ppx instruments the tested program by inserting code that associates each program point with a counter that is incremented when the program point is visited. Then, when compiled, the instrumented program is linked with bisect_ppxs run-time library. This library registers an __atexit hook that is called at the end of the instrumented program’s execution. This hook writes the program-point counters to a log file. At this point, the developer can run their test suite to gather coverage data. Finally, the bisect-reporter-ppx program included in bisect_ppx, reads the log files and generates an HTML report.

An example of bisect_ppx instrumentation

Consider the following test file for which we want to measure coverage:

let add n1 n2 = n1 + n2

After running bisect_ppx we obtain the following, instrumented source code:

module Bisect_visit___src___lib_addition___addition_lib___ml =
    let ___bisect_visit___ =
      let point_definitions =
        "\132\149\166\190\000\000\000\004\000\000\000\002\000\000\000\005\000\000\000\005\144\160P@" in
      let `Staged cb =
        Bisect.Runtime.register_file "src/lib_addition/addition_lib.ml"
          ~point_count:1 ~point_definitions in
open Bisect_visit___src___lib_addition___addition_lib___ml
let add n1 n2 = ___bisect_visit___ 0; n1 + n2

We see that bisect_ppx has added a module Bisect_visit___src___lib_addition___addition_lib___ml containing a serialized definition of the program points in the source file, and a function ___bisect_visit___ that increments a program point counter. The function add has been associated to program point 0. Hence, the ___bisect_visit___ function is called in the definition of add with the argument 0. Consequently, each call to add increments the counter for program point 0.

Test coverage for the Michelson interpreter

We have used bisect_ppx to measure the test coverage of the Michelson interpreter in the Tezos node, and to detect Michelson instructions that were lacking test cases. We had two goals: first, to make sure that each branch of the step function in the interpreter (corresponding roughly to each Michelson instruction) is covered by at least one test. Second, to make sure that each instruction has at least one dedicated test. Our work on this topic can be followed in merge request !1261, that also contains information on how to set up bisect_ppx for Tezos.

To measure progress on the first objective, we run bisect_ppx on the full test suite of the Tezos node. For the second objective, we run it using a test suite that targets individual Michelson instructions.

Baseline results

We take as our baseline commit #0478afc7 of the Tezos node. This version precedes our work towards full coverage, and for this reason has a low coverage.

The results using bisect_ppx for the Michelson instruction test suite are available here. We see that the test suit for instructions covers 60.79% of the interpreter. Several instructions are missing tests: instructions for big maps such as EMPTY_BIG_MAP, MEM and GET; the instructions SOURCE and SENDER related to the transaction in which a contract is executed; and arithmetic operations such as INT that converts an integer (int) to a natural number (nat).

The coverage obtained when using the full Tezos node test suite is better, at 70.21% test coverage. However, it is still worrying to see that some instructions, such as SIZE for lists and MAP for maps, have no tests at all!

Adding Tests

There is a simple solution to the lacking coverage: writing more tests. Fortunately, this is often quite straightforward using the Python Execution and Testing Framework for Tezos. This framework provides a programmable API for the tezos-client and tezos-node binaries. Writing tests is especially simple for instructions such as INT, that depend only on the state of the stack. We will have to do some more work for instructions such as SOURCE that interact with the block chain.

Testing simple instructions

For simple instructions, our testing approach consists of writing a simple contract that takes a sample input of the instruction by parameter, and leaves the result of the instruction in storage. For instance, to test the INT operation, we write the contract int.tz:

parameter nat;
storage (option int);
# this contract takes a natural number as parameter, converts it to an
# integer and stores it.
code { CAR; INT; SOME; NIL operation; PAIR };

Then, we write a pytest that executes the contract with a sample input, retrieves the resulting storage and asserts that it is equal to the expected one. As we have a large number of tests like this, we use pytest parameterization. This enables the execution of the same test with different parameters. In our case, we write one test method that is parameterized by a tuple (contract, param, storage, expected).

    def test_contract_input_output(self, client, contract, param, storage, expected):
        contract = f'{CONTRACT_PATH}/{contract}'
        run_script_res = client.run_script(contract, param, storage)
        assert run_script_res.storage == expected

It runs contract, using the run script command of tezos-client. Here, client is a fixture provided by the Tezos Python Execution and Testing framework, that wraps a tezos-client binary. The run script command is called with the parameter param and the initial storage storage. Then, the test verifies that after execution, the final storage contains expected.

For instance, to test the INT instruction, we could instantiate the test parameters with ('int.tz', 'None', '0', '(Some 0)'), meaning that the when the int.tz contract is passed 0 by parameter, then the final storage should contain (Some 0). These instantiations are given by decorating the test method. The resulting code looks like this:

    @pytest.mark.parameterize(..., [

       # ...

       # Test INT
       ('int.tz', 'None', '0', '(Some 0)'),
       ('int.tz', 'None', '1', '(Some 1)'),
       ('int.tz', 'None', '9999', '(Some 9999)'),

       # ...

    def test_contract_input_output(self, client, contract, param, storage, expected):
        # ...
More involved testing

Testing instructions that depend on the state of the block chain requires more effort, as exemplified by our tests for the SOURCE instruction. This instruction pushes the address of the contract that initiated the current transaction, called the source. Note that this may differ from the sender, which is the address of the contract that initiated the current internal transaction and that is obtained by the SENDER instruction.

Let us demonstrate the difference between the two instructions with a scenario involving two contracts. The first contract, receiver, puts the result of SOURCE in storage:

parameter unit;
storage address;
code { DROP; SOURCE; NIL operation; PAIR }

The second contract, proxy, transfers whatever amount it receives to another contract, as specified by the parameter of the transaction:

parameter (contract unit);
storage unit;
code {
      DIP { NIL operation };

Now, assume that the contract bootstrap2 creates a transaction to the proxy contract sending 99 ꜩ and giving the address of the receiver contract as parameter . Then the smart contract proxy creates an internal transaction that transfers the received 99 ꜩ to receiver. This is illustrated by the following figure:

The proxy contract transfers money received from bootstrap2 to the receiver  contract

In this scenario, the source of the internal transaction (from proxy to receiver) is bootstrap2, whereas the sender of that transaction is the proxy.

We test SOURCE to verify that it operates as expected. First, we test a direct transfer from bootstrap2 to the receiver contract, and then a transfer through the proxy contract as in scenario described above. In both cases, we assert that source address left in storage of receiver is that of bootstrap2:

    def test_source(self, client):
        init_store = IDENTITIES['bootstrap4']['identity']
                           1000, 'bootstrap1')

        # direct transfer to the contract
        client.transfer(99, 'bootstrap2', 'receiver', ['--burn-cap', '10'])

        source_addr = IDENTITIES['bootstrap2']['identity']
        assert_storage_contains(client, 'receiver', f'"{source_addr}"')

        # indirect transfer to the contract through proxy
        contract_addr = client.get_contract_address('receiver')
        client.transfer(99, 'bootstrap2', 'proxy',
                        ['--burn-cap', '10', '--arg', f'"{contract_addr}"'])
        assert_storage_contains(client, 'receiver', f'"{source_addr}"')

Final results

From the beginning, we’ve had a reasonable degree of testing in place; which was reflected in the core’s stability. Despite this, we committed to achieving 100% (or close to it) of code coverage. To that end, since April 2019, we have added 7191 new Python integration tests, using 66 new test contracts. As a result of this effort, we now obtain a code coverage of 99.12%2 for the tests for the instruction integration tests and 100% for the full contract test suite. We cover all branches of the interpreter function in script_interpreter.ml.

In addition to quantifying the confidence in the test suite of the interpreter, bisect_ppx helped us discover dead code in the interpreter. The interpreter contained a “peephole optimization” for the UNPAIR macro. This optimization was supposed to match the expansion of the UNPAIR macro and transform directly a stack on the form (Pair a b) : S to a stack of form a : b : S, without executing the full expansion. To our surprise, we discovered that this branch of the interpreter is never visited during testing. Quite unexpectedly, since the UNPAIR macro is very commonly used in our test contracts.

It turns out that the optimization of UNPAIR was never applied. The expansion of this macro had been changed since the optimization was added to the interpreter, but the optimization had not been updated to match. Consequently, the pattern match in interpreter no longer matched the actual definition of UNPAIR. Thanks to bisect_ppx, we were able to detect this dead code. As gas updates in Babylon removes much of the gains from this optimization, we removed it in the current Carthage protocol.


Now that we have achieved a high level of code coverage in the interpreter, the next goal is to maintain this level and to extend it to remaining parts of the Tezos code base.

For the first objective, we have added code coverage measurements to our continuous integration system. This lets developers calculate test coverage through a button push in the CI’s interface. In the future, we would also like a developer to be automatically informed if, for example, they add a new Michelson instruction but forget to add tests verifying its implementation.

For the second objective, we are working on covering fully the type checker of Michelson, currently at 75.94% percent. To achieve this goal, we are writing both unit tests and integration tests. We are also encouraging developers to add automatic tests for all new functionality and when modifying existing behaviors.

End Notes

  1. For Python test cases, this counts the difference between test items collected by pytest when running pytest tests_python/tests/test_contract* in commit #0478afc7 and in commit #395bee9dc. For test contracts, this counts the difference between the number of *.tz files present in the src/bin_client/test/ respectively tests_python/contracts in commit #0478afc7 respectively commit #395bee9dc

  2. The missing 0.88% percent comes from a case of the deprecated STEPS_TO_QUOTA instruction which is unreachable through the tezos-client commands used for integration testing of contracts.