Testing is an important complement to formal methods that we use
through out Tezos to ensure software quality. In this blog post we
discuss how we use regression testing, through the
pytest-regtest
tool.
Regression testing
Regression testing is a coarse-grained testing method for detecting
unintended changes in the behavior of the system under test. We have
applied regression testing to the tezos-client
, notably to
smart-contract type checking and execution.
Globally, regression testing consists of two components. First, a way
of running and storing the behavior of a system in a test
scenario. Second, a way of running the same test again and comparing
the behavior of the first and second run.
This is why we call regression testing coarse-grained: a regression test passes only if the output is exactly the same as in the first, stored, run. Consequently, a regression test will detect non-functional changes, such as changes in white space output by the tested system. Such nitpicking can be overly restrictive, and we will show later in this blog post how to relax overly strict verification.
In our case, we have enabled regression testing on an existing test
suite that exercises the type checker and interpreter of Michelson
smart contracts. In the first run of the test suite, the output of
each command passed to tezos-client
by the test case is stored in a
log file by the test runner. After verifying that validity of the
output, we commit these log files to source control. In subsequent
runs, the output of each command sent to the client is compared to the
expected output stored in the corresponding log file.
Regression testing using pytest-regtest
To implement regression tests
we plug in
pytest-regtest
to the pytest
-based Python Testing and Execution framework used for integration testing of Tezos.
This plugin provides the regtest
fixture that is then used as a file object: it can be written to, or used as a context manager.
If the --regtest-reset
flag is passed to pytest, then all output
written to the regtest
fixture during a given test case is stored to
a log file. On subsequent runs (unless the --regtest-reset
flag is
passed again), the output written to the fixture will be compared to
what was recorded during the first run.
Let’s write a test case and then extend it with regression testing.
We start out with a classic unit test that exercises the Michelson
ADD
instruction. We first define a Michelson contract that takes a
pair of natural numbers as parameter and, using the ADD
instruction,
calculates their sum and puts it in storage:
parameter (pair nat nat);
storage (option nat);
code { CAR; UNPAIR; ADD; SOME; NIL operation; PAIR; }
We store the contract in add_example.tz
and write the following pytest:
class TestRegressionExample:
def test_example_add(self, client):
add_contract = 'add_example.tz'
run_script_res = client.run_script(
add_contract, 'None', 'Pair 10 15', amount=None
)
assert run_script_res.storage == '(Some 25)'
The tests runs the add_example.tz
contract with
parameters 10
and 15
using the
run_script
method of the client
fixture of the Tezos Python
testing suite, and then verifies that the result is indeed 25
.
As expected, this test passes:
$ pytest -rf tests_python/tests/test_contract_opcodes.py -k test_example_add
============================== test session starts ===========================================
platform linux -- Python 3.7.3, pytest-4.4.0, py-1.8.0, pluggy-0.9.0 -- /usr/bin/python3
cachedir: .pytest_cache
rootdir: .../tezos/tests_python, inifile: pytest.ini
plugins: timeout-1.3.3, parallel-0.0.9
collected 302 items / 301 deselected / 1 selected
tests_python/tests/test_contract_opcodes.py::TestContractOpcodes::test_example_add PASSED
This is all very well, but say that we want to test some other aspects
of the ADD
instruction, such as its gas consumption. We can either
write a new test case, but this quickly gets tiresome as Michelson has
some 80 instructions that should all be tested. Furthermore, gas
constants are likely to evolve, so hard coding them in the tests makes
for brittle tests. Our test suite already contains tests for all
instructions — it would be more convenient if we could piggy-back
existing tests, but without asserting explicitly the gas costs of
each instruction in those tests.
First, how do we inspect the gas consumption of a contract? We can track
gas consumption of a program (and by extension, its instructions)
using the --trace-stack
option to the run script
command of the
client. When this option is given, the interpreter reports a stack
trace that contains the contents of the stack and the remaining gas at
each program point during execution. Let’s try (the output has
been redacted for brevity):
$ tezos-client run script add_example.tz on storage None and input 'Pair 10 15' --trace-stack
storage
(Some 25)
emitted operations
big_map diff
trace
- location: 9 (remaining gas: 799557 units remaining)
[ (Pair (Pair 10 15) None) ]
- location: 10 (remaining gas: 799557 units remaining)
[ (Pair 10 15) @parameter ]
- location: 13 (remaining gas: 799556 units remaining)
[ (Pair 10 15) @parameter
(Pair 10 15) @parameter ]
- location: 14 (remaining gas: 799556 units remaining)
[...]
- location: 11 (remaining gas: 799554 units remaining)
[ 10
15 ]
- location: 18 (remaining gas: 799553 units remaining)
[ 25 ]
- location: 19 (remaining gas: 799553 units remaining)
[ (Some 25) ]
- location: 20 (remaining gas: 799552 units remaining)
[ {}
(Some 25) ]
[...]
We can deduce that the gas consumption for the ADD
instruction in this
execution is less than one gas unit (location 18), and that the gas consumption for the
whole script is no more than five units. This is correct, and we
would like future tests to signal if this changes unexpectedly. A
solution would be to write the whole stack trace to the regtest
fixture.
To implement this idea, we
modify our test:
def test_example_add_reg(self, client, regtest):
add_contract = "add_example.tz"
run_script_res = client.run_script(
add_contract, 'None', 'Pair 10 15', amount=None, trace_stack=True
)
regtest.write(run_script_res.client_output)
assert run_script_res.storage == '(Some 25)'
The modified test requires the regtest
fixture by including it as a
formal parameter. Then we instruct the client fixture to pass the
--trace-stack
flag to tezos-client
through the trace_stack=True
.
Finally, we write the complete output from the client, as obtained
from run_script_res.client_output
, to regtest
. Now if we run the
test_example_add_reg
, it will fail due to the lack of a pre-existing
log file.
$ pytest tests_python/tests/test_regression_example.py -k example_add_reg
tests_python/tests/test_regression_example.py::TestRegressionExample::test_example_add_reg FAILED [100%]
================== FAILURES ======================
____________________________ TestRegressionExample.test_example_add_reg ____________________________
regression test output differences for tests/test_regression_example.py::TestRegressionExample::test_example_add_reg:
> --- current
> +++ tobe
> @@ -1,43 +1 @@
> -storage
> - (Some 25)
> -emitted operations
>
> -big_map diff
[...]
We try running again with the --regtest-reset
to create the log file:
$ pytest --regtest-reset tests_python/tests/test_regression_example.py -k example_add_reg
The generated log file is found in tests_python/tests/_regtest_outputs/test_regression_example.TestRegressionExample\:\:test_example_add_reg.out
, and we can verify that it contains the expected stack trace:
$ cat tests_python/tests/_regtest_outputs/test_regression_example.TestRegressionExample\:\:test_example_add_reg.out
storage
(Some 25)
emitted operations
big_map diff
trace
- location: 9 (remaining gas: 799557 units remaining)
[ (Pair (Pair 10 15) None) ]
- location: 10 (remaining gas: 799557 units remaining)
[ (Pair 10 15) @parameter ]
- location: 13 (remaining gas: 799556 units remaining)
[ (Pair 10 15) @parameter
(Pair 10 15) @parameter ]
[...]
- location: 11 (remaining gas: 799554 units remaining)
[ 10
15 ]
- location: 18 (remaining gas: 799538 units remaining)
[ 25 ]
- location: 19 (remaining gas: 799537 units remaining)
We can now run the regression test and verify that it passes:
$ pytest tests_python/tests/test_regression_example.py -k example_add_reg
=================================================================== test session starts ===========================
platform linux -- Python 3.7.3, pytest-4.4.0, py-1.8.0, pluggy-0.9.0 -- /usr/bin/python3
cachedir: .pytest_cache
rootdir: .../tezos/tests_python, inifile: pytest.ini
plugins: timeout-1.3.3, parallel-0.0.9, regtest-1.4.2
collected 2 items / 1 deselected / 1 selected
tests_python/tests/test_regression_example.py::TestRegressionExample::test_example_add_reg PASSED [100%]
========================== 1 passed, 1 deselected in 2.29 seconds ================================================
Now if a developer inadvertently introduces a change that modifies the
gas consumption of the ADD
instruction or its functioning, then
this regression test should break. Likewise if the developer intentionally
changes the semantics of the instruction. However, this should not be taken for
granted. A good practice when writing tests is to test the test itself
by intentionally breaking the functionality tested
and verifying that the corresponding test breaks. We follow
this advice and introduce a change in the gas cost of additions.
Gas costs are (mainly) defined in
src/proto_alpha/lib_protocol/michelson_v1_gas.ml
. Assume a cat,
well-meaning but imprudent, prances over the developer’s
keyboard, introducing the following change:
diff --git a/src/proto_alpha/lib_protocol/michelson_v1_gas.ml b/src/proto_alpha/lib_protocol/michelson_v1_gas.ml
index f61e519fe..228af62c6 100644
--- a/src/proto_alpha/lib_protocol/michelson_v1_gas.ml
+++ b/src/proto_alpha/lib_protocol/michelson_v1_gas.ml
@@ -182,7 +182,7 @@ module Cost_of = struct
let abs int = atomic_step_cost (61 + ((int_bytes int) / 70))
let int _int = free
let neg = abs
- let add i1 i2 = atomic_step_cost (51 + (Compare.Int.max (int_bytes i1) (int_bytes i2) / 62))
+ let add i1 i2 = atomic_step_cost (9000 + 51 + (Compare.Int.max (int_bytes i1) (int_bytes i2) / 62))
Oops! Suddenly additions have a fixed overhead of over 9000! It would be unfortunate if this change passes our test suite. Let us verify that it is not the case. We recompile the node and run the tests again:
$ pytest tests_python/tests/test_regression_example.py -k example_add_reg
=================================================================== test session starts ===========================
platform linux -- Python 3.7.3, pytest-4.4.0, py-1.8.0, pluggy-0.9.0 -- /usr/bin/python3
cachedir: .pytest_cache
rootdir: .../tezos/tests_python, inifile: pytest.ini
plugins: timeout-1.3.3, parallel-0.0.9, regtest-1.4.2
collected 2 items / 1 deselected / 1 selected
tests_python/tests/test_regression_example.py::TestRegressionExample::test_example_add_reg FAILED [100%]
================== FAILURES ======================
____________________________ TestRegressionExample.test_example_add_reg ____________________________
regression test output differences for tests/test_regression_example.py::TestRegressionExample::test_example_add_reg:
> --- current
> +++ tobe
> @@ -28,16 +28,16 @@
> - location: 11 (remaining gas: 799554 units remaining)
> [ 10
> 15 ]
> - - location: 18 (remaining gas: 799413 units remaining)
> + - location: 18 (remaining gas: 799538 units remaining)
All is well: the change is reflected in the output and the tests do
indeed break. Curiously, the difference between the previous gas
consumption and the new is not 9000, but 799538 - 799413 = 125
. The
reason is that the atomic_step_cost
works with an internal gas unit,
which is later scaled up to obtain the user-facing gas unit.
An eavesdropping client
In order to avoid modifying all existing test-cases and to specify
manually the output that should be registered in each test, we wrote a
new pytest fixture client_regtest
. This fixture mixes the existing
client
fixture with the regtest
fixture of pytest-regtest
. In
addition, it automatically stores the output of all commands passed to
tezos-client
. For example, we now express the example test case:
def test_example_add_reg_client_regtest(self, client_regtest):
client = client_regtest
add_contract = "src/bin_client/test/contracts/opcodes/add_example.tz"
run_script_res = client.run_script(
add_contract, 'None', 'Pair 10 15', amount=None, trace_stack=True
)
assert run_script_res.storage == '(Some 25)'
Using this method, we add regression testing to the existing test cases simply by changing the fixture.
Scaling down the pedantry: adding output conversions
One difficulty with regression testing is dealing with output that may
change from one test run to another. For instance, a contract that uses
the NOW
instruction will have a different stack trace on each run,
as the timestamp that is pushed by NOW
will differ on each execution.
Like a merciless river, time flows ever forwards.
There is little we can do to change the laws of physics, but we can
apply “output conversions” to the logs written by pytest
to replace
such data in the output with fixed markers.
We have written a client_regtest_scrubbed
fixture that extends
client_regtest
with a series of such conversions. For example, a
timestamp such as 2019-09-23T10:59:00Z
is replaced by [TIMESTAMP]
and operation hashes with [OPERATION_HASH]
.
An issue with long paths problem in pytest-regtest
During the implementation of regression testing, we discovered an
issue in pytest-regtest
generation of log files. Tests with long
parameterizations would create log files with names whose length
exceeded the operating system path length limit. We have patched
pytest-regtest
to replace explicit parameters in log file names with
a hash, if the name would exceeds the limits imposed by the operating system.
Discussion
We have followed the method outlined above to add regression testing
to Tezos integration tests. Notably in the existing test suite
test_contracts_opcode.py
that tests
Michelson instruction semantics. We have also applied it to test
macro
expansion.
As a result, Michelson instructions are now regression test enabled. This decreases the risk that any future change inadvertently changes macro expansion, the semantics or the gas cost of any instruction, as doing so will alert the developer by breaking the test suite in many cases. However, due to the dynamic nature of gas costs, it is impossible for any test suite to cover all behaviors. Think of the addition of numbers, whose cost depends on the size of the operands. Tests can only cover a finite number of operands, but since integers and naturals are unbounded in Michelson, they can never cover all. Consequently, a change that changes the gas cost for an untested pair of operands will not be detected. Nonetheless, regression testing provides an important security net that helps to nip bugs early in the bud. Moving forward, we would like to extend regression testing to other parts of the test suite.
Our current approach to output conversion is quite ad hoc. We use the same set of output conversions for all tests. For examples, this results in all timestamps being removed from the output, even those that are constant between two runs. In the future, we would like to use test-specific conversions targeting directly the parts of the output that is expected to change.
Finally, regression testing is coarse-grained and so also
fragile. In addition to picking up unintended changes, regression
testing also detects intended, harmless changes.
For instance, in merge request
metastatedev/tezos!54,
the UNPAIR
macro is replaced by an instruction. UNPAIR
being
widely used, this change breaks all very large number of regression
tests. If such a change also modifies instruction semantics or gas usage in an unintended
way, then bugs may be introduced. In such cases, it is important that
the developer is very careful when inspecting the diffs of the
regression logs and that they keep the change to the code minimal
to ease review!
Indeed, as with any method for verification and validation, regression testing should be approached as a complementary activity to careful code review and formal methods.