- Introduction
- Tale 1: a new tail recursive monadic interpreter
- Tale 2: hunt for a serialization performance bug
- Conclusion
Introduction
We subscribe to Kent Beck’s motto:
Make it work, make it right, make it fast!
— in that order. This implies an incremental software development process:
- When you make it work, you may elide corner cases or design considerations required to make it right; and
- when you make it right, you may introduce abstractions which are good for provability but are too naive to make it fast in the implementation.
This is normal: incremental development separates concerns so we can properly deal with each of them.
We’d like to share with you two tales of making Tezos faster in Granada by following the motto above:1
- Tale 1 led to an optimization of the Michelson interpreter; and
- Tale 2 led to an optimization of the serialization layer.
Both tales are set after the “make it right” stage and during the “make it fast” stage. The code was clean (and we wanted to keep it that way); we just wanted to increase performance.
With these two optimizations,
Granada — the latest, and recently-accepted, amendment proposal of Tezos economic protocol —
executes smart contracts with much better gas consumption.2
For example, a typical call to Dexter entrypoint XtzToToken
consumes roughly
- 75600 units of gas in Florence, and
- 9320 units of gas in Granada.
That’s an eightfold improvement.3
Tale 1: a new tail recursive monadic interpreter
Tezos executes smart contracts in Michelson, a strongly-typed low-level language.4 Roughly speaking, when provided with
- an input argument, and
- a storage state, and
- a stack of pending operations,
a Michelson script computes
- a new storage state, and
- a new stack of pending operations, and
- it may also carry out some concrete action to change the state of the blockchain — the technical term is a side-effect.
An aside: side-effects via monads
Monads are how functional programmers deal with effectful computation, meaning programs that may compute a return value and also have side-effects on the ambient world as they do so (input/output, token transfers, etc).
The impatient reader is welcome to skip forward to our toy MichelBaby interpreter and return to this subsection for reference if required, especially when the reader sees us mention bind
later.
Still here? Great!
A monad is a type-constructor, meaning that given a type t
and a monad monad
, we have a type t monad
.
There are many different monads — e.g. the error monad, the state monad, and the Lwt
(lightweight threads) monad — but for the purposes of this blog post, these all just represent actions on some state machine.5
Thus t monad
is a type for computations that compute a t
while exerting side-effects on a state monad monad
.
A monad comes with the following key operations:
-
return : 'a -> 'a monad
This inputs a value of type'a
and returns that same value, in an empty (or initial) state. -
bind : 'a monad -> ('a -> 'b monad) -> 'b monad
This binds a stateful computation from'a
to'b
, to an input'a
-in-state, thus obtaining an output'b
-in-state. -
run : 'a monad -> 'a
This discards the state and just returns the value.6
C, Java, or Python have monads, or more precisely one monad: a single “global state monad” which represents how the actual state of the computer changes as it executes instructions. Since in these languages the global state monad permeates all computation, it need not be mentioned explicitly — but it is still there, as you can check: just run a virtual machine, take a snapshot, et voilà your global state is saved to disk as a datum.
OCaml (in common with may other functional programming languages) allows us to be far more fine-grained, by bundling state-fragments up into monads.7 This has at least two benefits:
- fragments of machine state are now encapsulated as typed data and can be passed as arguments to other functions (as our crude example above of snapshotting a VM illustrates); and also,
- using monads encourages the programmer to be discriminating about precisely what effects a given computation really needs — if any.
So if we think about it, execution of a Michelson script requires a few distinct kinds of state:
- The state of the blockchain (e.g. to transfer tokens).
- The gas level consumed in the current block (especially in case we are about to run out of gas).
- A script may interact with the environment via I/O.
- A script may also fail because it tried to perform some invalid operation, and this too is interaction with state; namely the failure monad whose effect (if triggered) is to say “I have no return value because I have failed”.
In practice the Michelson interpreter uses a monad in OCaml which is the composition of three different monads:
the error monad, the state monad, and the Lwt
(lightweight threads) monad.
If the reader sees return
, bind
, or run
below — they are just the monadic glue coming from this or a closely related monadic combination.
An interpreter for Michelson scripts
An interpreter is a program that runs scripts. Let’s sketch what an OCaml interpreter might look like for a simple object-level language8 MichelBaby. MichelBaby is derived from a subset of Michelson’s instructions as used in the currently-live economic protocol Florence; so what follows can be viewed as a simplified but indicative reflection of the smart contracts system as it is currently deployed.
Here’s the OCaml type declaration for the abstract datatype instr
of MichelBaby instructions (the impatient reader can skip forward to compare with the second version):
type (_, _) instr =
| Push : 'b -> ('s, 'b * 's) instr
| Mul : (z * (z * 's), z * 's) instr
| Dec : (z * 's, z * 's) instr
| CmpNZ : (z * 's, bool * 's) instr
| Loop : ('s, bool * 's) instr
-> (bool * 's, 's) instr
| Dup : ('a * 's, 'a * ('a * 's)) instr
| Swap : ('a * ('b * 's), 'b * ('a * 's)) instr
| Drop : ('a * 's, 's) instr
| Dip : ('s, 't) instr
-> ('a * 's, 'a * 't) instr
| IO : ( z * 's, z * 's) instr
| Seq : ('s, 't) instr * ('t, 'u) instr
-> ('s, 'u) instr
(The type z
above is the integer type from the standard OCaml library Zarith
for infinite precision arithmetic.)
We notice that:
-
instr
is polymorphic over two type parameters:('s, 't) instr
.- The first parameter
's
represents the input stack type and - the second parameter
't
represents the output stack type.
- The first parameter
-
Each line in the datatype declaration corresponds to an individual instruction. The type parameters give useful information on the intended meaning. For example:
-
Mul : (z * (z * 's), z * 's) instr
is an instruction that inputs a stack headed by two integers (z * (z * 's)
) and outputs a stack headed by one integer (z * 's
). Intuitively,Mul
pops the two integers off the stack, multiplies them, and pushes the result. -
Push : 'b -> ('s, 'b * 's) instr
is an instruction that inputs a value of type'b
and returns an instruction that inputs a stack of type's
and outputs a stack of type'b * 's
. Intuitively,Push b
pushesb
(of course).
-
-
There is a special sequence constructor
Seq
that- inputs an instruction that inputs stack
's
and outputs a stack't
, and - inputs an instruction that inputs a stack
't
and outputs a stack'u
, and - returns a composite instruction that inputs a stack
's
and outputs a stack'u
.
Note how the types make it obvious what the interpreter should do with
I Seq J
: firstI
, thenJ
. - inputs an instruction that inputs stack
-
IO
is a basic effectful instruction — for concreteness, the reader can assume that it writes the topmost integer of the stack to a file, and could be used for example as a hook for logging and profiling execution.
MichelBaby typechecking for free, courtesy of the OCaml typechecker
This style of polymorphic datatype declaration makes the OCaml typechecker act as a MichelBaby typechecker too:
only (representations of) well-typed scripts can inhabit the instr
type.
For example, consider the following MichelBaby implemention of fact
the factorial:
(* Inline syntactic sugar for sequencing *)
let ( @ ) s1 s2 = Seq (s1, s2)
(* MichelBaby instruction to calculate n! *)
let fact n =
assert (Z.(n > zero));
Push n
@ Push (Z.of_int 1)
@ Dup @ CmpNZ @ Dip Swap
@ Loop (Dup @ Dip Swap @ Mul @ IO @ Swap @ Dec @ Dup @ CmpNZ)
@ Drop
Hurrah, this is well-typed!
OCaml’s typechecker does not know that fact
is a MichelBaby sequence of instructions to compute the factorial, but it does certify that fact
has the following type:
fact : z -> ('s, z * 's) instr
In English, this is the OCaml typechecker telling us:
fact
will input an integer and return a MichelBaby instruction that inputs a stack and returns a stack of the same shape except it has an integer pushed onto it.
Now suppose we to forget the CmpNZ
instruction above.
Then the OCaml typechecker will reject the definition with the following error message:
This expression has type (bool * 'a, 'b) instr
but an expression was expected of type (Z.t * 'c, 'd) instr
Type bool is not compatible with type Z.t
This is the OCaml typechecker, giving us MichelBaby typechecking for free.
Tracing factorial: a standard toy example
Our implementation fact
of factorial is a tracing factorial, meaning that it computes factorial and contains an IO
tracing (or breakpoint) hook so we can observe intermediate results of the computation using an input/output primitive of our choice — a similar hook is in our efficient OCaml factorial function below.
Tracing factorial is interesting because it is an easy-to-understand toy example of a mostly pure computation with some side-effects — which is what most smart contracts look like. This is a convenient toy example: no implication is intended that tracing factorial is the only thing that could be done in MichelBaby or Michelson!
The step
function, Version 1
The instr
datatype allows us to represent well-typed programs.
Now we need to design an interpreter.
First, suppose we are given a function step
step : ('i, 'o) instr -> 'i -> 'o monad
Then we can write our interpreter
as follows:
type ('storage, 'argument) well_typed_script =
(('storage * 'argument) * unit, (('storage * operations) * unit)) instr
type ('storage, 'argument) interpretation =
'storage -> 'argument -> ('storage * operations) monad
let interpreter : type storage argument.
(storage, argument) well_typed_script -> (storage, argument) interpretation =
fun instr storage argument ->
bind (step instr ((storage, argument), ()))
@@ fun ((storage, operations), ()) -> return (storage, operations)
So now we just have to write code for step
.
Standard practice is to just follow the inductive structure of instr
.
We present Version 1 of the step
function (cf. Versions 1.1 and 2 below):
let rec step : type a b. (a, b) instr -> a -> b monad =
fun instr stack ->
match (instr, stack) with
| Push x, stack
-> return (x, stack)
| Mul, (x, (y, stack))
-> return (Z.mul x y, stack)
| Dec, (x, stack)
-> return (Z.(sub x (of_int 1)), stack)
| CmpNZ, (x, stack)
-> return (Z.(compare x zero <> 0), stack)
| Loop _, (false, stack)
-> return stack
| Loop body, (true, stack)
-> bind (step body stack) @@ fun stack -> step (Loop body) stack
| Dup, (x, stack)
-> return (x, (x, stack))
| Swap, (x, (y, stack))
-> return (y, (x, stack))
| Drop, (_, stack)
-> return stack
| Dip i, (x, stack)
-> bind (step i stack) @@ fun stack -> return (x, stack)
| IO, (z, _)
-> bind (io z) @@ fun () -> return stack
| Seq (i1, i2), stack
-> bind (step i1 stack) @@ fun stack -> step i2 stack
The recursive function step
observes the instruction and the stack and
produces a monadic computation using return
and bind
.
Let’s read through some cases of this definition:
-
Case for
Push
. As prescribed by the definition ofPush
, the input stack type is's
and we must return a stack of the form'b * 's
. That is the type of(x, stack)
. Besides, since we need the returned value to be of type('b * 's) monad
, we usereturn
to turn(x, stack)
into a monad computation. -
Case for
Seq
. As prescribed by the definition ofSeq
, the instructioni1
must be of type('s, 't) instr
andi2
of type('t, 'u) instr
. The recursive call(step i1 stack)
is well-typed and returns a value of type't monad
. Usingbind
, we can retrieve the stack of type't
needed to interpreti2
. The final computation has type'u monad
as expected.
- Case for
IO
. Here, we use an effectful operationio : z -> unit monad
and compose it with a computation that returns that stack unchanged.
We check that the interpreter correctly computes the factorial on an input (other inputs work too!):
# run (step (fact (Z.of_int 100)) ()) |> fst = ocaml_fact (Z.of_int 100);;
- : bool = true
This interpreter has a good property: it can’t fail.
Because we get MichelBaby type-checking for free, every application of step
to a well-typed argument in instr
is guaranteed to successfully execute (provided the ambient execution environment doesn’t suffer an overflow; see next paragraph).
We can’t fail on an ill-formed operation like multiplying two strings or popping an empty stack, because our free MichelBaby typechecker will detect and reject the instruction-sequence as ill-typed.
Unfortunately, our interpreter above is not yet “right”, because step
is not tail recursive.
Note that:
- Recursive calls to
step
are not the final step in the computation in the clauses forSeq
(instep i1 stack
) andDip
(instep i stack
). - For comparison, the calls in
Loop
tostep (Loop body) stack
and inSeq
tostep i2 stack
, are tail recursive.
This is not “right” because each call to the interpretation loop that passes through a non-tail-recursive call to step
, may consume a bit of the OCaml calling stack,
and we can use this to provoke incorrect behaviour, namely a stack overflow (when the call stack fills up and the computer has to terminate execution because it runs out of memory):
# let rec long_seq accu n =
if n = 0 then accu
else long_seq (Seq (accu, Seq (Push Z.zero, Drop))) (n - 1);;
val long_seq : ('a, 'b) instr -> int -> ('a, 'b) instr = <fun>
# step (long_seq (Push Z.zero) 10000000) ();;
Stack overflow during evaluation (looping recursion?).
When the stack overflow error occurs, is architecture-dependent and depends on how much stack is available to consume. A cleaner way to handle this is to count the nesting depth of non-tail-recursive calls, and abort if we go too far.9
The step
function, Version 1.1
We add a depth
counter to Version 1 of step
(cf. Version 2 below):
let rec step : type a b. int -> (a, b) instr -> a -> b monad =
fun depth instr stack ->
if depth > 100000 then fail "Too many recursion"
else match (instr, stack) with
| Push x, stack
-> return (x, stack)
| Mul, (x, (y, stack))
-> return (Z.mul x y, stack)
| Dec, (x, stack)
-> return (Z.(sub x (of_int 1)), stack)
| CmpNZ, (x, stack)
-> return (Z.(compare x zero <> 0), stack)
| Loop _, (false, stack)
-> return stack
| Loop body, (true, stack)
-> bind (step (depth + 1) body stack) @@ fun stack -> step depth (Loop body) stack
| Seq (i1, i2), stack
-> bind (step (depth + 1) i1 stack) @@ fun stack -> step depth i2 stack
| Dup, (x, stack)
-> return (x, (x, stack))
| Swap, (x, (y, stack))
-> return (y, (x, stack))
| Drop, (_, stack)
-> return stack
| Dip i, (x, stack)
-> bind (step (depth + 1) i stack) @@ fun stack -> return (x, stack)
| IO, (z, _)
-> bind (io z) @@ fun () -> return stack
The program is still non-tail-recursive but at least we have replaced a stack overflow which we cannot control, with an explicit branch within our own program which we can control.
Some tests are required to determine whether 100000
is a good limit.10
Once this limit value is appropriately chosen, we can claim to have made the interpreter “right”:
- it follows a standard OCaml programming style (hence natural to reason about), and
- it is robust to stack overflows.
Making the interpreter fast
How fast is our step
function?
To get an idea, we can compare how long it takes to compute the factorial of a hundred — 100! = 100*99*98*...*1
— with an equivalent native OCaml implementation of the factorial:
let lwt_fact n =
let rec aux k accu =
if Z.(compare k zero = 0) then return accu
else
let accu = Z.mul accu k in
bind (io accu) (fun () -> aux (Z.sub k (Z.of_int 1)) accu)
in
aux n (Z.of_int 1)
Here are the benchmarks, for the OCaml factorial function and V1.1 of our step
function:
┌─────────────┬──────────┬────────────┐
│ Name │ Time/Run │ Percentage │
├─────────────┼──────────┼────────────┤
│ OCaml │ 3.71us │ 23.4% │
│ Step (V1.1) │ 15.83us │ 100.0% │
└─────────────┴──────────┴────────────┘
On this specific example,
the interpreter is five times slower than the reference implementation in OCaml.
Why?
Profiling execution with the linux perf
command, we learn that 50% of the time is spent in the monadic combinators (mentioned above).
Mostly this is due to the code for handling Seq
in the code above, in which bind
is called
to glue together the interpretation of the two instructions of a sequence:
this has a negative impact on the performance
because bind
happens to be a rather complex operation when the monad includes the Lwt
(lightweight threads) monad.
We need to get rid of those bind
s.
Reducing the number of bind
s
Let’s reexamine the MichelBaby code for fact
above:
(* Inline syntactic sugar for sequencing *)
let ( @ ) s1 s2 = Seq (s1, s2)
(* MichelBaby instruction to calculate n! *)
let fact n =
assert (Z.(n > zero));
Push n
@ Push (Z.of_int 1)
@ Dup @ CmpNZ @ Dip Swap
@ Loop (Dup @ Dip Swap @ Mul @ IO @ Swap @ Dec @ Dup @ CmpNZ)
@ Drop
The MichelBaby script contains only one inherently effectful instruction: IO
.
Yet each Seq
(written @
above) calls a monadic bind
operation, which is expensive.
This means that we pay the cost of the monadic abstraction repeatedly, when in fact a pure computation could perform (all but one) of the calculations more efficiently.
Can we separate the pure instructions from the impure ones and then use bind
just when impure instructions enter the scene?
We can start by syntactically separating the pure computations from the impure ones,
taking inspiration from our OCaml version of the factorial:
let lwt_fact n =
let rec aux k accu =
if Z.(compare k zero = 0) then return accu
else
let accu = Z.mul accu k in
bind (io accu) (fun () -> aux (Z.sub k (Z.of_int 1)) accu)
in
aux n (Z.of_int 1)
Notice how
- the accumulator
accu
accumulates the result of a sequence of multiplications — that’s the pure computation, whereas - the impure computation — namely the call to the effectful operation
io
— is performed in the body of theaux
function.
We will now rephrase step
so that:
-
The pure computations rewrite the input stack passed as an argument to a tail recursive call to the next instruction.
-
The impure computations return monadic computations.
First, we must rephrase the datatype that defines the instructions, so that every instruction points to the next instruction (compare the code below with the first version, and z
explained here):
type (_, _) instr =
| KHalt : ('s, 's) instr
| KPush : 'b * ('b * 's, 'f) instr ->
('s, 'f) instr
| KMul : (z * 's, 'f) instr ->
(z * (z * 's), 'f) instr
| KDec : (z * 's, 'f) instr ->
(z * 's, 'f) instr
| KCmpNZ : (bool * 's, 'f) instr ->
(z * 's, 'f) instr
| KLoop : ('s, bool * 's) instr * ('s, 'f) instr ->
(bool * 's, 'f) instr
| KDup : ('a * ('a * 's), 'f) instr ->
('a * 's , 'f) instr
| KSwap : ('a * ('b * 's), 'f) instr ->
('b * ('a * 's), 'f) instr
| KDrop : ('s, 'f) instr ->
('a * 's, 'f) instr
| KDip : ('t, 's) instr * ('a * 's, 'f) instr ->
('a * 't, 'f) instr
| KIO : (z * 's, 'f) instr ->
(z * 's, 'f) instr`
Let’s compare Mul
from the first version with KMul
here:
-
Mul : (z * (z * 's), z * 's) instr
is an instruction that*inputs a stack headed by two integers (
z * (z * 's)
) and * outputs a stack headed by one integer (z * 's
).Intuitively,
Mul
pulls the two integers off the stack, multiplies them, and pushes the result. -
KMul : (z * 's, 'f) instr -> (z * (z * 's), 'f) instr
is an instruction that- inputs a sequence of instructions that inputs a stack headed by an integer
z * 's
and outputs a stack'f
, and - outputs a sequence of instructions that inputs a stack headed by two integers
z * (z * 's)
and outputs a stack'f
.
- inputs a sequence of instructions that inputs a stack headed by an integer
Intuitively, KMul
just adds Mul
onto the head of an existing instruction sequence.
- Our new datatype does not represent instructions; it represents instruction-sequences.
- The datatype-constructors represent instructions to push instructions onto sequences.
Ever written a shopping list?
The family sits around the table calling out household essentials, and then Dad (or Mum — whoever does the shopping) adds items to a written list.
KMul
corresponds to somebody saying “Don’t forget to buy multiplication” and then Dad (or Mum) writes “Mul-ti-pli-cation” on the end of the list.
Except that, because this blog post is about Very Serious Programming, our lists start on the right-hand-side and expand leftwards.
The issue with the cost of bind
is also relevant in this analogy.
Most of the time it doesn’t matter what order items are put into the shopping basket.
So it makes sense to split the shopping list into a small number of big heavy things that need put in the trolley last, and a large number of smaller things that can be efficiently executed by rapidly traversing the supermarket in whatever order is most efficient for that supermarket’s layout.
Isn’t computer science wonderful?
KHalt
is a new instruction, to represent the end of our to-do list (think: EOF or EOS marker).
There is one catch to our to-do list / shopping list analogy (to be fair, the reader can expect programming smart contracts to be a little harder than shopping):
it may not always be possible to determine the next instruction entirely in advance.
Execution may dynamically depend on input parameters:
for example, a control-flow operator like KLoop
dynamically chooses the next instruction by observing a Boolean.
This means that we need another stack — a control stack — that will allow control-flow operators to dynamically define what the next instruction should be:
type (_, _) instrs =
| KNil : ('s, 's) instrs
| KCons : ('s, 't) instr * ('t, 'u) instrs -> ('s, 'u) instrs
Intuitively, if instr
represents a sequence of instructions then instrs
represents a list of sequences of instructions, with the constraint that the final stack type ('t
above) of each sequence of instructions on the list, must be compatible with the input stack type ('t
above) of the next sequence in the list (if any).
The step
function, Version 2
We present Version 2 of the step
function (cf Versions 1 and 1.1 above):
let step : type a b p. (a, b) instr -> a -> b monad =
fun i stack ->
let rec exec : type a i p. (a, i) instr -> (i, b) instrs -> a -> b monad =
fun k ks s ->
match (k, ks) with
| KHalt, KNil -> return s
| KHalt, KCons (k, ks) -> exec k ks s
| KIO k, ks ->
let z, _ = s in
bind (io z) (fun () -> exec k ks s)
| KPush (z, k), ks -> exec k ks (z, s)
| KLoop (ki, k), ks -> (
match s with
| true, s -> exec ki (KCons (KLoop (ki, k), ks)) s
| false, s -> exec k ks s )
| KMul k, ks ->
let x, (y, s) = s in
exec k ks (Z.mul x y, s)
| KDec k, ks ->
let x, s = s in
exec k ks (Z.sub x (Z.of_int 1), s)
| KCmpNZ k, ks ->
let x, s = s in
exec k ks (Z.(compare x zero) <> 0, s)
| KDup k, ks ->
let x, s = s in
exec k ks (x, (x, s))
| KSwap k, ks ->
let x, (y, s) = s in
exec k ks (y, (x, s))
| KDrop k, ks ->
let _, s = s in
exec k ks s
| KDip (ki, k), ks ->
let x, s = s in
exec ki (KCons (KPush (x, k), ks)) s
in
exec i KNil stack
Compared with Version 1.1 and Version 1, Version 2 uses far fewer monadic combinators:
- Version 2 has one use of
return
(forKHalt
, of course) and one use ofbind
(forKIO
). - Versions 1 and 1.1 have four uses of
bind
and more uses ofreturn
than we care to count.
How does this work?
- Pure computations (i.e. stack-related and arithmetic operations) just get pushed onto the input stack (the to-do list).
- The impure
KIO
computation callsbind
to glue its interpretation to the rest of the evaluation. - Control operators, like
KLoop
, update the control stack to dynamically determine the next instruction-sequence.
Question: does this improve performance? Answer: Yes, significantly:
┌───────────────┬──────────┬────────────┐
│ Name │ Time/Run │ Percentage │
├───────────────┼──────────┼────────────┤
│ OCaml │ 3.62us │ 22.9% │
│ Step (V1.1) │ 15.83us │ 100.0% │
│ Step (V2) │ 6.39us │ 40.4% │
└───────────────┴──────────┴────────────┘
On this specific micro-benchmark, Version 2 is almost three times faster than Version 1.1 and its speed is comparable (to within a factor of two) with the efficient native OCaml implementation.
Note that protection against stack overflows (as discussed above) is not required
because the new interpreter is tail recursive.
Note also that step
Version 2 still gives us MichelBaby typechecking for free, so it is as safe (as as much static guarantee of correctness) as Version 1.
Is that everything?
Granada’s new Michelson interpreter gains most of its efficiency from the program transformation described above, but three other optimizations are also substantive:
-
We use a local gas counter in the interpretation loop instead of the gas level found in the context.11 The OCaml compiler can represent this local gas counter in a machine register instead of a heap-allocated object; reading from and writing to a machine register is several orders of magnitude faster than for a heap-allocated object.
-
Similarly, we cache the top of the stack by passing it as a dedicated argument in the interpretation loop, separately from the tail of the stack. Semantically this makes no difference — it’s still a stack! — but by this device we encourage the compiler to store the top of the stack in a machine register, potentially saving many trivial RAM read/writes as the top of the stack would otherwise get pushed to and then popped from memory (see a paper on stack caching for interpreters).
-
The Michelson interpreter includes an integrated logging tool, so that users can test smart contract code off-chain (e.g. profile their smart contracts before they go live). This logging tool is inactive for live on-chain code, but in Florence the logger has nonzero cost even when inactive — thus, live smart contracts run slightly slower in Florence than they would if the profiler did not exist at all. For Granada, we found a way to exploit the control stack to implement genuinely zero-cost logging.
Tale 2: hunt for a serialization performance bug
Doing serialization right thanks to well-typed encodings
- Serialization converts data values into a form that can be stored or transmitted on the network (e.g. when you save a text file to disk, it gets serialized to 0s and 1s).
- Deserialization is the dual operation of reconstructing the data from its serialized representation.
(De)serialization is ubiquitous in Tezos. For example:
- Deserialization occurs whenever data is read from the chain (e.g. loading a script source code to execute a smart contract call).
- Serialization occurs whenever the protocol stores information on the chain.
- (De)serialization occurs whenever nodes communicate over the Tezos network.
Nearly anything useful you might do on a blockchain involves either reading from the chain, writing to it, or communicating with another node — so the short list above covers pretty much everything, other than the occasional few microseconds of pure computation.
The node and the economic protocol use a library named data-encoding
— version 0.2 in Florence; version 0.3 in Granada —
that provides convenient operations to define encodings and
to automatically generate functions to convert values
into sequences of bytes or JSON objects,
and of course functions to decode serialized data back to values.
Serialization functions should not generally be written manually: it’s hard and error-prone work.
Better to generate serialization functions automatically.
To this end, data-encoding
provides combinators to define a value of type t encoding
from a value in some arbitrary serializable type t
.
For example suppose t
is a basic binary tree:
type tree = Leaf of int | Binary of tree * tree
Then we can produce an encoding just by following the structure of t
as follows:
let tree_encoding : tree encoding =
mu "tree" @@ (* `fix` is this function -> *) fun self ->
union [
case
(Tag 0)
~title:"Leaf"
int31
(function Leaf x -> Some x | _ -> None)
(fun x -> Leaf x);
case
(Tag 1)
~title:"Binary"
(obj2 (req "left" self) (req "right" self))
(function Binary (left, right) -> Some (left, right) | _ -> None)
(fun (left, right) -> Binary (left, right));
]
With that definition, we describe the encoding declaratively through the equation that defines the type of the value it works over:
Combinators’ types are informative and so limit the chances of error:
val mu : string ->
('a encoding -> 'a encoding) ->
'a encoding
val case : title:string -> case_tag ->
'a encoding -> ('t -> 'a option) -> ('a -> 't) ->
't case
val union : ?tag_size:[`Uint8 | `Uint16] ->
't case list ->
't encoding
So data-encoding
makes serialization right:
we avoid writing error-prone boilerplate and
thanks to combinators
- the input required from the programmer is minimal (thanks to their expressivity),
- the results are fairly likely to be correct (thanks to the combinators’ precise types), and
- the results are safe (again, thanks to static typing).
How costly is this abstraction?
Serialization with data-encoding
version 0.2 (as in Florence, and thus before Granada) may have been “right”, but it was not particularly fast.
Typically, if we compare the time required
to encode a binary tree with \(2^{20}\) nodes using data-encoding
with the time taken to encode the same tree using Marshal
(the unsafe serialization module of OCaml standard library)12,
we get the following numbers:
┌───────────────────┬──────────┬────────────┐
│ Name │ Time/Run │ Percentage │
├───────────────────┼──────────┼────────────┤
│ Marshal │ 67.46ms │ 30.4% │
│ data-encoding 0.2 │ 221.82ms │ 100.0% │
└───────────────────┴──────────┴────────────┘
data-encoding
version 0.2 is roughly three times slower that Marshal
in that example.13
One might assume this is just the price of having
a nice abstraction for and a well-typed implementation of serialization.
After all, Marshal
is untyped and implemented in C, and low-level programming in C can be fast.
Surprisingly, this is not the case.
Using the linux perf
command we can observe that a specific function in data-encoding
takes half of the execution time during writing: check_cases
.14
check_cases
checks that the data-encoding
‘s combinator for union
types
is properly applied to a nonempty list of disjoint cases.
It’s surprising to see check_cases
called during serialization of concrete values, because one would have expected it to be called just once when the definition of the encoding is processed.
In fact, things are even worse than they appear because check_cases
has quadratic complexity with respect to the number of cases.
Thus execution time degrades further if we add a new data constructors to our type for binary trees:
┌───────────────────┬──────────┬───────────┐
│ Name │ Time/Run │Percentage │
├───────────────────┼──────────┼───────────┤
│ Marshal │ 74.25ms │ 25.4% │
│ data-encoding 0.2 │ 292.16ms │ 100.0% │
└───────────────────┴──────────┴───────────┘
For datatypes with five or more constructors, the performance of data-encoding
seems to bottom out at roughly ten times slower than Marshal
.
So, why is check_cases
called during serialization of concrete values? The investigation is simple since the definition of tree encoding
uses only two combinators: union
and mu
.
check_cases
is called during serialization of concrete values because of a bug in mu
which we will explain now.
Fortunately, the case for mu
-based encodings in the function write_rec
is a mere one-liner:
| Mu {fix; _} ->
write_rec (fix e) state value
This line says that fix
(the function passed to mu
) is called to get the encoding of the value to be encoded.
Here Mu
is the internal data constructor used in the function mu
.
Let us see what this fix
function is when write_rec
is executed on our example for tree encoding
:
1. let tree_encoding : tree encoding =
2. mu "tree" @@ (* `fix` is this function -> *) fun self ->
3. union [
4. case
5. (Tag 0)
6. ~title:"Leaf"
7. int31
8. (function Leaf x -> Some x | _ -> None)
9. (fun x -> Leaf x);
10. case
11. (Tag 1)
12. ~title:"Binary"
13. (obj2 (req "left" self) (req "right" self))
14. (function Binary (left, right) -> Some (left, right) | _ -> None)
15. (fun (left, right) -> Binary (left, right));
16. ]
This is a correct encoding function!
In fact the return type of the fix
function is tree encoding
so it makes sense to recursively call write_rec
to encode value
in the interpretation of mu
since it is of type tree
.
Besides, self
will be equal to e
in the interpretation of mu
,
that is Mu {fix; ...}
which is also consistent because tree_encoding
is used to serialize the sub-trees.
However, even though the definition of tree_encoding
is functionally correct,
it has a hidden performance bug.
Each time we write a tree,
we execute union
(see line 3 in the code above), and union
carries out some sanity checks, including the aforementioned check_cases
.
The encoding for trees is an immutable value, so its definition need only be checked once — not each time a tree is serialized!
How to fix fix
?
Now that we understand the problem with fix
, how can we address it?
The idea is simple: the construction of the encoding for an arbitrary recursive type (of which the binary trees considered above are an example) should execute fix
only once
and then produce an encoding that we can trust to serialize an arbitrary number of data elements of that type,
with no further checks.
This works because an encoding is an immutable value
so executing fix
on e
always returns the same value.
On the implementation side,
we just need to call fix e
once,
then remember the result in a local reference
and return the content of this reference in subsequent evaluations of fix e
.
Programmers may recognize this as the common technique called memoization (caching the results of function calls).
Here is an excerpt of the corrected mu
that illustrates how we proceed:
let self = ref None in
let fix_f = fix in
let fix e =
match !self with
| Some (e0, e') when e == e0 ->
e'
| _ ->
let e' = fix_f e in
self := Some (e, e') ;
e'
in
...
self
remembers that the fix e = e'
.
This allows us to return e'
when fix
is called with e
again
instead of recomputing fix e
.
We re-run our first benchmark on data-encoding
version 0.2 but with a correct mu and see that this patch significantly improvemes the performance of data-encoding
:
┌────────────────────────────────┬──────────┬────────────┐
│ Name │ Time/Run │ Percentage │
├────────────────────────────────┼──────────┼────────────┤
│ Marshal │ 69.19ms │ 61.5% │
│ data-encoding with correct mu │ 112.42ms │ 100.0% │
└────────────────────────────────┴──────────┴────────────┘
We reduced the execution time, and also our encoding is now less sensitive to the number of cases in the encoding. Indeed, the second benchmark (in which our type for binary trees is extended with additional data constructors) performs almost as well (112.42ms vs 128.18ms):
┌───────────────────────────────┬──────────┬────────────┐
│ Name │ Time/Run │ Percentage │
├───────────────────────────────┼──────────┼────────────┤
│ Marshal │ 74.14ms │ 57.8% │
│ data-encoding with correct mu │ 128.18ms | 100.0% │
└───────────────────────────────┴──────────┴────────────┘
Is that everything?
The optimizations described above account for the bulk of the performance improvements moving from data-encoding 0.2
in Florence to data-encoding 0.3
in Granada.
data-encoding 0.3
does include other incremental optimizations, which yield real but less significant speedups, and we can sum it up in a little equation as follows:
data-encoding 0.3 in Granada = data-encoding 0.2 from Florence +
correct mu +
some other optimizations
Yet marginal gains are still gains and can still accumulate, and putting all of them together we were able to reach nearly the same level of efficiency as Marshal
for important encodings — in particular the one dedicated to the Micheline
format, a central data representation in the Tezos protocol.
Conclusion
We said at the start of this blog post that we followed a process of “Make it work, make it right, make it fast”.
These three qualities, far from being in opposition, complemented one another to make our optimizations possible and safe. We needed the codebase to be of high quality — having a sound architecture and using appropriate abstractions — to find, perform, and check our optimizations. Clean, well-engineered code meant that we could understand the code, effectively profile it, locate sources of inefficiency, and then apply local rewrites to improve efficiency.
Code that is right, is easier to optimize. Even in the context of a critical system like Tezos, where correctness and security have the highest priority — especially the context of a critical system — making it right also helps to make it fast.
-
Some background: the Tezos blockchain has what amounts to an operating system layer called the economic protocol, which is unusual amongst blockchains in that it is upgradable by stake-weighted community vote.
The currently-live economic protocol is Florence, and an upgrade to Granada is now scheduled for August 2021. So this blog post can be thought of as an inside peek into how a significant performance optimization was attained for Tezos’ next big OS upgrade. Developing such updates is a large part of Nomadic Labs’ day-to-day activities. ↩
-
A blockchain is “just” a distributed state machine and smart contracts are “just” programs that run on that machine to modify the state. “Gas” is used to bound computation and so prevent denial of service attacks: a smart contract receives a gas budget when started, and if it exhausts its gas, it gets terminated. Lower gas consumption is a corollary of optimizing on-chain computation to make it run more efficiently. Or to put it another way: optimizing Tezos to run faster on a given piece of actual hardware, means that more useful on-chain computation can be fit into a given gas budget. ↩
-
You can use another language to write your smart contract, but to execute it on the Tezos blockchain you’ll need to compile to Michelson. So does that make Michelson a bytecode language? Yes, but just calling it that somewhat belies its power: we describe it as a “strongly-typed low-level language” instead, which arguably captures the spirit of things somewhat better. ↩
-
The subtle simplification here is that
monad
does not quite represent a single state; it represents a state-change (i.e. a side-effect). However, if we assume that all computations start from an initial empty state, then a state change can be identified with the result of applying the state change to an initial empty state. ↩ -
That’s like running a calculation on a calculator, reading off the result … then switching off the calculator. The calculator had a state but we don’t care about it any more because we have the result we wanted. ↩
-
We’d like to call them bytes of state … but that would be confusing. Perhaps bits of state? ↩
-
Object-level language here just means that MichelBaby is the language we are implementing. Contrast this with the meta-level language which is the language in which we are writing the implementation; OCaml, in this case. ↩
-
Python does this for you whether you want it to or not. There’s a system-wide (mutable) limit of 1000 on recursive calls. The implication is that if your function has recursed that many times, it’s actually doing a loop (i.e. an iteration). Tail recursion is a functional programmer’s version of iteration and in particular, OCaml automatically recognizes and optimizes tail recursive calls to avoid stack allocation, so that a tail recursive OCaml program is operationally just a (generalisation of an) iterative loop. ↩
-
“Good” here means something specific: In practice, smart contracts are subject to gas limits — specifically, the current Operation Gas Limit for the currently-live Florence protocol is 1,040,000; see also this limit directly in the relevant line in the Florence source code — so a smart contract that gobbles our interpreter’s resources is likely to be terminated for exhausting its gas allocation anyway by the ambient Tezos abstract machine, and this would normally happen before the local hardware that is running the Tezos node on which our abstract machine happens to be executing, runs out of stack space. Thus a “good” limit is
- large enough to allow complete gas exhaustion for practical smart contracts that run on-chain (so in practice it would never actually be reached), yet
- small enough to prevent stack overflows on the underlying execution environment (so our code has predictable, controllable behaviour that we can reason about).
Does the fact that we do not expect this branch to be reached in practice make it irrelevant? Not really: for impractical smart contracts, stack could still be exhausted before gas. For instance, a deep recursion of sufficiently cheap operations (e.g.
PUSH 1; DROP
) might still overflow the call stack before gas is exhausted. So we do still have to detect and fail deep recursions — just in case. ↩ -
The context is the state of the blockchain used to validate a block. ↩
-
The Marshal module of OCaml is untyped. For this reason, the module must trust the programmer that some serialized bytes can be turned into a value of a given type. If the programmer is wrong, the program may crash. ↩
-
Test file here: comparing-marshal-and-data-encoding-0.2.ml.html. ↩
-
The files linked are at https://gitlab.com/nomadic-labs/data-encoding/-/tree/0.2/src at time of writing. However, we will give permalinks to an archived snapshot of the repository at https://archive.softwareheritage.org/browse/origin/content/?branch=refs/tags/0.2&origin_url=https://gitlab.com/nomadic-labs/data-encoding/&path=src/. ↩