At Nomadic Labs we are proud to create next-gen software … but we are even prouder to help create the next generation of software scientists!
So it’s our great pleasure to host and work with several PhD students, who bring unique perspectives on the technology which they help us to develop, and whose interest in blockchain we are happy to nurture and inform.
In this blogpost, we will ask five questions of one of our students, Guillaume Bau (and a couple of questions of his supervisor at Nomadic Labs).
Over to you Guillaume … no the mic is over here … that’s right, you’re on air!
I’m currently a PhD student at LIP6 and Nomadic Labs. I graduated with a computer science BSc in 2007 from the Université Pierre et Marie Curie (now Sorbonne Université). I worked for some time in some network analysis and cybersecurity companies, mainly as a C language developer. In 2016, I entered a startup targeting floating point manipulations errors and approximations. This was the occasion for me to use my theoretical background: I developed an abstract interpreter for C and Ada, designed to detect floating point arithmetic misbehaviour in avionics software.
Then, while first entering Nomadic in October 2018, I had the opportunity to continue my academic studies: I graduated with a master’s degree and worked on a prototype for an abstract interpreter for Michelson as my Nomadic Labs internship.
2. Tell us more about the topic of your PhD: who is your mentor, and what are your research objectives?
The topic of my PhD is to develop a static analysis tool for the Michelson smart contract programming language, helping developers to assert correctness properties in their Michelson code and to detect errors in it.
It will use Abstract Interpretation, a static analysis technique which tends to give quick results, even for thousand-line contracts, and which does not require user intervention or programming to guarantee some properties, as are required with proof assistants.1 Abstract interpretation works by abstracting known, unknown, and partially-known values as overapproximations, e.g. using an interval [1,10] if we know a value might be 1 or might be 10. This often allows us to deduce useful mathematical or logical properties, for example, “this value is always different from 0”, or “this piece of code cannot be reached”.
Abstract domains are responsible for managing Michelson instructions with respect to these value approximations, and a large part of the work needed to write an abstract interpreter is to develop the abstract domains handling some specific language feature like Michelson sets or maps, or some specific property, like storage changes though multiple calls to a specific smart contract entrypoint. Of course, because we are always manipulating overapproximations, an analysis can raise false alarms2 if the approximation is too loose.
Much of my PhD work goes into developing new abstract domains that make useful trade-offs between precision and performance, to help Tezos smart contracts developers guarantee properties of their work. For instance, I wrote an abstract domain of “smashed” lists, able to represent as a unique abstract value the set of elements contained inside a Michelson list, and an improved domain will consist in keeping abstract items of the list separate.
My research director is Antoine Miné from LIP6, and my mentors are Vincent Botbol and Mehdi Bouaziz from Nomadic Labs. My work integrates with Mopsa, Antoine Miné’s modular framework for Abstract Interpretation. This is currently maintained by Antoine with the help of the Mopsa team, and allows building analysers for multiple languages (C and Python analysis are provided built-in), while providing useful primitives and a robust framework for abstraction compositions.
3. What is the value added and innovative side of your PhD thesis (if applicable, in the Tezos blockchain)?
Michelson has an interesting combination of language features. It is high-level in some respects — it’s a strictly-typed language with high-level data structures — yet low-level in others — e.g. it is stack-based. Stack-based languages are uncommon3 and there is relatively little work on static analysis for them, and furthermore, Michelson’s high-level data structures create opportunities to develop rather precise customised abstractions.
Michelson is also interesting because it’s a smart contract language, not a regular programming language, so you tend to write different kinds of programs, for which you may need to prove different kinds of correctness properties. Some of these may be new and specific to smart contract programming languages; like token-preserving invariants or resources analysis (gas, storage size evolution, and so on).
Interesting properties also emerge from the underlying blockchain platform, e.g. properties of inter-contract and transactions behaviour, and these can depend on the underlying blockchain implementation.4 That is: if you have a specific blockchain implementation C running smart contracts X, Y, and Z, then an effective analysis of the (full panoply of possible) interactions between smart contracts X, Y, and Z in the context of C, may be key to proving safety, even if you personally only care about the correctness of contract X! As you might imagine this can get quite complicated, which is why effective automated tools can be so valuable.
Thus from the point of view of a hands-on real-life Tezos smart contract developer, an automatic static analysis tool would add real value. It would help find errors earlier in you development cycle, and could prevent blocked contracts or malicious exploitation. Abstract interpretation is scalable to real-life use-cases and does not require much human work to obtain its guarantees of properties. I would argue that for the working programmer, this is a desirable combination of features!
4. What is the benefit of preparing a PhD and spending most of your time in a private company vs. a University lab?
University labs are fecund environments and ideas can be exchanged and relations between different topics explored — but it’s not always the case that the actual developers of the system you’re studying are working in the very same office. It would be like working on a C language analyser and being able to pop over to Brian Kernighan in the neighboring office to ask “Why did you do it like that?”
Studying at Nomadic, I’m surrounded by Tezos developers, some of whom are working on techniques for certifying smart contracts with Coq; some of whom are working on the Michelson interpreter; and we are all reviewing code from other Tezos contributors. So I’m right at the heart of Michelson’s development, with the ability to pre-empt the language’s evolution, and even participate in it.
Blockchains are an interesting topic, being at the intersection of several important disciplines: distributed systems, networking, cryptography, language design and analysis and more. So there is much opportunity for new and useful academic research.
And, the Tezos blockchain in particular considered safety in its design right from the start: the main node is written in a memory-safe language, the cryptographic routines are formally verified, and Michelson was built without implicit failures. Nomadic engineers share a desire to apply formal verification to guarantee the inner working of the node and network behaviour. This culture from Tezos and particularly from Nomadic Labs implied working with the OCaml language, and working on Abstract Interpretation, both of which I’m very interested in.
I have a background in and am familiar with basic abstract interpretation techniques, to e.g. find numeric invariants, but I have a lot more to learn to be able to build systems capable of proving properties involving e.g. authentication and authorisation. While working with the Mopsa framework, I’m learning how real-life and industrially robust abstract interpreters are developed, with advanced feature sets, like modular analysis and abstraction compositions.
I would be very pleased to be able to make a small contribution to making smart contracts more secure — initially on the Tezos chain, but I hope the research may be transferable to others — and I look forward to studying and learning at Nomadic Labs and creating a practical project which will help real Tezos users to create value, by making things that people want and can (safely) use!
Ph.D. students are expected to absorb vast amounts of information. My role is to give Guillaume guidance, advice, and a framework to understand the material — as well as to suggest pertinent literature and references where relevant.
Blockchain is a particularly challenging and exciting space in this regard, because it is inherently interdisciplinary and because, being so new, there is less of a body of knowledge to guide us.
My background includes expertise in formal verification (in particular using abstract interpretation) and in the implementation of static analysers. I also know the Tezos codebase and its ecosystem intimately, having participated in its evolution from even before the mainnet launch back in September 2018.
My technical input is then twofold:
- I will help Guillaume by proposing techniques to increase performance and precision5 in his static analyser design, and also
- I will guide him towards useful real-life use-cases, to make his work maximally useful for the smart contract developer community.
Many of the most expensive errors on blockchains are related to buggy smart contracts. Formal verification can help critical software development but it remains had to apply: it’s great when it works, but tooling tends to be either incomplete or hard to apply (or both).
Guillaume hopes to create a fully automated analyser. The user just needs to provide a specification (e.g. “only my tz1 address may remove tokens from this contract”) and the analyser will automatically check whether the specification is met.
Guillaume’s tool will also be able to detect coding style anti-patterns, dead code and security issues.
The abstract interpretation method Guillaume is using has two strengths. First, it’s great with the numerical (i.e. arithmetic) properties which are common in smart contract code (asset management, ERC20-like contracts, decentralised exchanges, etc.). Second, it’s fast: thousands of lines of code can be analysed in a few seconds.
I look forward to integrating this analysis into the Tezos smart contract development process and so helping users and developers in the Tezos ecosystem to catch more bugs, more quickly, in their smart contracts.
To be more precise (and in fairness to proof assistants):
- there is a highly labour-saving subset of properties for which abstract interpretation tends to not require user intervention or programming (e.g. detecting integer overflows), and
- there is another still-useful subset for which abstract interpretation is applicable, but which may require some user intervention e.g. to refine invariants (“user X has permission to transfer the contract tokens” might require us to refine our invariant with a precondition like “user X has address tz1”), and
- there is a third subset for which abstract interpretation can not provide useful invariants in general: e.g. because the properties involved are just too complex for a general abstract domain to check. In this case, you could try to build a customised abstract domain to prove a specific instance of this property — but this may just be a case of picking the right tool for the job and it may be quicker and easier to prove your invariant using other means, and then input it into the analyser as an assumption.
So abstract interpretation is not a magic wand, but it can feel like that sometimes. ↩
This is like a false positive in a medical test: not necessarily bad in and of itself, but it wastes time. ↩
May I expand on this in a footnote please? [Sure, please do -ed]
Stack-based general programming languages are uncommon, but low-level assembly and bytecode languages are frequently stack-based: e.g. Intel and ARM assembly, Java VM bytecode, Microsoft CLR bytecode, and OCaml bytecode.
So Michelson is an interesting engineering hybrid that combines features typical of a low-level compile-target language, with features of high-level programming languages. It is possible to program directly in Michelson, though I wouldn’t want to myself. A key point is that verifying low-level languages requires to reconstruct the programmer’s intent to some degree, which can be hard in some cases and may lead to loss of precision. But, Michelson’s high-level features ameliorate this, and furthermore, being strictly-typed means that specific classes of problems can never arise, which facilitates the writing of both an interpreter and an abstract interpreter. ↩
For instance, transaction evaluation ordering differs between Ethereum and Tezos, and Tezos has notions of internal vs. external operations. A good abstract interpreter for Tezos, if it is to be maximally useful, may need to know about and account for these specificities. ↩
More performance = Increased speed. More precision = the analysis raises fewer false positive alarms. ↩
Sound means something rather precise in mathematics, namely “correctness with respect to a logical model that is proven to mathematical precision” …. ↩
… and correct means “the stated invariants are maintained and there can be no exceptions, counterexamples, or corner cases”. Knowing how to translate intuitive notions of correctness into verifiable invariants is a key skill, but that’s a whole other discussion. ↩