Rust Programming Language RustConf 2021, 15 Sep 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: RustConf 2021 - Compile-Time Social Coordination by Zac Burns

Description

Compile-Time Social Coordination by Zac Burns

You can write good code. So can I. But can we write correct code together? The hardest problem facing the ordinary developer of today is not in algorithms or frameworks. Bugs are commonly found between the lines. Projects contain rules that must be adhered to everywhere but are specified nowhere. They are conventions, tribal knowledge, and best practices. Let's learn how Rust makes it easier to write code that agrees and is consistent across files, crates, and persons. This is the story of how I stopped stepping on everyone's toes and learned to love the borrow checker.

A

A biologist, a physicist and a rust programmer, are observing a house. Two people go into the house and a while later three people exit the house, the biologist says they reproduced.

A

The physicist says there was an error in the measurement and the rest programmer says there are now negative one persons in the house, three people, three different sets of assumptions about how the world works.

A

I've been programming for a long time more than 20 years. Now, over that period, I've held several jobs across various unrelated industries with diverse teams and used many programming languages.

A

Each scenario came with its own unique problems, but in every environment the same issue has come up over and over again, that issue was maintaining consistency in code written by different people and at different times a well-architected well-implemented program is internally consistent. There are patterns and design choices that are adhered to in remote locations across the code.

A

When done well, it's a beautiful thing, but if you change or add code, how do you keep consistency with the design choices already made?

A

Let me give you an example of what can go wrong when consistency is not maintained.

A

Here's a fictional game component in a made up maximally terse and permissive language called hazard lang.

A

We have a poison class which encapsulates game component logic and an event that damages the entity with each tick of the game loop by subtracting one from its health.

A

Can you spot the bug? It's a trick question the bug can't be seen when looking at this code in isolation, the bug only manifests when this code is used within a system of components that make different sets of assumptions.

A

Let's look at another component that in practice would be in another being in another file, may make it less likely that you ever see the code side by side.

A

This missile class has an event on collide that damages any entity. It collides with the difference between this component and the previous is that it locks the entity before modifying its state.

A

There's no obvious problem with either the poison or missile component taken individually. Only by looking at both simultaneously can we see that each makes different assumptions about the system. They belong to. One locks, the entity before modifying its state and the other does not, which one is correct.

A

This is another trick. Question no component by itself can be either correct or incorrect. The whole program is only correct if every component agrees with the same set of assumptions as every other component, the poison component may assume. The system has two copies of the state. One copy for updating, while another thread looks at a read-only copy for rendering the missile component may assume that the system allows for concurrent entity updates the larger the program, the more chances there are for inconsistencies, since any line could create an inconsistency with any other line.

A

The potential for inconsistency grows exponentially as a function of the program's size. This is why large scale development is difficult.

A

Over the last 20 years, much of my attention has been devoted to cultivating best practices for scaling code bases without creating inconsistency. What did I learn?.

A

Cease development read all 27 million lines of the program and its dependencies after a month when you fully understand how each line interacts with every other line, make a small change and hope. Nobody else touched anything over that time. Repeat until you go out of business, I'm being sarcastic. What are some practical things? We do.

A

We write comments, comments help, but it's not enough. The architecture is spread out all over the code to explain how every line conforms with the overall design would be redundant. So people don't do that worse. If you're writing new code, any relevant comments would be by definition somewhere else. Where you can't see them.

A

You can't have new code agree with the comments. If you don't know, the comments exist.

A

We gain experience and we share tribal knowledge. It helps, but it's not enough. One way we disseminate lessons learned over time is by communicating best practices.

A

An example of a best practice is no global variables, but nobody agrees on what the best practices should be. The reason nobody agrees is that everything is situational. If you are writing. Embedded software. Global variables may indeed be the best answer to a problem.

A

So best practices are more like ideas to consider, because I got into trouble a few times as such best practices are not a solution to finding out what constraints apply to the code. You are writing right now.

A

The neglected company wiki is not the answer, nor is code review I could go on, but that's not the point here is the point, the more I consider the problem of maintaining consistency, the more I'm convinced that the best return on our investment as a discipline is in the compiler and in compiler aided solutions, unlike communication or mentorship, the compiler scales to any team size, giving personalized advice to every contributor exactly when they need it. Unlike a company wiki, the compiler cannot be ignored or out of date, unlike some random blog posts on the internet.

A

The compiler has complete knowledge of your project's source.

A

For the remainder of the time, I want to tell you how rust's, compiler and standard api work together to create a pit of success for compile-time social coordination, starting with the locked entity example we'll also look at how you can leverage the same features in your libraries to create consistent pits of success of your own.

A

Let's look at how a mutex is used in rust, preventing the problem we saw with the game components. The first line moves the entity into the mutex.

A

Now, if we try to access the entity data without locking the whole program fails to compile, the compiler will tell us that the entity cannot be accessed because it has been moved into the mutex. So we roll that back.

A

Only by locking the mutex can we modify the underlying data past. The second line the lock is no longer held, and once again the data in the entity is not accessible. For some of you. This is not new information.

A

Of course a mutex would own its data and rust. Of course you cannot modify data while the lock is not held, but I want for you to rediscover the joy and quiet brilliance of this api.

A

Some people new to rust may be hearing it for the first time, I'm confident that they didn't hear it before rust, because this api is impossible in c plus c sharp python, javascript java, php pascal go swift or any other mainstream language.

A

People coming from these languages may have recently spent time, debugging a forgotten call to lock, or they may have recently stored a reference to data protected by the lock to use later on the ui thread, or maybe they locked the wrong, lock or forgot to release the lock. None of these bugs are possible in rest.

A

We can go even further. The consistency guarantee afforded by the mutex api allows us to make some wild optimizations. That would never get past code review in any other language.

A

There's an innovative api on mutex getmute, which takes a unique reference to stealth, returning a wrapper over a unique reference to our data. The comment for getmute says: since this call borrows the mutex mutably, no actual locking needs to take place.

A

The mutable borrow statically guarantees no locks exist, so there is a way to get access to the data in the entity without locking, but it's checked by the compiler whoa.

A

Imagine committing some code in another language with a comment stating a lock does not have to be required here, because we know that nothing else has access to the data. At this point first, you would not be trusted to make this assertion.

A

Even if your claim were verified, the change would likely be rejected in review, because we do not know the assertion will hold in the future. This is again the same social coordination problem playing out across time, because the social coordination problem is so hard. We've become accustomed to the habit of engineering. Sub-Optimal software to avoid making mistakes, we've even invented entire architectures that are elaborate. Kitty, gloves, hiding the important part of the problem we are actually trying to solve, transforming the data.

A

What at first glance appears to be a restriction imposed by the compiler, actually grants us the freedom to write better software without being constrained by the limitations of communication and shared knowledge between its authors.

A

Let's look at another example in hazard language, this time of serialization.

A

Here we have a generic serialization function in hazard, binding. It takes a collection of names and values to serialize zips them together to form a property collection loops over each property, then writes a tag for each name, followed by a write of the corresponding value to the file where's. The bug the theme of this talk is bugs that come about through poor social coordination bugs that don't fit on one screen.

A

Let's take a look at another file. The scheme used here by hazard serializer is to have the index in the schema serve as the tag. There's a comment stating that the schema needs to be consistent.

A

Okay, no bug yet, but our story is not over. Remember this part about the schema needing to be consistent.

A

One day the original developer leaves and another developer is hired to replace them. The new developer writes this usage of hazard serializer.

A

Maybe they didn't notice the utility function and that the original developer wrote so they do some things by hand.

A

They set up their schema with a predetermined, consistent order, just like the docs said to, and they write tagged property names and values just like they are supposed to so. The new dev tries their usage of hazard serializer and writes a file, but when they take a look in the app, the data is all wrong.

A

They ask around and get the story that, because there was so much trouble, maintaining consistency in the schema between the server and the client that the convention among the team is to use alphabetical ordering in the schemas.

A

Here it is using name date for the writer, but the reader, which is implemented somewhere else by someone else uses date, name the new dev being the proactive sort and ready to make a good impression decides to fix this once and for all.

A

The dev makes this commit. The old version is on top and the new version is on the bottom, the dev added a call to names.sort which ensures that there is a consistent ordering that adheres to the internal convention of having alphabetical names.

A

Their code now works, there's an added benefit, but there are fewer requirements for calling this function, so the bug won't be hit in the future and we ship yay except except nobody, looked at these two pieces of code. At the same time, the old serialized function that I showed you at the start. With the new fixed tag function, do you know how many things a person can keep in their head at once?

A

Three to four? So, even in the few minutes, I distracted you with a story about the new dev. We may have forgotten to consider how this change would interact with the old code.

A

So what's the problem? Well, if you look at both sections at the same time, you can see that we are iterating over a list while modifying it oops the program probably won't crash. Instead, it will write garbage data, which is arguably worse at no point in time was this serialized function and the new tag function on the same screen? At the same time, people changed different bits, fixing the problems they were aware of creating local consistencies, but global inconsistency.

A

Remember that these examples are simplified. Real code bases are comprised of huge, directed graphs of function, calls being mutated concurrently by multiple people. Even if two inconsistent nodes in that graph were just a few hops away, there could be hundreds of nodes reachable within the same distance.

A

Finding the inconsistency is much like finding the needle in a haystack. If you don't know a priori, where to look our example may seem contrived, but the issue is common enough: that a google search for the phrase don't iterate a list while modifying comes up with over 50 million results, this time-lapse animation shows only the files and directory structure of a project that I worked on at the graph to index blockchain data.

A

If the nodes for structs and functions were included here, there would be more than 3 500, more nodes on the screen and uncountably more connections. Yet this is a modestly sized workspace maintained by a relatively small team.

A

Avoiding iterating over a list, while modifying it in hazard laying required social coordination, but it's a mistake that I've never made in rust, even when working with other people. The compiler ensures for me that the connected nodes in our call graph are consistent in rust. There are three ways to pass values: you can use a shared reference to t a shared reference is immutable unless you implement special protections to guard against the problems that come with shared mutability, we call that interior mutability, but the typical shared reference is read.

A

Only there's also ammuniti unique reference to t which enables the permission to write to the value and finally t which transfers ownership of the value. Using these three types, we make explicit what are hidden contracts and hazard lines. The rust compiler can verify these contracts.

A

Let's see how it does that our serializer bug boil down to iterating the list while modifying it. If we try the same in rust, we get a compiler error.

A

This is the distilled version of the bug here. At names.itter, a temporary is constructed that maintains the state of the iterator. The iterator holds a shared reference to names, but on the following line, the call to sort takes names by unique reference. It is a contradiction for a reference to be both unique and shared. At the same time, contradictions are bugs.

A

What's neat here is that rust will detect this contradiction through any number of layers. Instructs and function calls so that, even if the code is not simple like in the example, it will not compile with the inconsistency the bug we introduced in a serializer cannot occur in rust, because the whole program would fail to compile we're almost done. I promise to show you how to use the compiler to enforce your own rules. That would otherwise require social coordination.

A

We will use the type system and compiler errors to teach new developers architectural decisions. This is going to be a bit of a doozy. So please prepare yourself by enjoying this picture.

A

Okay: let's go first set up the situation at edge and node. We write multi-threaded web servers that each serve thousands of requests every second. These servers read data from a database. The database has a connection limit and connection takes time to set up so to avoid going over the limit or incurring the setup cost on every request. We use a connection pool one day. We notice that a server instance stops serving requests.

A

There's no warning everything stops: cpu usage, flatlines, there's, no disk usage, no queries are served and if we restart everything is okay again until the next time that it happens. Why note that at this point, you're looking at an issue that is going to be difficult to debug, it's a real server with lots of code. Customers are panicked and the issue only occurs once every few days under vast amounts of load, there is no stack trace in the logs or smoking gun of any kind.

A

Here's the bug we found first get a connection from the pool, then save data using the connection and, lastly, emit an event to notify any listeners that there is new data available. Where's. The bug it's not here, it's in the interaction of this code with other code.

A

Let's look at some more code from a file far far away.

A

The code in emit event also appears straight straightforward. Our pub sub goes through the database, so we grab a connection and write an event to it. There is no obvious bug in this code either.

A

The problem is that, when admit event is called, we are already holding a connection from the connection pool. The connection held is not returned to the pool until it is dropped after emit event returns. In normal circumstances, this is okay. The second connection is acquired and then both are released. First, the connection and emit event is released.

A

Then the outer connection is released, but rarely if 50 requests hit emit event simultaneously, they're already holding the limit of 50 connections, so the call to get connection within a minute event never returns, because the connection pool is empty, since admit event never returns. None of the outer connections are returned to the pool and the whole request pipeline is deadlocked across all threads.

A

In order to understand this bug, you have to be aware of very specific architectural details. You have to know that connections are pooled. You have to know that events go through the database. You have to know that connections return to the pool on drop. You have to know everything that those disparate details infer and you have to know for any function that you are writing that no caller of your function holds a connection. If any call you might make would attempt to acquire one.

A

So you have to know all the code up and down the stack at all points and keep all this in your head on top of thinking about whatever problem you are actually trying to solve, since interacting with a database happens in many places, there's a large surface area of code susceptible to this button.

A

At this moment of discovery, experience and tribal knowledge are formed. The developer working on the bug says: aha, it is incorrect in the general case, to hold on to a pooled resource and ask for another from the same pool. Doing so will always eventually deadlock.

A

They share this information with the team write. A blog post add a comment to get connection and go on a conquest to stamp out every instance of this bug they can find. But this bug is really subtle. It's easy to miss, even when you know what to look for, because you have to analyze. Regions of a directed graph of function calls.

A

You have to look at the scope of the liveness of the connection. In this case the scope is between get connection and past the end of emit event.

A

You have to see what function calls overlap with that scope then traverse the directed graph of calls until you visit every reachable node and verify that none of those nodes attempts to get a connection.

A

Even if you get it right, someone can make a minor edit in the middle of the graph in the future. This edit may completely change the set of reachable nodes.

A

They may be unaware that upstream of connection is held while downstream a new connection is obtained because from where the edit is made, they may see neither so here's this bug it's hard to detect easy to create, is not fixable via architecture and hurts users in production. It's time for compile time social coordination.

A

What we want to do, as leaders is to take our learnings about the resource, starvation hazard inherent to pooled resources and encode those learnings in a type system, so that the compiler can then teach the rest of the team at scale and at the appropriate time.

A

The rule that we want to enforce to prevent this bug from coming up again is that each request may hold up to one connection, but never more during the scope of the request.

A

The solution is to create a token, representing the permission, to obtain a connection from the pool we can ensure. This permission is granted once per request by making the token constructor private getconnection is then appended to take a unique reference to the token what that does is to tie the unique loan of the token to the loan of the connection from the pool.

A

The connection is returned to the pool on drop, so only by returning the connection to the pool can we release our loan of the token? What is the effect of this change.

A

Returning to the call site of emit event, we now need to pass our token into get connection and we need to pass our token into emit event for it to be able to obtain a connection, but now this won't compile because the connection is still alive when we try to borrow the token the second time the compiler forces us to add this line to return the connection to the pool before calling emit event.

A

This fixes the bug not just here, but everywhere it might exist in a source now and in the future.

A

That's it a dozen lines of code to set up the rules and the graph traversal search for inconsistency is now mechanically executed by the compiler, removing the error prone and easily forgotten work from the developer as a bonus. The token is removed at compile time. There is no heap allocation or any runtime cost at all.

A

All of these problems locking entity data modifying a list while iterating over it and resource starvation, have two things in common one is that they happen all the time.

A

The second is that they're all part of a broader class of problems fixed as a natural consequence of the borrow checker, in fact, many other social coordination problems like memory management. If I pass a pointer to your library whose responsibility is it to free that memory save global variables, high performance, non-defensive code, security, even wagon support, are all underpinned by the borrow checker.

A

The borrow checker is the beating heart of rust, and is why I use rust. You could say I use rust because of its safety performance, web assembly, productivity, excellent, tooling, supportive, community, its empowerment etc.

A

All true, but none of these I consider differentiators, they are important, but they are literally the minimum bar. I have no use for any language where I cannot write programs with excellent runtime performance which does not compile to the platforms I care about, for example, among the small set of languages that meets this minimum standard. I ask what sets them apart.

A

For me, it is lifetimes in the borrow checker lifetimes have a reputation for being hard to learn. These fears aren't entirely misguided. Learning. Rust is hard.

A

It took me longer to learn rust than any other language, but there's a narrative being perpetuated in our community about the borrow checker being difficult that people fight the borrow checker while true from a certain lens. I believe this narrative to be misguided and counterproductive.

A

Do you want to know what was harder than learning lifetimes learning the same lessons through 20 years of making preventable mistakes?

A

The whole result is refreshing because there is a single unifying concept that provides a benefit across almost all apis. The accumulation of many small wins adds up. You want to know in a sentence. What's so important here is that there is finally a language that both has a string, concatenation method and I'm not afraid to use it at the risk of being hyperbolic.

A

I believe that the borrow checker has rendered obsolete much of the knowledge that I've gained over the past 20 years, and I think we haven't even seen how far this experiment will go, suppose that the future of programming can shed defensive architectural patterns, endless debugging, passing on best practices and tribal knowledge manually and learn to love. One concept that of lifetimes- in that case, we will see farther and accomplish more than our predecessors, if you're not yet using rust. That is the trade-off that I present to you. The choice is now yours.

A

If you like the idea of solving social coordination at compile time, you may also enjoy solving social coordination through incentive systems, which is one of the things I work on at edge and node, while using west. If that sounds appealing to you, we are hiring west developers.

A

You can contact me at that three percent gmail.com for any questions about this talk or what we're building thanks.