Rust Programming Language RustFest Kyiv 2017, 14 May 2017

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Geoffroy Couprie - Sōzu, a hot reconfigurable reverse HTTP proxy

Description

Could you write a Rust program that never, ever has to stop? Not even for configuration changes or binary upgrades? One that will handle the traffic for thousands of applications on many more virtual machines that get up and down at any time? The Sōzu HTTP reverse proxy is there to solve that problem. This talk will cover various parts of its architecture, from its streaming HTTP parser built with nom, its single-threaded worker handling events with mio, and all the associated tooling to command and control the proxy.

Geoffroy Couprie
http://geoffroycouprie.com/
https://github.com/Geal
https://users.rust-lang.org/users/geal/activity

A

Okay, thank you. Everybody first I have to thank Sebastian on ways like was a very good presentation and I'm happy to see just removing rust in we're all united in the media playing pain. So.

A

Yeah and the use num-num is great, so I'm here to talk about seoul zoo, said so long. Oh, it's a rivers, HTTP proxy. We built up calcloud also rapidly. If I have something that works, yep nope, we just check okay, Shmi all right. So how could we do a really good application of steam like you, do get push on everything with walk and do motoring upgrade whatever and I under security and roasts? Mostly there I've been doing big heavens reading there and.

A

You can actually have roasting production like our own or I, for whatever only on the server with a baton. So it's cool so cute. She doesn't work I'm here to talk about Suzu. So it's a reverse proxy.

A

So this is the same that stands between the internet and your application, like you, could have like 10 virtual machines for your application and you set up to the name 2.0 one of a few servers we have and those will just dispatch, two requests to the backend machines, and so we built one and what I said when I presented that fingers. It should never ever stopped because we have one problem with: we use a cheap proxy, which is a really great resource proxy.

A

But whenever we have to change the configuration we have to restart that thing and that this is bad because we lose connection and we lose stability.

A

So we started to do a proxy from scratch and we starting with writing that in rows, because the hey with striking again so very quickly, it's good when you have like a lot of application, backing on this virtual machine coming up and down and up and down and he's a new application, a new front and he's a new certificate and it should handle all that reallocated runtime and it's quite good as well as front-end server for micro services. And so why is changing configuration at runtime hub?

A

Imagine you have a charity, so this venerable fast, very stable, Rios, proxy and okay. You change the configuration and you add new back-end servers like you replace a virtual machine by a new one on different IP, so you will write a new configuration file and we'll tell you a Shh, epoxy, okay. So now this is a new configuration like reload it. So it's quite snap. So we try to not lose that many connections when it's reloading.

A

But the thing is, it loses a lot of information like when can I actually remove a back-end server, because I have removed it that from the configure from the configuration, but actually the might be connection still going on and I lose that kind of information. So we wanted to have something that quite dynamic and I can handle a kind of case. That's very specific, there's also. Another issue that we have is that we have to a great code at some point, and this is actually quite hard to do without losing any connection.

A

And the last thing we're reverse was a very good contender stack. We want to be able to predict memory rage and, as we know, garbage correction can make prediction a bit hard and one of the edges we wanted is to make it very easy to extend, and so we came up with this concept of social. This is this kind of Japanese Fontaine, which is Japanese going to be very, very quiet and there's only the zone of the bundle and on the rock disturbing sometime, very good ally, it's peaceful put it in production.

A

You don't think about it. It's already there alright.

A

So how is it it's actually a github repository with for cargo, lock spectra for project, so there's the so zoo library which is like how you make a proxy in turn of the event loop and everything above that we build the de sousa collectible, which will under get in configuration and getting the job grades, and that kind of thing there's the common library talk to the proxy and then configuration changes and everything and there's like the command line application that we made as an example of what we can do when you want to talk mm.

A

So it's made with norm, of course, like building a whole HTTP parser in Nam is an interesting exercise. It uses mu, not not Tokyo, and we ask for HTTP 190s and not not HCP to yet so a very quickly numb. We talked about it a few time, so it's possible community of its macros lots of lots of macros, so Bassel community is just a way to combine pieces of code and, like you have code that will look like a grammar, but it's just code.

A

It's not something that will generate something else as to what and inside nan passes, are very, very small. It's like hard to read, you have any function. Is that there's an input, type, a node to type an error type and when what the parser will return is ever result value with the remaining part of the input or I know me with poor, possibly a pointer to that of the input or I. Don't have enough the in up data to the side, which is very important so, as I said, macros, it's very small.

A

Only this one will try to apply terminated, so you apply the alpha buzzer and then the digit passer. And if both are successful, you return the result of alpha and when it's expanded it will look a bit like this. So it's just a list of much arms.

A

What actually it looks like this because reasons and the rule function is like that, but you don't have to see that it's just very straightforward got a lot of boilerplate.

A

So, as I said, we have been incomplete stuff, it's useful for streaming, because you want to pass things and you don't know if you will get the whole data like I will pass HTTP, but when I read from the circuit, I would just for what I read may may be stopping just inside the header, and so I have to to make sure that it's not counting as an arrow, because I know that more the attack can be coming. So I have way to split the passer and say: okay, so this part is don't narrow.

A

A still waiting for my data and the way it happens in Nam is I made this the farest abstraction, which basically and don't want to take the request or the response and build this whole object in memory. That I will just write back to the network. I just want to look at the data coming and say: okay, so from here to there, I will give back to the to the back end here to there and delete and then I insert a buffer and whatever so I have behavior. That just is based on that string buffer.

A

So, whenever I have doing something, it will say: okay, it doesn't move, and it's like advance from these many bytes and it will say, consume the in the buffer, this many bytes and it will advance in the state machine and recognize okay. Now we got a request line. Now we got the host. Now we got a condom size and whatever and having a very local has a very positive completion. It that way allows us to do like very fast trimming passing, so it should never ever ever.

A

Stop it's a big goal: we're not there yet, but we have ways so equation. Can can we recover from failure? Like treasure will happen?

A

How can we avoid losing connections, because we will lose connections from different reason that if the process crashes, we lose all of the all of the state and how can we avoid restarting process? Because it's one of the reason we can lose connection? So that's when I choose to have a multi-process architecture in rust in England proxy first I know that things will panic or crash or like right now. I have open SSL inside that thing. So I know the still a big chunk of C that can do a lot connector in there.

A

So I was fitting. Okay can I just have a lot of thread inside one response.

A

Essence: ain't, no different processes that I isolate directly I want it to have the zero downtime upgrade and for this I needed a multi-process architecture which see why I want to be able to sandbox the processes like you know that the proxy muscle time it will not need access to the file system like it's just a pipe from the network from here to there and there to here does not need to see the whole system and for a part of optimization, it's good way to have a multi-process architecture.

A

So you would ask like why would I try to isolate stuff and sundogs and make multiple interval process to recover from crashes and, like maybe I? Don't trust rust enough? I will cross trust a lot there because I'm really putting this outside, but it's good to rely on different parts, unlike the UNIX philosophy on processes and such system and stuff really helps us at. So how can we avoid this in connection when we upgrade like you?

A

Have the car that's running and you want to replace the processes and make sure that you won't lose what's currently running, so you just spawn new workers and then tell the old ones? Okay, no, stop! Stop accepting new connections you will just terminate. We will wait for the actual connections to terminate and then you will close and the the master process will fork itself and keep the control over the workers, and then it should be all right.

A

Well, in fact, it doesn't because a listening circuit has its own cue of coming connections and if you say to a worker, stop accepting new connection, they will not. They may be still some in the queue, so you actually need to move the reasoning to get from one worker to the next. But with this architecture we can have something that we just upgrade, like maybe you've, seen like this nice, zip up, so people that are changing the tires on the car. That's just moving sideways, like that's what we're doing with eggs right right cool.

A

So we have a list of workers, basically with just cloning everything, so it's the basic fork exact. We have what this is a smaller white whatever, so we use this region using the infrastructure. The sword in the NYX crate, which is quite good and the thing is we want to have the different worker- must earn everything talking to each other. But we want to have a type interface, because there we have the unique stream interface, which just a unique socket. So it's just bytes I can buy to receive bytes.

A

It's not tight, so I built a channel abstraction of that which basically there's a language. That's a JSON message is separated by nerd car I wanted to do a nice binary format, but the colleague said: no people will not build anything with that. If you push binary formats again so start with that use, Jaisal I know I'll be right in the end, and so that that's something that's quite interesting is when I start a walker.

A

I will send the configuration as a message, but then I will switch to another protocol and I don't need to make something very completely black. With that I, just put so that's where I should have used other colors, but I put the channel in blocking mode I write the message then I put in non-blocking mode and then I do into and I just change.

A

The type of the message I can send and receive so I'm using the same channel the same buffers under the under the hood but I'm just saying: okay, now this after I've written the conjugation, this channel, we just handle over types of message and under the worker side, I have the same thing when I and I'm on the worker side, I just exec a new process. I will totally new new process. Okay, here's the file descriptor for the channel, so it will start and know where we must talk.

A

So when you a process, all the file descriptors will be shared. So if you have a UNIX socket, you take the both parts. So we have the server part and the client part under the parent. We keep the sever one and the child. We keep the client one and just have to know at which number it is and once it's starting it gets this file. Descriptor makes a unique stream from this and receive the configuration in blocking mode because we don't want to just receive a small table.

A

We want to know to get the whole configuration and then it starts having a complete channel to call to talk with the master process. So now can we change confusion without reserve. There's another system for the the master process. He exposes your a unique circuit to and like you want to talk to the proxy.

A

It's not listening to local laws, because anybody could be talking on Vaqueros, it's a file somewhere on the sack system and there are rights to write or read from that thing, and this is where you will just talk to your clear and say: okay, add a new front, end server! Add a new virtual machine remove this certificate because we can change the TLS configuration on the fly and like he is a small example when we want to add another new front-end light for this application.

A

You we want to redirect any connection to this name and this prefix in the URL that we call the path begin and so okay add the HTTP front, and then we send this message to the proxy when we send a message to the proxy at some point, if we answer a message with the same ID and there are, there are three states for the message you can Avenel all you can have okay, it's done and cannot processing.

A

Why do we have processing like it's a configuration change? It should happen like right now, but let's say we're: removing back-end server, okay, remove the backend server and then the position, nope I'm, processing I still have this many requests lingering and are we send out the okay when it's done when the all the connection ended?

A

So this is really quite quite interesting because we have the architecture based on the old, unique C process stuff. Now, how does it work in the hood? So we use this and we use an even coupe and I said we use mule and not Tokyo. So why?

A

Because I started this project a year and a half ago and Tokyo did not exist. That's basically why the Indian things that I could not do. We've taught you that I will do and in you and Matt I talked about that afterwards. So mu is Indian very, very, very simple: it's just you give a circuit, and you say I want to know when I can read or write from this socket.

A

That's it, and, and so you have this whole object and you register you socket to this pole object and you call the pole method and we'll say: okay, I have this new event like this circuit is readable and this one is writable and this one is closed and you handle the atoms like the new hub dot. Is that okay, you have to have your state machine. That's meant to absorb okay, know that this thing is readable. What do I do know that these things is writable?

A

Why do you and the hard part of knowing how the events are sent, like maybe you've, seen the different the distance? The edge triggered events in the letter triggered events in one shot edge trigger that will be in level triggered is something that's very that comes from signal processing, but basically the way it works in a pole is when it's level trigger like the circuit is readable and you get the readable event over and over and over and over until you've read all of the data from the circuit and there's nothing like the way.

A

It's done. We've a sequence. Is you read, you read you read and then you get to narrow that say: hey. If you try to read again, I would block and that's how it's done edge triggered is okay. This circuit is readable and you won't get the event again and don't register that this circuit was writable and you do nothing.

A

I will not tell you again that it's readable, it's your job and so there's a I won't say easy, but there's a good way to animate the current thing, with our two structures like basic e readiness and interest, and we store for circuit. The interest like I want to be able to read from the socket I want to be able to know if I get the hub like a it's. It's closed.

A

All I want to know if there's an error, and so when you get the event like the readiness like if I have a web server. My listening circuit, what my letter so get it just opened. I want to know.

A

I want to read data, so I do internet, our HP, so read hub, I, know and when data comes in I will most probably have the readable and writeable event, and so I read I, read I, read, I, read, I, read I, handle Mike request and now I say: ok, my new interest is write hub error and then you might see that the range Ness was already right because, like most of the time the circuit will be writable, and so you write the data until you get the woodblock event and rinse and repeat so.

A

This is like the hard top to get how to under that and in a proxy it gets a bit more complicated because I have a front and a back socket and also I want to read from the front circuit and once I read from the front circuit.

A

I want to write to the back circuit and once the whole request is done, is the reverse: I want to read from the back circuit and then right to the front circuit, and so this is where the most part of the complexity- and this is the idea of using new mule- is very simple, but you have to handle the complexity, tokyo, hide and hunters the complexity for you until there's a leaky abstraction somewhere, and you have to do things yourself, but tokyo is quite good. I have a lot of fun develop.

A

So just me, just check is quite some time yet it's good, because I have a lot of slides after the end of my talk, so I will just get in the tools, but first it's a young project dress, starting like publicly talking about that, but can be already quite stable.

A

As you see with me with the circuit with what all the things I. Do it's a good way to get interesting the talking card, and we will we made it to so that it should be easy to build tools around like. Does this unique circuit exposing the configuration language it can make anything you want with that, like you could make a small dashboard your website to control the proxy there's something I was green like you can build, let's anchor it around the proxy.

A

With that the way you would do it is just you tell the proxy okay I have this web server so for this domain and this path prefix that black tank rip will talk to you say you send it to me, you tell let's encrypt to do the request.

A

You will request, it done, don't know the certificate, you send us a key to the proxy and you can build a lot of tools like that to automate completed infrastructure in the way it works at clever cloud. We have a rabbitmq the queue that gets. That gives the the events for the configuration changes and we have a small tool, that's beside the proxy and get the changes and send them to the proxy.

A

There are lots of easy issues, so you film to contribute- and there are lots of things to do something I want to do like very soon- is terminating the TLS at the backend server right now all the TLS is done on the server and done with open, SSL and I. Don't want to use it on SSL, but I don't have the time to switch for us TLS, please help me, but we could actually terminate the Telus connection to the back-end server.

A

So the idea is that the proxy will just look at the the first packets and see oh, the sni address, saying I want to have this application, and so it will just switch based on that. But for this, I would need to pass the TLS messages and know TLS.

A

Implementations can do that, but there's a very nice french guy who did a whole TLS parser in Nam and I did not have to do it and I will use its code and it's great over things like we want to add support for all of the configuration tools like conch ability, sound etcd and whatever splicing is quite interesting feature.

A

Basically, when I've read the headers from HTTP and then there's a body I, don't need to read the body I, just I just want to say, okay, so this much data should go from this socket to this socket. Why do I have to read that to a buffer in my proxy and then write that to a cannon buffer, and so there's the splice fiscal that say?

A

Okay, so you read to the scanner buffer and then you write to the disk and Alberto, and this does not go through the user space and you can get quite fast with this and also does HTTP to squat. So we need to help for that. It's called so Zoo is quite good if they're on github and that's it, but since I have ten minutes, I can ever go to all the slides I had left, because I thought would be too long or I can just answer questions.

A

What do you think questions or slides I think the problem with questions is we have a microphone necessarily for people, so I would lean toward more slides personally but yep. Okay, so then slide so splicing.

A

So that was the ID like whenever you read from the socket it gets to a kernel buffer first like from the network cap to buffer, and then you read to the user space, and then you write something else, and the idea is that just do proof from one to the next and it's done, but the memory is eight. That's something I didn't mention, but I had quite some fun.

A

With that I said, garbage collection can be an issue with that kind of project, so I want to have predictable memories edge and at first I saw like if I try to allocate the buffers any time a request where the request was coming. It was quite slow and quite unpredictable on the response time. So I just started to pre-allocate like I I, come on to the machine and say: okay, so I take all the run.

A

It's mine well, whatever I most of the time, your proxy will be alone on its machine, so I take all the RAM and then I will bound I would put bound on the number of connections depending on the size of the fur I give so I have eight jigs I have this side of this entire story besides of a buffer and so I do a division and inside okay. This many concurrent requests.

A

Why do I do that? Because when I don't have any more resources to give for new connections or new requests, I should just reject them, because if I try to turn more that I can have on the system, resources I would just start to be slow and I will slow every over request in concurrency, so I would just stop dropping stuff instead of getting slow right.

A

There are a lot of optimization to do like at first I did a very naive solution with there was a buffer for communication from front to back, and there was one for communication back to front at some point. I was hmm maybe maybe I can use the same buffer for front to back and back to front and then may not be used at the same time. Let's try, let's put lot of size ourselves everywhere. huh It works so yeah.

A

What else do I have yeah every worker is single sided? Why? Because each has their own memory and like they're, not sharing anything with any other worker. When you send the configuration to the master process, it sends the configuration to all the workers. They do not share anything. There's no mutex for the kind of thing and no me tags happy thread since I have different workers. I can just put them each on their own car and when I have each of them on their own car. I have grid with that data locality.

A

That means that, if, if I had a process that it will just jump from one car to the next all runs, the data does to be removed from the cache and move from one place to the next. If they stay on the same car, all of the input on data and code is always in the casualties view, and so we can get great performance benefit and there was other than nothing ever fingers increasing with that is. There are some kinds of network card.

A

They have packet queues for each car, so, instead of having one specific part of the memory where all of the calls go and take the packets, each core has its own part of the memory to get the packets, so they don't have to share that part ever and again, a grid pattern of gainin. That was one of the ID I wanted in the new proxy at first.

A

So what's left work, no dents like 30 minutes about what's good, with looking at you on e poll and all the other solutions, so I just stopped talking about this on. You can just talk about box afterwards. Thank you.