Cloud Native Computing Foundation EnvoyCon 2020 - Virtual, 12 Nov 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Incrementally Building Incremental - Alec Holmes

Description

Incrementally Building Incremental - Alec Holmes

This talk walks through the development process of incremental xDS lead by Alec Holmes and Joshua Rutherford inside the open source repository “envoyproxy/go-control-plane” . It touches on differences between SOTW and Incremental xDS, implementation hurdles tackled when building out the new protocol, and design changes in the pre-existing codebase needed to build out Incremental. Alec will lay out the remaining goals, and discuss the next steps for the repository.

A

Hey everyone, my name is alec and I'm real excited to talk to you today about incrementally building the incremental implementation in envoy's. Go control, plane upstream repo.

A

So I'm a core engineer at gray matter and I've worked there since inception uh gray matter. We have large customers. In fact, we operate in production in a large global enterprise and in order to meet scale requirements, we found that incremental is a necessary feature set required by go control plane. So we set about to add it.

A

uh This is my first major open source contribution.

A

I've been contributing to go control plane here and there small prs that I wanted to fix minor issues, and I saw that would mature the repo a little more but incremental as it's a-hole protocol implementation is a large feature set, and this is my biggest uh contribution so far in my career to an upstream repo, an open source upstream repo.

A

A

So this right here is a high level timeline that lays out the go, control, plane implementation path. While we were adding incremental. So in march, 2018, the initial snapshot cache of go control plane was released. This was the first tagged revision of go control plane and it contained not only the simple snapshot, cache which I'm sure many of you are familiar with.

A

In october of 2019, the incremental protocol was released by the community. uh This was an upstream change to envoy itself. uh The protocol was defined as a spec, but it wasn't implemented anywhere. I believe envoy only had cds currently functioning uh when that was current. That was released and in december of 2019, I swooped in to actually begin the implementation and write a proposal on implementing incremental inside of go control plane, as I had seen, some sort of traction in the java control plane.

A

But there was nothing nothing there in go control, plane and then in july, 2020 the mux and linear cash came out and those were targeted for things like better opaque resource handling and other conveniences to help state of the world protocols, which is a step in the right direction. But we still. We still believed that the incremental protocol was the right way forward for performance at scale and as of this month. Current time, the pr for incremental is open, working and ready for review.

A

So I have linked the initial incremental xds implementation plan. uh This was our upfront planning document in case anyone wants to read it I'd like to thank the team at lift and go control plane for the feedback they provided and the help they gave me to work through the design and as well as think about edge cases and failure, scenarios and things like that.

A

The main features here were we really set out to achieve performance at scale. So we wanted to minimize data over the wire. We needed the management server to be a little smarter, so to do some things like state management and, of course we wanted to maintain backwards, compatibility and the reason for that was to not break users code that have inherited go control plane as an upstream resource.

A

So the implementation itself consisted of a few things um I had to get my hands in the server and the cache for go control plane the two main pillars of the code.

A

There were completely different delta discovery, request response objects so previously in state of the world, we used discovery, requests and response, and with these new objects that means I couldn't reuse a lot of the existing code as it was specifically targeted for state of the world, which is a valid assumption, because that was the only thing that was defined as a spec at the time and now, with incremental a little more logic, has been offloaded into the management server.

A

So the server now needs to create a diff and track state, so it can intelligently broadcast out changes to resources and clients, as it detects changes within its snapshots.

A

So again, the cache is just a list of snapshots per clients and when things are updated, um it's the job of the server to understand who has subscribed to these resources um when they should receive changes and also when clients unsubscribe so that whole subscription functionality has also been enabled.

A

um I had to come up with a clever versioning system that actually targets the individual resources themselves, as before we were go, control plane was doing something that just used the global request response version that was in those discovery, request, response objects and delta doesn't really have that anymore. It just has a a simple debugging system version info, but that's not really a valid way of detecting change at a granular level.

A

So because we needed that granular level of detection, we had to develop an algorithm that would efficiently diff those hash, those versio or hash those resources and create accurate versions to compare to at a previous state.

A

So the implementation itself was fairly straightforward. um The only difficult part was the actual diffing um and and creating a fast way to do that, because again we're targeting performance at scale. So we don't want to hold back the server with a slow, diffing algorithm and we needed that to be quick. So, with the map implementation, we chose, um it enabled us to keep a pretty minimal invasiveness to the existing external api.

A

All we have to do to inherit. This change is just implement these callbacks and you're pretty much good to go uh there. They can be implemented in the similar manner that you've done with the state of the world, uh and and with this new implementation, uh you don't actually have to change the way you set snapshots or anything or create watches. There is a new, create delta watch uh function defined in the cat, the snapshot interface, the snapshot cache interface, but that isn't needed unless you're actually implementing your own version of the server.

A

So if you're using go control, planes implementation that we provide, that's all taken care of for you. So again, these callbacks are simply just defined because we couldn't reuse. The pre-existing state of the world discovery request response objects.

A

uh We had to come up with something similar and compartmentalized, because you could have scenarios when certain clients are in state of the world mode, but others are in delta mode. So again, they're they're, sharing the same resource pool but receiving items differently.

A

So with these callbacks you can have your state of the world callbacks, as well as your delta callbacks, and treat the functionality different.

A

So I wanted to talk about some challenges when implementing this code um and working in the repo. So I did spend quite a lot of time. Familiarizing myself, with the code base, um I had to reverse engineer a lot of the relationships between the cache and the server because, as I said before, I was just doing minor contributions. I didn't really fully understand what the code was doing and in doing so, I actually went back and contributed a lot of documentation and some resources for newcomers to read and hopefully better understand the code itself.

A

That way, they don't have to share the same pain that I did when implementing this large feature set. So again, I'm not going to touch on this, but the versioning at the resource level. That was another challenge because we had to. We had to develop a whole new algorithm just to do that and we couldn't again couldn't use a lot of the pre-existing code because of the fact that they were the differing uh discovery objects.

A

uh And the last thing I wanted to talk about was the upstream changes while building incremental. uh This is a fast growing repo. It's maturing quickly, and I'm really happy for that, but because I was so far in isolation on my on my own.

A

The the code did change quite a lot and there was a lot of prs for preparing for incremental things like that and, as I was developing, the upstream idea of incremental was also changing, so I had to quickly adapt my code, but it all worked out in the end and I'm glad how it turned out.

A

So here is the pr everything's passing it's working good to go. um It is ready for review and I just want to thank all those who have actually already reviewed it and provided some feedback. I know it's large, but I really do appreciate your efforts. um It's really welcomed and thank you again so go check the pr out if you're interested, I would love to have y'all's feedback and feel free to comment or reach out to me. Specifically, if you have any questions on the code.

A

So here is the integration test running you'll notice that it has a lot of the log statements with the hashed versions. If you actually want to check this out more, I provide instructions to run. It feel free to go, look at it and let me know what you guys think so: what's next, um I'm currently working on implementing ads for incremental all of the xds services are complete, but ads does need to be completed.

A

I know there's some more features that I need to build for that to actually be done and I'm pretty sure that's probably going to be the most used implementation of incremental the mux and linear cache implementations need to be done. I need to go back and redo and do those because again, as I was building this those came out, so I didn't have time to also implement those and not just simple.

A

I need to think about failure scenarios. I actually want to test this and see what how it does in production. um Well, not just production, but I want to see it in a real deployment. um I haven't done that yet, and I want to also uh performance benchmark this, so I want to see how it compares the state of the world what kind of performance gains are we looking at and yeah? I really want to put the protocol through it through the ringer in this. In this repo.

A

But uh again, thank you all for tuning in to my talk, um go check out the pr I have a list of resources for the talk in my github feel free to check those out that should include the slides and all the screenshots and things like that. um Thank you again. I appreciate all of you. Who've helped who've helped out. Oh and I'd also like to mention that um I am in the envoy, slack uh feel free to message me personally or reach out to me in the xds or control plane.

A

Dev channel, I'm usually pretty responsive there. So if you have any questions on the pr or the code itself feel free to hit me up online, thank you guys for tuning in.