Cloud Foundry Cloud Foundry Summit EU 2019 - The Hague, 18 Sep 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Diego 2019 Project Update - Sunjay Bhatia & Amin Jamali, Pivotal

Description

Diego 2019 Project Update - Sunjay Bhatia & Amin Jamali, Pivotal

In this talk, the Diego team will survey how the Diego components interact inside of CF to run application instances and tasks and then dive into how those interactions have evolved over the past year to improve system stability, security, and scale. This talk will also review how recent work in Diego supports powerful platform capabilities such as first-class support for app developer specified sidecars, improved reliability of the CF routing tiers, and other features that the core Cloud Foundry teams are working on or considering for development today.

For more info: https://www.cloudfoundry.org/

A

All right, everyone welcome to the 2019 Diego project, update I'm, Sanjay Bhatia, an engineer on the Diego team and.

B

I am Amin Jamali, one of the Diego engineers on the team. Today, we're gonna talk to you about some metrics that measure success.

B

Success for us and our users we'll focus on speed stability, scalability savings and security and high relates to the work we've done over the past year and continue to do over the next few months for each of the five SS we have here we'll go into what each one of them means for our team and some have to deal with the components that we work on on a day to day basis and the other ones have to do with some of the processes that we have.

A

Alright so speed. First, we expect to move quickly and deliver features to our users and reduce the feedback loops. That's what we think of it, as as part of the Diego team. We want to execute on our backlog quickly, so we're continuously providing value to our users, and so then they can in turn move quickly.

A

So what have we been doing to execute on this?

A

We've been kind of consciously trying to flatten our team out and distribute responsibilities across team members, we've also kind of intentionally for a while been trying to keep ourselves to a consistent release, cadence to get feedback faster and give users what they need, and recently we started making trying to make Diego a little bit more maintainable for the for our team, so we're working so that individual team members can work quickly and we can potentially onboard new team members faster and, as the Diego team were a pretty central component in the cloud foundry runtime we're required for a lot of other teams to get the features that they want through to their users, and so as an as an integration point, we've tried to highly prioritize these cross-cutting features and collaborations with other teams.

A

So some examples are OCI mode collaboration with the garden team. You know in order to increase the speed at which you can scale your applications and create containers. Instead of having your application droplet the tar file that's copied into a container. It's not actually there's a feature where you can actually make that a container image layer we've. Also tried to work with the garden team to calculate more accurate, CPU metrics with a garden windows team on route integrity for windows, so that TLS from the go router to your applications is communication from the go riders.

A

Your applications is over TLS bring that feature as a beta experimental feature for users on Windows and also more parity with Windows and CF SSH within a sixty and other external network plugins, you have the same experience across Linux and Windows. We've also worked with the Bill packs and capi teams on bringing citecar support to the platform and with the capi team and the observability teams on tagging logs and metrics for your applications with more information. So you can as a downstream consumer.

A

You can aggregate these things more effectively and don't have to do as much work. Putting all that information together yourself.

B

Next up we have stability and scalability, as the team goes, our team, we expect diego components to keep running and to keep your apps running. We've come to expect of Diego to perform well in larger installations, running thousands of apps, and we should continue to expect that from Diego and improve and refine and improve on this experience to be a bit more specific issues that happen at scale can be attributed to the some of the environmental faults which the system should continue to be more resilient to.

B

We worked on various features over the past year that helped solidify and improve on this on the Diego Diego components. As far as the stability and operational scale goes to name, a few things will start with Nats Nats failures are common in in installations where messages are somehow missed. So specifically, this. This was an issue where deleted apps remain routable in scenario, so we've lowered the chance of running into this issue by having a cache of that removals.

B

Another issue we've had previously is the network failures, as you know, build pack sometimes there's Network failures in the system, so we've improved on that with the task retries that we have built in into the Diego.

B

Now, if you previously, you have seen if you have seen app start failures due to not having you know if disk space before we've improved on that worked with garden team to have a new property that actually shows you a proper error message and it actually fells faster doing a staging time with not having enough capacity the last one and but not the least, is the improving the Seco connections for locket.

B

This has been an issue in some of the larger installations where locket has held on to some of the sequel connections and we've improved on that. As far as the connection lifetime goes and actually without actually impacting the performance there's new metrics there in order to figure out what they what them, what in order to help monitor that usage over time.

A

All right now, let's take a look at deeper look into one of the problems that we addressed in the past year before we start just for anyone. Who's not familiar we're gonna be talking about locket, which is Diego's distributed, locking component, which is backed by a sequel database. The BBS, an auctioneer use it to maintain, which of the many active many nodes is the active one and also the Diego cells, periodically check in with lock it so that the code, so the locking can keep track of which cells are available for the control plane.

A

The BBS ends up using this information from locket to make some decisions during convergence, which is the process of converging the desired state from a clock, controller and the actual state of the world. That's running in your foundation and when the BBS makes decisions about changing the state of an NR P or if an LR, P crashes or or whatnot, it sends events to the rat emitters that run on each CEL. They communicate with the go router about existing routes and they get this information from the VBS via events.

A

So here we have kind of a diagram about what expected kind of steady state of a system would look like. We have a BBS that has three LR piece to a running on cell one, one on cell two and locket has a record of each of these cells they're, both checking in with locket periodically, however, in real environments running at scale. There's environmental faults that happen, the network, latency cetera.

A

What happens when a cell is unable to check in with lock it for some amount of time, lock it, but its cells are periodically checking and with loggia to make sure that that state of the system is updated. So what happens to the work running on a cell that is unable to check in with the control plane?

A

So in the past, before the changes that we made recently when the cell one was not unable to talk to lock it, the BBS, it notices this during convergence, which runs periodically and the lrp is running on that cell or replaced the BBS tries to replace all the lrpc on this kind of missing cell and what actually ends up happening. Is you get downtime in your application because those are new replacement instances, shadow, those old instances? So why does this shadowing happen?

A

Fundamentally, we always represented LR PS as groups in the API. In all our internal logic, we had endpoints and all of the endpoints to fetch LR, PS and all the event endpoints grouped LR piece together. A group is basically the potentially evacuating instance and the kind of ordinary regular instance that correspond to an individual application instance.

A

This was easy for representing evacuation and routing and made our lives easier in that regard, but it actually has missing bit of nuance here, there's actually potentially because of the missing cell problem kind of a third lrp that could arise a third type of lrp, which is kind of a which we're calling it a suspect lrp.

A

So, in the case that we described before when cell goes missing the lrp zone, that cell we don't really know what's going on with them, we don't know what's happening with that cell technically, so when we immediately try to replace them the replacements shadow, the routing information for that cell, the route emitter is told, basically that the new instances should take over and the go router is told that there's no more routes for that instance.

A

So you get a short period of time when the replacements are starting up where you get 502 is potentially to your application.

A

So what do we do to fix it instead of grouping our piece we're just dealing with it? We're just now dealing with them individually, instead of adding another feel to the group, it made more sense. We thought to move forward with a flattened structure, so we can make be more flexible in the future. If we need to make more changes, this was kind of a big change.

A

We had new L, our new end points for the flattened, LR, Peas and new events and doing event endpoints as well, and we internally also stopped using groups and that sort of logic in our internal implementation.

A

So each long-running process now has a presence field.

A

Ordinary means that the lrp is running. Normally, it's kind of a combination of the cell state and the lrp state together, evacuating, is on there's an L R P, that's on a cell, that's evacuating and suspect is not on a cell. That's we don't know, what's actually happening with it hasn't checked in with Lockett, so it might be gone. It might not be so with this change. What happens now?

A

L RP s on the missing cell aren't immediately replaced they're marked as suspect and replacements are created, but they're not immediately used they're kind of waiting until we know that we should use them. Since we don't technically know the state of the cell, we're kind of optimistic and expecting work to continue running to ensure that the maximal route ability to your office, it happens so enduring like small environmental flakes, like one missed check-in with Lockheed.

A

If convergence notices that we don't want to drop routing to your applications, we want to maintain this maintain altitude as much as possible. So if a cell goes missing the replacements once they start running, then they take over. They take over the routes to the app that application and the original replacements are deleted. So you should have a seamless transition if a cell kind of blips out of presence, if a cell is unable to talk with locket, then you should have a seamless transition and not have any downtime for your applications running on that cell.

A

B

A

Here we're just seeing the removal of the lrpc from the database they're no longer gonna be routable, however, if the cell actually ends up coming back so the next time it's able to check in with the locket and the control plan is able to find it again. We just delete the replacements. We know that the original instances are already running they're already in a good State. We don't need to penalize the cell for missing a heartbeat. We just keep those running and use them and get rid of the old instances.

A

So this behavior is now default in Diego as a version 2.20 7, and is something that you should be I'm expecting the senior environments if you're on the latest stuff, you should not see any downtime when you have sort of these little environmental flakes in your in your environment and should have a higher availability for your applications.

B

Next up, we have savings. We want to prioritize the highest value of work for our users and and Diego team, so we spend our time and money more efficiently with the ability of having multiple runtimes in the future.

B

We want to focus on the right things within our team so that we can realize the goals of the community and the foundation, and that means prioritizing and D prioritizing work that ensure future savings and effort for the core CF teams that this means that refining and fixing the current feature said that our that are in the Diego is a high priority for us, and introducing new features is not as high priority. If a feature set can be realized in as part of the evolution of the platform, it might be best to start it there.

B

Another area that we are continuing to focus on is the continued focus on the Diego operator, experience, reducing the time to debug components and app failures. That's an area that we have heard from users and customers that they would like to have more visibility into and have an easier time debugging. These failures. When and if that happens and.

A

Of course, the last s security, we expect your components to be secured by default and the vulnerabilities are patched in a timely manner, so part of getting TLS and having all communication paths secured over TLS, we secured some more endpoints and more services that we talk to when you're doing a CF SSH so for operators that are concerned about ports on their Diego cells that are open that are not TLS endpoints.

A

You can actually now disable those endpoints if you are not using TCP routing and your app traffic and CF ssh traffic will go over TLS ports that are proxied by the Envoy proxy. That comes with each of your application instances and everything will be over TLS in that guard. It's not currently compatible to TCP routing, but if you just have HTTP up so everything should be fine and, of course, also we're trying to work on updating dependencies in a more automated fashion.

A

Continuing to with that automation, envoy, the proxy that we use put with each one of your applications automatically updating that going and any of the libraries that we depend on our code base.

B

That's all we have for you, shoutouts that we want to give out to our p.m. Josh here under development developers on the team, Orion ik, the CF components team that we worked closely with bill, packs, Cappy, CF, networking, garden garden, windows and Percy. More importantly, the CF community, every one of you there's a lot of these issues have been discovered via github issues and slack, so please keep them coming.

B

This is how we can make the system more robust and more stable and, if you're experiencing any issues at scale as far as the scheduling as far as your app crashes, please, let us know if you'd like to have a better experience debugging these failures, please let us know with that. We have a few minutes for any questions. If you have.

A

Doesn't doesn't have to only be about the what's in the talk, any other Diego question this you have, you have them, but.

A

Sounds like now awesome. Alright! Thank you all. Thank you guys.