Cloud Native Computing Foundation Network Service Mesh, 10 Dec 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Network Service Mesh Meeting 2019-12-10

Description

Join us for Kubernetes Forums Bengaluru and Delhi - learn more at kubecon.io

Don't miss KubeCon + CloudNativeCon 2020 events in Amsterdam March 30 - April 2, Shanghai July 28-30 and Boston November 17-20! Learn more at kubecon.io. The conference features presentations from developers and end users of Kubernetes, Prometheus, Envoy, and all of the other CNCF-hosted projects

Network Service Mesh Meeting 2019-12-10

A

So I'm currently traveling right now can someone tell me my audio is coming across whoa.

B

Yep, it's working fantastic.

A

C

Good morning, folks, hello, hey, so let me bring up the chat, you could go ahead and add yourselves to the meeting minutes as attendees. That would be fantastic.

C

Please note the call was recorded starting from the very beginning and will be posted to YouTube, and then we usually start about five minutes after to give folks a chance to show up.

B

B

Is this a copy/paste of the list of attendees from last time? I, don't think that we have all these people now or.

B

A

Yeah I got a question for you on slack as well, so sure cool.

A

Okay is now five past the hour, so it's uh let's get started so welcome to the welcome to the December 10th network service, a mesh meeting make sure that you add yourself to the attendee list in the document. The document is shared over the chat.

A

So we have, this is a recurring meeting every Tuesday at 8 a.m. Pacific time. We are also involved with the CN CF Telecom user group, which occurs every first Monday at 8 a.m. Pacific time and every third Monday at 3 a.m. Pacific time. The odd next call is going to be on Monday December 16th.

A

So next week we have the CN CF networking working group, which has been rebooted. They meet every two weeks on Tuesday at 9:00. A.M. next call is on Tuesday December 17th. So immediately after the network service mesh meeting, major events going on is Dave. Coffin. 2020 call for proposals is still closed because finished November 19th the acceptance letters are going to be sent, but I was told we are still waiting and they did announce that they're going to post registrations on December 9th. Does anyone who is following that know that happened yeah.

D

Actually, um our talks that we submitted for nettlesome special got some declines so.

A

That's unfortunate yeah.

D

A

Oh yeah, next time so a quick question. Then it are people still intending to intending to attend or despite not having talks given.

D

It's hard to say: I guess it's! We don't have an answer for that yet, but I think not to be honest, fair.

A

Enough, it's good remove it from this particular event. Then, okay,.

D

A

We have FOSDEM coming up on the first and a second of February in Brussels, the CF peace have closed and so I believe we're just in a waiting pattern for that now. I.

C

Did speak to one of the organizers? If, if we do wind up with something that with someone who really would like to give a talk, we may be able to be able to supply something under the wire, but again if someone feels strongly reach out. Otherwise, the see of these close, all rights of them, they're they're, super nice people and FOSDEM, is like the least formal conference on earth. It's a good time. Yeah.

A

I feel like with conferences you either want them to be extremely formal on how they're run or extremely informal in the middle just doesn't seem to work very well. It just doesn't set expectations right.

A

Well, so we have a cube con and cloud native con Europe coming up in Amsterdam in March 33 per second, the call for proposals closed on December 4th notifications will go out and in January the schedules will be announced in djenne January 22nd. There are multiple talks that have been sent in. Please add your CFP to the spreadsheet. That is on the list that you haven't and I, don't think I've added mine.

A

Yet so I will do that after this talk and we have NSM con EU 2020 that worked, they were attempting to organize we'll have more information on what happens there shortly.

A

We have Oh any oh, we have Oh NES, North America, that's also coming up. The CFPs should open up relatively soon I. Don't think we have an exact date yet and I'm not even sure we have a URL for it other than the fact that it's that it's going to be announced as we learn more will post more about it.

A

Do we have any announcements that need to be made.

A

Its are no announcements and we will move on to the social media. Community team I cannot see who's seen on the coal, so I'll read off what she has posted.

A

We have 612 followers an increase in 15 and are following 39. We have an increased our tweets by 28 we've posted on NSM con I. Thank you for 600 followers. We post on X Factor CMS and you can take away retweets.

A

The plan is to post innocent videos on online. We want to retweet any mentions that come across. There is a article that it's come out that has NSM I mentioned in it and the release notes, and when available, we will actually we do. We share the the new British podcaster did that come out I.

C

Don't know they'll be sure that we should probably give us some social media love I think it did come out. Yeah.

A

So we should check that if it, if it hasn't, come out and then it should be imminent, like it was quite a while ago that we deleted that.

A

Linkedin account stats, I've are not out yet, so we have to wait for LinkedIn to provide us with some information. I think.

A

And with that, can someone who's? The agent called give us a recap on what happened in earlier in the day.

B

Okay, so we usually use the time for questions and I see that JJ is here so the first thing there were a couple of questions around thoughts from any same coin, which are online already I. Don't know why we didn't do the announcement here, but the effective vs. This is an announcement.

B

The talks are out. They are linked in just America.

B

The total out there I see okay, I, don't know why I can't find the Asian call today. um Okay, so here we have a playlist. They are also linked in the.

B

In the page of the conference, so if you go and each video you can know the key knows the YouTube video, so you can open and see it is okay. That, apparently, is been working. There's also.

E

B

Here so that was more recipe start talking than they were a number of discussions around two specific topics. One is the isn't he of issue that Elia is facing. I see that Ryan is here on the call, so Rankin, please quickly, updates, because we were discussing today about what's going on with the support for helm tree.

B

So is this something that we should consider for my drink already? How do you feel about the pier yeah.

E

So just kind of a background of what what this PR is all about.

E

I first pulled that I there's a mentioned last week, I'm kind of new to NSM, just kind of coming on board, I pulled down NSM and the examples and started trying to run things locally, and what I found is that my distro, which just happens to be openSUSE tumbleweed, is now packaging, helm, 3, so that version of the helm, client that was landing on my distro was incompatible with the make targets that we have in NSM, so trying to run the examples things are bombing out.

E

What's going on so I put together this PR that wraps the helm client, it looks at what version of the helm client you have when invoking make and it invokes helm client accordingly and that's just kind of the first step I from from what I can tell right now with my limited exposure to the project, it seems like this is just kind of the first step to get us going toward helm, 3 support. So what this does this this? If we were to go ahead and merge this?

E

What this allows us to do is continue with helm 2 as we've done, but for folks who might be using the distro that starts packaging, helm, 3, I that they won't be surprised by things exploding on them when they try and run examples, and this problem I think this also insulates ci. So if, as we move forward and start adopting helm 3 in CI that this kind of gives us a bridge to get there, so we don't have to make the jump to helm 3 right away.

E

So that's the background with this PR now as to whether we're ready to merge it or not. I mean I've taken the work-in-progress tag off of it. I have seen that it is.

E

It fails ci in in different places, now I'm, not sure, if that's related to the PR or not, and that's where I could use some help, I'm, not sure if I'm doing something wrong. So any reviews.

C

Alright, it's it's probably not strictly related to the PR, we'll take a look. We've had a little bit of instability uh recently in the CI, as you might imagine, when you run hundreds and hundreds of hundreds of tests across multiple clouds, you occasionally run into some interesting hiccups, my favorite, historically being with one of the public cloud providers Percy and I.

C

That was that was super fun. Suddenly everything is failing. Why is everything failing? Oh wait, because the public cloud provider recommends downgrading their previous CNI. Okay, we can do that. I will not, but I really want to.

C

So we got that that ends up being that that that is, you will have a look, but very likely. It is not actually related to your your PR.

E

Okay, yeah so be feel free to take a look comment on the approach if I'm headed in the wrong direction. Let me know this is working for me in my sandbox and see.

F

E

Chance of action working yeah and it seems to pass through the bulk of CI- it's just kind of a random test here and there that seems to fail.

C

That's very good and I'm much appreciated and kudos to you guys for already being on hub three. That's really recent so and I very much like that that helmed Reid Humphries a great example or the community listening, because apparently there was be. What they gotten was a whole lot of tiller must die, die die and they listened.

A

Yeah tumbleweed is need like that, though, because they constantly upgrade packages and live on the on the edge from what I've seen yeah I want.

E

A

To find out, but yeah.

E

Well, I wouldn't necessarily recommend tumbleweed in production. It's a little too cutting edge, but yeah for for development purposes. It's not well for me. So.

B

Okay, so we kind of got to this, because Eli was facing some some other problems. There.

B

Which I think that he's trying to avoid and to work around in some other some other ways? But what we know is that there is a fix for these issues in the current helm master and eventually, when him becomes three point one released which I don't know when this will happen. You probably mean it's a it's good that we are already preparing for this, but we probably want to adopt three point one when it gets released.

B

Okay and then it's good that you mentioned these hundreds of tests that we run across the cloud, so we hit return eyes discussion which we I would like to bring. Also here we had it on the morning JJ, oh okay morning, for me: it's are. We sure that we want to run all these clouds tests on each and every push of something so I know that we get some optimizations for reread miss and some easy.

B

If you can't changes in the documentation and scripts, maybe but still I mean if I change, something in the network service manager. Why do I need to taste it to cause five close.

C

Yeah I mean part of it is so is the concern here the time it takes for the testing to run yes,.

B

C

B

Resource wise, we are facing challenges like if I give five beers there only in parallel, they are very much exhausting colder. It's also that we care for vailable and it becomes.

C

Okay, so I mean I totally get that and instead of like. Ideally, what you would like is a world where you have a bunch of unit tests that run well, that they give you sufficient coverage and sufficient confidence that you think you will probably be okay and that those run quickly, and then you probably still want to run the cloud testing a CI and every merge on master. So you know who broke what went that sound about right, I would.

B

Slightly design my ideal world slightly different in my ideal world when a developer developed PR they its loan against packet, for example, only which will take X time whatever that is, but I don't have to wait for some clusters to get point in some cloud provider and then before I as a maintainer committed before I merge, it I can press a button whatever that button is, and then this test get run against it. Alternatively, the developer can choose to run them themselves if they wanted to, but not against each and every push that they do.

B

C

So a little more human interaction in the CI is I. Think what you're talking about yeah so I I, like the notion of having knob there knob to control things yeah.

F

C

Even if we do have those we need to have an automatic, safe path, meaning that, if nobody presses the buttons return to the knobs, we have a seifish sort of thing that happens in the system, because I don't trust humans to do the mechanical bit all right, yeah, reliably and so.

F

Another option: okay, sorry.

C

F

I was going to pay. Another option is for this problem. When you have long-running Tessa's that it might be, another states of the cross. Clouds might be a third stage after integration test. They say your integrations headspace takes an hour or whatever your cross cloud one. If each one is an hour or so say three hours whatever it Q's. So if you have five p RS, the first one might fire off go all the way through get all the way to the third stage is across stage and then.

E

B

F

Know, there's other people putting in PRS doing commits or whatever but okay there. The Q goes ahead and takes the next one or it might taking the latest one three hours in or five hours in, or what have you, depending on what you want to do so that Q might be the answer as well.

F

For this type of thing, with long running tests.

F

Keep it automatic click? What you're saying yeah.

C

I mean that's kind of what I was getting at is the reason we run CI is we want to reduce the probability of introducing bugs into the system and we want to catch and isolate those bugs to where they actually, things actually broke as quickly as we can right, and that that's kind of the entire reason that you do CI right is to make sure that the behavior the system doesn't change unacceptably and there's always a trade-off here, because you wind up with how long the test run, which I agree.

C

You want to try and limit, and you also end up with the simple acknowledgement that it's an imperfect system. So, for example, you can get what I call the ships in the night problem where you've got P R, 1 and P R 2 and P R 1 and P R 2 both pass the CI, and so you merge, P R, 1 and P R 2 emerges clean, but it also completely breaks right, say you know: PR 1 change, the name of a function and PR 2 added an invocation for that function.

C

Right now, you are actually completely broken in a way that most CI systems won't catch. Now you can absolutely cause CI systems to that by forcing every CI for a merge to pass on ahead before your merge, but that gets even more nasty in terms of timing and annoyance to developers.

C

So I think this is actually slightly germane to the conversation. I wanted to have our own paths, because one of the things I think will come out of the shifting of forwarders to cross connects as well as the path stuff is that a whole lot of stuff becomes much more unit, testable much more reliably, because the system becomes much simpler to reason about and so I'm.

C

We might be useful to to also keep that in mind. As we're talking about this that you know, maybe the solution is to get a system that is more unit, testable.

C

For this, and then you know that will allow us to do more.

F

C

More easily and along the line of what you're talking about Nikolai does that make sense.

B

Yes, yes, I mean there is you mean like even today, in the goal that we did? We understand that it's not something that we can change from.

C

Well, it's it's something that can be changed. No question but I'd like us to figure out like what we think we're doing to change it. Of course,.

F

And obviously,.

C

It's not a switch flipping exercise because we would have a lot of risk involved and just flipping a switch suddenly.

C

So what I'm hearing is I think what I'm hearing is. The testing is taking too long, even as we bring in more optimizations. We have had some issues lately with it, with sort of a little bit of intrinsic flakiness that we're still chasing down and we get those issues periodically and then you know effectively. I think I'm.

C

Also hearing that we would like to have some nerd knobs available so that if something looks dangerous, you can apply much more strength than you with CI to it, or the developer can apply more strenuous, so you can see I it does that sort of capture the thoughts that have been described so far are there other thoughts that people on this topic, because I know we have a lot of people to interact with the CI I'm really curious to see like what other, not only one of the thoughts people have, but what other ideas people have.

C

Are you guys still there yeah.

B

I'm, finally, through the wizard, maybe someone else wants to yeah.

C

B

C

Trying to think okay.

C

So, let's, let's maybe brainstorm and revisit this next week, mm-hmm, it's probably a good idea. It is okay, but I do want to make sure we do capture this in next week's and in next week's agenda. So we don't lose it yeah.

A

Can we we get that as a to-do list or to-do item on there and I'll spend some time thinking about this as well? Yeah.

C

And and again anybody we were blessed to have a very broad set of experiences in the community, so you know we would like to draw on those.

A

Stick it in I just stick to work to do somewhere.

B

Okay, yeah question food. Anything oh.

G

Hello guys this this was me name is Alex actually Alessandra, but you can call me Alex I work for Red, Hat and I've been talking with Edie for quite a while now about beauty and operator to installed in networks EPS mesh. So if I can I can actually talk a little bit about that or I. Don't know if we're going to follow the order on on your agenda, were it was me yes,.

A

We can reorder it around then. This actually ties in really well with reconsidering our CI and cloud testing, because an operator would have profound impacts to do that. Sure I.

B

Mean I don't live here, doesn't be yeah cool.

A

But to be clear, not not everyone here will know what an operator is, so if you could still sit on that, that'd be fantastic. Of course,.

G

Of course, yes well, first of all, I work for the service reliability, engineering team that views actually entitles itself as an operator enablement team. So our our goal at Red Hat is actually to spread operators everywhere.

G

We are trying to get that community as strong as as we can, and so to that question. What is an operator actually an operator is a regular. We can say like that, a regular kubernetes workload, but actually has as the main goal to regulate all the lifecycle of your application. Whatever application, it is, so we we kind of distinguish these two into operator and operand. So everything that is related to kubernetes resources. May it be series custom resources or may it be regular resources such as daemon sets such as deployments or or whatever stateful sets.

G

If you need like pod definition, if you need to service accounts, everything can be actually automated by an operator. The first question that will probably come up is I can do that with helm right helm can do all of that.

G

Yes, you can, because with helm you can install, you can uninstall you can upgrade, but once you get inside of the operator world, you will realize that he can do much more, because operators are built can be built with the operator SDK, which is the let's say, the easiest way of doing that with golang, but also you can build an operator with cubular or with client go and of an extreme amount of flexibility to do whatever you want with your application.

G

So then you can begin to bring in other steps, such as backing up and restoring stuff. You can do high availability checks, you can do intelligent metric collection, intelligent races, and after that you can do actually deep insights on your applications. Using the operator, you can expose a specific metric endpoints and once you are mature enough, you can bring your operator to the outer pilot phase, which can actually rebuild everything change cloud providers restart whatever you want, because you are not dealing with only yellow files.

G

You are inputting every single bit of your configuration through a custom API that is Bute with that purpose. So an operator is basically getting human knowledge that is used to, let's say manage the whole lifecycle of your applications and put but that into code, and once you have it, then you you you, you can manage your application in a very automated way, so it's kind of the way I define an operator and and that's it like what we do actually is as a chain. We try to spread your Pareto's around.

G

We have two big targets: one is the operator hub IO, which is if you, if you want to check this, it's pretty cool because it will give you certain visibility. I, don't know if I can share my screen here, because I actually could could show that. So give me just a second here so share.

B

Any specific ideas of what you want to actually.

G

I have I, have a sketch of an operator already running and installing.

F

G

Copied your helm shirt. Let me share my screen here and then.

B

G

My my screen here I, have.

B

A quick other question: how I mean how portable these across the versions of kubernetes I mean? Can we consider that once we have operator it will require minimum main maintenance? We are great or it's just something that we need to yeah.

G

B

G

No, it's supposed to be completely easy to install anywhere and to me it's it's key, because we don't want it only on regular kubernetes, but also non open shift. So whatever you need to install the operator itself, it should work just fine, but one of the things that could actually make this operator really really really powerful is bringing in another piece of code that is called the operator lifecycle manager and actually help with versioning with upgrades with even with downgrades, if needed.

G

So that thing could take care of the whole lifecycle of your operator and in the application, and also one thing that I think is really important to say is that looking into the NSM architecture, I I, spotted that you have to other applications alongside with an SM one of them is aspire and the other one is Jager tracing, so those those two guys could have their own operators right. So the operator lifecycle manager could actually take care of those dependencies. So you have a very app store like experience.

G

If you go to the maturity on that, because you can say hey, my operator depends also on inspire and depends also on Jager and immediately operator. Lifecycle manager will install those for you previously to have your operator running well and your operator will install anything on top of it did I answer your question. Sorry. So.

B

My question was more like now: there's kubernetes 117 do I need to do anything to migrate the from 116 to 170. Oh no,.

G

It's yeah: it's not tied to a specific version unless you were dealing with some big API changes and- and you are using internally an old API or something that is actually on the kubernetes versioning. Okay,.

F

Is being broken.

G

It's kind of its kind of that it is. It is a very corner situation, it could happen, but it's a corner situation. Okay,.

B

That case yeah.

G

So I have my screen here. It is on a very I would say it's a very sketchy sketchy wait yet so here let me see if I can have my what is this operator. The operator I can run the operator operator from the operator SDK on my local machine, so I'm I'm running that it is talking to a kubernetes, a kubernetes api that is already configured here so I'm just actually running straight from the root of my project and I. Have a quick. A few watches here.

G

Looking into that kubernetes stuff aspire is already deployed, I did that by hand using helm. So there are a few things that I'm troubleshooting yet, but if I apply the CR that I do here for NSM it should you should apply everything. It was spit a little a little bit of some error messages because it's doesn't have some objects in everything is coming up here. If you can see the pods are here, I have like the initial web who are coming up here.

G

My accounts are all here: I have a daemon set coming here: I'm sue troubleshooting this les container and your network service manager. There is something related, probably to permissions. I, don't know what happened, but it's almost there, it's a very it's in the very beginning of it. So it is, it is quite straightforward: I can I can actually post on the chat the link to my github, for you guys to take a look. It is really really in a very very beginning, but it's pretty much. That's the fourth.

G

What that you have to do once the operator is running, you just need to apply your CR and how does it look? How does the CR looks like if I look here on my CID CR, this folder I can take a look on the CR ok here. So this is a copy of your helm. Values give it a take okay.

G

So if we look on apart from from the CR which is taking the specs, it's coming from the API that was good I can I can have a separate session on that, because it would take a long time. But we could look and should choose specific pieces of the API. One is the one that builds the CR, which is this one, which is the NSM spec and on the NSM status. I have nothing actually I. Need you guys to make some understanding here and to see what you wanted to see on your NSM object.

G

So what happens when I do that? Is that if I come here and I say Q y'all get an SM, I will have an object and if I want to describe that object, ok I will see everything that is configured right.

G

So my spec is here: I could have a status fuse from from that particular piece that I just showed where I can see everything services and points how its configured, what is delivering so, and this with just a few lines of code like two two weeks, give it a take on my three-times here and- and some of them are working time, beauty and operator. So actually.

C

You actually raise a super interesting question about what actually makes sense to put in a status is acting because I, just the ability for people too in a straightforward way, see the status of the system as a system is, you know, independently of the good things that that are done, is actually really cool. um So we would definitely want to think about like what is a meaningful status of the system.

C

G

C

G

Hard to me screen is pretty hard to. You actually know upfront everything, so what I can do to help you guys in contribute with that is trying to have a brainstorm on that and bring you on the next meeting, something on the startups. We can try to evolve little by little and my question it at this point is I. Don't know if I showed you guys. Let me show you something else here: real quick, so I don't know. If I can, let me see if I can. Let me let me just change this.

G

Oh, let me just I'll try to stop sharing and bring another screen just just a second. So let me bring the other one in because my kind of okay now I think I can yeah I think you can share. You know yeah, because I have I, have two monitors. Hearing I have kind of lost, I get lost, sometimes so. Okay.

G

G

Yeah it's the desktop. Can you see the these guys yeah.

B

G

So this this is the the the open shift operator hub. So let's say that I can see NSM here a developer would just type something like I, don't know, I couldn't anything from like spark, for example, which is a popular one. So I'll have a spark here: I can click there. I can see an install thing here and I can have a total UI on that. I can put some configurations on top of that.

G

F

G

I can build feuds here. I can and can do a lot of stuff. I can configure NSM from here from for someone that is not a.

C

Lot of us here totally get that operators are wonderful. That's part of why I was so excited when you, when you sort of showed up, because operators have been one of those things that we just haven't gotten to, but we've sort of known that we've wanted for a long time. So we're super excited that you're willing about it.

A

Is there anything that we can provide in order to like? Do you want to build an operator for us, or are you looking to help someone in the community build an operator like what like? How do you want to receive yeah.

G

Well, in my point of view, my big role would be to kick-start the project. I, don't need to keep that on. My github account. I would gladly leave that for the NSM, org or Network self smash github. It's not a problem to me. I would kickstart this the project and bring and and bring people in if I can help in any sort, any kind of thing like technical stuff. If we can, we can do even like workshop sessions on how to build a thing and how to do. We can do that online.

G

We can help with anything and and and and kind of empower the community to do the operator by by itself, not only me one guy developing it's it's not like the the I think the strength of open-source is actually bringing people in.

A

Yeah yeah: let's do that, so how about we did this kingdom on slack and let's discuss what the right shape of this looks like doesn't need to be a private repo should be in the in domain Rico itself, and so let's go over some of the details on this and we come and we can work on if we want to schedule something for people in the community. If you want to go over that, we can what's your time and we would love to have we would love to have you start. Listen to your thing.

A

One thing that I will so mentioned towards this, so our heuristic, when we build NSM, is that we have. We try to build it with with no single point of failure so like if your data plane dies, we can repopulate the data, the data plane with its information to continue. If the manager guys, we can ask the data plane and the clients and the peers saying like what do you think of the save of the world and then come back to the consensus.

A

So one thing- and this is something we're going to be very careful with on the community- to help progress this forward- is that we make sure that we build our operators with the same wood with the same sense. My operators are generally in this direction, but but that's also something that we want to just make sure that we don't.

A

We don't relax that that style of thinking yeah with that definitely get off, get a hold of me and I'll and I'll start helping with we're trying to write work out whether the right place to do this is, and we can create new repos or or set it up now. However, it needs to be done sure.

G

Sure I can totally do that. This is a I could think this is your inner hub, I, don't know if you guys this is a community one that is outside of open shift, so it's there for the whole community or binaries community, any version of kubernetes, so just for you to get a see if, when I take a look on your period or I, all it's kind of the same experience in and we try to match those two.

A

Important things on the agenda, so I don't have to move towards a but like. Thank you very much is absolutely fantastic. Thank.

G

Thanks a lot I will just stop sharing now cool.

A

Thanks and can you jump into the the path changes sure.

C

Let me go ahead and jump on that. Let me share sharing Google Chrome. Can everyone see the sheriff.

C

Well, yeah, so we we had talked a few weeks ago about wanting to move to consolidating the Florida as a cross, connect Network service to simplify the overall operation of network service mesh, and so in the course of doing that. Looking at that refactoring, we had to rethink how our healing was done.

C

Now our current healing is very complex and it sort of involves a lot of the fact that for cross connects that the forwarders were quite a bit different in some way is, even though the api is for semantically similar and now that they're becoming more. Like the other elements, the system we needed to rethink the healing and hopefully rethink it in a way- that's simpler and the course of that rethinking this sort of path concept emerges and let me I'll walk through it in stages right.

C

So if you look at the current state of the system, we've got the network service meshed off the network service stuff, where you have a network service request which has a connection in a list of mechanism preferences and that's passed either to a request which returns a connection or you can close a connection and those are the simple basic API is and when you look at the connection, you've got its ID.

C

The network service that it's addressing the mechanism, the connection context, possibly labels, you've got a repeated list of network service managers that it may pass through and you've got a string for the endpoint name and a state.

C

So the proposal is number one: leave the network service API alone right. So the idea, testing API, doesn't change. You still have a network service request, the request, a connection mechanism preferences gets passed to a request that returns a connection and you can close a connection so that all remains the same. But then, then you have a you introduced what I'm calling a path now I'll start over here on the connection effectively, you replace the repeated network service managers with this path element and you no longer need the state.

C

So if you come round to the path element, a path is an index that points to a repeated or in other words, a list of past segments. Think of a repeated in protobuf as an array, and so the index points to a particular element in this past segment. That tells you where, in the paths you happen to be at this moment, and then the past segment itself is a name. This is analogous to.

C

This is basically the name of the network, the thing that is providing the request API, which, if it's a network service manager, would be the name of that network service manager. If it's a Ford or it would be the name of the Ford or if it's a network service endpoint, it's the name of the network service endpoint. The ID is the connection ID that was issued by that entity for this past segment, and then we have two tokens associated with it.

C

We have a request token and the request token is set by the requester and it's an authentication token think of it as a spiffy ID document that says here's Who I am, and you get a response token back and when the return path, you had a response token in the past segment, which basically is inserted by the guy who's.

C

The in point, that's accepting your connection and the response token you can think of as an authorization token and the reason that's there is you want to be able to re-request in the event of some difficulty or in any number of events that I'll get to shortly, and so an example of this would be. You could have something where index 0 pointing to the first element of the path segments and then your past segments could be well.

C

First I went through the network service manager, who you know successfully, went through a forward or the forward handed it back to the network service manager and the network service manager handed it off to the network service endpoints. These are all the IDS that were issued by each of these for their connection, and so you have sort of an end-to-end picture of flowing back through the return path of exactly what the path is being taken here and this path can be extended through chains of embassies, if need be.

C

If that is actually a thing that logically makes sense. So, for example, we have passed through entities where that might be a good idea and then a quick notion about path. Token expirations, any good JWT token has an expiration claim. So both the request, token and the response to open are always going to have an expiration. The request token expiration is set by the requester.

C

The response token by the responder and the one thing that has to be true about these is they've got to have values that expire before the underlying search for identity expires, because those also have expirations. So they should always expire before they become invalid. In other words, if all this making sense so far.

B

C

So when we talked about the foreigner stuff, we jumped to an activity diagram and it turns out that I've got one here. I wanted to walk through. It turns out that this ends up being fairly simple, and it turns out that Auto healing is very robust. Auto healing can be an emergent property of the underlying system.

C

So this is an activity diagram and it starts here and the presumption is. This is either starting because you're a leaf client who wants the network service, but um this could also be where you start, if you're, starting as a pass-through thing like a network service manager or a forward or a pastor NSC as well. So whatever it is, you do to computer your request.

C

You do that for the the you know, the input you're going to send the request to, and you make sure to append a new path segment to the path and in doing so you have to increment the path index, and then you set the request token in that path segment you send that request to the endpoint, which receives the request. It does whatever processing of the request it's going to do.

C

This might include extending the path to you, a request that it sends as a client to someone else using exactly the same flow you normally get in the client, presuming that that works out well, so we're not going into the error case um your past segments. You know you want to set the past segment name to be the name of the network service endpoint, which also could be the name of every service manager.

C

If it's not to get that role with a Ford, you set the past segment ID to wither, ID, you've issued for the connection, and you set the past segment response token for this past segment and at this point that rip this red bar is essentially a fork, so the normal process becomes. You return the connection, because you've done your work at this point now the forking piece and you can see the studying done with a go routine.

C

Is you come down here and you say: look I'm going to wait for either the response token to expire or the request token to expire or to receive a closed, and if any, one of those things happens, I'm going to close the connection and what this means is that the connection and we'll see how this is going to be okay, in a moment, the connection is going to expire unless either, unless I receive a refreshment of the request to response token, by getting a new request from the client.

C

This means that your request, if the client dies you're, always going to have your connections cleaned up eventually, so going back to the client, we return to the client, it's connection it receives that connection, presuming that it does not actually receive an error. We come to another sort of four points here, this little circle with an X through it means that we're done with the flow, which basically means whatever function.

C

We called to do this we're going to return from, but we also then can run a go routine that either waits for the halfway point of the request, token or the halfway point of the response token, or for monitor connection, to indicate that we're missing a connection from the endpoint because it was deleted or missing. um And that comes back up around here to resetting the pass segment request. Token and resending the request and we fall through again.

C

Questions at this stage, I do have some examples in the slides that will hopefully make this easier to follow.

C

Okay, so here's the first example, it's very simple: you've got a client at an endpoint. The endpoint restarts right, which means it's lost all its state. So the client, its monitor connection, monitor connection. So it's monitor connection.

C

Its monitor connection eventually reestablishes a connector to the endpoint, that's restarted and it gets an initial state transfer. The initial state transfer lacks the connection that it was expecting, but it believes it has with the endpoint. So the client simply resends the request, including its path to the endpoint and gets back in return a connection.

C

So essentially, if the input loses a connection, the client discovers the connection is missing and asks for it again now a more complicated and then the other simple example is the client dies, so the client dies, then eventually the request or response token for the connection is going to expire and it will be closed and the endpoint will clean up the connection.

C

So the more interesting example is sort of like this. This is a more normal flow right. You get a client, it's certain something to the send something in the network service manager. That's one to the network service manager sends it to the for door.

C

Three, the forwarder sends a request to the network service manager, and then the network service manager sends the request to the network service, endpoint and there's negotiation along these steps and each one results in a connection um now, let's say the network service manager restarts right, so the network service manager restarts well. The client is going to learn that its initial straight from its initial State transfer that connection one is gone. So it's going to restart its request process with the network service manager.

C

The network service manager is going to then send a request to Florida.

C

The Florida is going to treat that as if the connection was being refreshed with a new request token, and it's going to go ahead and refresh its request token and send that on to the network service manager again, which has never heard of this connection by the way so he's going to go process it through normally and send that on to the network in point, the network service endpoint is still running, but it's going to treat this as if it's a refreshing of the token for its connection, and so it doesn't think anything has gone wrong.

C

So the only element in here you know effectively that it's aware that anything is necessarily done wrong in this case is the network service client now, in parallel, the forwarder also has a connection that it expects it's a client for the network service manager, which is three, and it may also discover that that connection is missing and may also review. Then re request it, but again everyone else in the chain is going to treat this as if it were a a refresh of the token, and so it doesn't feel strange or uh normal.

C

You know you just simply issue a new response, token and return back the connection and you're good to go.

C

Does that help? Does that example help a bit more nicolai.

B

Yeah I just wanted to know for the recording that effectively. This is happening in the future world, where the forwarder is not just another component from the basic deployment. It is just implementing a specific network service.

C

Yeah, no, that's absolutely true. In fact, it's sort of requisite for that, because we have to have healing right- and this is essentially a simple way to get to a generic set of healing um in the system, so dropping back, really quickly. I actually have slides in here that link to all those dropping down to the last one with advantages. So this is a huge simplification of our healing.

C

It uses a single behavioral flow everywhere, which means that we can write some very small components that can be used in borders that resurface managers, Network Service endpoints to do you know, pass through network service. End points etcetera to do healing you robust as a property the system, so it can heal if all the components except for the leaf client restart. So you can restart all the network service managers and all the forwarders, and it will succeed in healing healing, only flows backwards, not forwards, not backwards.

C

This ends up being really simplifying and healing is indistinguishable from token refresh if you're the endpoint. So you can't actually tell the difference between the guy be hot before you died or the guy before you is you as part of a healing chain and the guy before you decided to refresh his token. They look exactly the same to you from a security point of view. This actually increases our security greatly because connections expire unless they're unless they're tokens are refreshed.

C

So unless you get a new authentication token before it expires in you know, in a refresh a request, you will expire the connection, so if authorization expires or authentication expires, the connection expires connections are only in place for currently valid cryptographically authenticate applied enemies, because if the identity expires, you will naturally expire and connections are only in place if permitted by a policy at the time of the latest refresh.

C

So if you alter your policy about admitting a connection, then the next time there's an attempt to refresh the connection that will fail and the connection will go away so.

A

This particular point is super important as well, because it turns out that this type of refreshing is really difficult to do in practice. In fact, it's a very manual process of analyzing ACLs and working out what applications you break. So this is like the absolute huge in security space yeah.

C

And then, from a robustness point of view connections do not get torn down unless they expire right. So if I'm a endpoint until that connection expires, I am always available to have someone heal that connection. So we have to worry much less about the timing of of healing. So if, for example, you were to issue tokens with a 10-minute lifetime, then you can get a fast if the client, if you know some some element, comes back quickly, you will heal quickly, but if some element comes back slowly save for some strange reason.

C

It takes minutes like two or three minutes. You can still heal without having to put complicated timers in there that end up centrally putting the more like a lot of waste in the system so thoughts, questions I, wanted to sort of float the idea and see what people thought so.

D

By my connectionist expire, does that mean that the client should resend in some sort of timeframe, its connection requests absolutely.

C

So you remember, we came back here having received our connection in the return as a client.

F

C

The client, you know we hit this four point right. One half of it goes and Forks and returns for whatever asks for the connection, but the other goes and runs a go routine. That receives the request that at the halfway point of expiring, either request of the response to, and whichever is sooner comes back up here, sets a new request token, with a new expert and resends. It.

D

So I guess it's roll back here is that there is going to be a lot more requests going on right.

C

Timers are like you know, so if you had a 10 minute timer, then you might send a request every five minutes. If you had a 10 minute expiry.

D

Mm-Hm, and is there some some draw back again for scaling, for example, I mean yeah.

C

Really, from scaling point of view, it comes down to how many requests do you expect to receive so, for example, if I'm a node service manager on a note, a nodes, typically scaled to run about a hundred pods. If you presume each of those hundred pods, you know Oh for some bizarre reason and has five connections going then you're talking about a hundred new per minute for the refresh, if you've.

F

C

100 requests per minute is a relatively doable thing.

C

D

C

No, it's definitely something where you want to think about scale. No question, no question cool anything else. Any other questions I know we're terribly over at I'm. Sorry for keeping folks I just.

B

Want what something really quick so when you it and we as a community, speak about Healy Inc, all this being able to react to failures. I would just like to point out to the opening keynote for Keep Calm North America. By done where he was saying that failures are here and we need to provide the infrastructure not disabled, to cope with them. So does he have to do absolutely.

C

Cool all right, I'll talk to you guys next week take.

A

D

A

D

A

Very much thanks.

E