Kubernetes SIG Network, 24 May 2023

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Multi-Network community sync for 20230524

Description

Multi-Network community sync for 20230524

A

Start recording all right, welcome everyone at the multi-network community saying it's May 24th um we agenda today we are continuing discussion about the array and, if time allows we can can continue talking about the cap.

A

um If there's any other agenda item, please add yourself just probably before the con the cab discussion, because this one usually takes more so that the rest of the meetings um and I think Patrick you're a first. You kick it off. Do you have do you.

B

Need questions I have slides. So if you.

A

C

B

For me, I can show those.

A

Spot like make I will make you just Coho Spotlight yeah I will make you co-host yeah. You should be able to.

B

B

So, just let me know what I think it's still rendering for me. I thought, there's the first slide so yeah. This is um a presentation of a general concepts and features of dynamic resource allocation with some Twist of there are some additional thoughts at the end where we can spend as much time as we have on discussing how it relates to multi-network. So my name is Patrick Uli I work for Intel on upstream kubernetes and I'm, currently mostly focusing on Dynamic resource allocation.

B

Besides some other maintainer activities that I'm also responsible for in kubernetes, let's get started, I I assume all of you know most of this here, but this is usually where I start explaining. What the current resource model in kubernetes is because it defines exactly the things that we are trying to improve upon in Dynamic resource allocation. So, first of all, resources are local tour, node advertised by cubelet. It's a simple model of things that are accountable and linear, so cubelet and scheduler can make assumptions about how it can split up.

B

Some resource and resources are specified in the Pod spec as something that belongs to containers, but already for you. It probably becomes a problem when you're talking about network resources, a network card, because that is something that is shared between containers, um resource types, CPU and RAM of your previous one's CPU availables are considered native resources. We extended resources, that's where we have some extension mechanisms in kubernetes. uh Those are identified and advertised or discovered by device plugins advertised by cubelet, but they are still fitting into that model of a single counter.

B

Some number of things, basically on a node, uh already mentioned that the scheduler makes assumptions about these resources during pod scheduling, and it can do that because it also doesn't need to consult with anyone else. There is a schedule, X tender concept, but it's fairly limited. It's ultimately still the scheduler that decides a port fits onto a node. Then it subtracts all resources used by that scheduled part from the available resources on node, and then it knows how much is left and you can continuous scheduling the next part.

B

Oh, this is up whoops now I'm skipping some slides that I wanted to show, probably because they are hidden. Let me see what I can go back to my editing page. Sometimes that does it for me. I wanted to show this slide here.

B

And hide slide well, I. Don't know why it does that to me happens all the time when I'm editing something.

B

Let's see what let's work better now.

B

So dra is an attempt to overcome these limitations. That I just mentioned it's a rethinking of what a more flexible resource API could look like, and the use cases that we have in mind are for things, for example, that aren't local to a node, something that is connected to a node on demand through the IP network, like an IP camera or through some special Hardware like a internet connect. Interconnect like with cxl might be examples. These are things that can't be managed by a single cubelet.

B

So we need some different way of tracking how much of those are available in a cluster. Then we have more complex Hardware nowadays. Gpu accelerators are the main example that we currently have, and it's also the one example of one actual implementations that we currently have for Intel and Nvidia gpus. Those gpus are more complex. A single slice of a CPU has multiple aspects to it like how much compute Ram how much RAM it has how many commute units and then there are different types of gpus with different features and those aren't even integers.

B

It's a feature flag. Basically that says IP CPU this GPU can do this or that- and these are things that must be considered in combination to select a suitable GPU for for a pod and then another use case is specifying somewhere in the kubernetes API additional parameters that are required to pre-configure that Hardware.

B

The fpga example is listed here we typically need to over programming as fpg U is fpga is typically a privileged operation, something that uh we don't want to allow individual programs to do themselves, because that would imply that we need to give them full access to the hardware. Instead, uh the driver needs to set up the GPU fpga first load. The program then hand over a device identify the device node to the program with less Privileges, and then the application can just use the pre-programmed fpga.

B

But that means we have to have some way of specifying say your url. We have to load the program from that needs to fit into the API, because the user needs to supply that.

B

Finally, the sharing of resources needs to be a lot more flexible it. We might have the multiple different containers in the same part that all need the same.

B

Gpu instance that's one example, or if it's an expensive or something that's expensive to set up like an fpga, we might even want to share that same pre-programmed, fpga between different parts, either concurrently or one after the other, depending on, what's most suitable for the workload, and all of that needs a different API to shape to the defined Volcan relationships between containers and the resources that we're using another term. Another technology that will you you'll see mentioned is for container device interface specification. That's a lower level thing.

B

It's almost like an implementation detail in dra, uh it's a Json spec, supported by container runtimes. That enables someone to inject additional resources.

B

Volume amounts, device, nodes, environment variables into a container, using a Json description of what is supposed to be added, and that makes it possible for a driver to basically explain to the runtime what it needs to do without having to commodify each individual runtime to support a new hardware, uh category or device.

B

This all is a kubernetes enhancement proposal. The cap numbers at the bottom of the slide because we are touching core kubernetes, so it had to be a cap.

B

The way how it works is fairly similar to volumes.

B

We took a lot of lessons learned because I've also been working in six storage before modeling our apis and object types after the persistent volume claim and the volume class that exists in Zig storage, the.

A

B

Class, if you have questions, feel free to interrupt or we I can keep keep going. We can come back to the slides um so anyway, the resource class is the privileged object created by an admin when deploying um support for certain Hardware, together with a resource driver and the resource class has parameters that are known to be privileged. So these are things that normal users shouldn't be allowed to set something that gets chosen by the Admin. It also ties a certain resource uh category with a specific implementation.

B

Through the driver name business like a CSI driver in in a way a container storage interface driver, then the resource claim. That's the separate object created by the user, together with his application, and the resource claim is where the user can specify parameters, basically his requirements for the resource and that's a difference compared to how it's done in sex storage. We are also using a resource claim status to store information about the actual allocated resource.

B

That simplifies a lot of things. A lot of problems in sick storage nowadays are around synchronizing. The persistent volume object and the persistent volume claim object, and there are still cases where you can have leaked volumes when, when something gets deleted at the wrong time and it's a fairly complicated dance with lots of finalizers to prevent things from happening which, but just are difficult to track, because it's two different objects. There was a cap in kubernetes a while back.

B

That said, that status is also okay to be used for information that can't be restored from some other state. That's why we can do this. It wasn't possible before because status was supposed to be something that is ephemeral can be restored, but not anymore. We can. We can do that and finally, the part that has some additional fields that references that reference is a resource claim and I'll, give you an example on the next slide.

B

So another key concept is that these parameters, although resource cars and resource game, these are core objects, or rather they are building types in kubernetes. They are not in the core V1 API, but rather in a experimental, vivon Alpha 2 API that we Define resource.kas.io API Group.

B

Now this is built in what isn't built in is how to represent parameters. Those are not embedded inside those objects, but rather in separate objects, typically defined by a vendor crds or you get full validation of a parameters at the time of creating them, which is nice for users, because if I make a mistake, they will get very early feedback from from the kubernetes cluster. That something is wrong.

B

um Another object and another concept, is the communication between scheduler and resource driver I'm, trying to avoid situation or trying to keep a deployment simple. So all I'm requiring is that a resource driver can connect to the API server and all communication goes through the API server in a in a shared pod schedule in context. Object.

B

That also has the advantage that uh multiple different drivers can collaborate on the same object and work together to pick picker notes. The actual protocol is a bit more involved, make involve some back and forth between scheduler and driver, but it's so far it's working fine and I've already mentioned the allocation state that is in the claim status.

B

So here's here's an example. There are multiple different objects involved. That's the downside of the flexibility that we had to have, but we we're basically asked during the API review of the initial implementation, to really just limit our modifications to the core API to very few Fields. So that's that's. What we have added for containers have under resources, a new claims array But, ultimately, just a list of names, and the name here is internal to report chosen by the user.

B

The resource type oversource name refers to some other new array, the report level resource claims, and that is where we Define, where the resource comes from directly. Referring, as in this example, to some existing resource claim is one option. The other option is to have a reference, a template, a resource game template, and then there is some additional Tooling in kubernetes to create a actual resource claim from the template for each individual part. That's that's what you would typically use if each part needs its own resource.

B

The example here could have multiple ports, multiple different parts, all sharing the same resource, claim instance, so the external claim name- that is the name in the same name, space for security reasons of a resource claim the resource claim, then references the resource class.

B

It doesn't really do much here and that's where we have the test driver, but is something that we have in tree to test out dras. So this example is also not that realistic in the sense that we are using a config map, a predefined type to store parameters and the resource claim.

B

That's the last connection here that I'm that I just highlighted is from the resource claim to the config map in this case, but that, typically that instead of conflict map, it would be a crd, something that defines exactly what the parameters are and then the API will be a lot nicer. It will be not be just a string to string map as in a config map, but it could be arbitrary.

B

Complex good hashes could do hashes lists could do validation through the normal mechanisms for crd, and my last slide is trying to show how that could look like for for multi-network, but I suggest that we first go through the rest of the dra part.

B

If that's okay, but it's a question here, a pair I see, but you unmuted. So if you have a question just just shoot.

B

Nope, okay, um yeah this. This is an overview of the components and the types that we have had to add in kubernetes. So what what the cap is about is the left hand side the core component of the Carbonic kubernetes changes. Obviously we need to add new types to the API server resource claim is shown here as the core object, but everyone interacts with um the resource claim controller inside the controller manager. That is what handles the resource claim templates, but pretty much everyone else. Then.

B

Just deals for resource claim the it's a unified approach to generating resources for a pods instead of having some two different code paths. Basically, everyone just waits for resource game to show up and then works with that, and the resource claim controller is the one that is the only component that actually deals with a template uh object.

B

The scheduler has a new plugin built in that knows about resource claims. It in the filtering stage looks at resource claims and whether they are allocated uh it collects potential nodes by working with the other plugins and then in the scheduler Reserve phase resource plugin is called again if a resource claim hasn't been allocated. Yet we can't proceed in the reserve phase, but we can do what's called delayed allocation at that point, the scheduler knows roughly where the port could run. It knows about the available CPU and RAM.

B

It knows about other volumes, for example, and then the resource plugin, if that is the sufficient information available, can say: okay, let's, let's create that resource game for that node. That's where the Pod scheduling context object comes in, it's not shown here, but that's what the resource plugin creates and what the vendors then react upon.

B

They see where the pots get in context, object exists with, say a selected node, and then they try to trigger via location, there's also a mode with immediate allocation where the resource drivers don't wait for report to show up, but that is probably less useful and we expect that most resource claims will use the delayed allocation and weight for the scheduler to make some preliminary decisions about the node.

B

One key goal that we have for scheduling is that we never schedule a port onto a node unless all resources are really available for CPU and RAM. That's something that your scheduler can guarantee for resource claims. We achieve that because we expect the resource driver to do some real allocation set aside the necessary resources for that resource claim and therefore, for the part before the pods get before the report gets scheduled and the scheduler is waiting for that. It's waiting for that status, change with resourcing that says this is allocated you already.

B

Now you can schedule and then the load local operations. Those are supposed to be so easy that it's unlikely that they ever fail.

B

So the cupid parts that becomes relevant as soon as scheduling is done and the part has a node name set Kubler knows about the resource driver group, plugin, that's a normal registration mechanism, but with different API, where a two API called two grpc calls preparing resource and unpreparing it, and that happens when some container starts to run on the Node that uses the resource claim.

B

That's where the resource driver might set up some device nodes, prepare a CDI file and what it reports back to cubelet is the so-called container device interface ID, a simple string that Kubler just passes through to the container runtime and container runtimes know what to do with those ideas. They basically read. The Json file apply the necessary changes to the container and and then the application can run.

B

A

Yeah can I have a question to the previous one yeah.

B

A

It the API server- this is the scheduler site. Is there? Is there any at any point, does the scheduler calls to the resource driver.

B

Only indirectly, so um I probably should have a slide with a pod scheduling context object, because that is that's where the communication between scheduler and resource driver happens, so the scheduler before it selects a node. It knows that there are resource claims that have been allocated. It knows that both resource claims are waiting for report and what the resource plugin then does.

B

It creates a port scheduling context object initially that just has a list of potential nodes, as identified by the first attempt to schedule a port, and that ignores where resources are available, because that wasn't the way that was known at the time. Yet um the resource, Drive events sees that for pro, but the scheduler is trying to schedule a part. It sees what put nodes might be useful for or might might fit for part. It can reply with a separate list.

B

Saying: okay, I checked your notes, but these are unsuitable because I don't have resources and that it reports back in the same port, scheduling context, objects. So, ultimately, that pod scheduling context object will have information from all resource drivers that are involved with that particular Port.

B

It might be more than one the the port might have two two resource claims that need to be allocated for it from different vendors, but their bad uses for same port scheduling, context, object and then the part, the scheduler watches for changes of that object triggers another scheduling, attempt for the part, and now it has more information. It sees that okay, two, these two drivers that I'm waiting for they both agree that this node here is suitable.

B

Let's try to allocate for it and that's where they set the scheduler plugin again sets the updates for parts catering context, object and updates it with the selected node name set. That again means scheduling, needs to be aborted by time. It's an iterative process, so it's lower than normal spot scheduling, but that's a price that has to be paid for for the flexibility that comes with parameters that are entirely under control over vendor. So.

A

uh And that's that's the last thing that you said: that's why I'm asking it does rely on the external controller here right. Is that true that, yes, let's, let's assume what I'm getting at is, let's assume we have all the resources for a pod right. So there is not like no contentions, no, nothing! No, no race uh and I'm just scheduling a pod.

A

uh Now the the there will be an additional grpc call and the processing- and you said it's indirect, but it's still in the scheduling process uh that will be made to an external controller which has to return and process and provide me information. So basically the and what I'm getting at is the moment from Cube apply. Cube cattle apply to the moment where pod is ready.

B

A

B

Time that is slower, that's certainly going to be slower. Yes,.

A

D

A

Will be dependent on some external controller in.

B

This case is that true, yes, that's true, but that's right, that's the key, the key problem that we are having here. We can't assume that the scheduler knows what the resource.

A

B

No there's no way or how we can build that into the schedule. People have tried. People have tried to come up with some taxonomy of fairly flexible generic resource parameters, but the problem is, there are always some limitations that you can't model or you forgot to model, and then you need to update that model permanently to adapt to new hardware. We are not even trying that people.

A

That's that's the whole I. Think of your dra. That's the whole selling point that you.

B

Can it's like it's it's. The entire parameter handling is on the resource driver's side yeah. So let's talk about that. Perhaps for a second. So it does need two parts. Now uh the device plugin mechanism only had one. It only had a local cubelet part. Here we have a controller part, the part that interacts with the scheduler and we have a local part on each node and those also need to communicate between themselves.

B

Somehow um we have an example driver that is again using a crd, where it's publishing node state in a crd for its own internal consumption. So that's not. It can get documented. So so users can query the state of that Hardware, but it's outside of the array it's entirely defined by the vendor.

B

Okay, um yeah Short History. This has been a discussion for a while it it started actually with CDI uh an effort between Nvidia and Intel, uh starting at the lower level Parts just talking with runtime developers. How what that Json spec should look like some container runtimes adopted it and implemented it and then middle of 2021.

B

We started talking about how to do the kubernetes part, leading up to a first kept draft that got accepted into kubernetes 125, and we we got to the point where it was declared implementable with some caveats some opens, but and then there also was a prototype that I had been implementing uh along with a cap, but it got the capcat merged fairly late and it had some changes that just made it impossible to actually get code into 125..

B

Instead, um I proposed something fairly early for it when 126 big, but because it was such a big change. We had problems getting reviewer intention, it got merged. Eventually it wasn't.

B

It certainly wasn't easy, but it got merged as Alpha in 126., uh and it has been in Alpha since then, because we are now exploring use cases addressing Refuge comment that came out of that all of those review periods from them from 126, where still we're still working on walls, and we have various opens for beta defined in the cap- that we still need to work on um the other thing that happened uh both well I'm, mentioning Intel here, because I'm Intel but Nvidia also had a public Alpha quality driver for their Hardware.

B

That is based on dra. So, depending on your Hardware, you can use either one or the other um in 127 kubernetes. We we did improvements. Cleanups benchmarks were added. The example driver that I mentioned I, have a link at the next slide. uh It's somewhere that got published for kubecon uh this year, and there was a presentation at kubecon Europe, where uh Kevin Blues from Nvidia and Alexa for me talked about that example drive and how to implement a dra driver. So that's what happened this year and now the big open.

B

Well, some some opens uh scheduled enhancements are necessary. I mentioned that multiple port scheduling attempts are needed for a single part. The way, however, scheduler currently handles, that is that it's inevitably puts support into a big off queue. There is a five second delay, but it's completely unnecessary, but it had that's. How it happened currently is get what is being dealt with and I have PR's pending that eliminate that back off period, because it's normal for such support to need multiple attempts.

B

It's not an error condition that someone means that that the scheduler needs to slow down for this particular part, and these enhancements are pending. There is also some other proposal along the same line, and one of those will probably be merged into one 2028.

B

um Then beta criterias also include defining how the cluster autoscaler could work with a workload that involves Dynamic resource allocation. The traditional model is Auto's. Cluster autoscaler uses the same plugins as cube scheduler. It therefore knows which parts fit onto nodes and that's no longer true for for dynamic resources. So we are thinking about some kind of plug-in mechanism for the auto scaler. It can't be the same exactly the same, because that would be communicating through the API server and that clearly would be disruptive for the rest of a cluster.

B

So that's where we may end up with direct grpc calls or Verizon Plugin or native plugins that need to be compiled in it's. It's a bit open at this time, but I had discussion at kubecon Europe with the cluster autoscaler maintainer, or one of them about about this topic and we'll we'll pick that up and then eventually, if we solve all of these technical questions and one of those I think we also is perfect.

B

We want to address, is about cni and how this could be used for networking, um eventually we'll hope to get GA, perhaps in in 24. So that would be yeah kubernetes 131 there's certainly a long road, but that's inevitable, because it is, it needs time to get feedback. Incorporate feedback. Make changes, that's just the way it is so I'm, not I'm, not going to do a demo today. I think our time is better spent on discussions than just me walking through some shell commands.

B

If you are interested in the demo, I encourage you to go to this repo here. Kubernetes 6, slash dra example driver that has full instructions for bringing up a kind cluster, so no real hardware needed and the example driver simulates a GPU driver. So it's more realistic than what we have in tree and then you can see how how this all works in a multi-node cluster.

B

uh Just on your normal development machine, assuming that you have Docker on it and and running Linux or Mac OS on the right hand, side you can see where you can find us. If you have other questions. So there is the main channel that we're using on slack is the sick, node Channel, because sick node is for sponsoring sick.

B

For for this work, we have regular meetings if, if you are want to be involved in those, we have one on Monday, where we talk technical questions around daa with with our counterparts in Nvidia, um first also with cncf container Orchestra devices working group where CDI gets this test and here's a list of names just for your information. If you need to ask specific questions or want to associate some names with this project, here's here's all the people that have been working on this, both from Intel and Nvidia and I. Think. Well.

B

If you have questions around the array, we can discuss that now or we can move on to my rough ideas, how that could work with multi-network.

A

Oh, if you have something more than yes, please, let's move.

B

On yes, I do so you remember you might remember this slide here. How it looks like for the array at the moment for multi-network. It wouldn't be that different one difference is that these resources- probably and I'm, speculating here- I'm not associated with the digital containers, because the network is something that gets set up for the part. So we need some kind of concept for pod level, resources, something that only shows up once here and doesn't need to be referenced in the containers.

B

That would be the first change that we somehow need to to achieve, but it's doable. We had some ideas whether it could be some result in the status that tells cubelet that business support level resource and then trigger some slightly different Behavior or it could be in the temple in the resource claims. A new field that says yeah. This is a a pod level resource.

B

Both options probably would work. Then I wanted to call out that now I'm, actually using a crd for the parameters, so multinet ksio in a very first V1 alpha one. That would be something that your working group could own and control and Define with a spec that has parameters that are required to be supported by by different Hardware vendors.

B

If, if you I think that that mirrors, what you are trying to do with uh entry API, something that all drivers somehow need to support, but it could be your CID, so you wouldn't have to do that entry. You would have more flexibility with this approach by being out of G, at least in the experimental phase. If you feel that at some point these parameters need to be entry, nothing is going to stop you from writing a cap creating or asking the API reviewers to approve a new API Group and make that an entry type.

B

The advantage of that is that it actually would be rolled out using the normal kubernetes deployment mechanisms. It's currently still fairly problematic to have core apis defined by crds, because the upgrade path is unclear. Deployment installation is unclear of all crds. It's an unsolved problem, but also six storage has because the snapshot API that they Define that's a crd and even after several years they still don't know exactly how to convince everyone who installs the kubernetes cluster to keep that CRT, updated.

B

um The other, the other difference of the other other part that I want to call out is that this. This is basically could be a common API, supported by different and implemented by different Hardware vendors and the resource class mechanism could be used to select the actual implementation for for that abstract API, and you could Define that there must be in each cluster.

B

A resource class uh I think I forgot well, the name could be could be standardized too, and then the the contract between the hardware vendor and the user would be that a resource class with multinet.ks iOS name must support these parameters. I must do something useful I must do something specific, but I that is defined by by your your standard or by by the working group.

B

So that's all fairly genuine. It's not make requiring many changes in dra where it gets a bit more interesting is handling in Cube alert. What do we actually do at the cupid level with such a resource?

B

So one idea- and it's really- it's not specified anywhere. This is this usage field is new. It's really just an idea. A proposal but I haven't even formulated somewhere I did I did create a pull request for some experimental changes at the bottom of the slide is is something that I started discussing with folks and it has in a Warner comment.

B

It has this usage field, but I'm not sure what the right approach is and I also must admit that I don't know that much about CRI and cni and where, where the information needs to flow, others know that better.

B

So we we could but right now, cubelet. uh Does this node prepare call when it sees a container that references the resource came into port for resource pot level resources? We could change that so that, if just if there is any such resource claim without even if it's not referenced individually by containers, then it could do the preparation and it could handle other additional data, something that the plugin wants to pass into. Cni is somehow all of that could go into the Via resource handle or it could be part of the grpc interface.

B

All of these are Up For Debate.

B

The big Co-op that I see is the question of how you could then deal with this and how you could build additional apis or additional behavior in core kubernetes around this and I. Don't have a proposal for for that. To be honest, so we could we could standardize on certain fields in the resource claim status.

B

So, for example, this data field here that I'm I'm showing right hand side in the resource handle so that's part of the API, and it would be possible to have some specification that says this data field in a network resource and allocate Network resource must be in a certain format, and that could be something that could be read and interpreted by other components, but I'm really thinking aloud here: I've, not I, don't know whether that is suitable, um but yeah. That's.

A

On that, on that part, I would really need something like pod status with IPS right pod IPS. There.

D

A

We need we need that kind of level of integration for that, because um because- and this is another another I I'm gonna start poking holes- sorry Patrick how.

A

You use it's it's a it's, a quite good idea where you had this network. Crd I agree. That's that's nice! Because then it's it's flexible! We are not in core and and our development is unstival. But uh how do I deal with services? What do I reference? They will not allow me in in service to reference at the acrd.

A

So that's my first problem and and I. Don't think we want to put a clients resource claims into Services. Now uh now what I do with network policies, the same story I need to have, and then those are just the two first and now I have Ingress Gateway API I have at least those sort of things at the top of my head. That I need to not only because pod is just the first element that we are currently tackling, because that's the that's the core kind of.

A

But if you look at our list of requirements, not just the first phase right and then we need to integrate with the rest of core kubernetes. uh Yes, resource claims is nice place of implementing the thing and I assume that you you're saying about the usage. The cni uh that would that could be an optional thing, uh where not even not maybe fully optional, because in in our current approach.

A

It's it's flexible in terms of how far I want to reuse cni and how far I want to push into the CRI, and it has to be aware of multi-networking versus uh I'm doing it through some controller.

A

It's cni is just a stab as it is today and then the rest is done by some Demon. That's on.

B

The yep right so yeah I think you you're right about the the starting point could be this special resource handle field and then, when cube when you would need to work with cubelets and get cubelet to populate some additional fields and Report status, all of that could be triggered by a resource handle a allocation that has this special field and.

A

That doesn't really that doesn't.

B

A

It's not into dra itself, that's.

B

A

Ask it from other side right.

B

Anymore, this would be yeah, it would be a separate cap, but basically it says: okay I'm using dra as a starting point and once I hit cubelet. Then this is what we do on top of it, and that would be additional interaction between cubelet and some other Daemon. It could be another grpc interface uh that needs to be invoked by Cube alert where you pass information back and forth, and you could Define those part status fields that you mentioned and I I think that that extends then what starts with the array, but it go.

B

It certainly goes at some point into a territory where you have to Define your own cap and Define those fields and and then Define the rest of what you need for your use cases.

A

D

One thing I like to say the thing I like about this: that feels really good is I. Think when we are thinking that a pod is going to be attached, some networks there's some stuff in there. That is just generic. It's just going to be in a network and the network will just return an IP address and it's not you know the the implementation. Just deals with I think we do. We are going to have needs where we have to pass more complicated configuration through, and this is an existing mechanism.

D

We've got where we could do that. So you don't just say: I've got this network and the network field is going to give you whatever you need in order to do services. But in addition um here is some random bit of Plumbing information that that Network needs and the fact that dra already exists, and we can just piggyback on it sounds perfect and I mean I, haven't dug into lots of the details here and I may be missing.

D

Some of the points so I need to go and drill a bit deeper, but it does feel right to me as an approach. It feels like it's got some some real real promise, so I'd like to understand it a bit better and think some more about it.

B

Yeah yeah I want I, want to point out some one thing, but I missed earlier that I hadn't really affordable, but we could. You were concerned about scheduling performance if all that you need is a mechanism to pass additional parameters, and you don't really need to influence the scheduling, then we could also extend the resource class and have a field there. That says this particular resource Clause doesn't need a real allocation.

B

um We could just handle the allocation in the scheduler and basically say: okay. Well, we automatically allocate. We know that Visa is going to work. We don't need to piggyback through the resource driver controller, unless you want to um that, should speed up the the scheduling quite a bit and if, depending on what, how the how your implementation works that yeah that may be a valuable optimization anyway I'm rambling. There are some aspects that need to be considered like what happens. For example, if the parameter object gets deleted because it is separate.

B

That is something that the user could do. While the resource game is allocated. So we do have a concept of copying relevant values out of the parameter object into the allocation status so that if later on, the user deletes the parameter object. The resource claim is still usable, but anyway um I think there are things that we can change because dra as I said it's Alpha, we can still make those changes. If we discuss them with you guys now fairly easily.

B

um We can still also change them later on, but then it becomes harder.

A

To the parameters, uh kind of idea and subject, and what beat what you're saying keep in mind, I would imagine the network here is the network that we talk in our cap today right, so the parameters will be referencing that, in this network, object that we are seeing here.

A

You say it is Network parameters, but I need something that I can reference by the other object, as I mentioned, because the core issued isn't going away still how I'm going to do how I'm dealing with services, because that's the core think here we need to make this integrate with kubernetes. So what I reference in services and, let's say I- can reference the crd, because my cap is going to introduce that crd of a network which then your resource class or something can reference a network. That's fine!

A

So, basically, now I would end up with Parts referencing using the resource claim a path and then all the other objects will just directly reference the network itself, because they don't care about claims right.

A

They just need to point to to which network they apply to so Network would be the object that we are keep discussing today and then that one will reference the parameters itself so Pete that doesn't change, and it's still, you have another object that you have to reference for Network, because you have to have some common element that you can reference across. Other objects. I think we are keep forgetting about that one which is the core piece of this hole.

A

We already have a multi-networking referencing capability from Altus, and it already does that and it already works, and we have that in production trained across the board and it works, and if we just settled on that, this is just a retracement for Motors and instead of using annotation do this. But that's not the point of the whole effort for this to integrate fully into the core kubernetes, so have it across the board with all the other objects.

A

So let's make sure that we keep that in mind having just a reference for a pod is not it's just a 10 of the whole thing.

E

But basically in the array you can reference any kind of object right now, so it's the community decision will be. If this you know.

D

E

You you just reference something right in the that you correct inform on, but you just um when you create the the resource claim right and you reference a parameter with just any kind of object. That um kubernetes is aware of it. So you.

A

Know but that's other way around, but.

E

If you assume that the network.

A

Service on something.

C

You're going to have a similar problem, you're still going to have the pod, and now there you have the pot I'm, adding in more I.

E

Mean what's the problem: I have the poet the Pod reference to the network, so I can get the network information right right.

E

That's the I'm.

A

Trying to okay so now keep in mind, let's, for example, I'm trying to maybe I am a bit and I apologize imposing the kind of way we are thinking about services. But what we are thinking about Services is to have an ability so service, if you think about service, kubernetes and I'm I'm, saying kubernetes service here so.

D

A

Ip, all those right, those you cannot have a service across multiple interfaces, that's impossible right. You cannot have the same IP across multiple interfaces which belong to different networks. That's something that's that you you naturally can do so a service has to be bind to a specific Network and be advertised for all balances on a specific Network.

A

So that's why I am kind of imposing that a service cannot be applicable only to a pod. It has to be applicable additionally to a specific Network or interface of that pod.

E

This is represented in with the endpoint slices right now.

A

So that's that's just implementation under the hood first, let's conceptually what the service, what does it apply.

C

To well as a service applies to pods in reality right. Thank you.

A

That's true today, that's that's true.

C

I mean but I mean so so in reality, I agree with you sort of that. You probably want to be able to make sure that you can access a service over a specific set of networks, but from the highest level of abstraction. I guess either can be argued.

A

So so this is, this is uh I I, see what you're saying, uh but then Is it feasible. That's maybe let's, let's boil down to like from.

C

E

A

Point of view, is it feasible, like Services? Has single VIP should I be able to access that single VIP through multiple on on different networks? And let's.

E

A

Assume different types of networks: here we have coming from L2 to L3 networks times right, because we want to support all of them.

C

Yeah, so so VIP systems is a virtual address. Right in reality, probably can I could probably build it. The question is more: should you exactly.

A

So so this is where- uh and this is just one constitute- the other thing will be Network policies which per probably you have even more uh kind of ideas here and- and this is where conceptually Network policies putting aside the implementation, I know you how you integrate multiple cnis. That's, let's put that aside. Let's assume we can synchronize everything, but that's from the conceptual high level from Network policies themselves today, uh I would agree.

A

This is something that is more flexible here, because this is just blocking things right, so I should be able to if I, don't specify any network I apply to a pod, and then every interface should be, let's say, uh block these specific networks on it right, but then I maybe want to as an as an ability I want to network policy on this internet, not just on just on that Network I, don't care what happens on the other ones. Those are internal and those should be just unblocked.

A

But for this one specifically specific interface, it goes into the internet or something, and for that one I want to apply. My network policy I should be able to select with network policy specific interface, and it should be optional in this case, but I should be able to do that and now, even for that aspect and the the services and I want to go into and I apologize, I don't want to go into like specific implementations of those, but conceptually.

A

If you think about it, then I would have to be able to reference a specific interface in addition to the port right, because today you can only select a port through labels right.

E

A

And coming with multi-networking, and it was sufficient when you have only one interface in the port, and that was perfectly fine. It worked, but if you're going with multiple networks into the Pod and now you're, fully aware of them through kubernetes, core apis, I should be able to now. Additionally, any of the concepts that relates to any networking stuff I should be able to individually select one of those, and that boils down to now. Then, how do I do that?

A

I cannot do it for resource clients, because now I am going to put the resource names everywhere in all those that's kind of doesn't fit, or does it like translate right that, at least in my opinion? So that's why this this effort of this group is to introduce something core and I've. I saw some of the minutes from last week, where Prashant was saying that we want to make this uh Network a core API, so that it can be referenced in everywhere else. Not only the pod and I do see.

A

The value of the resource claims for for the concepts of like Network can have qos, where it gets a shared bandwidth, for example, and I love. That idea and I I would have want to have that right.

A

Where I can okay, I have a network and per node I can divide the bandwidth across the pods for it right, because that's something that's really important, and we want to have that as well, but I don't think we can avoid the API kind of aspect of the whole thing right, because this is the the the sharing of the network and qos, for it is just an additional extension.

A

That's how I would see it, but I wouldn't just make the whole initial concept of having decentralized reference of the network be replaced by this by just the COS kind of advantage of this whole thing, uh I I p Pete, you I'm, not sure you yeah I would I would like you to read more on this and and I'm not sure whether I, uh because you see you see this one as a good approach for for the whole networking, but I, don't think we probably can in the underlying kind of implementation.

A

Do something about this, but in terms of references in the apis, I would still want to Stack with my network and then sorry, the Pod Network, that we have.

D

I, don't think, there's a talk of replacing the Pod Network so much as supplementing it right. Yeah.

E

Yeah, you need to think of the array as a mechanism that you can use, but currently you can Define your apis in the code. This is what I'm trying like it's not contradicting it I think the point is to use the array for networking, because you will end up in situations when you want to want want to sophisticated resources for the networking, and then you will end up that. You request it from dla and you need to integrate it anyway to your networking, API. So I think that the discussion here is keep in mind.

E

A DNA is a mechanism for you, okay to use it okay. Now you can Define the API, I I. Think I also agree that the API needs to be in the code. This is my personal opinion, but I don't know what um this year, the the objects themselves right, but it doesn't mean when you request a network. You cannot do it with like a resource claiming and say that this is the network that you want.

E

Okay- and this is the only problem that solves you- the Qs and and to request sophisticated resource like sriov and fpga and other stuff. Yes, so it will be bundled together and it will be very easy to be used now, because even today you request a network and you need to specify that the resources um you know with a sophisticated Hardware.

A

And maybe that's the the gist here right like maybe we need to keep the way how we reference and uh in the it in the Pod and basically treat it as a resource claim for a network right, but then still have the Pod Network itself, so that the API piece is is satisfied as.

C

Well right, this would be more the characteristics that you would specify or the need I need this resource, but I mean not the generic setup of the Network abstraction. So to speak. If you need like I said if you need SRV or if you need a 100 Gig interface, well specify it as a resource claim and let the controller handle it.

A

No I would imagine this as uh your resource claim, so.

C

A

Cannot be called resource plans because then it would have to be still called a network for.

C

A

Blank implementation would still be done through through this resource, claiming.

C

A

Point, but let me just let me finish, but then you point to a pod Network, which uses and references a uh uh okay, I see what you're saying, because.

C

I assume that everything is in core, but then I have a pod that wants to have a oh, let's go even higher I want a DD. Port I want the 400 gig port uh for for this. For this pod on this network and I need to have a 400 gig Port. How do you specify this that that pacific, as a resource, claim uh and sort of the schedule, will do the mix and matching uh to to find you that car that can can give.

A

You that right, but then you're saying that will be a uh that would be a separate thing or, in addition to just saying, I, want to connect to this network and.

C

In addition, I would.

A

D

E

D

Almost like parameters for the network, you're saying not just I- want to be on this network, but I want this particular bit of random internal to that to my implementation parameters, here's where I can put them.

E

Yeah and then the current documentation, you have the provider that does that right.

E

This is fine, like network is the common pieces and then still you can reference a different object, which is like a specific thing that how you want to claim like how you want. You know a very you know, vendor control, because in most cases when people do just networking simple one day, just.

C

Want an IP address and a.

E

Very simple stuff: it.

C

Could be something simple as I need to have I want to have Hardware accelerated crypto for the following.

E

Types of network yeah for the following Network request, which is the resource, claim basically you're.

C

E

A network and yeah.

C

But it's always extra on top of a normal if I just want to have a best effort, I, don't really care uh connection. I don't have to file in the resource claim, because anything is good enough for for my pod and then it falls back on other.

E

Things so, and the nice thing here, yes, um that, um like Patrick, said we can change the array that in the neutral case right when most cases you know just vth very just IP address you don't need anything in scheduling, nothing is going to change and we we can modify. The array did not invo do any. You know Consulting with the controller and stuff like that, but once in the network you know I want an SLV. Now oh I want more sophisticated stuff.

E

Then you know you can also do the controller stuff because it's already built in in dra. So it's it's it's. It will simplify for US future problems that um we're probably going to tackle yeah. So.

A

Yeah before we draw Patrick, can you host me I I had to reconnect? Can you tag me as a host I want to kind of stop the recording as well make sure that it's there you're muted, if anything, Patrick.

B

I I need to find the right settings.

A

It's in participants go to participants.

B

A

In Zoom participants, right click on my name and and and you can just make hosts or something like that uh and then I think we're gonna call it here.

A

um Thank you, Patrick uh Patrick. Can you additionally do you have the link that you can share to this presentation.

B

I can I can post it somewhere if it's it's on on SharePoint, so I I could copy it to Google Docs. If, if that helps yeah.

A

B

A

Make it externally available I would appreciate that, so we can everyone and and I will tag I, don't think I have your email, but in in the agenda, if you could, if you can share.

D

It externally and then post.

A

Post that link in the agenda I would appreciate that yeah.

D

A

I think Let's uh I think, let's digest what we discussed today and let's continue the discussion. Let's, let's make a decision, let's try to make a decision next week about uh how we want to do and deal with the dra in this effort. Okay, um we are press six parts, so I'm calling it for this today uh see you everyone next week.

E

Bye bye, anyone hear me.