Cloud Native Computing Foundation CNCF Webinars, 10 Sep 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Webinar: Kubernetes and Networks: Why is This So Dang Hard?

Description

Tim Hockin presents a webinar where he looks at different models for integrating Kubernetes into your network in both single-cluster and multi-cluster environments. He looks at IPs, gateway configurations, and how to navigate security boundaries, describing pros and cons of each solution, so that developers can make the best choice for their particular environment.

Presenter:

Tim Hockin, Principal Software Engineer @Google

A

Okay, let's get started. Thank you. Everyone for joining us today welcome to today's cncf webinar kubernetes network modules. Why is this so dang hard, I'm jerry, fallon and I'll be moderating. Today's webinar I'd like to welcome our presenter. Today, tim hawkin principal software engineer at google just a few housekeeping items before we get started during the webinar. You are not able to talk as an attendee. There is a q, a box at the bottom of your screen.

A

Please feel free to drop in your your questions in there and we'll get to as many as we can. At the end. This is an official webinar of the cncf and, as such is subject to the cncf code of conduct. Please do not add anything to the chat or questions that would be in violation of the code of conduct. In short, please be respectful of your fellow participants and presenters.

A

Please also note that the recording and slides will be posted later today to the cncf webinar page at cncf, dot, io webinars and with that I'll hand it over to tim to kick off today's presentation. Take it away.

B

Thanks so much for the introduction uh I am excited to be here today. I am talking about one of my favorite topics, which is networking and kubernetes.

B

uh This is this is a problem that has plagued a lot of people for a long time uh and people still five years after the interaction kubernetes are still wrestling with the model of how it works and how to integrate it. So I, a few weeks ago, I prepared some slides just for a reading uh and they got tens of thousands of views, and I thought it was really interesting, so I would present them here today.

B

So before I jump into that, let me introduce myself. My name is tim. I work at google on kubernetes and gke and related projects. I was part of the original kubernetes team. uh I've been working on it since before it was open sourced, uh as tends to happen with projects like that somebody had to to do the networking and I drew the short straw, so I mostly pay attention to things like networking, storage, nodes, multi-cluster, lower level topics within the system, um but today we're going to be focusing on the network model.

B

So what do I? What do I mean by the network model? So, let's go back to basics a little bit when you have a kubernetes cluster. It is a bunch of machines and we tend to think of it in terms of virtual machines. But the truth is it doesn't really matter? I don't care if they're, virtual or physical, uh those machines are plugged into some some network right. This makes sense. This shouldn't be surprising to anybody.

B

This is standard machine stuff, the those machines we call nodes and we run our workloads pods on those nodes. Kubernetes is interesting because pods get ip addresses so pods 10. The question that we have here is how do pods get those ip addresses and how do they integrate with the rest of the network?

B

I suspect I've already lost some folks and which is why we're going to dive deep on this stuff today.

B

So let's go even further back to the core principles of the kubernetes networking fundamentals.

B

There are basically two rules for kubernetes integrations one all pods on a given node can communicate with all other pods on all nodes without nat. Don't worry if that doesn't resonate for you yet we'll walk through it uh and rule two agents on a node. For example, system demons can communicate with all the pods on that node and that's so that we can do things like health checks and and uh system management agents. I'm going to focus less on that today.

B

So let's walk through what it means for all pods to be able to talk to all other pods without nat. So let's start with what we think of as a quote normal cluster.

B

So we start with a network, in this case we're looking at the 10.8 network, which is rfc 1918 private space and almost everybody uses for their private networks, because it's pretty large within that network, I'm going to carve off a chunk of ip space for the cluster.

B

Now, in this case, I'm carving off a slash 16, which is 64 thousand. I p 164 k ip addresses. uh It's important to note. It is not required that a cluster be a single ip range and, in fact, we're seeing more and more users today who are not doing it this way, but it's very common and historically, that is how people have done it, and it makes my pictures a whole lot easier.

B

So, let's go back to this picture, so we've got a cluster space carved out in this case, I'm carving out 10 0 0 0 for our cluster.

B

Now, within that cluster, I've got two nodes and those nodes get their own ip addresses and those ip addresses come from that come from the larger network.

B

I want to pause on that how the nodes get the ip addresses we'll come back to that in a little bit, but for now each node gets an ip address, which again shouldn't be very surprising to anybody.

B

Here's the fun part each node also gets a carve out from the cluster's space for the pods that will run on each node. In this example, I've shown that two nodes are each taking a slash 24, which is 256 ip addresses, which is a lot and again. You don't have to give every node 24, but it's easy for the pictures and for people to think about in whole whole octets. So we're going to run with this for this example uh as before.

B

It's not required that nodes have predefined ip ranges, but it is typically how it is done, and it's it's going to make my drawings a lot easier. So we're going to assume that every node has a pre-allocated space. This does put a bound on the number of pods that you can run concurrently on a given node. If every pod gets an ip address, it's going to be able to only run 256.

B

The good news is that's a lot, so when those pods run, each of them gets an ip address, and again I mentioned it's. They get it from the node's ip range. Not always. There are some implementations that get ip addresses via more dynamic mechanisms, but this is how it's usually done and how we usually represent it. So again, each pod gets an ip address. You can see in this case the ip address for each for each pod comes from the range allocated to each node.

B

I hope that's not too confusing so rule one of kubernetes networking demands that all pods can reach all other pods. This is the fundamental rule. This is what enables the kubernetes model to work. Here's the rub: kubernetes does not guarantee anything about stuff that is outside of the cluster.

B

In short, we put up a wall around the cluster and we pretend that things outside the cluster don't really exist, or rather they're, not part of the fundamental networking model. So if I have a client outside of my cluster- and I want to talk to a pod or a service inside my cluster, can I do it? How do I do it? If I have a pod in my cluster, and I want to reach some other service outside my cluster say an auth server or a database?

B

Can I do it again how what happens when I do it? This is the stuff that has become very challenging for people to reason about, and it's getting worse, because multi-cluster networking makes this even more confusing. If I have two clusters can pods in one cluster reach pods in the other cluster.

B

That seems like might be a useful thing so today what I want to explore non-exhaustively, because there's a million creative things that people have done is how you might think about the sort of taxonomy of models that people are using for integrating clusters into their network.

B

So, let's start with what I call fully integrated mode or flat mode, I'm terrible at naming things. So I apologize in advance uh they're just going to get worse from here, so in a flat mode. You can see that each node, like I showed before, has its own ip address, but it gets an ip range from the network.

B

uh Everyone who's on that broader network, so those clients and those servers, the things that I've labeled other here they know how to deal with the fact that each node has more than one ip address, and they they understand that. That might be that the routing is configured statically in the network, devices, they're top of rack switches or even on each node, or you might use bgp for advertisement uh or it might be configured deep in the fabric like uh in a cloud environment.

B

uh The point is people don't have to think about where those like addresses are. They can just talk to those ip addresses. Now I called it flat mode and anybody who's. A networking expert is immediately raising their hand. um This might be l2 flat, it might be l3 flat. That is a different discussion and I don't really want to get into that today. There's trade-offs, uh I don't want to talk about cni drivers today either. Really it's it's uh somewhat orthogonal to the discussion.

B

um The main point here is that pods and everything else share an ip space. This makes communications really easy to reason about everything's on the same network. There's no overlapping ip addresses. There's.

C

B

Overlay networks there's no magic going on there's no translation. uh This is what kubernetes really wants. This is what we designed for, and it is achievable in lots of places in lots of ways. For example, clouds like google or amazon can implement this directly.

B

This gives you full integration, so this is really great when you have a lot of ip space. Kubernetes loves ips, and we've shown here that we've carved off a slash 16 right. That is a lot of ip space. uh It's good when the network is very programmable or dynamic. When you have control over the routing it's great, when you need high integration and performance, there's no translations, there's no proxies in between the clients and the servers, regardless of where they are, um and it's also really good.

B

When kubernetes makes up a large part of your footprint right when you can afford vip space- and you can justify it because you're spending a lot of your your budget on kubernetes, not so good when you have ip fragmentation or scarcity of ip addresses, uh you can read that as brown fields, uh lots of established companies have already taken the 10 8 space and carved it up and routed it around their infrastructure.

B

Generally, I don't mean this in a negative way, but somewhat casually with respect to fragmentation and when kubernetes comes in and says hey, I need a slash 16. I need 64 000 ip addresses some network administrators heads explode.

B

It's also not really good. When your network infrastructure is very difficult to configure or not very dynamic or if kubernetes is just a tiny piece of your overall footprint, then this may not be the best model for you at the complete other end of the spectrum. We have a model fully isolated or air gapped mode.

B

In this model, their networks are completely disconnected. There is no connectivity from inside or outside uh the clients. There's no such thing as a client outside of your cluster reaching into your cluster. It just doesn't happen and within your cluster you can still maintain the kubernetes requirements, but between clusters or two other things it doesn't exist.

B

In fact, this is an interesting model because you can reuse all of the ip, because everything is completely disconnected. There's no reason that you can't allocate the same ip addresses in every cluster, in fact they're, basically on different networks, so you can see sort of in this model. There is no connectivity, they are, uh as the expression goes. They are air gap. There is nothing connecting them.

B

This is really good when you don't need any integration, when you have a workload that needs to run in isolation, um it's you know, it applies when uh ipspace is scarce, uh we network is not programmable. One of the reasons people look at this model is because they care a lot about security, and this makes it much easier to reason about security boundaries.

B

However, it doesn't work at all if you need communications across a cluster edge right, so I describe this as one end of the spectrum. So, let's look at a point a little closer to the middle, where we're going to spend mo.

B

Pods within the the cluster get ip addresses from that private network, and then there are bridges gateways into the cluster and out of the cluster uh that are set up by your environment. This is a very common model, especially in enterprise world.

B

Some people have claimed that this is the only way, or this is the default way for kubernetes, that I've heard people say quote you have to choose which overlay you're going to use. uh Those sorts of statements are false. Obviously, we've looked at two different models already. um That said, this is a very, very common model.

B

In this case, ingress and egress traffic will go through one or more uh quoty fingers gateways. We'll talk about gateways in a little in a little bit.

B

This can be implemented via overlays. It's very common to use things like vxlan there's a lot of products out there that will help you set up vxlan overlays, but it doesn't have to be. It can also be set up via private uh routing rules right. So another common model is to use uh not to use an overlay encapsulation, but to use straight ip addressing, um but to use private vgp advertisement. So they're limited and only those nodes in the cluster know about each other.

B

Another way to think of this is that your nodes exist with two interfaces, with one leg in the main network and one leg on a private network, and they know how to route which traffic, which side.

B

This is, in fact, if you squint how clouds generally treat the internet, you have a private network of vpc, which you carefully control traffic into and out of right, so you're applying the same sort of pattern to your kubernetes clusters.

B

In fact, in this model you can reuse pod ips in each cluster, which is a major motivation for this model, so, uh depending on what gateway means for you, you may or may not be able to reuse node ip addresses, but node ips are very low count compared to pot ips right in this example, we're looking at 256 times more ip addresses for pods. So here you can see. I can use the same pod range for every cluster, because pods never directly talk to other pods.

B

They always go through some sort of gateway gateways are going to be translators or access.

C

B

This model is really good when you need some amount of integration. It's not air, but it's not complete. It's good. When ipspace is scarce or fragmented, you can afford node-ip addresses, but you can't really afford the pod ip addresses, at least not in giant contiguous chunks.

B

One of the things that's happening in upstream kubernetes right now is making it more uh possible to run your clusters with less monolithic chunks, which will take away some of the value of this model, which I think is a good thing.

B

This is also good when your network is not very programmable or dynamic right, and so you can plug it in and to the rest of the network. You seem to have some number of vms or machines nodes, and that's it.

B

It's not so good when you have to debug connectivity when you're bringing traffic in through a gateway, uh traffic gets necessarily more complicated. There's gonna be some form of translation and you need to figure out what is happening there. It becomes much harder to reason about uh what exactly is happening. It's also not very good. If you need direct to endpoint communications, there are some systems out there that assume that clients can talk directly to their endpoints.

B

They don't use things like load balancers, and if you need those direct communications, then this gateway model starts to break down. It's also problematic. If you need a lot of services exposed, the gateways can be complicated or sometimes expensive, especially if you're not using http, where you can reuse ip addresses. If you're using layer, 4 services like tcp or udp, but not http, you can end up requiring a lot of infrastructure to bring traffic in and out.

B

It's also, not so good when you are relying on client ip addresses for firewalls, which is a fairly common thing to do, especially in larger companies. Where you say look, only these ip addresses are allowed to access my database.

B

Well, those gateways uh become more complicated now because you have a limited number of them and it can also be problematic if you have a large number of nodes, because things like overlays and uh route advertisement tend to scale poorly and you need to get more and more infrastructure. As you get larger.

B

So I seem to have painted that model in a bad light, but it is a very common model, so I want to dig in more into what gateway really means, so here's the simplest form of gateway. You use your nodes as the gateways in this model. All the traffic into or out of the cluster is done explicitly by the nodes in the cluster, there's no uh extra load, balancers or or routers, or anything else. Here um the traffic comes through the node.

B

So this is like I said, the nodes have uh one leg in the main network, which would be their uh node ip address here, the the 10 to 40 network, uh and they have another interface in the pod network, which is the 10 0 network.

B

So in in this case, I'm going to take this apart even further and talk about how exactly ingressing risk works here. So let's talk about ingress first, so service node ports is something anybody who's played with. Kubernetes is probably run up against and try to figure out if they can make do what they want um and it's a confusing mechanism. So I'm gonna dig into it even a little bit further.

B

In the case of a service, node port, you have a client here, 128.1.1 it's sending a packet that packet as it transits is targeted at a node ip address. Remember the uh the client only knows node ip addresses and not pod id addresses. So hopefully you have a service that is load balanced or or is only a single instance, because I'm sending it to a port on the node. It doesn't really matter which node, because node ports exist on all of the nodes.

B

So you need some mechanism to figure out where you're going to send it into those nodes which is a slightly different conversation. But, let's assume you have a dns or something, and it's going to tell you that the service you're looking for is available on these nodes at this port and the port in a service. Node port isn't arbitrary. You don't get to choose port 80. kubernetes generally manages those ports for you and it uses them uses a high port range to generally avoid conflict. So you'll see ports in the 30 000 range.

B

So here my client is talking to port something. You know 30 000, something on one of the nodes.

B

When it arrives at the node, the node is going to use the destination port of the ip packet to figure out which service you're talking to so. If you're talking to port 3001, you might be going to service foo and if you're talking to port 3002 you're going to service bar and it's going to do a destination network, address, translation or dnat, and it's going to pick one of the back ends that it understands as being part of service foo or service bar depending on which court it came in.

B

And it's going to rewrite that ip packet, the most common implementation is iptables, but we also have implementations now in ipvs and nf tables and evpf, uh who are all able to perform the same basic logic.

B

We're going to forward that packet, and linux is actually quite good at this, so it'll forward the packet uh onto the destination pod that we wanted to it to arrive at note here. The client didn't choose which pod it's talking to. It just talked to the port and I'm sorry, I'm seeing a chat in there an.

C

B

uh I don't know how to fix it.

B

um So I'm gonna press on I'll try to get a little closer to the mic.

B

The packet arrives at the destination pod now, so one of the common things that people do is they will ingress traffic into an l7 proxy. What what does that mean? uh They'll run something like envoy or nginx or aj proxy inside their cluster. They will route all the traffic through a node port into that and use that to do further. Forwarding things like http are wonderful because they have headers that you can use to tell what did you actually mean? What host were you actually looking for?

B

This is, for example, the in cluster ingress controllers that many people run so again going back to this diagram, you can see once it's arrived at that pod. I can do anything I want with it on the reverse path. The host is going to do the opposite of that dnat translation and it's going to convert the packet back. So when your original client gets the the response uh when the packet gets through when the client gets the response, it seems to come from the nodes, node port, that it was talking to originally.

B

So the client is happy and the pods are happy and everybody's happy, except if you cared about some of the finer details which we'll get into in a bit. So looking at the egress side right, if I want to send traffic from my cluster to something else, uh we often call this snap source nat. um The typical model here is to use what linux calls ip masquerade.

B

What ip masquerade does is it says that uh traffic leaving this node, this machine is going to look like it came from this machine. So when I've got my client inside the cluster and I want to send traffic out of my cluster, we send the packet out when it reaches the edge of the node it's going to translate into the node's ip address. So all the traffic from that node appears to come from that nodes ip.

B

If you were tcp dumping it, you would see the source address as the the node 2 in the first cluster now, because we're doing ingress and egress through nodes when it arrives at the destination cluster. What we saw happening with node port is also going to happen, and so now you have a packet whose source address and destination address have both been modified. It's got two translations along the way.

B

This is the cost of of these gateways, and it eventually is delivered to the node and as before, when it responds it's going to go back through those translations in reverse and the pods themselves aren't really aware of what happened with the packets.

B

The network fabric happened handled it all for you there's a different model, though, which uses a virtual ip address, and this is used in uh in some cloud providers to provide a slightly different model instead of having to know about a node which node and which node port to use which is kind of a clunky human interface.

B

We give you a virtual ip address and that virtual ip address represents a service in your kubernetes cluster.

B

It's very similar to node port, but instead of using the destination port, we're going to use the destination ip to make those same routing decisions, so I'm not going to re-animate the same drawing, but the packet will flow through basically the same path. It will go through the same sorts of translations, but instead of a human having addressed a 30 000 port, it can address a virtual ip address. This is more compatible with things like typical dns and open source software, which isn't always easily configured for random arbitrary ports.

B

With the egress path, you still need something like ip masquerade as snat to uh to egress and there's.

C

B

Change just like variant of this model, which is a more proxiful way of doing things instead of having a virtual ip address, you have an actual proxy, the difference here, being that the virtual ip typically does not terminate a tcp session, it will forward packets and the proxy will terminate a tcp session and open a new session.

B

Otherwise, it's sort of behaviorally very similar. Again the packet will arrive at the proxy. The proxy will choose which back end it's going to go to and will forward the packet one of the interesting things about proxy model is you can either route to a node port, which is how kubernetes started with support, for example, uh amazon, uh elbs or it can route directly to pod ip addresses, if that proxy is smart enough to know how to get onto the island?

B

This is a more modern approach, which is very nice.

B

The problem here and in many of these models is the mechanism here will obscure the client ip so in this case the proxy, because it terminated the original session and opened a new session traffic looks like it's coming from that proxy's ip address. So if you have a client assuming if you have a server that needs to understand the client ip address, you have to pass it through in some other mechanism.

B

Again, some proxies, like elb, support things like the proxy protocol, which will include a header on a tcp stream. If you're using http proxies, then you've got ways of encoding it in http headers. um If you don't have those things, you have a much more difficult time getting the true client ip address and like the vip model, you still need something like s nat. In order to be able to leave the cluster now I could probably talk for another hour just about ingress.

B

In fact, I have another slide deck which I tried to merge in here and then realized. I was way over time uh that goes into more detail about how these models work and the various trade-offs uh of them and and the uh the actual details of how kubernetes implements them. um But I'll have to do that at another webinar.

B

The options for egress gateways honestly are very poorly explored. So far, we've had a lot of discussions, but we've not yet seen a lot of interesting implementations.

B

So we looked at island mode and uh one of the variants of island mode that we're seeing uh in more usage now is what I call archipelago mode, which is uh roughly bigger islands or groups of islands.

B

Within the archipelago, the uh the model is effectively the flat model, which means that your multiple clusters can talk to each other without translation, uh it means you have a flat space and you carve off a lot of ip within that archipelago. But when you integrate it with the rest of network, it becomes island mode again like island mode. This can be implemented as an overlay or not as an overlay, and it still needs gateways to come in and out of this again, this could be gateways using the nodes.

B

This could be gateways using virtual ip addresses, or this could be gateways using proxies or other variants. This is not an exhaustive exploration of this, but these are the most common patterns that we've seen. This is a really attractive model.

B

um Well, sorry, backing up, you can't reuse ip addresses between clusters, but you can between archipelagos, so we've seen, for example, customers who set up an archipelago per cloud region or per data center, and they can economize on ip space between those but still have high levels of connectivity within a particular region.

B

This is good when you need that high level of integration, if you have services like kafka, that want direct connections to endpoints, it's good when you need some amount of integration with your non-kubernetes environment, but that's not maybe the primary driver, it's good when ipspace is scarce, though it uses more space than plane, island mode and it's good when you're less programmable and dynamic.

B

Like island mode itself, uh if you need to debug connectivity between clusters, this can get complicated. uh If you need those direct-to-end point communications from outside of the archipelago, it can be complicated if you need to expose a lot of services to their non-kubernetes environment.

B

It can be hard, for example, node ports by default are limited to about 2000 node ports and we have had actual users of kubernetes who need more. It is a configurable flag, but it's not something most people configure.

B

So if you have large numbers of services, it can be a problem um like plane, island mode, if you're, relying on client ips for firewalls, it can be problematic um or if you have large numbers of nodes across all of your clusters. Now your scale limit is the number of nodes in the archipelago instead of the number of nodes in each cluster, so you can see there's a real trade-off uh to be made in this model.

B

The gateway options are very similar to plane, island mode, um and so I won't go back into them again, but I will look for an opportunity to present more on on ingress and the ingress modes later and I'm hoping that we will soon have the ability to talk more about egress gateways and more egress modes.

B

So, of course uh you want to know which, which one should I use uh and the sad truth is. There is no right answer. Unfortunately, uh I don't know your environment uh in the abstract. um There's there's no way that I can tell you which one is best. They have real trade-offs and I've tried to uh elucidate some of those trade-offs here, um but you have to make some value decisions yourself.

B

If I could, I would wave my magic wand and give everybody ipv6 and then flat mode would not be such a big deal, um but that isn't possible yet, um and there are other considerations beyond just ip efficiency. So uh I'm happy to talk with people about which modes they think make sense for them, but there isn't a magical answer so I'd like now. uh We're we've got plenty of time for questions. I'd like to switch to questions, because I think this is uh such a nuanced topic that the details will be fun to talk.

C

A

Okay tim. Thank you very much for the presentation. um We have plenty of time for questions, as tim said. So, if anyone has anything they'd like to ask, please drop it into the q, a box and we'll get to as many as we can before the end of the.

B

Hour I see that my audio was really bad, I'm so sorry.

B

uh I believe they're recording yes, the question is, will ever record be sure there will be a shared uh recording of this. uh I apologize for the audio. I will, I see in the chat, a question uh ways to look over the slides. Yes, I have posted the slides already on my speaker deck. So if you go to speakerdeck.com t-h-o-c-k-I-n, uh then you can find the slides there, uh and I imagine that the cncf.

C

We'll be also posting the slides, since we shared them.

B

Question can you explain island mode sure, so, let's, let's flip back in there to island mode, since this is the uh the most common model that I see for larger enterprises.

B

All right so plain old island mode um in this model, for example, let me just pick on one particular model. If you look at the two different clusters, you can set up a vxlan overlay, for example. Again this is not the only way to do it.

B

Just one example: instead of a vxlan overlay, each of the nodes in that cluster have an agent running that understands how to route vxlan packets and they might get that information by um uh gossip by talking to each other or they might get it through the kubernetes api or through some other configuration mechanism, but they all know how to uh route to each other.

B

So node, one in the top cluster knows that if it needs to talk to the pod c, then it's going to encapsulate the packet and forward it on to node two right, but it's only between the nodes within that cluster that that information is shared. So if uh pod, a in the top cluster wants to talk to pod a in the bottom cluster, it doesn't know how to get there and that's why it's called an island, because there isn't there isn't a bridge between these.

B

Two things they're not connected to each other um and that's why I drew the edge of the cluster in a darker shade um to emphasize that it's kind of a barrier within the cluster. They all know how to reach each other, but outside the cluster. It doesn't work. Now I picked on the xlan, but there are other models you can use, for example, bgp, to share routing information between those nodes.

B

If your network understands how to do bgp- or you can even just use static routing configuration, you can run an agent on each node that literally just uses the linux route tools to configure static routes to forward ip packets.

B

So there are different ways that you can implement the island, but the the name comes from the idea that it is isolated and that it is not as easily accessible as things on the you know quote: mainland.

C

B

Be I saw there was.

C

A follow-up question to this.

C

uh All right, let's.

B

Go through the, I can't find the follow-up, um so uh nat scalability issues, so yeah, nat scalability is a fun question. um When you're doing all these translations, the kernel has to keep track of which translations, it's doing so, there's a tool called contract connection tracking which the linux kernel offers and which we use heavily in the default implementations. Now there are many implementations. Services are an abstraction um connection tracking we set up by default.

B

When you run cube proxy, we set up a large number of connection tracking records, so we make a lot of space for the kernel to be able to track connections. um This is generally fine for most users it's fine and for tcp in particular. It tends to not be a huge problem because tcp is connection oriented and when the connection closes, we can immediately clean up the connection tracking.

B

If you use a lot of udp services, it can be problematic. Udp doesn't have a connection, so we have to time out udp contract records. So uh what we have seen occasionally is customers who have a high number of udp based services and with connection tracking records that are just sitting on waiting to be tied out um and so for some of those customers we've given them some flags.

B

You can set that key proxy to change the scaling for comedy connection tracking records that you create um some other models like psyllium, as a replacement for proxy doesn't use, I believe, to not use the kernel's connection. Tracking mechanism uses its own connection.

C

Tracking mechanism.

B

We would be able to elaborate on the arrival on node and next steps to determine which node and pod.

C

uh Yes, let's flip back to that sorry, two windows: uh let's look at.

C

B

Is going to look at the destination? That's the first decision, which service did you need to talk to and given that it's going to use um I'll talk about the model I'm most familiar with, which is like tables it's going to enter any logic that says: okay, I've determined that you are heading for service.

C

Come on slides.

C

Like that there we go.

B

um So determine that you're you're aiming for service food, um it will then figure out. Okay service foo has some number of pods behind it, and it's going to try to choose a pod for you to route to now. I don't get into it very deep.

B

um In some cases it's going to choose a pod for you, that's on the same node like in this example, but in other cases it's going to choose a pod, that's on a different node, so there may not be a pod for service foo on the node that you happened to get routed to it.

B

It's going to actually route it to another node, which adds some complexity to the logic and makes it that much more difficult for humans to reason about, because their con other tcp dump, for example, will show multiple connections with multiple address translations happening across multiple nodes. This is the part. That's really unfortunate about that model.

B

Kubernetes has introduced some parameters like uh external traffic policy, which allow you to control that a little bit more and limit it to only sending traffic to pods on the same node, which is great, except if you happen to have routed to a node that doesn't have any back ends, in which case it will end up being a black hole.

B

So this is a trade-off that we allow users to make on a service by service basis, because there isn't an easy and easy answer to that for things like cloud load, balancers I'll talk to the.

C

Vip model, let me pass through here.

B

um Something we do in the cloud load balancers is when we implement the vip. We have a health check on each node, which tells us whether or not each service has a back end for that service. On that node. That was a long sentence given node one does it have back ends for foo? If the answer is yes, the vip will potentially route to node one. If the answer is no, the zip will not route to node one. It will only route to node two or whichever nodes actually have back ends.

B

That way you avoid the black holding problem.

B

How do you increase the default number of pods from 100 to 254? Like you mentioned, uh I know the ips needs to be there, um so the number of pods you run on a node are governed by two different flags. One is the number of ip addresses you allocate to that node, and the other is how many pods cubelet is willing to run. Cubelet generally sets a limit of. I think the default is 110 for the number of pods on a node.

B

There are some people, some use cases where you want to raise that what we don't want to do, what we discourage people from doing is making the number of pods and the number of ip addresses very close to each other. What happens in that case is you can potentially reuse ip addresses very quickly and kubernetes being a distributed system. There's generally something out there that has cached an ip address, whether it's dns, which has you know, time time to live measured in seconds, but still not instantaneous.

B

So we we generally say. However many pods you want to run on a node. Your ip addresses should be about twice that big. In fact, what we do by default in gke is, when you tell us how many pods you want to run on a node we round up to a power of two and then double it. uh So if you want to run 110 pods, that's going to round up to 128, which double this 256.

B

um many users, most users, in fact, don't need to run 100 pods per node, and they can actually dial that in the other direction and set it lower. So if they say, I only need to run 32 pods per node, then we round that up to 32 and we double it to 64 and they get four times as many.

C

B

In the same ip space any chance you can talk about the ingress options that you left out due to time.

B

I have another deck, which is also presented on my on my speaker, deck, which you can walk through, where I dig into the vip model and the proxy model in a lot more detail.

B

um Also, if you uh google, for uh for me, we've done some talks that go into excruciating detail about how the traffic routing works, with respect to different load, balancing.

C

B

Would overlapping service network ipspace? I just lost the.

C

A

I can take that up for you.

B

Good overlapping service network, ip space with external, empty space be an issue if a pod needs to communicate with both in cluster services and services external to the cluster. If I'm understanding the question correctly, yes, any time you overlap ip addresses, you now have to disambiguate.

B

If you have a client that wants to talk to both of them. This is one of the fundamental problems of ipv4 is just not enough ip addresses to go around.

B

So if you look at a model like uh island mode, where you might potentially reuse those ip addresses, there is simply no way that you can have a pod directly speak to another pod, because it can't tell whether it's talking to its own private version of that ip address or to the other clusters version of that ip address.

B

This is akin to if you set up your internal network to use the 8.8.8.8 ip address, I would not know how to tell whether you meant to talk to google dns or to your own service, so overlapping ip addresses should be done carefully and thoughtfully.

C

uh Questions I saw something that I meant to answer.

B

Would ipip overlays work? um Any overlay should work. If you can route the traffic into the other cluster uh deterministically vxlan is the one that has sort of risen as very popular right now, because it's a fairly simple overlay, but people are certainly doing it with gre or other encapsulation mechanisms.

B

Could you give some idea on node pool what is the use case for more than one node pool? Sure node pools is not an official kubernetes concept, but it's implemented by many of the kubernetes providers. It's just a group of machines that have something in common, so I can speak to the way gke implements it.

B

It's a a template for a vm shape, so all the vms, all the nodes in a vm. Excuse me all the nodes in a node pool have the same shape and configuration.

B

So we offer you the ability to configure the pods per node that is configurable at the node pool level, for example, but it's not an official kubernetes concept.

B

Each node has its own configuration so when, when we go to schedule, we're going to look at each individual node and ask the node how many? How many pods do you allow? That's part of the node status.

B

Questions, how do you increase the default number of plot? Oh, that we did that one? um What are some of the challenges trade-offs? You've observed when you correlate these models with low latency packet processing.

B

uh The short answer for low latency packet processing is the more things that have to touch the packet, the worse off you're, going to be right. That seems pretty self-evident if you're bouncing through proxies you're, going to have generally higher latency than if you're, not and so for, customers who need who really focus and care about low latency packet processing, we encourage them to go, have uh fully integrated flat mode as much as they can, and sometimes that means having different mixed models.

B

They have one cluster that runs their super low, latency applications and is in flat mode, and they have another cluster in island mode which handles less latency sensitive applications.

B

Question are gateways managed by kubernetes. Only no, absolutely you can do whatever you want to make those gateways. Work. Kubernetes happens to have some abstractions that make it relatively easy to set up load balancers in various cloud environments, but you uh kubernetes is wonderful in this way. There's an api and there's the implementation and they are distinct and you can implement your own uh again quote cloud provider via whatever mechanism you want.

B

If you have some load balancing infrastructure that you like or some creative traffic routing that or network fabric that you can leverage, you are welcome to do your own thing. Kubernetes gives you lots of interesting api mechanics, so you can watch for updates and implement your own controllers and in fact all of our controllers are open source. So you can uh embrace and extend the controllers that we've already written to work in your own environments.

B

How would flannel versus calico work, which pattern are they flannel and calico? Are island modes generally um calico being a bgp model? uh Well, actually, a mixed model now, but primarily was focused on bgp.

B

It can integrate with the larger network, but it also can not depending on where you're going to uh route those advertisements, um flannel or canal, uh are vxlan implementations. So those are generally uh island mode. Implementations do you know any way to simplify this model and remove the dependency on that? Do you see service mesh solutions, adding value to the network layer indicates or the opposite.

B

So nat is one of the implementations that we use for uh cube proxy right. um If you wanted to use something like a service mesh instead of a cube proxy, that is a completely valid implementation and it wouldn't need nat in the same way like it doesn't rely on the kernel's nat, but it's going to be doing uh effectively translation in the proxy, so you still have some amount of state that is stored as your routing pockets from around I'm a big believer in funny. I was just talking about this this morning on twitter.

B

I'm a big believer in service mesh and the idea that eventually most users will want some of what service mesh offers. uh Probably most users will not want all of it, but most users will want some subset of it and so, as service meshes become more powerful and more easy to use.

B

I think it's great that they can implement these same abstractions and you should be able to use them uh just like any other implementation right. So if, if you wanted to use, for example, seo it's going to handle service routing instead,.

C

B

Proxy and that's great.

B

So many questions this is.

C

B

um Where are we apologizing for the question, but I want to ask this: is an entry point for learning understanding, kubernetes difficult for an admin who has a good understanding and skills with.

B

Virtualization, sorry, that wasn't the question as a follow-up, I think. Okay, um do you have references for setting up these different modes, technical, configs and tools.

A

B

uh The references are generally in the implementation, so there is there's documentation on setting up flannel and canal for um island mode uh and calico has lots of great docks for setting up calico the default kubernetes implementation kind of assumes flat mode, but it doesn't assume it very hard. It just doesn't assume anything about networking other than the kubernetes networking model, which means pods can talk to other pods yeah. The kubernetes.

B

The hard way uh is a great doc. It's very difficult to write abstract docs around the networking model, because there's so many dependencies on which environment you're in what they support. How native it can get? Not every cloud provider can support full flat mode, for example,.

B

Suggestions on local platforms to explore experiment with case networking options. I am a big fan of kind myself, but kind builds unknocker and there's only so much you can do with it before you end up running into things that the kernel lets you do, which you might not be able to do uh in a real network model. um So uh I would I personally use kind for testing stuff out when I would play with like l2 space.

B

um I use google cloud when I want to play with a non-l2 space is cube proxy based on iptables qproxy has several modes built into it. One of them is iptables. That was the default, and maybe still is the default and most common model.

B

Increasingly, people are using ipvs as the implementation, which is a somewhat optimized path through the kernel, but very similar in internal mechanisms. Very different in configuration there's also a user space mode which hardly anybody uses anymore, and there are now there's been proposals to add nf tables directly, although I think we're going to see a standalone replacement for cube proxy that uses nf tables and psyllium, for example, uses evpf and rather than folding all that into one giant q proxy binary. I've been encouraging people to build their own cube proxy replacements.

B

Q proxy is just one implementation of the service abstraction.

B

When using where to go, when using a node port service, would clients, within the same node be able to use that node's, ipv and service, or do they have to use the pod ip directly?

B

In most implementations, you can use the nodes ip and node port, but I'm not sure why you would want to. Unless you have some weird consistency requirement. Node ports, very frankly were designed to build higher level load balancers. So when people are using node ports directly, it's for me it's a smell like I feel like you're missing a piece.

B

The goal is to implement load, balancers and proxies and vips through.

C

B

Ports not to have people use them directly.

B

There are services of type cluster ip with cluster ip, none that can be routed via a service name. Yes, how does the communication happen for those without an ip associated with the service? So we call those headless services. uh They don't have an ip address ahead.

B

They do not work very well in island mode, because you, if you have five replicas, you need five ip addresses to address them or five ports to address them, and the problem is the number of replicas can change dynamically and it's much slower to program load, balancing infrastructure uh than it is to program within the cluster, and so in general, if you have headless services. This is what I said about direct connectivity direct to end point connectivity.

B

It does not map very well into uh highland modes and I'm told I have no minutes left, I'm happy to do more questions via whatever mechanism people can find me I'm available on twitter on uh slack on github, on whatever medium, I want to also throw a quick shout out. We have another ambassador webinar coming september. 25Th caitlyn is doing and bowie we'll be talking about the evolution of the ingress api, which we just touched on barely today and it's new, hopefully replacement called gateway.

B

You can click on the link in there go to the cncf upcoming webinars and get more details.

A

Okay, well, thank you very much tim for a wonderful presentation and thank you all so much for your participation as well. As I said before, today's presentation and slides will be available on the cncf website later today. That is all the time we have for today. Thank you very much for joining us today. Everyone and stay.

A