Cloud Native Computing Foundation Online Programs, 29 Apr 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: eCHO Episode 44: The Inside Track on Sidecarless Service Mesh

Description

Get the latest on the sidecarless Cilium Service Mesh from Liz Rice and Thomas Graf

00:00 Introduction & headlines
07:53 What is Cilium Service Mesh?
11:52 Cilium Service Mesh moves service mesh into the operating system
15:27 Beta tester comments
18:44 Early performance data
26:52 Security and mutual authentication in Cilium Service Mesh
37:37 Performance improvements with Cilium for authenticated, encrypted traffic
43:43 Cilium Service Mesh roadmap

A

Hello and welcome to the ebpf and psyllium office hours, also known as echo live stream. This is episode 44 and we're going to be talking about what we've been learning so far with the scicala service mesh, the psyllium service service mesh beta program, but before we get to that, let's, uh as always talk through some headlines, uh do let us know where you're watching from say hi in the chat. It's always great to to hear where you're watching us from and to get your questions and comments.

A

I'm really hoping there's going to be lots of questions on the service mesh program, because I know a lot of people have been really interested in what's happening there, but before we get to that, let's look at some headlines, so uh I've got a couple of uh interesting blog posts that have been written about psyllium. uh As always, the show notes were on hackmd. You can find them in. Is it that side that side?

A

That's where the uh the link is to the show notes, so you'll be able to find links to everything that we're sharing today.

A

The first of these blog posts that uh I wanted to share today is this one from rvu uh for those of us in the uk. You might know this brand as you switch and they've got a really nice case study here around how they're using cluster mesh with psyllium there's another blog post uh this time it's about debugging, a networking problem that was actually happening in psyllium and the folks here at superorbital did make a contribution upstream to fix this problem. I think it's a really interesting story about how they went about diagnosing that issue.

A

It was an intermittent problem. Those kind of stories are always great to to read about. I think so. If you have a story that you would like to tell, and perhaps through the psyllium blog or you've written it elsewhere, and you would like us to help promote that story and share it with the rest of the world. um Do you get in touch with us bill?

A

Mulligan has put together this form on the psyllium page, where you can ask for help and uh whether it's you want some help, building a story for a presentation, perhaps you'd like us to retweet something that you've been working on. You can use that form as an easy way of getting in touch and getting some help from the folks in psyllium, so lots of people to say hello to let's quickly, say hello to quentin, always good to have quentin here raphael's.

A

Here we have fabio uh who have I clicked on next uh joe's here nico, who I met in uh aws summit earlier this week, so good to see you online as well as in person this week jano's here so yano has been doing lots of work on service mesh, so we will have the support if we get asked a really tricky technical question, tony great, to see you hello and hello to russell and to mattia good, to see everyone here.

A

Let's move on to look at a few more posts that have caught our eye this week, quentin collett collated a lot of these for us. um So thanks to him for that, uh oh I think I didn't open the link for this one.

A

So this one is uh open: telemetry, auto instrumentation, using ebpf, so they're able to instrument your go executables and convert your network requests into auto open, telemetry um data to um to be displayed in things like jaeger, and so this is a pretty cool, there's, actually a kind of work through walkthrough here getting started guide. um I haven't tried it, but if you try it out, let us know how it goes um good to see this kind of automatic instrumentation being able to instrument your apps with the power of ebpf. I think, is really great.

A

And we have this blog post, an introduction to bpf core compile once run everywhere from kaiseosef. I hope I haven't mangled his name too much.

A

So if you're new to the compile once run anywhere everywhere concept, this might be a helpful blog post.

A

If you're getting a bit more advanced in ebpf programming, this post describes how to extract parameters when you have a function that has more than five parameters. So that's a good bit of detail for for some of you out there who like to get into those details.

A

This was another slightly unusual bit of news, but I think it's really interesting to see how bpf ebpf is becoming the kind of de facto standard. This is actually from lineage os, which is an android operating system. I'm not super familiar with it myself, but it's interesting that android has removed ip tables in favor of ebpf. As we can see here, and you know they're seeing that as a good thing, um it will obviously make a lot of their networking more efficient.

A

It does mean that lineage os have made the decision not to support kernels uh for um older than 4.4, I think is, is the level- and I think we're going to see this a lot, but in order to take advantage of ebpf newer kernels are going to be required. So it's interesting to see that in the mobile world.

A

Okay and oh, the last link I have here is for a really great book that my colleagues, natalia and jed have written. There's a download link here to get this book about using security using ebpf for security. Observability it'll actually give you a little sneak preview into some functionality that that we've been working on internally that we'll be releasing shortly.

A

So if you download that book you'll know what I'm talking about, okay, we've got a few more people, who've joined that I don't think we already said hello to hi, to rodrigo hello to alejandro from tokyo and hello, to yarian from israel great to have you all with us. So we also have another special guest with us today, and so it's my pleasure to welcome to echo my friend and colleague, thomas graf who's, going to share lots of information today about service mesh. Welcome thomas.

B

Hello, thanks always great to be on.

A

Yeah, I think this is going to be a really interesting edition of the show. um We've obviously had the civilian service mesh out in the world in the beta program for a few months now I think we've learned a lot, so we want to share that today. Give you some highlights what we've learned and uh talk a bit about what's coming next, which should be really exciting so to get started thomas. Do you want to tell us what is psyllium service mesh.

B

Absolutely, let's do that and I will share actually the blog I've written and we'll go through that, because I had some hustle nice pictures in it.

B

So what is a service mesh we'll start very broad, high level first and then go into what is unique about cellular service mesh. So what is a service mesh? A service mesh essentially is a technology that looks at providing connectivity between services and then in addition, provide observability security, traffic management and resilience. So this cloud in the middle the service mesh cloud. This is kind of the attempt to provide all of this transparently.

B

So let's look into what it actually means like what does observability what type of security or type of traffic management. So, first of all, we want resilient connectivity. So we want connectivity that if something goes wrong, it should retry right very similar to tcp. We want three transmissions. We want connectivity to resume if there was a temporary glimpse. This is often being seen or being referred to as retries.

B

Then we want layer, 7 traffic management, so we want layer, 7, load, balancing. We want to be able to load balance based on http headers um host name, request, based load, balancing and so on. We want this for http, grpc and other layer 7 protocols. Then we want identity based security. We want mutual authentication, we want to be able to validate the identity of services or allowing services to authenticate each other.

B

We want observability in tracing. We want to see on the request level what requests are being shared or being sent between services, and we want to not do this on the network level, but actually again on the layer, 7 level where we can see api calls http, request, hp responses. We want to see http uh return codes, we're going to see the latency. We want to see the entire service map, what services are talking talking to each other and so on, and then, very importantly, that actually brings us to the next concept.

B

We want this transparently, because if you look how this was done in the past, like the desire to have this functionality was not new in the past, this was done by building this functionality directly into the applications.

B

I've given an example here where you may have a python app and a go app, and you would essentially write service mesh functionality and, for example, link that into the apps with a library and then do retries to tls for mutual authentication and so on, but that required you to essentially modify all your apps and even worse, if you were using lots of different frameworks for your applications, you have to use a different service, mesh library, library for every language framework that you're using.

B

This is why the side car model was introduced which essentially moved this library, this mesh library, this functionality into a sidecar, which means that we no longer have to use a a language, specific mesh library, but we could actually use a sidecar proxy and keep our applications unmodified. This is this picture, essentially, where we see that the side cars are now injected and all the the all the service communication is going through a side car as it leaves the application and before it enters the application.

B

On the other side, what we are proposing with sterling service mesh is the following. We want to actually get rid of the sidecar proxy. We don't necessarily want to get rid of the proxy. We're actually are huge fans of envoy uh hello, matt, you've done a great job. Creating envoy onboard has been integrated with solium since many years, so quite a bit of functionality in cilib is implemented via envoy, a combination of ebpf and envoy.

B

What we don't necessarily like is one sidecar proxy per app or per pod, so what we are providing with selenium service mesh is essentially very similar to how um name spacing technology and other things that are are clearly part of the kernel and have built a foundation for containers. We want for service mesh um to also essentially become part of the operating system and be transparently available without requiring to run additional sidecar proxies. So this picture, I think, demonstrates this.

B

Well, we want to make service mesh as transparent, as, for example, tcp is today or as uh cpu c groups are today or as name spacing. Technology is today, and we see service mesh as a founding block for containers, just like other name spacing technology as well.

B

This means that, from a from an integration perspective, it will look something like this, like you will have a service mesh that essentially sits as part of your network stack as part of your operating system stack essentially just above tcp. So we see service mesh as an extension of tcp. Tcp is kind of the old-school service mesh. It also has some security functionality. It has retreat transmissions, which was groundbreaking when it was introduced and service measures like similar functionality but adapted to microservices technology into cloud native infrastructure, as we need it today.

B

The reason why we want to do that is because injecting a side car is incredibly expensive. So this is a picture that shows how a side car needs to be injected when it runs separately as a user space proxy.

B

So we have to essentially do a so-called transparent network injection and have to pass through the network stack multiple times multiple times. This comes at a pretty high cost.

B

If we go back and we see kind of the the difference, um if we do, if you provide similar functionality but directly in ebpf, we can see that from a performance perspective that has a massive benefit. So the the yellow here is a proxy based infrastructure that uses side. Car injection to provide l7 visibility.

B

uh Blue is the baseline. No visibility at all and red is an ebpf based http visibility, library that, for example, provides open, telemetry tracing data, and that is incredibly powerful because all of a sudden, we can get this visibility data without without without introducing a lot of overhead and before we leave it off. um I wanted to show kind of the other difference which is the overhead overall, so this picture tries to show the footprint that um essentially happens when we're on a car.

B

So this is the view of one node and it's assuming that we have three 30 parts running. These are the the blue boxes right in a cycle model. We also have to run 30 side car proxies. Are they the the green boxes? So you can see I'll essentially have the the the the containers running. At least half the containers running will be side cars, whereas in a sidecar free model.

B

When we make this part of the operating system and run one proxy per node or maybe one proxy per name space, we can actually talk about a variety of different granularities. uh To how many proxies we want to run. We can heavily reduce the footprint, the memory and cpu footprint of the service mesh.

A

And I think that's something that our beta testers have really. um You know, they've seen the the effect of that they've been really excited about the effect of that and I've actually got a few comments from some of our beta testers. So we sent out a survey and we do actually have the results all the kind of raw results published.

A

But I pulled out a few comments here and I think a lot of the comments spoke about this idea of the overhead of side cars, the extra resources, the efficiency that we would gain by removing these multiple sidecars.

A

But I think something was also striking. Was it's not just about the performance and the the overhead, but it's also about the complexity of management. So I I think this. This first comment here really really sums it up. You know another another person, who's a big fan of envoy, but not hugely fond of the sidecar model and the extra latency and complexity involved.

A

um This was just a selection of a few of the comments that we had and you can see these terms of complexity, overhead extra resources appearing again and again. So the the simplification that we're bringing with psyllium service mesh, I think, is really resonating with the folks who've been trying it out, which is which is really great.

A

um Welcome to faye who's. Just asked about the link for the blog, so you'll find links to everything in the show notes which you can find on hackmd. So there's a episode 44 that you'll find at that link and you'll be able to find everything that we're sharing today from there.

A

So um I guess some of the other things that we learned from our beta testers was really about what features they wanted to prioritize and this. This is a blog post that uh that I wrote well, it's it's back in january, so so some of you may have already seen this, but I'm sure it still holds true.

A

Observability really stood out as the thing that people really wanted to get from service mesh. That was not a single respondent, or perhaps there was one respondent who said that that wasn't a need for them at this time, but we can see that there's been a a huge requirement for observability ingress um encryption and uh some of those layer. Seven management uh features things like the rate limiting the retries, the circuit breaking um canary rollouts. You know they're all important to people, but I think you know taken in conjunction with the comments.

A

We could really see that at least for people who are attracted to cinema as a service mesh. The simplicity is one of the the real um pulling factors for them.

A

So I guess you know: we've talked a bit about the performance.

A

uh Perhaps we should share some of the early data that's been collected so joe in the isovent team has been doing some really great work to create some reproducible tests, because we don't want to just tell you that you know well that performance is really great. We want you to be able to go and recreate those results for yourself and verify and validate that they are. um You know that they are correct, um so I think I have a copy of those results here.

A

um There's the the testing profile that we've been using that joe's been using is using nighthawk, which uh I think it's from the service mesh performance group or perhaps from the measuring group, but it's a kind of standard application if you like for testing service mesh, throughput and latency.

A

So as I bring up these performance slides thomas, do you want to sort of speak to some of these results as we bring them up absolutely.

B

So yes, so we're we have compared um essentially cilium.

A

B

Mesh with envoy as the integrated proxy and then compared that against istio- and I think, we've also run numbers of linkedin but they're, not quite complete yet so today, we're sharing comparison numbers against sdo.

B

um What we're measuring here is essentially on game with reproducible script, so you can run them yourself as well and then measure request rate like requests per second, the cpu consumption, while doing so so requests per. Second more is strictly better. You want more requests per second at the lowest possible cpu consumption, and then we also measure latency mean latency p95, p99 latency as well. You with latency lower, is better. You want the lowest number of latency possible, the higher the latency the longer, for example, your clients.

B

Your users, have to wait for the website to load or for serv your service to respond. This can in particular, be impactful for microservices architecture because um essentially, maybe to serve one client request. It will have to contact multiple services in a chain.

A

So every time you're, adding latency in that.

B

Chain it essentially, you pay that cost multiple times. What we're seeing across the board is that psyllium provides or performs significantly best better, like here we're looking at um requests per uh per per second uh and the latency they're off requests per. Second, you can see that.

B

Trying to figure, I think the the the axis is titled incorrectly, or is this the latency slide? This.

A

Is the latency this.

B

Is yeah, I think it's the title, um oh at different requests per second, okay, so.

A

This is the latency.

B

So we can see that that psyllium is essentially performing uh better, uh almost half the latency um that compared to us to a cycle model. You.

A

Can see that this is it's vaguely.

B

The same so the difference is: there's slight difference, depending on the the different um rates of requests per second that we're running, um but there is weightless the same type of um difference. If we go to the next slide, we can see um the throughput itself, so this would be if you, for example, if you are streaming.

A

B

For example, if you're video streaming, there is less difference here like if we look at max throughput. There is less difference here, but.

A

Again, the latency.

B

Is where the difference comes in so this means that to achieve the same throughput, psyllium was using a bit less cpu and for and thus could provide better. They could respond quicker.

B

The last slide here is even more interesting. This is the time we've measured to bring up a part, and this is where there's actually a significant difference in the two models. If we have a sidecar model and we require a sidecar a proxy for every part, we have to inject that part or that cycle into the part, and the part cannot really start until that. Sidecar is ready and and available, whereas if we don't.

A

B

A sidecar and we're running a proxy outside that is not required. That really shows in the the time it takes for a part to be to become ready. So in this scenario, we have essentially scheduled 200 parts and then we're measuring. How long does it take for the pulse to become available?

B

So this would be the scenario you're scaling up, you're scaling out because you have more customer requests or more client requests coming in and you want more service backends and the difference here is: is three x or more than three x, three x in particular, for jobs and deployments.

B

This simply shows that the cost of side car injection. This is not even the data path impact. This is just the cost of actually injecting the cycle and starting it up, it's very, very significant and you will have to wait quite a while for your parts to become ready and I think, as we uh talk about security later on, we also have numbers that we can show comparing mtls done in an istio cycle model compared to mtls with psyllium and how that compares to doing authentication with fire guard. For example,.

A

Yeah, I think, just adding a little bit more on to that uh question of startup time for pod.

A

Another comment that we've seen from several beta testers is that they've had issues where the the time it takes to start your sidecar container and perhaps you've got an in it container and possibly more than one application container the kind of coordination between those things all having to start up and potentially taking some time to do so, can create issues like timing issues around, what's expected to happen in what order, I think the other thing that that strikes me when I see the the latency graph like this.

A

Is you know if it's going to take about twice as long? That seems um that that's kind of as expected, because your network traffic is going through roughly twice the length of path, so that totally makes sense. It seems really intuitive.

B

A

B

This is something that's over often underestimated, and um I think a lot of us can relate that. If you have this, this problem is called buffer block. If you have lots of devices in between two endpoints, when you when you, for example, video stream, the more devices you have in between all of them will buffer um to achieve the maximum throughput and and there's more of these buffering devices you have the worse.

B

The latency get buffering is great to increase throughput, but it's really bad for latency, because the buffer fills up and it has to empty again. So when you send new data, it will take quite a while for that, for that to go through through the buffer and.

A

Every side car.

B

Is essentially a middle box, a a device that where you where buffer bloat can apply.

A

Great we've got a nice comment here from yeah, I'm saying nice insight to say thank you right.

A

Shall we turn to the question of security, because I think that was perhaps one uh comment that people uh people had made was you know they they'd like to understand the the security model, and we also know that a lot of people are turning to service mesh because they want that encrypted traffic between uh between their pods. So uh should we talk a little bit first about that um kind of east-west traffic, mutual tls requirements.

B

Yes, let's do that so the last blog post, we've shared um specifically talked about the sitecoreless model and uh the big ask based on that is okay. How are you going to do music authentication in that model? What does this mean for secrets? How can we, for example, integrate spiffy and so on? So this is a sneak preview at the blog that will go live next week, which essentially provides the details on how psyllium service mesh and psyllium. So this will also be available just for psyllium how psyllium will do mutual authentication in that context?

B

So if we scroll down a little bit, um we can see that if the basic of what is actually with mutual authentication- probably everybody knows this so be very quick. Mutual authentication essentially validates a sender and the receiver and they can both validate and authenticate each other, so the receiver knows I'm receiving from the sender. I know a trust and the sender knows I'm sending to a receiver. I know and trust, and we have a certificate authority on top which provides and creates certificates and makes sure to build this trust. That's mutual authentication.

B

Mutual authentication is typically done with something like mtls, but other examples would be ipsec or ssh. There's many, many more um actually common use cases or common examples where we use that every day. But let's look at tls. Tls will look something like this.

B

There is a tls handshake portion which will be done at the beginning of every tcp connection, and then this handshake portion in this handshake segment, the sender and the receiver- will authenticate each other by sharing certificate data and sharing a key to then after the handshake after the authentication is done to be able to trust each other and to encrypt all the data and to provide in integrity over all the data that is exchanged as well.

B

What's important here to understand is that both the handshake and the payload, the data is shared or sent in the same connection. So it's a tcp connection and you have a tls handshake in front and then on that same tcp connection, you also sent the data. This is amazing for the internet, which is uh essentially on unreliable untrusted networks and tcp contracts can transmit and traverse everything on the internet, which makes tls highly compatible, and it's why we are using it every day for for kind of internet traffic.

B

If we compare that to, for example, a different form of how um mutual authentication is often done, we can look at ipsec with ike, which is typically what was done, um or was often still often done in enterprise networking. This is where we build an authenticated network below so we're using ipsec to, for example, authenticate all the nodes with each other and build a trusted network when you run silly in transparent encryption mode. This is exactly what is being done.

B

All the nodes will receive certificates and keys, and they will authenticate each other, so you could not actually join the cilium network if you don't, if you do not have a certificate and a key to join the network- and you can revoke those certificate of, of course, and nodes will only communicate with other nodes that are trusted and as part of this, we get authentication encryption um in integrity and so on on the left is the mtls model, with a sidecar with a proxy right, so you can see those are kind of the two models that are often used and sometimes compared with each other.

B

There are a couple of pros and cons, and I think so far we have been assuming that you can either do the left or you can do the right and I think what we believe is we want both.

B

We want the best of both, so we will combine this together, but before I'm getting a slightly ahead of myself, let's look at pros and cons real quick of both models right so mtls model as as done today. The big the big pro is that it uses service level identity, so each service has its own identity and certificate, which leads to better security compared to the node level, authentication that is often done with ipsec, where only a node has an identity and all the workloads on that node rely on that single node identity.

B

Tls is specifically designed to cross complex networks. We've talked about that, but then there's a couple of really bad cons and that's the cons. Those cons are the ones that we want to resolve.

B

Tls has been built for the internet um and it primarily works for tcp and quick where it is built in. It is really struggling. For example, it is possible if dtl has to make it work for udp, but it's no longer as simple as for tcp and for a variety of other network protocols. It doesn't work at all like multicast and so on, no chance, and it also leaks the entire application topology to the underlying network, which means all the pod ips are actually visible.

B

So you can you're not as hidden as encrypted as you might want to be, and you're also leaking all of the control traffic, such as icmp to the underlying network as well, and.

A

B

Things that we actually want to avoid the node level authentication with ipsec and ike great, because it's completely transparent. Obviously the con is that we don't have service level keys. Yet it is also significantly more efficient and scalable because it does not require these side. Car proxies right. That's not relevant to tls, specifically.

A

About like proxy or sidecar, based.

B

With with, um if with mtls um the the big downside is that when a node gets compromised and you lose that certificate that identity, all the workloads on that node are essentially compromised as well from a certificate perspective right, you, one identity has just a broader meaning. So how can we actually get to a model where we combine both of these into a model that is stronger and we benefit from the cons of or the pros of both sides, and this is what psyllium will start doing.

B

So we are separating the handshake portion and the data portion, which means tls, is really great to run the handshake. um We want to use service, specific identities and keys, and we want to use tls as the handshake mechanism to authenticate services with each other. But then we don't want to rely on tls for the actual encapsulation of the data, because that's very limiting.

A

And essentially, limits tls.

B

To almost exclusively tcp, we want to do the encapsulation on the network level, which means we can hide the entire application topology and we can support any network protocol.

B

This also allows us to essentially run a completely sidecar free model. So if you look at this, this is how it looks like so on. The left is the current mtls sidecar model and on the right we are seeing the sidecar free mtls model, so you actually see instead of just having one arrow.

B

It has two arrows now right now, so we have the mtls tls handshake here, which will happen between the psyllium uh demons, the cylinder agents running on the nodes, and this will authenticate the workloads with each other, and then we have bpf, which transparently holds up the connection data until that authentication has happened and then can use ipsec or wireguard or even optionally. Nothing to actually do the it do. Do the integrity and encryption of the connection.

B

So in this model you can even choose to say I want authentication, but I don't want encryption which you cannot choose with a pure mtls model with uh with the sidecar model, and this gives a lot of flexibility and it completely decouples the limitations of tls to provide secure a secure connectivity pipeline in terms of who will be managing keys. So if we'll now expand and away from the data path and look, let's look at who will be owning and generating keys?

B

There's something really beautiful here, and it is that because we will also use tls like uh kind of the dominant model. Right now in terms of doing mgls with sidecars, we can actually support any existing certificate.

A

Entity management system.

B

Where this is spiffy, whether it's using serp manager, istio walt linkid, whatever it is smi uh we can. We can reuse all of that to for identity management for for certificate management, but then provide a much more reliable, much more efficient inside car free data path to actually provide to connectivity and provide a mutual authentication between the services.

B

A

Will look like.

B

And this is a preview of what's coming with the solium spiffy integration. uh It will be as simple as this, where you can write a psyllium network policy so that our policy is then our policy language of of psyllium, and you can essentially use in this case it's using spiffy id selectors.

B

You can essentially require that that part certain services require mtls or require mutual authentication by allowing or by doing an ingress rule like this and allowing from the spiffy id and then silly will automatically retrieve the certificates from spiffy photo services and do the mutual authentication thereof, I'm showing the policy for one side of the service. In our case, both parts will need to opt in so both app one and app. Two will need this policy and both need to essentially allow each other.

B

um This prevents that if essentially, both, let's say app, one and app two are owned by different app teams. Both app teams have to essentially buy into this model and allow this if a proxy is still needed. So I think um that's sometimes a bit of a confusion. So let's say we have this model where we actually don't need a side car whatsoever.

B

If you still want functionality that doesn't need a proxy, for example, certain forms of layer, 7, load, balancing or retries or circuit breaking. That is still 100 possible, but you will still not need side cars.

B

You can then use psyllium's per node on void proxy, and this has a big benefit, because you will only need one of them, so even in that case, if you're running mtls with psyllium outside of the proxy no psi cards- and you still want layer, 7 load balancing only one proxy, whereas in a sidecar based model, you always have two proxies in every in every single connectivity that that you see.

B

So, even in that case, it's you, you already divide your latency by two, because you need only one proxy and not two in terms of performance. What that means that this may actually be surprising how impactful this actually is. This is uh comparing essentially requests per seconds for istio, with mtls psyllium envoy, tls, mtls cylinder with wire guard and then solium, no mutual authentication at all. So no encryption, no um integrity, pure networking, so the the blue is essentially the baseline, like that's what the network can do and then uh with sterling wireguard.

B

This is the request per second that we can achieve if we encrypt and authenticate everything with wireguard. So this is on this network level. Stilling with envoy. um This is the static envoy configuration that selium does to provide or to authenticate with tls and uh with standard certificates, and you.

A

Can see that this is already quite.

B

Impactful so there's a clear performance difference. We were actually a little bit surprised surprised that psyllium with onroy was um outperforming is to empty last a little bit. We don't know, quite I don't know quite why that is, um they should be relatively similar because they are essentially doing more or less the same right. It's envoy base to tls. The big difference is between wireguard and kind of the tls model in the sidecar case, um which shows that you can, you can get mutual authentication, but with radically better performance in particular.

B

If we look at the latency things become really interesting, I would say so here. Lower is better right. Obviously, the best latency, with no authentication at all, still in wireguard, um twice the latency, but then it gets really bad as we go into an ister mtls model. um So based on these numbers and based on, like the other properties that we gain, the the separation of authentication and of the payload is incredibly interesting to us, and this is the model that we will pursue uh for to provide mutual authentication for both psyllium service.

A

Mesh, but also just sodium.

B

By by by itself,.

A

So I guess we would expect, and I don't know if we've we've tried this, but I I guess we would expect that a authenticated, but not encrypted connection would probably fall somewhere between the blue, psyllium, no wire guard no envoy and yes,.

B

Yeah, so it will be for the requests per second. It will be like the blue, because, like once it's all once authenticated it will have the raw performance, but the latency will be uh in like in the red right, because you will, we will still have to do the authentication handshake up front. That's just the cost of doing all of authentication. There's. Actually one really good benefit. So if we go back up here, um this handshake here does not necessarily have to be done for every new connection.

B

Right, let's say we have workloads running and the workloads have authenticated each other. We could do that once and then cache the authenticated authentication result, let's say for an hour, so we only need to re-authenticate every hour or so, and then even that gets even better because we can actually re-authenticate in it based on an interval even for a very long-lived connection. So even if this, if the connection down here that is then secured with ipsec or wireguard, if you want to continuously re-authenticate that every hour that is also possible. It's something that's.

A

B

With mtls, because you only have the handshake portion upfront so that that adds a lot of nice security properties as well.

A

I think one thing that also struck me, particularly when you were showing the network policy, was how this is really rather than having a separate service mesh and a networking implementation we're pulling this functionality into the networking elements. So it really does speak to the simplicity, the you know, making it much easier to.

B

A

Because you're just going to have fewer components involved- and I I think that's a tangible example of how we can get this extra. The authentication functionality without needing to introduce new components and.

B

This actually also, I think, highlights another great property like it's. Maybe.

A

Not obvious, but.

B

Because this is a still in our policy, this essentially combines authentication and segmentation. So let's assume for a second, that the identity of app two gets compromised right um and somebody else gains that identity. That.

A

Identity in certificate.

B

Form exactly right, but app one here will actually only allow from app2 not just based on who has the certificate, but if that's actually an app to part, so you can then not use that and go and use that key and from somewhere else initiate a connection to f1 that will not work. So this essentially combines both the network policy segmentation. What park can talk to what other part and adds authentication on top, so you get the combination of both, which is very, very powerful.

A

As always, security is best in layers defense in depth and and that's exactly what we're seeing here, great um there's, a very nice comment from ray saying that this is uh very timely and looking forward to the blog post uh yeah. I think that's going to be a really good read when it's published so we're hoping for that next week. I think if you have questions, do put them into the chat as we go along. I think uh yeah we've got another 15 minutes or so so we've got plenty of time for for questions.

A

Is there anything else that you'd like to talk about on the security front, or should we move on to the future and the roadmap.

B

Let's move to roma because there's lots of exciting stuff coming there as well.

A

Okay, so we've recently actually published as part of the psyllium documentation, a public-facing roadmap, so um this includes some areas that are not service mesh, but there's a big section here about service mesh, because obviously that's a big focus for the project right now and some of the features that you'll see in this list already exist in the beta, some of them not yet some of them are things like the spiffy integration that we've been talking about today and some of these things you will see graduating into stable in the upcoming v 1.12 release, which you may have seen if you're a cillian user, you may have seen there's some release candidates already out there in the wild, so you can see exactly what which of these features are making it into those release.

A

Candidates as we go, um so I guess we could kind of talk through what these are. I, I guess, starting at the the top here, we no longer need to maintain a separate psyllium service mesh feature branch. It's all been integrated into into psyllium.

A

It's the the manifestation of service mesh becoming part of the networking component. It really is there, and I think that in v1.12 we are expecting to graduate the kubernetes ingress capability to stable we've had, um you know, really good response on that. It seems to be behaving very well. That includes being a kubernetes conformant ingress and the visibility we talked about how visibility was such an important part of this and kind of automatically get visibility of ingress traffic through hubble and uh support for additional annotations thomas?

A

Do you want to speak a little bit about about the kind of annotations that are going to be supported both now and in future releases? Perhaps.

B

Yeah, so I think this is this- is, I think, an area where we're really looking for user feedback. We got quite a bit throughout the beta program, already ingress.

A

Is a classical example.

B

Where there is essentially the the cuban added standard then, but but almost everybody uses some form of annotations, whether it is for canaries, whether it's for ssl pass through like lots of additional functionality, that almost all ingress is supported in some way, but are not part of the kubernetes ingress standard uh want to be supported, so we're we're 100 upstream conformance now we're passing the conformance test, which is great we're now adding uh the the annotations that you most need, so whatever annotations that you're currently using whether it's with um ambassador, whether it's with contour or nginx or istio, whatever, whatever additional functionality that you need, that is not found in the kubernetes ingress spec itself.

B

Please let us know where this is essentially the next step to add the the most most desirable ingress annotations.

A

And with uh more on the kind of observability side uh we have a hubble open, telemetry collector and the psyllium itself generates prometheus metrics. So there's some additional metrics there. I think the open telemetry collector.

A

um I I personally haven't seen a lot of feedback on that, so I would love to see a few more people trying that out and and telling us how they get on with that, but the prometheus metrics. I think we are, I don't know if we can count them counting those as stable in 1.12 uh yet, but they certainly seem to be behaving very well.

B

Yeah absolutely yeah. It's planned to be stable for for 1.12, but we'll make a final final visit final.

A

B

Day before we actually write the release block, I would say.

A

And then we have, um the ordering of these bullet points is, is not uh super logical so, but.

B

A

The next point really speaks to the um control plane, the the configuration of service mesh. So I think particularly for the what I'd like to call vanilla service mesh use cases if you're doing basic configuration. You can already do that with uh the kubernetes ingress that uh that we already have um we're looking at adding additional annotations to kubernetes.

A

To provide more of that kind of layer, 7 functionality and the things like the the retries and so on- and this is all about how you get the configuration from a high level abstraction into the envoy configuration which is, I would say for us- we consider that quite a low level abstraction the configuration of those envoy listeners. Would you agree with that.

B

Yeah um absolutely so, I think if we kind of look at what we can already do, we can already kind of do the 80 service mesh use case right. We can already provide visibility, and ideally you don't need to configure much to actually enable that that's kind of the case you essentially it's a single knob turn the turn on http visibility and you get http visibility and you get prometheus metrics. You get open telemetry uh support for that. Then there is a set of layer.

B

Seven low, bouncing and ingress in combination, for example, gateway api, which is also coming, is a great. It can be a great app to do that already, in particular, combined in combination with certain annotations.

B

So this can do uh path based routing cannery, uh retries um percentage percentage, based routing request, based routing and so on, right and then the mtls or the security side we've talked about before that will be driven by the sort of network policy plus, for example, spiffy or serve manager or walt, which then means that we don't really need our own massive set of crds or something. The goal here really is to to create something that feels as native as possible, with the existing kubernetes experience that most app teams are already aware of.

A

We have seen um quite a lot of interest in using smi as that kind of configuration interface, so I I guess the the balance there, between a set of crds and the crds that that teams are familiar with. I get the impression that people see the smi as a a nice middle ground between the kind of complexity and the functionality that people really want to use.

A

um I think we've kind of talked more grafana dashboards uh totally makes sense and- and I've seen actually a couple of questions coming through uh about uh metrics. Let me just bring up some of those, so does hubble include metrics for mutual authentication. I I'm sure the answer is. It will absolutely.

B

Yes, absolutely yes,.

A

And uh actually here's an interesting question. So aaron was asking joe, but uh I'm sure we can answer this as well, for everyone else can hubble showcase the performance metrics. I think the answer is yes. Through um grafana in particular, the metrics are exposed through prometheus and then grafana yeah.

B

And I think, uh for example, hubble can also measure the hdp latency like of request response, and uh you can actually. Let's say you have a very stable workload. You can actually use that as a way to benchmark your service mesh as well like you, don't necessarily need an artificial benchmark to do this. You can run your own apps and run different service meshes below and use open, telemetry metrics to figure out like how fast how like what is the?

B

What is the request response latency, for example, like is my app now responding slower or faster. So that's, that's, actually, a great way to measure and hubble will be, or hubble can already showcase that.

A

Another question here: what's the status of the gateway api integration right now,.

B

Yeah, so this is clearly like we're getting this question every day now so clearly, as we've completing ingress gateway, api is next uh kind of internal discussion, smi or gate to uh most recently. So many people are asking for gateway api that I think, will.

A

Definitely do both.

B

There's no question: it's just about the order we will be implementing data api next.

A

And if people want to help contribute to that, you know we psyllium is an open source project. So we are, um you know, able to deliver things more quickly if people pitch in and help so.

B

Yeah, let me uh kind of to point out for all of these high-level service mesh configuration topics. You do not need to know about ebpf. You don't need to know about internals of envoy. We have built a way for the slim service mesh to essentially consume raw envoy configuration, which means to, for example, implement smi or ingress or gateway api. It's actually primarily about taking that specification and translating that into onboard configuration, which is so you don't need a specific ebpf knowledge. You don't even need knowledge about salem internals as well, so it's actually a.

A

B

For for everybody to come in and help out, it's it's it's fun and it's it's evolving very, very quickly, if almost daily daily demo sessions with like new new functionality coming in because it's it's so simple to now add additional functionality.

A

Question from carsten hi carsten uh regarding hubble, open telemetry. Will that be included? So you don't need to set up a collector yourself.

B

Yes, I think that that's the goal um we would love to have the conversation with everybody like how that should look like how, like what type of collectors should we include by default, should we use a particular base image to use and add the the hubble uh open telemetry collector.

B

On top of, uh we would love to have that conversation for now we're helping users to essentially build their own images with the hubble hotel built in, but we want to get to a better default image that will have the right set of collectors included for that suits. Most most people.

A

And another question from choco: is it possible to use cinema service mesh on top of a cloud provider cni like aws or azure? Absolutely.

B

Yes, so, first of all, you can run the full selling feature set on the cloud providers, but you can also run still in so-called chaining mode and actually only get the service mesh functionality on top or only getting our policy functionality on top. So you don't necessarily need to rip out the cni at all. You can run cilium service measuring all aspects of cilia on top of existing cni's as well.

A

Great, so this is your last chance if you're out there and you have a question- get get typing in very quickly uh really great question actually from from jaren, he asked in in two parts he said: will spiffy get its own demo episode and then also asked or maybe mutual authentication. I think absolutely. That will be something we'll want to to show off in an echo episode that will uh yeah.

B

We should definitely get all the the the different contributors on as well. There's been so many people contributing to to to this providing feedback providing code in the pr. So this is not just an eye surveillance or like a silium core team kind of um effort. A lot of people have come in with interest on this and are con are contributing to get spiffy their spf integration done.

A

Just going back to the uh cloud provider cni is: can we run city of service mesh on gke without data plane, v2 data plane, v2 is based on silium, but uh yeah.

B

So data plate v2, is essentially google's version of of cilium, whether you can remove data plane metoo from gk. It's probably more a question for google. We can definitely run selling service mesh on top.

A

And then another good question here: how does psyllium allow creating a gateway when you deploy an application? Do you have a gateway built on top of the ingress gateway api, and I.

B

A

Is really about how ingress works.

B

Yeah, so I'm hoping, I understand that uh the question correctly is so sort of ingress um is essentially provided by the and the ebpf data path that runs on all your nodes, so still does not need to deploy an initial proxy pod or something like how it is sometimes seen by other ingress controllers. Essentially, all your kubernetes worker nodes will become capable to perform ingress as part of the evpf uh envoy data path, which means it's it's it's very transparent. It just magically starts working.

B

You enable ingress control with psyllium and psyllium will start uh implementing ingress services either for the ingress class psyllium or if you, if you choose uh for stylum, to be the default ingress, then for all of the ingress services. There's no additional parts, there's there's! No, there are no additional proxies or something that needs to be scheduled ceiling already has on the boy built in and that's already running, on, all of your nodes.

A

And if you try it out you'll see when you create an ingress, it will create another service for you automatically, but there's no additional kind of workload running no, no additional pod. It's just the service that gets created.

A

Wonderful, um so thomas any last thoughts before we wrap up the show today.

B

Yeah we've been blown away by the feedback, we're getting so far. We're kind of prepared to okay, we'll we'll start evolving into becoming a service mesh a little bit, and this will be fun and the responses have been like crazy, like so many people. I've kind of this is exactly what we want. Can we really get rid of the side? Cars? That would be amazing. We want.

B

We want the simplicity, we want performance gains, we want a better security model, we want the not having massively complex crds and I think this chance to kind of provide feedback um from what we have all learned operating service measures in the last couple of years. This. So this opportunity is really, I think, intriguing for a lot of people. Like I've learned a lot. We can. We almost have a chance to redo service mesh a little bit. I think that's very exciting.

A

Both for us and for.

B

All the the the users that have been testing service mesh so far so we're. We are really excited that for the vast majority we can mark the service mission, a great integration stable in 1.12 and move away from the better phase and really go into kind of production with it as well.

A

Yeah, I I personally, I think it's really exciting- that we're bringing this functionality into the networking layer. I think the more we can kind of put this sort of infrastructure into you know as low down the stack as possible and make it simpler for developers, so that developers can concentrate on writing application code and don't have to worry about managing so many different components. This is all to the good in the world of cloud native.

A

So uh I'm sure some of us have our focus, we're thinking ahead to kubecon in a couple of weeks time, if you are there um both thomas- and I will be there along with many other folks from from psyllium from the project contributors who are part of iso vegan and also external contributors. So hopefully we'll might get a chance to meet some of you face to face there, which will be brilliant.

A

uh Look out for that blog post, describing the uh the psyllium approach to authentication and encryption that we will be publishing next week. Do follow, uh follow cillian project on twitter and that's probably the the first place that you'll see it perhaps on linkedin as well, and I think with that. I just want to say thank you to everyone for joining. In the conversation, all your great questions as always, really uh lovely to have that interaction, and thank you again to thomas for spending time with us today.

B

Thanks a lot, this was fun.

A

Take care everyone have a great weekend.

B

Thanks everybody bye.