Cloud Native Computing Foundation CNCF Webinars, 8 Apr 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Webinar: Best Practices for Deploying a Service Mesh in Production: From Technology to Teams

Description

Successfully operating a service mesh in production requires much more than just `kubectl apply`: it requires drawing clear lines of responsibility and accountability among platform, service, application, security, and devops teams. In this webinar, we will showcase several real-world Linkerd adopters who have gone “beyond the mesh” and organized their engineering teams to collaborate more effectively in order to run reliable, cloud-native applications.

Presenters:

William Morgan, Co-Founder & CEO @Buoyant
Ana Calin, Systems Engineer @Paybase
William King, CTO and Founder @Subspace
Matt Young, VP of Cloud Engineering @EverQuote

A

Alright, let's go ahead and get started. I'd like to thank everyone. Who's joining us today. Welcome to today, CN CF webinars, best practices for deploying a service mission production from technology to teams. My name is Erica team I'm, a business development manager for cloud native technologies at CN, CF, ambassador I'll be moderating to these Bernard parameter, which will be a conversation between William, Morgan, co-founder and CEO at points and a cabling systems engineer at bay, it's William, King, CTO and founder at subspace, and my young VP of cloud engineering or VP of cloud engineering.

A

Another quote a few housekeeping items before we get started during the webinar and we're not able to talk is in ten days there's a QA box. Will your screen do you feel free to drop your questions in there and we'll get there we'll get to as many of those as we can at the end? Please remember this is an official Xion webinar at the CN CF and it's not subject to the CN CF code of conduct. Please do not have anything to the chat or questions that would be a violation of that code.

A

Basically, to be respectful of all your fellow participants and presenters, please note that this recording and the slides will be available later today on the CN CF work at our page at CNCs IO, forward, slash webinars and with that I'll hand it over to William.

B

Thanks Ariel all right welcome everyone. Thanks for joining us, I promise we're going to try and make this exciting and fun, and we're not going to talk about the pandemic for sixty minutes, you'll have a safe zone all right, so the title of the webinar is service message from technology to teams.

B

This is me and I'm William Morgan, one of the creators of linker D, which is a service mesh, I'm CEO of the company called buoyant, which does lots of service mesh things, including sponsoring, and maintain linker D I build a project called dive delivery platform for service messages, I have to live, delivered many service mesh talks, webinars and basically my entire life began with the service mission will end with the service mesh fading into obscurity. Well, hopefully, not so. That's me the actually interesting people here today. So what I want to do is I'm gonna.

B

Have each of these folks introduce themselves, and the majority of this presentation is going to be a conversation with them so Matt, can you tell us a little bit about who you are whatever quote does and what your role there is certainly.

C

Hi everybody, my name, is Matt I run. Our cloud engineering team at ever quote ever quote- operates a leading online insurance marketplace in the United States that connects consumers that are seeking insurances of various types, with insurance providers to help them protect life's most important assets, their family property and future.

C

In short, we connect a whole lot of people that want to shop for something with a whole bunch of people that are providing services, and we do a bunch of machine learning and smart analytics, combined with a fairly sizable web facing set of services to make that happen. My team partners with our engineering teams there are my customers and, and we build a platform full of services and curated patterns that let our teams manage their own services and production.

C

B

Thanks Anna hi.

D

Again, my name is Lanna systems, engineer or infrastructure engineer at pay base and pay base. Is the payment services provider specifically for market bases, get sharing economies, blockchain businesses or any type of index? We are finding ourselves and we operate in a very regulated space, which means that for us specifically, it's important to be highly reliable, available and scalable. Just what they're customized.

B

E

F Oaks William King CTO, co-founder of subspace the two-year-old startup that we just came out of stealth like in the last couple of weeks ago. We are solving lag for multiplayer gamers, globally, everything from layer, one with lasers, all the way up to the highest layers with distribute systems.

B

Awesome well thank thank all three of you for joining us today. We're gonna post, the slides on the CN CF website, but I'll just point out: I'm gonna skip ahead to the very very end. I have a couple links in here. Our esteemed panelists didn't mention. You know some really exciting stuff, so I have a link to subspace. It's big launch emergence from Stealth announcement. Matt has an upcoming service mesh talk on service mesh con and then actually delivered a talk at the last surface. Mesh.

B

Alright, with that, let's take a look at the agenda for today, so I want to try and keep this pretty simple and the main thing that I want to do is there is a lot of technical content out there around the service mesh I. Don't want to cover that too much, mostly.

B

What I want to focus on and the reason why I've asked Anne and William and Matt to join us, it's kind of the organizational aspect, but once you actually have a service mesh that you have deployed to some environment somewhere, how do engineers interact with it? What has to change or doesn't have to change around the way that the teams are structured and basically, how do you actually operate? This thing from kind of the team and human perspective, as opposed to from the perspective of you, know the computers and the bits and bytes.

B

So that's the focus of the webinar I'm gonna start with a very brief look at what is the service messages so that we're all on the same page and then the majority of this time will be a fun and exciting panel with our three guests and then we'll have some time at the end, for Q&A is from the audience so, as Ariel said, feel free to type in your questions, as you hear interesting things and then best to address them in the table section of the webinar okay.

B

So that, when is the surface mesh promise I'm going to keep this brief and actually I'm gonna take a different, slightly different focus. So here's here's what I'd like to frame it as this time is service mesh is a tool and it's a tool for giving the platform owners.

B

You know, as opposed to the developers or the business logic implementers so tool for giving them the observability the reliability and security primitives right. This is like kind of stuff that you get. Those primitives are critical for cloud native architectures, which is why we want to give them to them, and we do it. The kind of the magic beans is we do it with no developer involved, ideally there's some asterisks in there right, ideally what the service mesh delivered and the reason why it's so useful.

B

It's not actually the features themselves, but it's the fact that it delivers those features to the platform team in a way that decouples them from the developer teams. So, rather than asking the developer teams to all implement TLS in the exact same way, you know and fighting with the product managers who are trying to deliver, you know kind of business logic features. We can do that at the platform level, rather than having instrumentation be fragmented across all and telemetry fragmented across Africa's.

B

We can give you a consistent layer of telemetry at the platform level and so on. So that is what a service mesh is in practice. They all follow a similar pattern and I'm going to mostly talk about linker D here cuz, that's the one that I'm most familiar with, but the reality is almost every service mission follows a very similar pattern, which is you have a control plane.

B

You have a data plane and the control plane has you know kind of some machinery around how the service mesh actually works and keeping things together, and the data plane is where we do this kind of the weird and funny thing, which is we install a little proxy next to every in kubernetes terms inside every pod and we wire the traffic through that proxy and those little proxies, which you now have hundreds or thousands or tens of thousands of right.

B

Those are kind of what we call the data plane and those are responsible for managing, manipulating and measuring all the traffic that goes in between applications. So there's lots more information about this on linker de IO.

B

It's an open source, open governance, service measure to CN CF projects very happy about that and in production for for probably much longer than this, including at companies like pay base and ever code and subspace all sorts of github stars, which is very important and a more or less stable release. Cadence, okay, very last section here. So just to make this really concrete.

B

You know what is linker D actually do, there's a set of features around observability, there's a set of features around reliability and there's a set of features around security, and, as we have our conversation with our panelists, you know a lot of these features are going to be brought to the surface, and so on the observability side. We have things like service level, golden metrics, so success rates, Layton sees throughput service topologies on the retry side or on the reliability side. We have things like retries, timeouts and load balancing multi cluster support.

B

On the security side, we have things like transparent, mutual TLS and certificate management, and the angle that link Rd provides in the spaces. Try to be as light and as simple as possible, so it's easy to make things complicated. It's a lot more difficult to make them simple.

B

So that's what we spend a lot of our time and energy and I guess we'll find out whether we did a good job at that or not okay, so now on to the fun part, hopefully, hopefully that all made sense if it, if it didn't in that resources slide at the very very end of the slide deck or a couple links to some Doc's and blog posts and things that you can read to help you even for me thinking about the service mission as a category. Ok, so now the kind of fun part here.

B

So this is a question you know I really want to address, which is how does my engineering organization successfully adopt a service mesh and what I'm going to do is our three victims: I mean participants, I'm gonna, ask them a series, a series of questions, we're going to do it panel style and.

B

Hopefully, we will all learn something new, because all three of these people have actually deployed a service mesh to production and has to live with the consequences of that decision every day. Okay, so this is this is the big list of questions, but we're actually going to go through this one by one, everybody feeling ready: yes, all right, okay, so the very first question which of course, I missed how big is your engineering organization and how is it structured Matt? Why don't we start with you.

C

Sure our engineering organization, that ever quote, is roughly around a hundred people all in across disciplines. My immediate team is seven or eight I'm over headed small, say seven from but we're growing. The way, we're structured is something that we've pivoted over last year. You know in the past the team was largely operationally based where we were, you know, sort of just doing what was needed, but over the last year we've really changed out to more of a forward.

C

Looking team, you know tasked with building out a platform that allows us to solve problems for our engineering team so that they don't have to solve them individually. So in a way we're an embedded start up inside a recently public company. My customers are all of the engineering teams, and my product is all the cloud things so that their service hosting environment.

B

Great and how do things look at babies, and so.

D

For us, our engineering team is unusually small. We have a total of five people that includes two systems, engineers, so infrastructure engineers or asari, and two three software engineers and the way the team is split. The the world looks like is that, although the systems engineer maintain the infrastructure and monitoring systems and service mesh, our software engineers are able to deploy new versions of an application themselves without having to make major changes to infrastructure, and everyone gets involved into everything. So we can also the systems.

D

Engineers can also troubleshoot the application side and the other way around, and so we have quite a flat structure in a way type of thing. Great.

B

Great and William has subspace look subspace.

E

Has about 30 engineers from infrastructure engineers to connectivity and network engineers all the way through to performance and software engineering, we're about 10 people on the software, engineering and sre side and the service mesh is owned by that software engineering group. Ok,.

B

Great, so we've got a nice range of sizes. Here we've got 5 30 and 100 engineers, alright, and the next question. William I think you've got a head start on this already. So why don't you? Why don't you keep going with it so yeah, it's subspace who owns the service mesh and how does the rest of the organization interact.

E

Sounds good, so SSR reason the stuff for engineers all just so you take a lead on it, developing what our pipelines look like for the different service measures that we've got deployed and a lot of the other software engineers interact with it by taking some of the service templates and toolkits, and we do from kind of our skeleton best practices.

E

We kind of take the approach of the service mesh and the tooling is kind of the page highway, and if a software engineers need to go off-roading, they can do everything custom, but most look at it and say the tooling: you need the service. Miss provides, isn't worth it, so they take. The templates get the service deployed oftentimes in under an hour.

B

D

By only you mean who configures my exchanges upgrades the service match, that would be myself and the other systems engineer in our team. However, I want to add that our setup is such that, after we've deployed it to production every time we add a new, a new service into our system.

D

Everything is automatically configured to join the service match, so the actual management of it. It's very small. It's yeah, it's very minimal.

B

Matt is that your experience as well.

C

Ownership, so we from a from, if it breaks, who fixes it, that would be our team if it's ownership in terms of who's been a proponent for it and who's rolled it out. That's also my team, however I think at least it never quote. Our applications increasingly are viewing the infrastructure that they need as inclusive to their definition of what their service is, whether that's core infrastructure components like stores buckets and things like that. You know now we have terraform and what descriptions alongside the service.

C

The same is true for some of the configuration of the mesh. Where we don't have. You know we have roughly a quarter of our services, the most critical ones in the mesh. Now, with you know, adoption happening over the coming at a quarter and a half so initially I would say it's more of a shared ownership model. However, because the way we prioritized Howard how we're starting this was done in close collaboration with the teams that needed it right.

C

So we really leveraged them to make sure that we weren't often spaced so to speak from a.

D

Requirement standpoint.

C

But you know in a classical definition: we own it I suppose yeah.

B

I love that I love that word ambiguous to see, see what people would say. What about for things like you know the retry policy for a particular service, or you know, timeout configuration or things like there are these intersection points between the platform side on the developers on there.

C

There are I could talk more to it and in the in the. Why did we adopt the mesh and how do we? How do we roll it out? But you know we have a ever quotes about five or six years old, seven, depending on how you count so there's I, don't say strata, but there's a number of different epochs time periods and different service architectures and the most recent few years is primarily kubernetes hosted for new services.

C

So you know there's a lot of before we had a service mention we needed to do a timeouts and retries, so we actually have some services and/or libraries that are in use that do some of that. So some of the features that amesh provides that you mentioned for many services, it's a way for them to prune out things, but we haven't done that. Yet I can speak more to maybe the following question and let the others speak, but it would be a little more a little more.

B

Leonid, do you have developers who are trying to or who, who have things like retry policies or latency or timeouts, or things that they care about that they didn't have to? You know, there's some kind of interaction between devs and platform and.

D

Sorry, I didn't quite catch the question community thing: yeah.

B

Some more you know. Sometimes there are things that developers, depending on the organization they're things that developers may care about, that kind of fall into the service mesh realm of functionality right, like I care about how retries are gonna work. For my for my service or I care about how you know the timeouts, then callers are sitting when they call my service and.

D

Yeah they do need to care about latency and retries, but we haven't seen after we incremented link Rd, we haven't seen a big change or like a latency increase that the performance of our system. In fact, we are able to make other changes at the same time that enabled us to to be able to offer the same kind of performance for the system.

B

E

Say on our side, we're very latency aware and we measure everything in milliseconds are smaller. We actually use linker D to help a service is able to insist that its clients and consumers are not setting time off longer than a certain amount or other retry benefits. A consumer is able to be more aggressive and have a lower threshold, but a service is able to say what its expectations are. It's basically using from an SRE, SLO type perspective. We use the service much to help, standardize that so those.

B

Particular thresholds for an individual service are they, you know? Are they in the hands of the platform team where they have enhanced the developer team.

E

We don't really have much of that distinction here. It's kind of like a co-partnership on that, but it's at the service, like the namespace architecture level and we'll go through and agrees that this particular service should have these characteristics. And then both sides will implement to that. Ok.

B

A

B

We we covered a lot of this already, but is there a notion of like this formal platform team and if it's so you know what are its stated goal or how does it know whether it's being successful, I think.

A

D

In your case, let's.

B

Start with you, because I think you probably have the easiest answer to this, which is to be free well,.

D

As I said, though, our team has a very flat structure, but in terms of making sure we're doing well, I guess it comes down to measuring the performance of the system not being paged consistently. When were on call and we've, we haven't seen an impact ever since we've we've implemented, link, Rd and the right version for us, and you know, ever after after we've solved all of the initial box we encountered, we haven't seen either way that performance has changed.

E

D

B

That's the five five engineering perspective: let's jump up to the thirty engineer perspective William back to you, yeah.

E

So from us we're just getting to the point where we've got s eries, who are driving and doing things like linker, D upgrades or kubernetes node scaling out, and it's been great to be able to change the type of node that our entire cluster is using. Well, the cluster is still in a zero downtime state.

E

The goals from the platform team were kind of being able to know is, is our overall platform and the service we're providing to customers and the gamers? Is it still operating at a nominal state? If it is then, okay, all of these things, that would require large coordination and keep continuing. It's not they're the ones who are able to at least shine a broad flashlight on where the problem might be. That's one of the things that we've really valued at the observability.

E

It's easy it's within a minute, or so it's easy to tell roughly where in the service tree, the problem is originating from okay.

B

And then Matt, what's the you know, a hundred person perspective is the cloud engineering org, the platform team, I.

C

Guess, there's a couple different ways to answer that in tight, every quote: we actually we've we've just finished planning for the quarter and really talking about what what's what's a service versus? What's a platform, it has been a topic, so so, if I'm, to use sort of the definitions that we've adopted internally, you know we would say. As you know, a service is something that delivers value to you. Like here's, a thing you can call here's a service, I'm running.

C

You know it provides this value, whereas a platform is something that you can use to generate value for yourself. I, don't know if that distinction is clear, so it you know right now, I think we we have a number of platforms teams, so you.

E

C

We then, within the larger, for the context of this discussion team. You know we have a data engineering portion of our of our consolidated engineering team and they run a data, an analytics platform right that people can put data into our cloud platform that we're running you know is comprised of you know some shared terraform modules and kubernetes clusters and the service mesh. So in that, in that respect, yes, we are a platform team and we're producing something that our teams can just come to and use I think we're still.

C

You know midway through the full rollout, so you know I'll caveat it was saying you know we still have some work to do before. I would call it like a done platform which, to me means I can back away slowly from it all of the core use. Cases are covered and documented with examples you know we're still more and be like. Well, here are the dozen or so services on it and if we're going to add a new one, we'll do what they're doing, but it's not completely self-serve debt.

C

So in that respect it's not a done platform.

C

A

That's a useful.

C

Distinction or I could just say yes.

B

And then William an Anna. If anything that matt says sounds crazy, you know feel free to jump in and just yell at him.

C

I'm really not fragile all.

B

Right so so Matt, actually what we won't. We stick with you, so I think you've touched on this a little bit. But what was the original motivation for adopting a service mission and how's that handout, or have you like kind of shifted.

C

So we never quote, we were. We had the happy misfortune of having way more load than we expected a little bit sooner than we expected over the last couple of years, we've seen traffic to our consumer facing services. Just you know double triple and and and up so we had a number of monoliths that were being decomposed in the process. You know, in some cases we actually have great.

C

You know you know very discreet, classically defined microservices, but in other cases we have what's more really a distributed model with or somewhere in between and I, don't mean that in a bad way, I just mean we needed to scale some portions more than others, but we still do have either temporal coupling or in some cases, other forms of coupling still present, which again is not necessarily broken. So our initial motivation for bringing in a service mesh it was, as still at the time, was to load balance.

C

Chair PC, we had grown as an organization to the point where simple rest interfaces, while expedient became a little more difficult to manage without very strict. You know swagger definitions or open api specs, which didn't always happen so proto and gr pc was chosen as an RPC typed language for many of the new services. But both you know all of the cloud providers didn't at the time have l7 load, balancing and many still don't so you know we had lots of load and no way to load balance it. So that was our initial motivation.

C

There are, however, two other real big reasons why we needed and we do need still a service mesh one. Some of our teams are breaking into or not breaking through. We are we're in the health space and we end up dealing with not just personally identifiable information or.

E

C

The Eph I or other data that our customers give us that's either of a medical nature or the like, where there are compliance issues where we need to ensure that we have em TLS an encryption and transit as well as at rest for everything . so having a service mesh. You know that that's one of those things that we can provide to all teams without all teams having to deal with authentication and encryption and MPLS. So.

E

C

Was the second big one and then third observability? Obviously you know when we were a 20-person company with a big shared code base, everyone just kind of knew what was going on, but now that we have dozens of services and rising, and you know teams that are growing, not just in number, but also across geographies. Where we're now you know a multi region team, if you will having a.

A

Consistent observability.

C

Platform is critical for mean time to issue identification resolution, diagnosis and the like.

B

Okay, so the initial impetus then was purely for a RPC load balancing, but then now the things that have been sticky, I guess our awesome yeah.

C

And there's actually a fourth one: I don't want to hug too much time here, but you know we're rolling out continuous deployment for our services, we're using flux, CD and flagger for Communities hosted services at least, and the observability and metrics that come out of that can help us form the predicates that we use for Canaries. That's active work and flight for us. We've got. You know, pilots up now, and we like what we see so far.

C

So we're doing things like this quarter, taking all of our proto that we build in CI and generating service profiles, we've moved over to linker DS. So now our observability is not just that service level, but it will be moving forward at route level or at method invocation level and that's a huge win, because you know when something goes wrong or or when we have an issue. We can very quickly see where the issue is.

C

B

Yeah, that's I, love that it's really really cool and I know you're, also in a regulated space. What was your motivation was a mutual TLS or wasn't something and yeah.

D

So the the main motivation was jealousy load balancing as well. Our application is a distributed knowledge that is deployed on top of vanities as micro services, so it's quite complex and it has I think last time we counted over 50 micro services, but now realistically, maybe four a tip not 100 and we are in a regulated space.

D

So MPLS and encryption and security was really important to us, but we are able to find other ways to go along that the main issue was scalability and being able for services that communicate through jealousy and protocol being able to load balance. G RPC was a pain point. Any fooi wouldn't have used the service mash. We would have had to change the way the services communicate with each other or even build ourselves a smash, but I. Don't think that was something that yeah.

D

B

That's not fun, yes, okay, thanks and then William. You know, unlike Anna and Matt you get to live this carefree life of no regulations, no rules, you can do whatever you want. I assume.

B

What was the motivation for you folks, especially since you were operating such you know in such a low earn and such a latency sensitive kind of space. Why would you add a service mission that just adds proxies and adds latency everywhere, so.

E

My co-founder and I actually came out of a regulated telecom space, so we brought forward a lot of those best practices and we're like if we were gonna build in something at the infrastructure level, you might as well start using best practices. It's a lot easier to Greenfield, bring those in and establish them as tooling than it is to try and backport them later. For us, it was actually more to bring in determinism and more services being able to self configure.

E

So some of the examples because we're doing clusters between multi-cloud, both from on pram bare-metal to cloud hosted versions. We were seeing strange connectivity issues between them and having the service mesh, run MPLS or run basically the ketchup rock scene between the services in the cluster, and we were able to get ways to do it between clusters.

E

It actually brought a lot of determinism in and services were able to go through and self configure how they wanted the service mesh to react, and since we use scaffold and helm for a lot of our CI CD deployment process, we were able to specify that in the actual deployment. So we could make as a discrete unit a service mess change like, for instance, we were.

E

We had one scenario for a couple days, a TCP connection, League and opes, but we were able to use the service mesh to very specifically tune it from multiple perspectives to find out where the sources that came in and roll it back and then roll forward. Once we had resolved it.

B

So it's worth the worth of latency hit.

E

We see bigger latency hits not from the mesh itself, but from many other sources like having a service mesh has bought the mesh more of a latency budget. By solving it elsewhere, then it's cost okay,.

C

Great our experience as well, we haven't really noticed any issues of latency with linker D and and we bought Samara budget elsewhere in particular, the more nuanced way that load balancing happens. That's a little more adaptive that linker D has you know. In particular, we run some fairly large clusters, where we opportunistically run some workloads on faster nodes. When you know model training, things aren't aren't busy, and so you know it's not a unit uniform.

D

Distribution or.

C

Around rough with load, balancing for first time of our G RPC services is not super optimal yeah.

B

This is really good to hear one of the challenges we really faced early on and talking about. The concept of the service mesh to people was, you know like it seems like a bad idea right, like you're, adding thousands of proxies everywhere and like you're gonna incur a hit there, and so you know you have to. We had to talk about how you know.

B

Yes, every abstraction you know, as has a cost, but you're gonna have benefit at the end and like, but it's you know like a very abstract kind of conversation, so knowing that it actually is practice. I do.

D

Remember William one of the very first versions of link Rd when we sold it, we've seen major major latency added to our services, but then but then there was a bug between the application and on Lincoln T, but then we've sort of worked, and then we solved that and after that has no fixed. We haven't seen much latency added.

B

B

All right, let's.

D

Move on to the next one, actually.

B

I think there's two questions here and maybe we'll try and address them at the at the same time, because I want to make sure we have space at the end for audience questions. But so you know, what's been the Orient biggest organizational challenge: I'm rolling out of service mission by organizational I mean like people. You know, I understand that deploying anything in kubernetes is a challenge just from the kind of nature of the beast and then what's been the most surprising benefit so William. Why don't we start with you?

B

Do you have the best name, biggest organizational challenge and most surprising benefit if any yeah.

E

I would say for us, the biggest organizational challenge was kind of two parts and we solved each of them in an interesting way. So one was being able to find a shared set of configurations that works for all services when we know that's impossible, so we found a working to find a same default and how do we migrate off with a default for specific scenarios for as long as they have to be off default and then, where possible, try and bring them back in, and so that was managing.

E

That was an organizational challenge more of that related to working with some amazing engineers in our team who were learning how to go from a service mesh so folks, who had never actually been in an SRE or an Operations type of role. Even we had a. We have a nickname internally you're the SOE intern, for these sets of projects where it's basically you're getting the matrix level of how does a service mesh work? How do all the components work? How do you change and configure individual components to override the defaults so organizational challenge?

E

That was a piece. It's paid off dividends for us because we now have more people who understand how the internals of the components work. The first part with the configurations in the defaults.

E

We had scenarios that won't get into too much detail where we intended a configuration on a service, much deployment to look one way and in reality, when it deployed in kubernetes, they look different, and so, when you're, trying to make gentle adjustments that direction- and you have a deployment system that interprets things unexpectedly at least done unfortunate downsides.

B

Great great and Matt, what about you you're, steering this ginormous organizational boat? What was the biggest organizational challenge and you know most surprising, shocking benefit.

C

So I think one of the one of the challenges I think wasn't to initially adopt the mesh I mean we had very concrete problems of I'll, say manual, load, balancing happening before we had a solution to load balance gr PCs. So so you know I guess at a high level. The challenge has been that for teams that have an acute concrete need for which Emes solves that's easy. What's a little bit harder in a growing company and we are, we have an enormous opportunity.

C

You know, and we manage the business such that when we do things that increase revenue or or are working, we do more of them and so giving the conversation around for teams that don't have an acute problem, but there's an organization-wide benefit to having all of our applications in a mesh where we can in a consistent way, have a view across services having them take time to actually learn some of the things that learn or change configurations, and things like that that doesn't have a immediate value to their team.

C

That can be just from a people or a project management perspective. A little bit of a challenge, however, I think it's solvable and when you show them some of the stuff they get like hey, you can come to the measure you can implement until s yourself or you know some of the some of the you know we're standardizing on an observability stack that is kind of really heavily leveraged, consistent, metrics coming from these services, so that we can say hey if you hop on the mesh.

C

Here's all this alerting and monitoring an anomaly detection and other things that you'll get out-of-the-box that you would otherwise maybe have to manage yourself. So that's one challenge. Another challenge we've had is you know we shifted to kubernetes a couple a couple of years ago and some of the difficulty is there, I mean as an aside my my partner.

C

The first time she saw the peanut butter I was eating a couple years ago there was like this raw peanut butter stuff. That's you know she said. Oh, this is okay, but it doesn't taste like it's done to me. Kubernetes doesn't feel like it's done yet right. It's it's! It's useful! It's a step in the right direction. That's it's doing a lot of positive things, but it feels like it's not arrived. There is a barrier to entry, and so in particular, for us, we have both kubernetes and non kubernetes workloads.

C

So I think one of the challenges has been that teams now need to kind of, in particular when they have services both inside and outside kubernetes know, it's forced us to address some technical debt and learning around. How do we handle east/west versus north-south traffic right? How do we you know? What are the finer points of this and I? Think?

C

A positive aspect, though, is that we now have had a number of discussions about how we're how we're making some choices like raising nginx now, instead of cloud vendor-specific congresso, this kind of use cases and an outcome has been. You know a higher level of perhaps knowledge about the bowels of the networking that was not there before.

B

Okay, great thank you and and and I you're. The real engineer here right. The rest of us have devolved into management roles and are, in our ivory towers, shuffling organ org charts around. So keep us keep us pure. What's been the biggest organizational challenge, babies from rolling out link D and it.

D

Has been honestly, it has been dealing with banks and again I'm, not talking necessarily from from a service much perspective or application perspective was to do mainly with the complexity of our application and very specific functionality that we were using, that it seems that at the time, none of the other link, Rd or even easier, because we've tried this year before link early users were seeing so for me, I, remember a few quite a few weeks where I would deploy it on to our testing environments tested as much as I could and then I would message.

D

My team saying I'm ready to deploy to production, I would deploy to production and then I'd have a run. The team saying stop, stop stop roll back, it's not working and again the talk that I did with the Risha at the service mesh con talks about those challenges. That has been the main thing, but we were able to solve them and were able to do that through collaboration between the different teams and just sort of to go back into the next questions.

D

If I wanted, if there's something I wished, someone would have told me so my cell phone, which I came up with this matrix of how to troubleshoot something as complex as a service mash when your own application is very complex and I just wish that I had access of that when I was deploying it but yeah that that has been the biggest challenge. I would say and in terms of a surprising benefit, being a to see on the UI to see the dependency tree between services.

D

Because, although we're a very, very small team, we move very fast and sometimes it's hard for the systems engineers to keep up with how the dependency between the services has changed over even on a weekly basis and also that helps with onboarding new engineers as well. Ok,.

B

Great and that decision matrix that you in Risha came up with that's in your talk, which is like so there's a link to that at the end. Ok, we're gonna do one last question here from me and we're gonna have to stay really focused because I want to leave a bunch of time for the audience. Q&Amp;A we've got a whole bunch there. So very last question 30 seconds or less. Maybe you've already answered this Anna, but we can start with you what what's your best advice for other organizations who want to adopt a service watch.

B

It's a watch. Watch your service master on top I.

D

Would just say don't be afraid to reach out to the to the team, who's who's, managing who's, managing your service smash, sorry contributing maintaining that's the one I'm maintaining the smash. So for us we are able to contact you guys over slack, and that was the fastest way. We were able to fix everything before seeing on our side and don't be intimidated because it looks it's very complex. Service mash is very complex. So just take everything incrementally and add things as you go. That's it that's.

B

Great okay, William! What about you? What's your best advice, 30.

E

Might said, I'd say for that service smash. The incremental approach is the best way to look at it. I would take a step further and say, while you're looking to adopt indignant incremental approach, get something working and then, when you get something working at very small level break, it see how it breaks understand how to be able to triage it. Roll back the break, go and add the next piece of the feature and try and take things in a functional unit.

E

So, let's say north south for an api gateway, as one unit did for east-west between services and neck namespaces. Another unit between multi clusters at its own separate unit, you'll, learn a lot about the subtleties and the insides of the abstractions by seeing how it breaks and then putting it back together. Right.

B

Alright and Matt I saw you nodding your head that yeah.

C

I think it's safe to say that if you're, if you're dabbling in these waters, the service pushes everything is shiny, so be really really really clear about what you actually what problems you're trying to solve and ruthlessly prioritize. There are many features of link or D. For example, we haven't explored yet because we've really needed to focus on the ones we focused on and take an incremental approach and iterate as an example.

C

You know we have at the at this point some namespaces, where we've got everything in the namespaces meshed in our new environments, for building out for our next next-generation stuff, it'll be a default to have the service mission tabled and the exception will be when you're not on it, but it's very easy to roll out something very broadly and then and then discover what you don't know. So a lot +1 reach out to the upstream communities. It's one of the advantages of working in open source based CNCs stack.

B

Great ok, well, I'm glad to hear that the community aspect is coming out here. Ok, so we've got a couple minutes left, while we've all been talking. Ariel has been slaving away behind the scenes. Curating all the questions that have come in so I have no idea what these questions are. I have not looked at them, yet we're gonna we're gonna, find out together taking.

C

A quick, quick pass through there's some good questions here. Well,.

B

How about this we've got, let's see, ah ok for questions total and let's make it a free-for-all so, rather than me, directing everything a good.

C

Question answering.

B

C

So, what's I think somebody asked: why did you move to to link Rd from sto or steel versus there's two or three questions on that? For us? We rolled out his deal first, we still have one work load on it because it uses header based dynamic path, routing which linker D doesn't do we've kind of found that Estelle was very broad. It seems to have a ton of features, but it's also very difficult.

C

It has a lot of moving parts and for us, most importantly, it's very opinionated on ingress gateway, and we wanted to have the flexibility to choose our own aggresses, as we still haven't consolidated on one single API gateway type like Ambassador or glue or something else so for us. Linker D was a little more narrowly focused and more towards a less configuration and less barrier to entry, as well as being a little bit less overhead in terms of performance.

C

So when we moved to link or D, it was primarily like I said for those three things: observability MPLS and load balancing, and it left open choices later that we didn't have to make up front I would.

D

Just add on what Mac said: we've had the same experience with this journaling card, where we found that if yours was having a lot more features, but it's also the barrier to entry is much more higher and Link Earth is much more simpler, but for us specifically came down to being able to to deploy it and at the time when we did it, we didn't have the right support from a steal to be able to troubleshoot it properly.

D

Of course that doesn't mean that things haven't changed for them now, but that's just what our experience has been I.

E

Say ours kind of echoes, so we deployed issue initially because it was basically 30 minutes and it's up and ready and services are employing into it, but we actually ran into troubles of creating between versions and trying to resolve features.

E

Linker D was always looked at as you here are the components and the Lego bricks, putting together how you need them now you want sto is more of push one button and fingers crossed. It works exactly as you need it. It was very impending it as the API gateway.

E

We actually ran into issues and the reason we migrated or the start of the migration from sto. The linker D was when helm options and sto CTL options we're not being respected, and you dive in the code, and you realize okay, there's a significant difference between the two and there's. No, there's no way to configure that particular construction, whereas linker D, two hours later was up and running with the full cluster in a beta environment, and we didn't really look back.

B

Yeah, that's great I, think there's a question on latency and overhead and metrics err are there are some metrics if you search for kinfoke ki, an vol k and linker d, maybe kendo clinker, DST, o you'll see a performance comparison that was done in May of last year. So it's almost a year old and both projects have released several versions since, but that was the most comprehensive benchmarks and that I'm, aware of and all that stuff is downloadable. And so you can. You can reproduce those graphs yourself or not.

B

Try try it with the new versions and, let's see what happens and then there's a question about the underlying proxy service. Does that have the greatest impact on performance and latency? Or is it the policy driven part of the mesh that cause the greatest resource contention? Late selection question I think either either of those could have? Certainly the proxy has a huge impact on performance.

B

You know, because that's the thing where every single call you're making between services now has to go between not just one but two proxies and a tox to be, and they have a proxy kind of on both of the client side and at the service side. So if that proxy is not as fast as humanly possible and it's computationally possible, then you're losing on performance on the policy side, you know I, guess it depends on how policy is done. So it's easy for link rudy, because.

A

Like he doesn't have.

B

Policy right now, so no performance it it's on the roadmap from link regime. When we do, let me do that.

B

C

Hire some of the rest of these there's three minutes that looks like I was just reading through I could give some quick answers, but yeah.

C

One of them, whereas Canaries and blue greens with Lync ready, that's something we're piloting now using something called flagger which is out of leaf works.

C

It integrates pretty nicely- and you know I hope- to have some better results to talk about soon, but our pilots are looking great so far on the ingress side, we're using nginx today and the reasoning why I came over class I questioned, but the reasons why his we're operating clusters and multiple clouds, and so we want our application definitions to be as portable as possible, so not having to tokenize things like ingress decorations. For you know, this club under bridges that Club vendors ingress is, is something that drove us.

C

Also nginx is like 20 years of Hold'em established. So for us that was a safe choice.

D

We're also using nginx for ingress, we've always done so so it wasn't a map. It wasn't anything related to link on me. It was good to see that they work together. Now, there's.

C

A question C as well I'm, certainly I, wasn't.

E

Saying we're on the Envoy side, and that was more from a performance and rapid reconfiguration. We also got an advantage on the GRP seed arrests transcode, because we were green, fielding and protobuf and Jeff received in the beginning. We didn't have to Swagger, define and flag respect all the rest side, yeah.

C

We have a couple services using G, RPC gateway for that.

C

So we're you know, chirp you see under the covers and then can expose rest. There was a question about multi-tenancy and service. Mesh I can only speak to linker D right now, but you can run multiple versions of the mesh in different tenants. If you wanted to, however, because CR DS, at least for the near future, until version CRTs are more real, it's you're stuck with one version of the mesh across a particular cluster, but you can run multiple control planes in parallel without too much drama.

C

B

Were a couple of questions about upgrades? How do you folks handle upgrading the mesh.

D

So when we, when we installed link Rd the helm, chart for link, Rd wasn't wasn't that advanced, so we decided to just create this an in-house script. The plan was always to to get the helm chart with link Rd, but we just haven't had time to do that, but we just do it via script and there's there isn't really downtime or if there is maybe a couple of minutes max. So we haven't seen much downtime and in terms of I think the question was specifically about terraforming infrastructure.

D

We haven't, we also deployed infrastructure with a phone and we haven't seen any issues with that.

B

Some charts have been updated, a fair amount in in 2.7 yeah.

C

We're in the same place, we initially installed just manually by hand because the home tribes didn't exist at all. We're using terraform for infrastructure, for the cluster itself is terraformed but workloads themselves. We don't have in terraform one approach that we found that works pretty well so far for having get-ups methodologies but with helm as well is to use flux. Cd, however, we're using a mixture and some we're looking at the helm operator, but we've actually flux, has some capacity to template things out.

C

So the approach we're probably going to be taking for upgrading moving forward is to template out the helm chart and then apply that directly I, don't think we have time to cover how upgrades with Lync or D work anyway, it's probably out of scope, but we've shown that we can do it without taking real downtime.

E

We approach it at the cluster level. We pick the cluster out of rotation for more, like a kubernetes Federation perspective and update from that we joined start using linker D after the helm. Charts were solid, so that was one of the hurdles. That really does he overcome before we started trying it out.

A

And yes, and with that, this very lively conversation comes with it and thank you so much to the Williams annamaria.

D

A

For today's great presentation, as mentioned earlier, the slides and the recording of this will be available later today. If you have additional questions and we didn't get to all of them- slack linker di Oh, which was one of the early questions, the place where you can go to engage community.

A

Thank you all so much and we hope to see you at a future ciencia webinar.

D

Thank you, Thanks.

B

Thanks Matt thanks William thanks Anna.