Cloud Native Computing Foundation CNCF Webinars, 12 Aug 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Webinar: The Open-Source Observability Playbook

Description

Observability within distributed systems is essential to understanding how your application is performing and establishing application reliability. In this talk, we will discuss the rise of microservices in the cloud, the pillars of observability (metrics, logs and traces), and open source tools for tracing such as OpenTelemetry, as well as OpenTracing and OpenCensus. The goal is to understand the tools themselves, as well as discuss best practices in using them. Using the right tools and best practices can accelerate application development, ensure that applications perform as expected, and reduce user impacts.

Presenter:

Hen Peretz, Head of Solutions Engineering @Epsagon

A

Okay, let's get started, I'm jerry! Thank you all for joining us today for today's cncf webinar.

A

uh The open source observe observability playbook, I'm jerry, fallon and I'll, be moderating. Today's webinar we'd like to welcome our presenter today ken perez head of solutions, engineering at espion, just a few housekeeping items before we get started during the webinar. You are not able to talk as an attendee. There is a q, a box at the bottom of your screen. Please feel free to drop your questions in there and we'll get to as many as we can. At the end.

A

This is an official webinar of the cncf and, as such is subject to the cncf code of conduct. Please do not add anything to the chat or questions that would be in violation of the code of conduct. Please be respectful of your fellow participants and presenters. Please note that the recording on slides will be posted later today to the cncf webinar page at cncf, dot, io webinars. With that I'll hand it over to hen to kick off today's presentation.

B

Thanks gary, um oh guys, uh you guys see my screen.

A

Gary yep we're all set cool.

B

um So let me just uh just do a quick introduction about myself, I'm ken paris, I'm running the solution, engineering team in hexagon uh and up. Second, we are building a monitoring solution for uh modern application, whether it's on microservices or serverless, everything that is basically distributed.

B

Our tool gives the ability to add pretty seamlessly tracing and logging and also correlating between metrics, so uh what we call a full observability for those modern applications, and in this talk, I'm going to talk about quite a few things.

B

uh The first thing I want to do is just to do some sort of like a recap of what are the old-school observability methods that are currently used, I'm guessing by uh a few of you that are already here uh in the crowd and um how are you able to actually achieve observability um easily I'll also do some some kind of live demos. Hopefully nothing will actually blow up.

B

It's been a while, since I did that and uh things that we also gonna touch, is how open source uh tools today give you the ability to implement some sort of like do-it-yourself observability, and they have improved significantly throughout the years, and I think there can be an excellent start for people that want to start off uh just gaining some more uh understanding of about how the system actually works, um but uh first thing first.

B

So, let's just ask yourself: why do we even need monitoring it can be to um for various reasons, but I think the main one is just to make sure that our business works, because, if we're able to if we have a website that is currently running and accepting um payments from customers or getting some uh orders uh inside the website, if that website actually working properly, that means that our business is actually working properly and if we think about what are the type of aspects we want to monitor.

B

I think uh I read that on google's sra book I'll share the links later.

B

There are three golden signals that you want to get from your system in order to be able to understand if it's working properly, if it's very healthy and they are pretty uh intuitive and logical- to understand the first one is latency, meaning that if your system has suffering from a large latency when it comes to requests, this means that it's affecting your customers and they have a bad experience around the traffic- is how many people actually going into that website, and if that traffic is being spiked or being low, it's something you want to monitor in order to be able to understand whether your system is healthy or not.

B

The third thing is, it sounds pretty obvious errors. You want to be able to understand how many errors you have in your system.

B

This is something that is not that trivial when it comes to distributed systems, and not it's not that easy to understand whether you have errors in your system um and just being able to to put that some sort of like a spotlight around the errors can be extremely useful and the last thing is saturations, which means how many of my services are being highly highly frequently accessed.

B

If I have a database, for example, is that some sort of my bottleneck, if I'll be able to understand that and to see all my services and and see which type of resources are being frequently used more than than the other, then I will be able to understand whether my system is healthy or not.

B

um Let's just talk a little bit about old school monitoring that is currently being done today, and um I I think this is probably the most popular approach right now, so you have a system that is running usually uh in order to monitor that you need to install an agent.

B

So all the monitorings are pretty much agent based and the downside to that is that they only collect host data and they only collect metrics and when that comes to a problem, is that it doesn't give you the full understanding of, what's going on within the services you're able to see that uh your payment service, for example, is under, um let's say a container, so you can see the cpu, you can see the metrics, but that doesn't really give you enough understanding whether uh the database has been highly accessed or something inside your own business logic is actually causing the errors and um um what we actually need in order to troubleshoot.

B

We need some more debug there, so metrics are not enough, and this is like the first thing that we do that we try to go whenever any engineer uh is getting a call and says that one of the services is acting out or something is acting a little fishy. The first thing you want to do is achieve much as more as debug data as you want, and we go directly to the logs in order to get that, and um today, logging uh is in order to to achieve logging.

B

Then, if you think about it, you have your logs being read into the output, probably to some file in your containers or whether to uh you remotely move them to some other vendor and they're, being basically dumped either locally or remotely, and an agent just collects those data and sends them whether to your own uh proprietary, elasticsearch that you uh ran out on yourself, or maybe you actually push that to some log vendor to be able to handle high traffic.

B

So everything is, is being done using agents that move logs from one place to another, and uh if you go to what actually is being done today, so those those methods of logging and monitoring can actually work pretty good.

B

If you have a monolithic application- and you have only one system but think about taking logs of different types of services in your system and making sense that that's where the hard part comes and just um if we think about it, like the change that software has been made in the years, I you can see that trend in probably any graph that you will look up online. There's a huge trend of companies doing the huge shift from monolithic applications into uh distributed systems. uh Lambda adoption has been a highly highly along.

B

The has been highly adopted and also containerized. So a lot of companies did the switch to content to containers which makes their their system much much much more highly distributed, and that shift is not. uh It makes a lot of sense because it's much more easier to develop. If I'm as a developer, I don't need to worry about the host that is running.

B

I only need to work about the business logic and I'm able to use as much as more um third-party apis in order to query different types of data sources around the internet or if I want to implement some service, I can just use it instead of implementing that myself, that has significantly improved the the pace that we can actually develop as a startup company. It's extremely useful to know you have those tools available online that you can just if I want to implement payment payment, I can just use stripe.

B

uh If I want to run something and not worry about the host, I can just run it on the lambda function and pay as I go. This allows me to move my business much much more faster, but it also comes with a few uh downsides when it comes to actually understanding what's going on, so it sounds a little bit uh shiny at the beginning, but once you run it into production, I'm pretty sure most of you uh are familiar with that.

B

It's becoming pretty hard to to to track those things down and if I'll, just continue talking about the challenges of engineers and devops throughout, that thing is that troubleshooting becomes extremely hard because you're not sure if just using those logs and the metrics is something that it can be very efficient for highly distributed application. And if you look here on the right, we have a few.

B

uh Each line represents a service, and we see just the one that is actually valid is a communication throughout the services and as an engineer to be able to to actually identify this trend throughout my system, I need to correlate between different types of logs, and that is something that is not possible using the uh the basic logging capabilities and if I can't really properly monitor my system, how can I actually develop new things?

B

If I don't know what's going on on my production environment, and I have 20 services that are currently running and I know they're held in traffic and we have some sort of login that tells us whenever there's an exception, but I don't really know what's going on within that and whenever there's a bug, it's going to be extremely difficult for me to troubleshoot as a developer, I'm losing my confidence when it comes to implementary services, simply because I don't really know how it's going to affect my production environment.

B

um So this just limits to what we call the three pills of observability. You probably saw it around the web called as melt, which means we have metric tracing logging in advance and what it means that, in order to gain full observability on distributed systems, I need to be able to take the metrics and take the traces and take the logs and combine them all together.

B

In order to give me a clue of, what's going on and to end through my system, that that is the only and efficient way to be able um to not only know what's going on in production, but also to be able to see. What's going to happen once I gonna push new services through my development process and when it comes to troubleshooting, it's gonna be extremely useful to to have the uh all those uh capabilities in hand.

B

um Now I want to start off with first. um A lot of this talk is gonna, be pretty technical and just showing you guys how we can take um a python application that is running on flask, which is probably the most common framework today and we're going to try out and see how we can implement login best practices. So hopefully you guys can actually leave this talk with with something useful and and something that you can actually try out uh on your own teams. So I'm gonna talk a little bit about logging.

B

Logging is something that uh I think this is the the the number one tool for engineers number one debugging tools and that just allows you to not only um not only display some appropriate information about your system, but also assist you whenever you want to debug something you just print something to the screen, and now I just encourage you, I'm just going to list a few best practices and then we're going to run to uh just a few uh examples.

B

For me, the the top best practices for login is, first of all print thing in a json print thing, structured and the reason I'm I'm saying that, because, if you're just gonna print some text, that tells you service a call the database, it's gonna be extremely hard for you to uh push that and and scale up with that and put that into a log aggregator that will actually make sense and if you're going to put it in a json structured data.

B

You'll be able to do things like index the fields that are actually part of that json and later on, take those actually fields and run some aggregated uh queries on that. For example, if you have a request coming to your system and you're printing, a json that says this, this uh request came from path.

B

uh A then you'll be able to go to your log aggregator, whether it's on elastic or not, and once it's indexed, it can actually filter all the events that that actually contain that without getting any false results, if you're going to use something that is not structured, I'm going to just going to share how it actually looks like- and I think the most important thing is to actually automate that, because let's remember that everything that we print to the screen is stuff that we actually wrote in advance.

B

So if you are looking to improve our let's say our logging capabilities or observability capabilities, we must uh be able to uh not think all the time while we code what we actually gonna print. We need to have that process being fully automated, because it will just be prone to errors. We don't want to every time we write one one line of code, then we write another login code.

B

This just also makes your code looks bad and also very hard to maintain, because every change you do you need to change, also the prints and eventually it will. It's gonna start to have some inconsistency, um so I'm just gonna jump into the demo. uh We're gonna use python and flask. Hopefully it will work since I haven't done it for a while. Let me see here so I clear this there. It is so let me just actually jump to the code.

B

Okay, so um just gonna run here, okay, so what actually we have here is the um uh this is a flask application, it's pretty straightforward, you import the flask and what you do with that is actually uh pretty neat, because I am able to use the logger if you are familiar with.

B

This is just an example for python, but if you're familiar with the login tracing library, you're also able to override the the default formatter, which means that every time I print to the screen, uh even if it's a warning or anything like that, it will put it in a structured manner, meaning that I don't need to no longer print.

B

If, if I talked about like how to put things in such a thing, you can do that you can actually automate that so everything you print on the screen, even existing logs, that you currently have in your code.

B

If you only change and add a an additional class called the structure log, they will be able to automatically change the way it prints to the screen, and what we have here is just a simple hello that is accepting anything coming from the the default website and it's just printing a warning that a user got in and is uh an endpoint uh whatever now. um Actually, I think this is gonna work better, like.

B

This yeah, I'm probably breaking the code now, but I just want to show you how easy it is to actually implement that kind of thing. um I'm just doing some minor change, because I just noticed that this screen doesn't really print anything.

B

B

A

B

Okay, so let me just run this structure login.

A

B

Now we're just going to curl it perfect, so I just ran a request to my flask application. It's been running and now we can see the structured log, so you can see the log here is pretty structured. I can put it in elastic and easily index that, and this means that every time there's a request, I can make sure that it's, I know the service name. I know the stage whether it's production or not.

B

I know the level, I know the message and now, if you want to know how many errors I had in production from that service, you can just filter by service and stage and the type of level message, which is something pretty pretty useful. It's not just bringing something to the screen that will be later on pretty hard to to follow and um moving to the next thing is actually monitoring best practices so logs.

B

uh We already agree that logs are not enough, and if you want to actually monitor things even better, we want to be able to make sure that all our metrics are being aggregated into a unified dashboard.

B

um And what do I mean by that is, is that I I need to have a single dashboard, where I can view all of the resources in my services, whether if I'm currently monitoring my database or monitoring my flask application or monitoring anything else that comes in mind. I don't want to switch between screens, because every time there's going to be an issue, I'm going to have to say. Okay, where do we monitor uh our database?

B

And if I want to actually take those things and put out alerts, it's gonna be extremely hard to maintain once you have different types of dashboards. So if you're choosing in one dashboard just go with that and make sure that all the metrics have been pushed to that dashboard, and the second thing is define the critical metrics that you have. uh You don't want to be alerted up like late at night on anything that actually breaks in your system.

B

You want to define the critical metrics that you say that in that point it's a something that you have to uh to fix, meaning that if my database is a little bit loaded, I don't want to get any any alerts, but whenever the database reaches it, it's like there's no return point when it ever it's gonna be stuck and not gonna be available for my customers, then that's the threshold I want to be pointing at, and the third thing is that don't think about metrics, as only as infrastructure, metrics think about metrics as a thing that can help your business so make sure you use custom business metrics and if I take the example of a website that is currently, um I know, shipping things to uh to customers and have some audios in line.

B

If I have a dashboard that tells me that, this month or this week or the last few days, we got 20 um 20 orders or 10 orders, or something about that. I can also understand like what is the business state and what is the business itself and it's something that it's pretty cool to share with everyone that are not specifically engineers in my system can be shared by the sales engineer.

B

They can also use those those dashboards, and I also highly encourage to take all those groups in your system and and put them in the same dashboard, obviously give each one of them their own view, but everything needs to be in the same place. It will be the much more useful and efficient for your company and um and if we think about like what we want to monitor on application level, then every call, for example, that we are doing to some api.

B

We want to see what is the average duration uh if we have a if we push messages on a queue we want to see what is the minimum number that is currently being accessed to the queue? It's it's just making sure that uh you know which type of resource in your system is prone to errors, because if I'm gonna have a a queue that is currently handling.

B

My the my all my deliveries, if it's gonna uh break for example, then I will also be alerted on that, but I also want to be aware whenever there is a business issue, for example, if that message queue is reaching like no orders at all. In the last hour and and typically something that is not happening, then I also want to put on actual things on that and there's a lot of things that you can go through like the http code that you want to actually monitor.

B

But I think the most important thing when it comes to those metrics is to be able to have your uh your services trigger those metrics and print, for example, the the structured log that we saw and do that automatically, which is something that I'm just also gonna jump and show now.

B

So we were talking about the monitoring, best practices and way. This screen is here.

B

So I'm just going to jump to this example of actually using I'm using flask in this example. So I'm going to continue that and show how we can. uh If you said, we want to measure every request coming to my server and know what is the average duration? I need to be able to calculate that, and my first best practice tip was to not do things manually, so we're gonna use, what's called a middleware in flask, it's just the ability to every time. There's a request coming in to flask.

B

You can use the middleware, which will basically, you can hook into different types of times in your in your system. For example, if a request just come, you can hook on on the refer request and do some operation and also in the after request, and also do some operation and as.

A

B

The the flask application is pretty standard, just something that returns us hello world. All the thing that is pretty juicy is coming here in the middle, where so we have the the middleware that is actually hooking to the before request and after request with those two functions, and what we do is that we first, whenever the request comes, we time it. We take the current time, and the second thing we do is that we print to the screen. What was the duration with also some additional information and, let's see how that actually looks like.

B

It's called apm cool, so our flask application is currently running. We're gonna send another message awesome. So now. What I can see is that not only um and if I'll just go back to the code, we're also still using the structured log that we saw and that structured log gets a a record to print and it also makes sure if you bring in uh something under message. It also make sure to add that as an additional information, so all the file name, the service.

B

The stage that we saw before is still keeping the same um the same structure, so it can be also extremely useful for indexing that later on, and if I'll go back to the example, so we run it and now we saw the level the level is warning. uh This is the the stage in production and the thing the new thing that we just added is that how much time it actually took so I can see uh the duration was around nine. um I think it's nine milliseconds um here and the status code was 200.

B

So if I'm the fact that it's so uh json, I can actually parse it in any in especially in an elastic on any anything that supports uh log aggregation. So I can actually filter out all the requests or do aggregation like how many requests were 200, how many 500 and set that as my metric or even duration. If this api is going to take more than 30 seconds, then I want to be alerted because aws api gateway timeouts.

B

On that point now we have it's: it's pretty nice what we did so far, but there's a lot of things that we know. That's missing. It's really nice! If you have a an application that is currently running and it's something that uh that works on a very low scale. But if we think about a very distributed system, we need to be able to correlate those metrics and logs.

B

We need to be able to uh take different services in our system and make sure we correlate between that, and this just lead me to um explaining about distributed tracing. um I'm sure you guys heard about this. uh Probably uh a lot of you actually practicing that and and using that in your teams, I'm just going to show a few uh kind of do it yourself.

B

This retrace thing and how to use open source tools to to be able to implement that, and it's like, I said, open source tool today- did a huge improvement uh when it comes to that. So it's it can be pretty easy to do that, and I also show that uh uh in a few now before that, let's just drag back and understand, why do we even need and what is uh distribute tracing so a distributed tracing is basically a trace is basically storytelling.

B

It means that I can tell what actually happened in my system.

B

For example, a client went to my website and pushed an order and that trigger a service that is talking with a database and that database uh put a message on a queue and that queue inject another thing, and then I returned, uh and then I sent an email to my customer, so I'm closing a full loop of asynchronous events, but they're all initiated from the same position so to be able to draw that um and to be able to to to understand how it goes. This is how this retracing actually uh help us.

B

It's just the ability to to to understand uh what happening from end to end in my system and doing that yourself. um It. It is easy, but it requires a lot of maintenance and if we're talking about uh small companies that it might, it might work. But if we think about scaling up uh this means that your entire uh engineering team needs to be aware of that. That needs to do that.

B

uh I think, like only very very tech, savvy companies like uber or companies like that, allow themselves to have full teams dedicated to uh code of observability and share those tools along the other teams. And um the truth is that today, uh when, when a company just started, it's pretty hard to do that yourself, and it doesn't even make sense to do that yourself, because you, you are want to focus on your business.

B

You want to be able to deliver products to your customers, rather than have a niche team that will do solely observability um and if I'll, just talk about the. What do we have today in terms of like the the landscape of tools that allow you to do this retracing so uh part of the cncf? We have open, tracing and open sensors that now join into open, telemetry and also jager and zipkin.

B

That will allow you to not only generate traces, but also uh the tools to ingest them and visualize them um when it comes open to omgs, that's the term of the standard and the list of libraries that will allow you to do some sort of like automated uh instrumentation.

B

I, um if, if you guys, are not familiar with that, I I strongly invite you to to try them out uh they're, pretty easy to set up and we're just gonna do that in the example very soon now, if I'm talking about how to actually do that so generate, traces is something that, um if you use for in this example, we're going to use open tracing and in order to generate traces, we need to understand what do we want to have inside that traces?

B

So a trace means that every time I service run like we saw the request going from um in my flask. So if we'll just track back here, I can see that the request here once it run it just printed uh a message up to the log and for me this message can be that's what I call a trace. A trace is just a log print that tells me what this application did.

B

We talked to which resources it actually used and how much time it actually happened, whether there are exceptions or not just some um raw information about what this operation actually did and open tracing is just basically a standard that allow you to uh to to code and create those kind of traces in their lingo. It's called a span, and the first thing you want to do is to be able to uh instrument every call you have.

B

So, if I'm talking, uh if I'm having a flash request, that is also accessing my adobe sdk and putting a message on s3 or putting a message on an sqs or calling a third party api or even calling my postgres database, I want to be able to automatically trace those without the need to manually, creating those spans, and um this means that every question response I have I'm going to create a span, and I also want to for that expand to add some context, meaning that if I'm calling a database, it's not enough for me to understand how much time it took.

B

I also want to understand which table which query and some more additional information that can help me later on debug and uh the most important thing when it comes to creating spams is to be able to not only think about a span as part of a a single service. Thinking about in the scope of uh end to end like we, like, I talked before that a request can go and put a message on and on on a message: queue that will later on asynchronically trigger another service.

B

So those two services need to pass correlation ids and tell each other that they are part of the same distributed trace, and this is the ability of actually injecting and extracting identifiers and I'll show that right now in the example.

B

I'm sorry I'll say if I'm going too fast, um but you can just later on watch it and put me on half the speed. That's what usually people do?

B

um Okay, so let me just jump into the tracing, so I'm using open tracing library uh uh and I'm just using their tracer and their format here and if we, if we remember the flash middleware with it before that, actually put the before request and after request. uh This is our. This is my flask application. It's still the same hello world. It doesn't do anything fancy, but the middleware here is now creating spam. So uh I want before the request actually happens. I want to extract um from the uh http headers of the flask.

B

I want to extract the identifier and this this actually uh just give me the actual span context. So I I want to be able to whenever a quest comes to take all the information that I can take, and this will be my contacts, meaning I can see the headers.

B

I can see some additional information and I can also uh use that context to to create a new spam and- and if this, in this example, if for if, for example, someone called my uh my flask application and and that service was also traced in the headers, there will be a correlation id. That will tell me hey. You belong to this distributed trace, and now I can continue it and every every time you create a span, it's either you're continuing a distributed, trace or you're just creating a new one.

B

So here we're just going to create one or or be a continuation of another one. Now we also want to add some context and additional context like what is the url and which user actually requested that, so I can later on, filter by type user or filter by that url. That will allow me to understand uh think much more powerfully, and I also wanted, after the request to be able to display the status code and the duration that it actually took. So, let's see how it actually looks.

B

B

Okay, I'm just gonna install up and tracing.

B

Good yeah, so now we have the capture span and everything that is printed, uh I'm just printing the actual uh uh span. It doesn't have anything very structured in terms of of printing that, but the span itself is can is sent is being sent to the jugger that we can later on, show how it actually looks like now. We have the traces, so that's service that flex application created a span and that span is being ingested.

B

So I can use jager, for example, which can be pretty useful for ingesting traces and it can handle any scale that you want. It all depends on where you actually run it, and it also allows you to do some searches around that, and also, and also do some visualization, so um there's also other tools that will allow you to set up alerts, meaning that every time there is a spam that contains an exception, then you want to be alerted on and uh a good example for that is jager that this is a.

B

This is exactly how it looks like, so it gives you a timeline view of all the operation that happened. If you think about it, you have your flask application that is stuck into the database, so you can see all the operation it's pretty useful to optimize.

B

So if you know that you have a flow kind of flow in your system, that is currently handling orders like um the example that I'm using throughout this talk, you can understand how much time is being invested in that uh in each in each part of your system, and um the most thing important about creating spam is adding and tagging them and adding some contacts for you to be able not only um not only understand when you once you troubleshoot, like identifiers, you can see the user id the customer id the device id, but you can also see things like that are relevant to your flow control.

B

You can see the um what type of event is actually happening, things that are related to your business logic, and you can also add business metrics uh in each span. You can say how many items I currently have in the cart. How many minutes um uh actually been washed in my video uh offering or anything of that sort. So adding contacts into spam can be extremely useful, not only for troubleshooting, but also for later on, searching and filtering and then creating metrics around that.

B

So I can create a metric that will tell me how many items I have in my in my cart in a trend of every day throughout the week, so that can be a a dashboard that I can actually use and set up alerts whenever it's reaching a threshold, whether it's minimum or not or not, or whether there's a spike so I'll know that I need to rather make sure that my database can handle the load if it's going to continue in the same trend.

B

The other thing that can be extremely useful and life-saving for a business. The next thing that that is super important to add to a span is I'm saying just let's not stop there. Let's not only add the thing that we think we need, let's just add anything. For example, if I'm calling a database, I don't only need to say what type of table or how many uh what's the length of the query or what is the result?

B

I want to see the entire query through that sql or nosql database, because I can use that query to see what is my most frequently query to that database. I can do optimizations around the database. I can see which one which what exactly was the query that ran that caused that error, because getting a notification that my database database failed is not enough for me to troubleshoot. I want to understand exactly uh what happened and, um let's say, for example, I'm querying now uh uh stripe for payment.

B

It's not enough for me to understand that I got 500 or to understand that um something actually broke. I want to see the actual payload, because any third-party tool that you use not only that they tell you that something went wrong.

B

They also give you a very informative information about what actually went wrong and if you think about the time that this actually will save you is that uh if, for example, something actually broke- and you know you have 500- and you see- and you know the exact service- you know the customer that that it happened. You know everything but now you're trying to understand what actually happened. So the engineer will go to your dev environment and will try to you, know, reproduce it and reproducing takes time.

B

It's not accurate and if you're in advance already printing, all the information that you need then you'll be able to um troubleshoot it uh on the exact events that happen. So you can see the exact event that happened with all the um relevant information.

B

Now. This just leave me to a mindset that I think you guys should should actually use. Is that using tracing not just as a part of the uh the logging of the matrix uh pillars, but just as a as a glue between those so I can use, I can go from traces to the exact log. I can print exactly where the logs resides through the trace that I print. I can uh print things like what is the container id, uh whether it's running a lambda?

B

What type of what's the lambda function I can get can easily correlate between or even print the request id of that lambda function? Then I can easily curl it between the trace to the environment and also vice versa. I can go to the environment and to that invocation and and search that uh request id throughout my login, and then I will find the exact traces that happen, and that is possible because I'm going to use the structured and automated logs. So I don't need to worry that my team actually does that.

B

um So um we're just gonna come very closely to the ending I'm just going to talk very briefly about. What's, for me, the best practices to gain full observability to the system and when it comes to that, I think teams just need to have things automated for them. You don't want your uh engineers focusing on doing uh by yourself creating spans uh bringing things to the logs.

B

Even thinking about that, you want to have a tool that you just plug in and make sure that everything is already automatically captured. You don't need to even maintain it, because if you need to make the decision, whether uh you're going to focus on fixing your delivery system or fixing your logging library, you're always going to choose the delivery system, and then it's going to be extremely hard for you to track back and actually do it. So you're going to have some legacy observability code.

B

That is pretty hard to maintain, and I don't think especially for small startups or even huge enterprises. It's not something that you want your team to actually focus. So you need to think about just exactly as I will not implement my own payment system and I will use uh third-party like like other tools that that are currently available.

B

I don't want to also implement observability into my team, because there's companies that exactly do that side, note epsilon, but I'm I'm saying that as an engineer that this is something that we also uh encourage in our own company. We use. uh We also use the uh apps again to to monitor our own epson environment, and this means that our engineers can focus on actually building the system and not worrying on monitoring that and you want to to have that observability tool to support any environment.

B

So if you're going to switch between serverless kubernetes, ecs, aks, azure, gcp or anything on on-premise, you don't want to worry about that that you need to change the way your code operates. You want to be able to have a a tracing library or a tracing tool that will work on any environment, because companies often do this change. Do the sh do the shift and the first thing they are cured about how we will be able to monitor if you're going to move to azure or things like that.

B

So you need to eliminate that from your uh your thought, because you can use a tool that will support any environment and you want to be able to connect every request that goes in in a transaction. For example, if you have a lambda function that is currently uh putting a message on a database that is triggering another function. That is talking with your one of your containers or one of your legacy services.

B

You want to be able to connect all of those together in a transaction and then to have the ability to take that data and actually search and analyze that so you can gain uh business related, metrics and also to be able to search very quickly and troubleshoot that, and that alone will just help you to you just quickly pinpoint those problems.

B

um So, just to summarize um modern applications today are are very distributed. They are using a lot of apps like abstract layers. You don't need to worry about service. You don't need to worry. If I'm talking about kubernetes, you don't need to worry about creating those containers. Everything is being managed automatically and everything is being abstracted from you, because you want to focus on your business logic and in order to monitor that it's not the standard monitoring unlock that will assist you.

B

You need to use something much more advanced that will also inherently implement, distribute tracing with them, and um this is exactly why distribute tracing becomes much more crucial uh component in in in any environment, and just as a side note for me, I encourage you guys to uh obviously, if you're running a small company, and you want to try all those things, those those open source tools are definitely available and they are pretty pretty good.

B

But if I'm thinking about scale and production um don't implement something that unless you actually need it, if you want to be professional in payment, then yeah then implement your own payment system. But if you are only want to use that as a tool, then you need to choose like what is the best tool for you to use, and obviously you don't want to have your engineering team focused on creating uh stuff like that, so um that is for me. So now is the q a session.

A

Thanks, thank you hen for a wonderful presentation. We now have some time for some questions. um I believe we already have one here in the q, a box in quote, support any environment. Would you be able to provide an example of a tool that is limited to a set of environments.

B

A tool that is limited to a set of environment.

A

Yeah provide an example of a tool that is limited to a set of environments.

B

Yeah, so if you think about being able to create your uh your own, uh if I'll take like jager as a service, for example, that you can run on your own environment, if you're going to do the switch to another environment, then you're going to have to also copy that to another tool to to another environment.

A