VMware The Podlets - A Cloud Native Podcast, 22 Nov 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Understanding Observability (The Podlets, Ep 4)

Description

Observability - what the term means, how it relates to the process of software development, and the importance of investing in a culture of observability.

For the show notes and transcript: https://thepodlets.io/episodes/004-observability/

Feedback and episode suggestions:
https://twitter.com/thepodlets
https://github.com/vmware-tanzu/thepodlets/issues
info@thepodlets.io

Hosts
https://twitter.com/carlisia
https://twitter.com/kris-nova
https://twitter.com/mauilion

A

Welcome to the popplets podcast hi, everyone welcome back to the cubelets at show episode 4 today we're going to talk about observability. I am Khaleesi Campos today here on. The show with me are Duffy Cooley how.

B

You doing folks I'm Duffy Cooley staff field, engineer here at Cooper and at VMware and I'm. Looking forward to this topic also.

A

With us is crazy, Nova, hey.

C

Everyone I'm Chris, Nova I, don't know what do I do. I mean I've done all of her advocate I code. A lot I hang out in kubernetes.

C

A

I, don't want to be left out I'm an engineer in the open-source project called Valero that does backup and recovery for you a couple, ladies applications. So what's going on observability, why do we care? Oh that's.

C

The million dollar question.

A

C

Don't know I have a lot of thoughts on observability I feel like it's one of those words that it's kind of like DevOps like it depends which, day of the week, you ask a specific person. What observability means that you'll get a different answer.

B

Yeah I agree with that, but it seems like it's one of those very hot topics. I mean it feels like people, often conflate the idea of monitoring and logging of an application with the term with the idea of observability and what that means. So I'm, looking forward to kind of digging into this the details of that. What.

C

Does observability mean to you w so.

B

Might it might take observability is a set of tools that can be applied to describe the ways that data moves through a distributed system, whether that data is a particular request or a particular transaction. In this way you can actually understand the wave. All of these. You know all of these distributed parts of this system that we're building are actually interacting and, as you can imagine, things like monitoring and metrics are a part of it right, like being able to actually understand how the code is operating.

B

For this particular piece of the system is definitely a key part of understanding that you know how that system is operating, but when we think of it as a big distributed system with terrible Network daemons in between and lots of other kind of stuff and in between I feel like we need it kind of a higher level of context for what's actually happening between all those things and that's where I feel, like the term observability fits yeah.

C

I think I generally agree with that. I've got a few nuances that I like to pick out, but I have high opinions, but yeah I mean I I hear a lot about it. I have my own ideas of what it means but like why? Why do we need it? I want.

B

To hear your ideas of what it is, what is what is it, how do you, how would you define it? I mean.

C

We got we have an hour day. Listen to me, um I mean basically like okay, so I'm an infrastructure engineer. I wrote this book cloud native infrastructure. Everything to me is some layer of software running on top of it infrastructure and observer observability to me. Is it solves this problem of how do I gain visibility into something that I want to learn more about, like I, think my favorite analogy for observability have you all ever been to, like you know like like a gas station or a convenience store and on the front door.

C

There's like it's like a height scale, chart you'll, say like 4 feet, 5 feet, 6 feet 7 feet. I always wondered what that was for and I remember: I went home one day and I googled it. It turns out, that's actually for if the place ever gets robbed, as the person runs out the front door, you get a free height measurement of how tall they are, so you can help identify them later. To me, that's like the perfect description of observability.

C

It's like cleverly sneaking things into your system that can help you with a problem later downstream. I.

B

Like that, yeah.

A

So observability is sort of a new term because it's not necessarily something that I as a developer would jump in and say: oh gee, my project doesn't do observability, I needed I, I, understand, metrics and I, understand, logging, monitoring and so now I hear observability. Of course, I read about it to talk about it on the show, and it's not and I have been running into this word everywhere, but I feel why? Why are you people talking about observability? That's my question. Yeah.

C

Well, I think this kind of goes back to the gas station analogy again right like what do you do when your metaphorical application? It's Rob like what happens in the case of a catastrophic problem, and how do you go about preparing yourself the best way possible to to have an upper hand at solving that problem? Right, like you know, some guy robbed a store and then ran out the front door, and then we realized. Oh, we have no idea how tall he is. He could be 40 feet taller.

C

He could be 6 feet tall and then you know we learned the hard way that maybe we should start putting markers on the door. I feel like observability is the same thing, but I feel like people just kind of wake up and say, like I need observability, I'm, gonna go and I. You know I need all of this like bells and whistles, because my application, of course, is gonna break and I feel like in a weird way.

C

That's almost a cop-out like we should be working on a hardening our application before we work on preparing for catastrophic failure, but.

A

Why didn't I hear the word absorbability ten years ago or even five years ago, I think I.

B

A

It's about two years ago, I'll.

B

Argue that, like the term observability, is coming up more frequently, and it's certainly a hot topic today, because of effectively context. It still comes back down to context when you're in a situation where your application, wouldn't you have built like a cloud native architecture of your application. You got a bunch of different services that are, inter communicating or maybe all communicating with some particular shared resource, and things are misbehaving.

B

You're gonna need to have the context to be able to understand how it's breaking or at what point it's breaking or where, in the crate, in the tangled web, that we move is the problem actually occurring and, and can we measure that at that point right like and so traditionally like in in a monolithic architecture, you're not really looking at that you're like? Maybe you break up the model, you brick you break over the monolith you fit.

B

You set up a couple set points you're looking for it the way, particularly code paths work or, if you're, if you're, on top of the game, you might like instrument your code in such a way that it will emit events when particular transactions happen, or particular things happen and you're going to be. Looking at those events in really--it logs and looking at metrics to figure out how this one application is perform is performing or behaving with observer ability. We have to solve that problem across many systems, so.

A

That is why I put on the show notes that it has to do something with the idea of cattle vs. parrots, because because I'm saying this, because Duffy was asking me before we started recording. Why was that on the show notes and because correct me, if I'm wrong, I think you were going in the direction of saying you don't see it? You don't see the relation, but the relation that I was thinking about was exactly what you just said.

A

If I have a monolith, I'm looking at one thing about looking at one log, I can treat there's my little pad as opposed to when I have many microservices interacting I can't even try anything if I Twitter them as badly without that right, because I can't this is too much. So the idea of the reason why observability is necessary sounds to me like that. It's a problem of scale in complexity, yeah.

C

And I think that explains why we're just now hearing it too right, like um I'm, trying to think of another metaphor here, I guess today it's going to be a metaphor day for me: Oh got it okay, so I just got back from London last week, I had gotten off the tube and I. Remember I, like came up to the surface and like the blinding light is in my eyes and all of a sudden I saw a sign for Scotland, Yard and I was like whoa I.

C

Remember this from like all the detective sleuths like stories of my childhood and I donned on me, that the entire point of this part of London was there to help people recover from disasters and then I thought about why we don't have Scotland Yard type places anymore, and it's because we have security systems and we have like different things in place that we to kind of learn the hard way we needed and we had to develop technology to help make that easier for us and I feel like we're just kind of at that cusp of like our first wave of security cameras, metaphorical security cameras with observability, we're at that first wave of we can instrument our code and we could start building our systems out with this idea of I want to be able to view it or observe it over time.

C

In the case of trying to learn more about it or debugging a problem. So.

A

How do people handle and I'm asking this question because truly I have not yet I have yet to do like to have this problem for my project that I need to put I need to do observability in my project. I need to make sure my project is observable, I mean other than the bread-and-butter, metrics and logging. That's that's what we do. We don't do anything further than that, but I don't know if those things are were constitute observer ability, but what what Nova just said?

A

My question is where we want to look at this stuff later, but we also talking about cattle in these things. Supposedly, your servers are ephemeral. They can go away. Come back. How do we look at? How do we observe things if they, you have gone away? Yeah.

C

A

C

We get into like this exciting world. If, like how long do we persist our data, in which data do we track and there's you know a lot of schools of thought and a lot of different opinions around. What's the right solution here is, but I think it kind of just boils down to every application. Instead of the concerns, it's gonna be unique and you're, just gonna have to give it some thought.

A

Should we talk some more about that, because that sounds very interesting: yeah.

C

I mean I I mean I, guess like we should probably just start off with like given a simple application concretely, what does it mean to build out quote unquote, observability for that application.

B

There's this idea of in this in a book called distributed system observability by Cindy, shreekant, sweet, Darrin, I'm, probably flattering her name, but she went that there's, like skis three pillars. The three pillars are events metrics and traceability, or tracing the bench metrics and tracing. These are the three pillars of observability. So if we were going to lay out the way that those things might apply to just any old application like a monolith, then we might look at how can.

C

We just use like a wordpress blog, just like a for example. It's got a datastore, it's got a thin layer of software and an API sure.

B

So, like a wordpress app so like the first, the first thing we might we might we might try to do is actually like figure out what events we would want to get from the application and figure out how to instrument our application, such that we're getting useful data back as far as like the event stream and so frequently, I think that or in my experience, the the way that you want.

B

The things do you want to instrument in your application or a any calls that your application is going to make that might represent a period of time right, like it's going to make a call to an external system. That's something that you would definitely want to omit an event for, if you're trying to understand you know like where the problems are going sideways like how long it took to actually make a query against the database in the back end of a wordpress blog is a great example right.

B

C

Question you said the word instrumentation.

C

My understanding of instrumentation is like there's kind of a like a bit of an art to it and you're actually going in and you're adding like lines of code to your application that on line 13, we say starting transaction on line 14 we make an HTTP transaction and on the next line we have. The event is now over and we can sort of see that and discover that we made this HTTP transaction and see where it broke. If it broke at all, is that it am I am I. Thinking about that right, I think.

B

You are but what's interesting about that, but the reporting on line 14 right what you're actually saying the event is over right. That way, I think that we end up actually measuring this measuring this in both an event stream and also in a metric right, so that we can act. You understand you know over the last hundred transactions to the database. You know like. Are we seeing any increase in the amount of time of the process takes like? Are we actually? You know are we are we are?

B

We is this something we can measure with metrics and, like you know, understand like? Is this value changing over time and then, from the event perspective? That's where we start trying in things like contextually in this transaction? What happened right so in this particular event, is there some way that we can correlate the event with perhaps a trace and we'll talk a little bit more about tracing too but like so that we can understand? Okay. Well, we have you know at two o'clock.

B

We see that there is like an incredible amount of latency being introduced when my wordpress blog tries to write to the database, and it happens every day at two o'clock. I need to figure out what's happening there and so, like that's a great to even get to the point where I understand it's everything, it's two o'clock, I need things like metrics, so many things like events specifically give me that time correlation to understand. Oh, it's a two and.

C

This is where we get into what currently I just asked about, which was how do we solve this problem of? What do we do when it goes away like in the case of our two p.m. database latency like for lack of a better term? Let's just call it the heartbeat, the 2:00 p.m. heartbeat, what happens when the server that was experiencing that latency mysteriously goes away? Where does that data go? And then you look at tools like I know: Prometheus. Does this an elastic search?

C

Has its capability to do this, but you look at how do we start managing time series data and how do we start tracking that and recording it and it's a fascinating problem, because you don't actually record you know 2:00 p.m. to this second and this degree of a second, this thing happened. You record how long have spent since the previous event, so you're just constantly measuring Delta. It's like it's like the same way that get works like every time you do a git commit you don't actually write all 1,000 lines of software.

C

You just write the one line that changed, yeah I, think.

B

You highlight it really I mean both both both of the to be a highlight. A really good point around, like this whole kettle versus best thing. You know- and this is actually something that I spent a little time with in a previous in a previous life, and that and the challenge is that, like especially in systems like kubernetes and other systems, where you have you know, perhaps your application is running or being scaled out dynamically or scaled down dynamically based on load. You have all of these ephemeral events.

B

You have all these events that are from pods or from particular instances of your application that are ephemeral, they're not going to be long-lived, and so they, this highlights a kind of a new problem that we have to solve. I think when we start thinking about cloud native architectures, in that we have to be able to correlate that particular application with information that gives us the the context to understand like.

B

Perhaps this was this version of this application, and these events are related to that particular version of the app, and when we made a change, we saw a great reduction in the amount of time it takes to make that database call and we can correlate those new those new metrics based on the new version of the app and because we don't have this like as a long-term entity that we can measure like this, isn't like a single IP in a single piece of software. That is not changing.

B

This is any number of instances of our application deployed. You know like it makes you have to think about this problem fundamentally differently and how you store that data, and this is where the cardinality problem- that you're highlighting comes in yeah.

C

Okay, I have a question open question for the group. What is the scope here and I guess to like kind of like build on our WordPress analogy? Let's say that every day at 2:00 p.m. we notice there's just latency and we've been, you know just we spent the last two weeks just endlessly digging through our logs and trying to come up with some sort of hypothesis of what's going on here, and we just can't find anything.

C

uh Everything we've talked about so far has been at the application layer of the stack instrumenting, our application, debugging, our application, making HTTP requests what happens or what should we do or disability even care if one of our hard drives is failing every day at 2:00 p.m. when, like the cleaning service comes by it accidentally bumps into it or something?

C

How are we gonna start learning about these deeper problems that might exist outside of our application layer which, in my experience, those are the problems that really stick with you and really cause a lot of trouble. Yeah.

B

Agreed or somebody has like scheduled a backup of your database every day at 2:00, so it locks the database for a period of time of the backup. You know like what wait. When did that happen? You know.

C

Yeah, why did that happen? Somebody like commented on the line in the crontab, and then the server got reset, there's like some magical bash scripts somewhere on the server that goes and rewrites the crontab yeah.

B

C

B

These are, these are the needles in the haystack, so we've all stumbled upon one way.

C

Like are we responsible for instrumenting like the operating system layer, the hardware layer isn't.

A

That what monitoring is like some sort of testing from the outside, like an external testing, that, of course you only get gives us the information after the fact right, the server aware it died. My application is already not available so now, I know yeah, but isn't that? Isn't that? What monitoring but isn't monitoring? What would address a problem like that I.

B

Think it definitely helps I think that I think what you're digging at Chris is some correlation being able to actually identify in a particular period of time, what's happening across our infrastructure, not just to our application being able to you know, and the important part is like how you even got to that that time of day like how do you know that this is happening like when you're looking for those patterns like how did you get to the point where you knew that it was happening at 2 o'clock right if you know that it's happening at 2 o'clock because of the event stream per se right?

B

That actually gives you a time correlation. Now you can look at okay. Well, now, I have a time and I need to like scoot back to like a macro level and crank.

C

It up at 2 p.m. yeah.

B

Globally, at 2 o'clock, what's going on in my world right like is there is you know, I know that these are the two entities that are responsible. I know that I have a bunch of pods that are running on this cluster I know that I have a database that may be external to my cluster or maybe on the cluster. I need to really like understand, what's happening in in the world around those two entities, as it correlates to that period of time.

B

Give me a nut shot to give me enough context to even troubleshoot, but.

C

B

C

So I mean in Kali.

B

C

Sorry go ahead: Carly's yeah, no.

A

It's: how do you do it, though, because I'm super gonna go back to the monastery I mean I'm using external serve service to ping. My my service in my service is down yeah I'm, going to get the timing right. I can go back and look at the information, the blog stream. What I know that was because of the server know, but should I be paying in the server too, should I be paying every layer of the infrastructure? How do people do that? Yeah.

C

That's kind of what I was alluding to it's like: where does where does like observability at the application level? Stop and systems observability across the entire stack start, and what tools do we have and where are the boundaries they're cut?.

B

It so I think I think this is actually where we start talking about, like the that third pillar, that we were referring to earlier, which is tracing and the ability to understand from the perspective of a particular transaction across the system, what entities that particular transaction will touch and where it spends its time across that entire transaction.

B

So if my query, so what I was trying to do was actually like, you know, submit a comment on a wordpress blog if I had a way of implementing tracing through that WordPress blog I might be able to leave myself little breadcrumbs throughout the entire set of systems and understand, okay. Well, what you know at what point did I I mean we're we're in this in this particular web transaction? Am I spending time so I might see that you know from the load. Balancer I begin my trace ID and in that load.

B

Balancer terminates to this pod and inside of that pod I can see where I'm spending my time a little bit of time to kind of load, the assets and stuff a little bit of time for pushing.

B

I commented to the database and identifying what that database is it's an important part of that trace like if I understand, I mean you know where that traffic is gonna go next and how much time I spent in that transaction. You know so again. This is like down to that code. Layer like we should have some way of actually leaving us. You know producing an event that may be related to a particular trace ID, so that we can correlate the the entire lifecycle of that transaction that unique, trace ID across the entire process.

C

Interesting it.

B

Helps us narrow the field didn't understand what all the bits are that are actually being touched, whatever that are, that are part of the problem, otherwise we're looking at the whole world and, like obviously, that that's much bigger a stack right like so.

C

One of the things that I've kind of learned about kubernetes as I've been like working with kubernetes and explaining it to people and going out on the road and talking and doing public speaking I found that it's very easy for users to understand. Kubernetes. If you break it down into three things: compute network and storage, and it what I'm kind of getting at here is like the application layer is probably going to be more relevant to the compute layer. Storage is going to be where, which is that's?

C

Observability storage is going to be more monitoring and that's gonna be what is my system doing? Where am I storing my data and then network is kind of related to tracing which we're looking at here, and these aren't like necessarily one-to-one, but it just kind of like distribution of concerns. Here am I thinking about that? Like kind of the same way you are Duffy, I, think.

B

You are I, think I think what I'm trying to get to is like I'm, trying to identify the tools that I need to be able to understand, what's happening at two o'clock and all of the players involved in that right and so for that I I'm, actually relying on I'm relying on tools that are pretty normal, likely Billy, actually monitor all the systems and understand. What's you know and have like real time stamp stuff that describes you know like I, got an adios or alert or what-have-you.

B

That says that you know my backup for the MySQL database is started at 2:00 o'clock and it ends at 2:30 I'm, relying on things like an event stream to say you know get to give me some context of time when my problem is hunting and I'm, relying on things like tracing, perhaps just to narrow the field, so that I can actually understand what's happening with this particular transaction and what are the systems that I should be looking at, whether that is there's a bunch of time being spent on the network.

B

So, what's going on with the network at two o'clock? There's much time is being spent on persisting data to a database going on with the database. You know I mean I. Can this Kennan gives me I think enough context to actually get into trouble, shooting mode right, yeah.

C

And I don't want to like take away from this lovely definition. You just you just dropped on us, but I'm gonna, to take a stab at trying to summarize this so observability it spans the whole stack, so I mean it's like. If you look at the OSI reference model, it's gonna cover every one of those layers, and all it really is is just a fancy word for all the tools to help us solve a problem. Yeah.

B

C

Sorry I'm not trying to like take away from your.

B

Demos right I, just.

C

I kind of want to just like, simplify it, so that, like I, can grok it a little bit better. How.

A

About people this culture factor into it, it was just tools, I think.

C

Culture is a huge part of it Peskin. What is what.

A

Would it with this culture be tremendously different from what we get now, usually with, at least with modern teams, modern companies that are doing modern software? That's what I meant to say: I mean.

C

I, don't what it looks.

A

At friends, yeah.

C

I definitely think, there's like you can always tell I like somebody once asked: what's the difference between an SRE and a senior sre and they were like patience and it's like you can't, you can always tell folks you've been burned because they take this stuff extremely seriously, and I think that culture like there's, there's commodity they're, like people are willing to pay for it. If you can actually do a good job at going from chaotic problem, I have no idea.

C

What's going on and making sense of that noise and coming up with a concrete, tangible output that humans can take action on I mean that's, that's huge. That is.

B

I'm I was recently discussing the the ability in in a in another medium. We were having a conversation around doing chaos, testing, test and I. Think that this relates and the the interesting thing that came out of that for me was the idea that you know I spent a pretty good portion of my career teaching people to troubleshoot, which is kind of weird.

B

You know like teaching somebody to have an intuition about the way that our system works and giving them a place to even begin to troubleshoot a particularly complex problem, especially as we start building more and more complex systems.

B

It's really a weird thing: to try and do and I think that, culturally, when regarded when you get into when you have, you know embraced technologies like observability and embrace technologies like chaos, engineering I think that culturally, you are actually you are not only enabling your developers and your operators, your sres, to experiment and understand how the system breaks at any point, but you're also enabling them to under to better understand how to troubleshoot and characterize these distributed systems that they're building so I.

B

Think that, and, and if that is a part- is that as a cultural norm within your your company, I mean think about how many miles ahead you are of, like the other people in your industry, right like you, you have made it like through through adopting these technologies, you have enabled your engineering teams, whether they be the people who are writing the code.

B

Will they be the people who are operating the code or the people who are who are just trying to keep the whole system up or provide you feedback to experiment and to and to develop hypotheses around how the system might break at a particular scale and to test that right and and giving them the tools with which to actually observe this is critical. You know like it's amazing, but yes,.

C

I kind of like, in my mind again I'm, on my metaphor, kick again I. Think of like the like the bank, robber movies, where they like take dust and blow it. Then, all of a sudden, you can see the lasers mm-hmm, yeah, I, come kind of feeling like that's what's happening here, is where we're kind of purpose like chaos, testing. It would just be the practice of intentionally breaking the lasers to make sure our security system works and observability is the practice of actually doing something to make those lasers visible.

C

So we can see what's going on mm-hmm.

B

A

Because the two of you spend time with customers special, maybe a few more so than over, but definitely I spent zero time no I I spent zero. My I'm curious to know if someone I said, let's say on sree- wants to implement set of practices that comprise what we are talking about and saying it's a possibility: okay, but they need to get a buy out from other people.

A

How do you suggest to go about doing that because they might know how to do it or be willing to learn, but they may need to get approval and they need to get buy up, not buy out I'm sorry I buy in from their managers from their colleagues. You know there is a there is a benefit and there's a cost.

A

How would somebody present that I mean we just talked about Duffy just like gave us a laundry list of benefits, but how do you articulate that, in the way that you prove the cost, those benefits are worth the cost and what are the cars? What are the trade-offs, yeah.

C

I mean I think this is such a great question, because in my career I've worked at the world's most paranoid software as-a-service shop, where I mean everything we did. We baked like emergency disaster recovery into it.

C

Every layer of everything we did and I've also worked at shops that are like now we ain't got time for that, like hurry up and get your code moves and push to production and I mean I, think there's pros and cons to each but I think you know, as you look at the value you have in your application, you're gonna come up with some sort of way of concretely measuring.

C

That of saying like this is an application that brings in five hundred bucks a month or whatever and depending on that cost or how much your application is worth to you is gonna depend I, think on how seriously you take it like, for instance, a wordpress blog is going to probably not have the same level of observability concerns that, like maybe a bank routing system would have so I think you know, as your application gets more and more valuable.

C

Your need for observability and your need for these tools is going to go up more and more I.

B

Agree, I think that, from the perspective of like how do you convince- and you know maybe an existing engineering culture to to make this jump to introduce these ideas- I think that I think that's a tricky question, because effectively what you're trying to do is kind of enable that cultural shift that we were talking about before right, like what tools would set up the culture to succeed as they build out these these applications and distributed systems that are going to make up or they're going to comprise the basis of what your product is right.

B

What tools, what like and and getting to the media, coming at that from like an SRE perspective, you I, bet sort of need air cover to be able to actually have that conversation. Oh, how have those tough conversations with your developers and say look!

B

This is why we do it that way- and this is this is something I can help you do, but, like fundamentally, we need to instrument this code in a way that we can, actually, you know, observe it and to understand like how it's actually operating when we start before, we can actually open the front door and let crazy and look to let the internet in right like we need to be able to understand how and when the doors fall off and yeah.

A

B

Know if we're not Institute, if we're not working with our developers, who are more focused on understanding, you know, does this function do what it says on the box, rather than is this function implemented in a way that might accept that might emit events or metrics all right? This is a that ever had completely different set of problems from the developers perspective.

B

There are different I've seen a couple of different implementations of how to implement this within an organization, and one of them is Facebook's idea of product engineering or think it's called product engineering or Production Engineering one of the few, and so this idea is that you might have somebody who's similar in some ways to an SRE somebody who understands the infrastructure and understands how to build applications that will reside upon it and is actually embedded with your developer team to say you know before we can like legit sign off on this thing.

B

Here's here are the things that this application must have to be able to and wire into to enable us to operate this app right so that we can understand. We can observe it and and minute monitor it and do all the things that we need to do and the great part about that is that it means that you're teaming with the developer teams. You have some engineering piece that is teaming with the developer team and enabling them to understand.

B

You know why these tools are there and what they're, for and and really and really, and you know, kind of promoting that engagement and.

A

It's a getting to that place. It's a interesting proposition! Isn't it because, as a developer, even as a developer, I see the world moving more and more towards the developer, taking ownership of the apps and knowing more of the more layers of the stack and if I am a developer and I want to implement, incorporate this. These practices I, don't want to I need to convince someone, but either develop or whoever is in charge of monitoring and making sure the system is up and running right, yeah, so I, don't wanna lose my train of thought.

A

So one way to go about quantifying the need for that is to say well. Over the last month, we spent X amount of hours trying to find a bug in production and that X is like a huge number. So you can bring that number and say this is how much the number cost and in engineering hours, but on the other hand, you don't want to be the one to say that it takes your hundred hours to find one little bug in productions of you, yeah I,.

C

Mean I feel like this is why agile teams are so successful.

B

C

Team, sorry for cutting you off carly's yeah, no.

A

C

Anyway, I was just gonna, say I feel like this is why all teams are so successful because baked into how you do your work is the sort of this implicit way of tracking your time in your progress. So, at the end of the day, if you do spend a hundred hours work like trying to find a bug, it's sort of like that's the team's hours. It's not your hours and you sort of get this data for free at the end of every sprint. Yeah.

B

What's your point, news I mean: what's it, what you've brought up is actually kind of another cultural piece that I might I think it's a problem like it has to be I.

B

Think that frequently we assume there are many I should say. Let me put this differently. I've seen, companies where, in the culture is somewhat damning for people who spend a lot of time, trying to troubleshoot something that they wrote, and that is a terrible matter, because it means that the the people who are out there right in the code who are just trying to get across the finish line with the thing that needs to be in production right, have now this incredible pressure on them to not make a mistake. That is not okay.

B

We are all here to make mistakes, that's what we do professionally is make mistakes and the resentment and the rest is the gravy. You know I mean like and.

C

B

Yeah, it makes me nuts that that there are organizations that are like that, I feel like we really just in it. What's awesome, both is I, see that narrative raising up. You know within the within the ecosystem that I, you know the brown cloud native architectures and other things like that is it's like you know we we at you know you were hired to do a hard job, and if we come down on you for thinking that that's a hard job, then we're messing up you're, not messing up.

A

Building software is very hard and complex, so if you're not making mistakes you're either not human or you know making enough changes, and in today's worlds we still have humans. Making software robots we're not there yet, but and it's a very risky proposition not to make be making continuous changes, because you will be left behind yeah.

C

I feel like there's, there's definitely something to be said about empathy for software engineers, like it's very easy to be like. Oh, my gosh, you spent a hundred hours looking on this one bug to save $20. How dare you, but it's also it's a lot harder to be like. Oh you, poor thing. You had to dig through a hundred million lines of somebody else's code in order to find this bug and it took you a hundred hours and you did all of that just to fix this one little bug.

C

How awesome are you and I feel like you know, that's where we get into the team dynamic of? Are we like a blame, Centrex team? Do we do we try to assign blame to a certain person, or do we like look at this as a team's responsibility like this? Is our code- and you know poor karlie CEO over here had to go dig through this, like code that has been touched in ten years or whatever.

A

Another sorry sophi another layer to that is that my experience I have never done anything software or looked at any codes, brought up any system that is trivial as the end result boys, especially in relation to the time spent. It has never happened that it wasn't a huge amount of Education that I got to reuse. You know you know in the future in future work so.

B

Yeah, that's true.

A

B

That's kind of what I was referring to is around like being able to build up the intuition around how these systems operate right like if the longer the more time you spend in the trenches working on those things right, if you have, if you are enabled leveraging technologies like observability and chaos, engineering to troubleshoot, you come up with a hypothesis about how this will break when this happens and test it, and you know, view the results and come up with a new hypothesis and continue down down down that path.

B

You will automatically I mean, like you know, by your nature, build a better intuition yourself around how all of these systems operate, doesn't matter whether it's you know the application that you're working on or some other application, you're going to be able to build a better intuition for how to understand and characterize systems in general you'll, be a better person. You'll be a better engineer for distributed systems. If you are in a culture, that is blameless.

B

That gives you tools to experiment and gives you tools to to validate those experiments and come up with new ones. You know.

A

I'm gonna challenge you and then I'm gonna agree with you. So hang on. Okay!

A

So we're going to challenge you, so we are saying that observer ability, which actually bottom boils down to using automated tools to do all this work for us that we don't have to dig in manually on on a case-by-case basis. No, no.

B

I'm saying observability is a set of tools that you can use to observe the interactions and Cape and and the interactions and behavior of distributed systems. Okay,.

A

But with automated tools right no.

B

The automation pieces are really I mean you wanna. Take this one Chris yeah.

C

I mean I think like they came, they certainly can be automated I. Just don't think, there's a hard right query: a hard bit of criteria that says everyone needs to be automated, like there ain't nothing wrong with Association into a server and running a debug script of something if you're having a really bad day. Okay,.

A

Yeah, okay, but let me go me back with my theory. That's just pretend it is because you will sound better.

B

A

Right so, let's say not to exclude the options to do it manually too, if you want, but let's say we have these wonderful tools that I don't can't automate a bunch of this work for us and we get to look at it at a high level. So I'm thinking is well beef, whereas before we, if we didn't have, we didn't use those tools or we are not using those tools. We have to do a lot of that work manually.

A

We have to look at like a lot hidden, a lot more different places and you get to do I would challenge you that, but hey gorg. Let me finish the whole tech spill. I will challenge you that we develop even more its. We shun that way. So we are decreasing the level of intuition that we develop, potentially by using the tools now I'm going to agree with you.

A

It was just a rational now that I had to follow I agree with you that it definitely helps you develop intuition, but it is a better quality of intuition, because now you can hold these these different pieces in your hands because you're looking at it at this higher level, because when you look at need to look at things that piecemeal at least I am like that, like okay can hold this.

A

One thing here is big already in my head and then for me when I switch contacts, then go look at something else: I all right, I forgot what I you know what I looked at over there and it's hard to really hard to keep track and really wasteful for it, it's possible to keep all of it in our minds. Right and let's say: I have to go to the whole debugging process of over again.

A

If I don't have notes, it will be like just the first time, because I can't possibly remember I mean I've been in situations of having to the bug different systems and my okay I'm now like third time around I'm taking notes, because the fourth time is just going to be so painful. So having tools that lets us look at these at a higher level. I think has the additional benefit of helping us understand the system and have hold it together in our heads.

A

Because, okay, we don't know little details, how things are happening behind the scene. But how are useful is that anyway, I much rather know how the whole system works together and the points of failure, like I, can visualize right.

A

C

Have a question.

B

C

Everyone I, following up on carlee's Hijaz, how she challenged you and then agrees with you. I I have I really want to ask this question because I think Carly CEA's answer is going to be different than Duffy's and I. Think that's gonna say a lot about the different ways that we're thinking about observability here, and it's really fascinating. If you think about it, so have either of you worked in a shop before where you had like the guy.

C

You know that one person who just knew the code base inside and out he had been around for forever. He was a dinosaur and whenever something went wrong, you're like we gotta get this guy on the phone and he like would come in like oh, it's this one line and this one thing that it would take you six months to figure out. But let me just fix this really quick, Bam, Bam and productions back on line.

A

The code base guy yeah the systems admin guy he's no like something that is not my app, but the system broke. Get that person who knew every like could take one second, to figure out what the problem was have.

C

You seen that before they're, like that one person.

A

Just was so much.

C

Tribal knowledge, yeah, absolutely yeah, Duffy.

B

What about you.

B

C

What I'm getting at getting out here is I. Think observability and I mean this in the nicest possible to all of our folks at home, who are actively playing the role of the guy I, think observability kind of makes that problem go away right.

B

I think it normalizes it to your point. I think that it basically gives you so like I, think I think you're on to it. I think since I think I agree with you, but I think that, fundamentally, what happens is through tooling, like chaos, engineering through tooling, like observability, you are normalizing what it looks like to have that to to teach anybody to be that person right and that's the key takeaway is like you know, to curvaceous point. She might.

B

Actually you know Chris and I I promise that we will approach some complex distributed systems problem fundamentally differently, yeah.

C

B

Right if somebody said somebody has a broken kubernetes cluster, Chris and I are both going to approach that same problem, and we will likely both be able to solve that problem. But we are going to approach it in different ways. Yeah.

A

B

And, what's and I think that the benefit of having common tooling, with which to experiment, to understand and observe the behavior of these distributed systems means that you know we can. We can normalize what it looks like to be a developer and have a theory about how the system is breaking or would break and having some way of actually validating that through the term through through the use of observability and perhaps chaos, engineering depending and and that means that that that we're turning the keys over to the crew.

B

Turning the keys the council over there's no more bus tests, you don't have to worry about what happens to me at the end of the day. We all have this common goal.

A

B

A

But this is the most excellent point: I'm glad you brought it up, because what both of yourself is absolutely true, I mean give me a better documentation and I. Don't need you anymore, because I can be self-sufficient exactly.

B

A

B

A

A weird way to observe like where things went wrong and again going back to that. What I said that more and more developers are having to say being we asked I mean some developers are actively taking on the ship and in in other cases, they've been asked to take more ownership of the whole stack and I'm saying you know from the application level down the stack and but you gave me tools to observe where they went wrong beyond my cold as a developer, I'm not gonna call the guy yeah.

C

A

Level of self, so the guy doesn't want you.

C

To call him, oh so.

A

He provides benefit that we could say is provide the engineer, an additional level of self-sufficiency, yeah.

C

It's I mean teach teach someone to fish, give someone a fish yeah.

A

B

All right, well, that was a great conversation on observability and kind of talked about it, just a bunch of different topics. This is Duffy and I had a great time in this session, and uh thanks yeah.

A

I'm super glad to be here today, thanks for listening come back next week, thanks.

C

For joining everyone, I and I apologize again to all of our guys at home. Listening, hopefully we can. We can help you with observability along the way to get everybody's job a little bit easier. All.

A

Right and I want to say you know, for the girls, we know that you out there too, though,.

C

A

Just a joke: oh yeah,.

C

I mean I was totally.