Cloud Native Computing Foundation FluentCon: Cloud Native Logging Day with Fluent Bit and Fluentd EU 2021, 14 May 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Keynote: Fireside chat with Martin Mao, CEO of Chronosphere - Anurag Gupta, Calyptia

Description

Don’t miss out! Join us at our upcoming event: KubeCon + CloudNativeCon North America 2021 in Los Angeles, CA from October 12-15. Learn more at https://kubecon.io The conference features presentations from developers and end users of Kubernetes, Prometheus, Envoy, and all of the other CNCF-hosted projects.

Keynote: Fireside chat with Martin Mao, CEO of Chronosphere - Anurag Gupta, Production Manager, Calyptia; Martin Mao, CEO/Co-Founder, Chronosphere

In this session join Martin Mao, CEO of Chronosphere, in a fireside chat with Fluent maintainer Anurag to discuss the evolution of logging in relation to observability and answering questions about the broader Fluent Ecosystem.

A

Okay, well, hey martin thanks! So much for for joining me on this fireside chat, um you know, would love, maybe a quick intro for for all the folks who are joining us to influence, maybe who we are, what chronosphere is uh yeah just a little bit about yourself.

B

Yeah for sure, uh thanks for having me anorak, I'm really excited to be here today uh chatting with you a little bit about myself. So my name is martin, I'm currently the ceo and co-founder of chronosphere. We provide a hosted monitoring solution to companies adopting cloud native. We help these companies monitor their infrastructure, so primarily kubernetes.

B

These days monitor their applications, which is generally micro, services, oriented these days and monitor their business as well in in real time, and the core technology of our product uh really came out of uber um and a lot of the open source observability projects that came out of uber and that's actually, where I spent uh four years of my career before founding chronosphere, I led a core part of the observability team there, where we created projects such as m3, which is a distributed and scalable metric storage engine, it's compatible with prometheus's, long-term storage there for metrics uh we created jager, which is the cncf graduated, distributed tracing project.

B

uh We actually completed the trifecta and created a logging platform as well internally, but unfortunately we never open sourced it. um So yeah spent a lot of time in my career in the observability um space, uh both solving problems for uber, with these direct solutions, but also for the broader community via the open source channels, as well. So a little bit about myself there and again really excited to be here today and looking forward to our chat.

A

Awesome yeah and you know, jaeger, alongside fluentd as a graduated cncf project, so uh you know awesome to hear about all this observability journey and uber what a small app right. uh You know, I think, as as probably one of the leading folks in that observability space with jaeger and all these other projects. You know I'm curious. What what should companies be be thinking about, or what should users actually be thinking about when solving for observability.

B

Yeah, that's a great question and I love the fact that you put users in there right uh because it is probably what the focus should be is what should the the users be thinking about and and even if we think about who the users of observability are that's been changing fairly rapidly, I'd, say right. So historically, perhaps the users or practitioners of observability and monitoring has been isolated to the sre department or you know, perhaps a core infrastructure team.

B

However, if you think about modern development and application developer, they not only have to write and develop their application. They also have to test it that they deploy it. They have to monitor in production. They have to remediate issues when it goes wrong, so really for us, observability, you know is, is more than just a practice.

B

It's probably like a cultural mindset, much like how devops is, and it really is something that all developers you know are at seeing embrace and adopted this culture, and this mindset and we're seeing that more and more so really the the end users of observability is, is all the developers out there and if you look at it from that perspective, uh they're really trying to optimize for one outcome, which is to know when something is wrong and to remediate that issue as quickly as possible and and ideally before end customers find out right.

B

Whether they're external.

A

B

Or other perhaps engineering, teams and whatnot, uh and I think optimizing for that outcome of remediating particular issues in their applications, really three questions um that we're trying to solve and and answer for here, the first of which is can I get notified and how quickly can I get notified when something goes wrong? Because if you don't even know when something goes wrong or your customers find out before you, that's really not a great place to be, and I'd say the second one is triage right.

B

So once you do get notified, something is wrong figuring out. What is the impact? um Is it impacting all of my customers or just the subset of my customers? Is it one cluster or another? You know if you get working up in the middle of the night? Is this something I have to deal with now? Or can I wait until the morning?

B

So you know triaging the issue and knowing the impact is a fairly important question that we need to answer and then the third step or phase or question that we do need to answer is. Can I root cause this issue? Can I find at the underlying root cause of the issue and and really provide a fix here for it?

B

So you know, I think those are probably the three uh steps or or phases I'd say that developers go through in achieving their outcome, which is to remediate the issue as quickly as possible, and you know for myself and for us the team here at chronosphere. That's what we think about um when we think about observability and that's what we think end users should focus on. We hear we do hear a lot of definitions out there. um Where you know it's more, I would say, concentrated around.

B

Perhaps the data types like metrics traces and logs and the three pillars per se. You know those data types are definitely important um for sure, um but you know, and they are the the types of data we need to answer the question and arrive at our outcome.

B

The data types by themselves, don't really give you observability or better observability right, just just taking the three of them off and say: hey I have logs, I have metrics, I have tracers, it doesn't necessarily mean you have observability or great observability and producing more of each of those data types that also doesn't lead to greater observability. So you know, we do think that an outcome-based approach for the end user, which is the developer, is, is perhaps a better way to think about observably as a whole, as opposed to a data approach. There.

A

Yeah, it makes a ton of sense. I think everyone's getting hammered in with these these three pillars: logs metric traces, you got a checkbox all above and yeah. It's sometimes we just forget about the user. uh In those cases- um and you know you mentioned, like the the three steps or phases uh are there- things you'd recommend to users to like going about to to meet those? Or you know, maybe you can help clarify those pieces a little bit yeah.

B

Would love to double click on that for sure so yeah? When I mentioned that the three phases they are, I guess faces or steps in the sense that they are generally sequential right. You have to do one before the other.

B

You do have to get notified that something is wrong before you can go and fix the issue for sure I'll say that you know the best way to think about these three steps or phases is that it's still again optimizing for the outcome, which is remediation and it's not necessary that you have to go through all three phases to remediate an issue right. So if you think about it, um if you're mid deploying your service and you get notified- that something is wrong.

B

First course of action is probably to roll back that deploy instantly right and that could be your way of remediating the issue. You don't know the root cause there, but you've remediated the issue, you've avoided customer impact, I'd say, and that's really what you're trying to optimize for right. um So it's really about that, and you know perhaps at the notification phase you can remediate instantly. So there's no need to go triage and to reclause the issue.

B

You can sort of do that after the fact- and I think that's important as well, you don't really want to be doing uh sort of root. Cause analysis live during the incident with the pressure of knowing that the business is down or impacted, or anything like that. um Perhaps- and you know I'd say also generally, uh issues are introduced when we change a system when we leave a system alone generally there's not as many issues and I'd say. The highest percentage of causes of issues is when we introduce change to a particular system.

B

So perhaps when you get notified there, that that um resolution or step to remediation is is fairly quick. The second one is triage, so you know. Definitely there are a bunch of situations where just being notified isn't enough. You haven't actively you're not actively introducing a change to the system, so you do want to triage the issue in the sense of knowing what is impacted. What is the impact to you know subsets of my of my customer base, perhaps or all of my customer base.

B

uh Perhaps you know you can isolate the issue down to one cluster or one availability zone or one region right, and I think that um helps you know how bad the issue is and how much sort of urgency you need to put into resolving the issue. But often we find that at that step you can also remediate fairly easily without root cause analysis, and you can imagine you know.

B

Most of our modern architectures are spread across multiple clusters: multiple availability zones, multiple regions, so a quick sort of resolution to that to remediation, perhaps is to route your request around those impacted zones or clusters. Right. If you know the issue is isolated to to cluster a or to you know, zone a uh route. Your request away from that such that again you're you're, uh you've, remediated the issue and your customers are not impacted.

B

um You know yet uh you have time to to really figure things out, uh not under that time pressure, and then you know there are definitely occasionally those issues where um you know you can't do any of the first two and you really have to get dug in uh and really dig at what the root cause is in production. Again, this is probably not the preferable one.

B

You know I'm sure the developers would uh would prefer not to have to debug things live in production under the time pressure of uh actual impact to the business, but that does need to happen sometimes and and when it does, you know you really got to get dug in there figure out what the root cause is: roll out a fix and remediate the issue that way. So you know they are sort of sequential and dependent on each other, but I think you know again.

B

The outcome is remediation as quickly as possible and from any of the steps you can really achieve that that outcome as quickly as possible- and I think that's the best way to think about those three phases there.

A

Yeah- and I think it helps a lot and now I think, as you talk about remediation, is there going a little bit back to kind of the three pillars and the three data types? Are there certain data types that make themselves or lend themselves to make it easier to remediate right is like metrics or traces? Is that going to help you move faster or you know how do the data types kind of relate back to all of this.

B

That's a great question: I think the short answer is is yes, uh you know, there's, obviously a lot of uh details in that you know. I think, when we think about the data types and as I mentioned earlier, you know just by having all three checked off doesn't lead to better observability in in in any way and just having more of each data type doesn't really accomplish that or achieve that either.

B

I think, if you look at the the the outcome, which is remediation and the phases, there are particular data types that are better suited to particular phases right. So if you think about notification generally, when you're trying to be notified of something that notification is generally done in an aggregate view of all of your data or all of your requests, perhaps right so you know you're looking at um how many requests am I getting per second? How many errors am I getting per second? What's the aggregate uh latency of those particular requests?

B

um So I think you know to answer or to help with the notification phase of the problem. Metric data is generally a more optimal type of data to go and solve that problem. It doesn't necessarily mean it's the only one and it's the necessary one, but if you think about what you're trying to measure there, it's really an aggregate view of numerical data, you're, counting things or you're, measuring the latency and generally metrics, which is you know, values over time.

B

And if you think about notification and alerting you're generally checking those numerical values against a particular threshold. So um you know, I think metrics is perhaps better suited for the notification phase of things, um perhaps triage a little bit as well.

B

If you think about triage you're, really trying to dig one level deeper into the error count or into uh latency a little bit and then perhaps having labels or tags on your metric data to better slice and dice by a particular cluster or an ac or region, can help you with triage out there for sure and the the actual individual request itself, perhaps is not is not as required for those phases, but as you shift later into the the phases in the process a little bit into more steeper triage and root cause analysis.

B

I think that does lend itself to data types like logs and traces a little bit more because logs and traces are, you know.

A

B

Default, a little bit more verbose have a little bit more information there, um and I think you know as that transitions uh there are definitely, um I would say, lots and traces are perhaps more efficient at solving those particular phases of the problem. um So you know, I think, all three types and I'm sure they exist for a particular reason.

B

All three types are definitely useful throughout the whole process for sure, but in my mind it's a little bit about optimizing for the particular phase, and now this doesn't mean that you need to have all three instrumented right, and I think this is what is great about.

B

You know the announcement earlier um today about fluent bid and sort of extracting metric data off of vlogs, because quite often it doesn't mean that you need to go an instrument for all three types of data, but converting between one type to another type to optimize, for the use case is, is a great a a great advantage. I guess- um and you know again uh really happy to see fluent bit and fluently- go down that path uh in enabling those use cases.

A

Yeah- let's talk a little bit about that. You know, I think you know we have a couple sessions at fluentcon that you're going to talk a little bit about more on the the the tracing side with flip it, as well as just the metric side, would love to just get your your thoughts about maybe metrics logs how these things all stem together and fluent bits uh kind of new announcement. There.

B

Yeah for sure, um so you know, I think there will be a session later today uh from from mike at named marcus. And if you look at that session, I don't want to ruin his session bye, but by any means, but he was already you know via a custom plugin already generating metrics off of logs right.

B

So this is already happening, even though there is first class support now in sloan b, which I think is great uh users were already doing this out of necessity, uh and and if you look at mike's use case, it really is to extract metric data off of the log so that he can alert off of this and get faster notifications, because again, metrics is perhaps a better optimal method of that phase than than logs is right.

B

So you know, I think um uh this is already something that that end users are leveraging already that the need is there already. I think, what's great about the announcement and correct me if I'm wrong here under a, but I believe the capability of extracting metrics from logs has existed influencing fluent bit for quite a while. But the big announcement today is that it's happening in in the metric extraction component.

B

It's happening in prometheus format, and I think that is also really great for the industry as a whole right and if you take a step back and look at the monitoring and observability industry, it's been a huge shift towards open source standards right. So you know, fluent d: is the graduated sort of sincere project for logging same for prometheus for metrics, and I think that the best part of that isn't that there is a solution.

B

It's great, that there is a standardized solution, but that the there's standardization around the format of the data uh there right. So if you look at all of the projects in the cncf ecosystem, most of them are in fact, if not all of them are emitting metric data.

B

If we take metrics as an example emitting metric data in the prometheus format already, I think that is hugely advantageous for the industry as a whole, because it means, from an end user perspective from a developer perspective, um you're, not locked into one storage solution, whether that's a storage solution, you're hosting yourself, or whether that's like a vended storage solution. Right, you can instrument in in one way in one protocol and every solution out there sort of supports it right and again from the metrics perspective.

B

If you look at the back ends, there's a lot of different solutions out there. In addition to prometheus, I mentioned, m3 is one that we open source out of uber, but cortex and thanos is there and available um as well, and I think the the sort of movement to these standards and the power of that movement uh is also seen by all the vendors that are providing monitoring and metrics based solutions as well, in the sense that they all have to support prometheus as a protocol. Now as well.

B

Right and again, all of this, I think, is great for the end user in the industry at large. Is you know, because you're not locked into one technology as a back end or you're, not locked into one vendor as a back-end, which I think is great? So you know it's great to see that you know uh that ability is supported at first class now in fluent bit, but also that the exposition format is in the industry standard which is prometheus and that's you know great to see, I would say.

A

Awesome awesome yeah. I I think that's that's how we were thinking about it from the the fluentd side, the fluid bit side. How do we just conform with the standards- and I think you know a big upcoming project in this space is definitely open telemetry and we announced some some earlier stuff today, where we're saying hey we're going to have some integrations going on with the protocols they're building would love to get. Maybe your take on on the approach of of that project. These projects together, uh and just maybe your your take on open, telemetry.

B

Yeah we would love to.

B

um I I assume most folks watching this call are somewhat familiar with open telemetry, but if not, it is a collection of apis and sdks again with the goal of standardizing the the protocols and the clients that is generating all of this observability data, so overall as a whole as a project, I love it because it's pushing the industry towards more open source standards for sure I think if you look at open, telemetry and and the project really started off around um having sort of standard client libraries for disparate trace data, it expanded over time to include metric data, and I believe then you know the natural progression would be to extend, expand it even further over time to include log data as well right.

B

um And if you look down that path- and they have you know similar again apis and sdks across all the major programming languages which again standardizes the instrumentation and the production of the data, which I think is great. If you look at it as a project, you know it has um support for tracing right now. Metric support is being added actively right now and I think vlog may be coming soon and I think that's going to be great for the industry moving forward.

B

I think you know, perhaps in a year or two or perhaps even sooner than that, you'll see a lot of the applications. We are writing instrument in open telemetry from the beginning and that's great and in fact it's it's not only just the the three protocols there. It's a single client for all three types of of data, or at least two types of data right now and perhaps a third type down the line.

B

I think that's great for new applications moving forward, but I think, if you look at a from a practical lens of the things that we need to monitor today, um there is so much existing instrumentation. That is pretty, I would say, impractical to go back and re-instrument from a company's perspective uh for existing customer applications have written themselves or sometimes it's impossible because you're pulling you know a dependent library or an upstream project that you're using and you don't really even have control on how those things are instrumented.

B

So I do think that you know open telemetry. Hopefully, is the future that that is the standard there eventually, I think um taking it from a practical looking at it from a practical perspective, I think projects like fluent d and fluent bit are great here, because there is a lot of sort of back good support for existing protocols and existing instrumentation. That is this today and it's tackling the problem not from a client perspective, but from like a processing perspective outside of the application itself.

B

So I think you know um that that is a very different um uh way of of solving the problem. I think it's one that's going to be required as we sort of handle a transition, that's going to be for multiple years, and I think you know, with with this design of uh fluent bit fluenty, where you're processing outside of the application itself lends itself to have other advantages um as well. So one of the companies we work with, uh they called techton uh when we talked to them about their use of fluent bit.

B

What they were using it for was to actually augment the stream of log data that was coming out of the application itself, with additional metadata around the environment that it's running, and so they were augmenting it with the cluster and the namespace of of the kubernetes cluster that they were running in right, and I think that is a hugely powerful thing to be able to do, and I think they were using. uh I can't remember the exact feature name.

B

I believe it's called the rewrite tag feature or something like that in influent it, but I think you know that adds a bunch of fairly powerful um uh additional value as well, in the sense that you know now, you can sort of standardize the additional metadata that you add to the streams, which is always a hard problem to solve right. You can imagine if you ask every developer to emit environment or the cluster name, who.

A

B

You know which way they're going to go. Do it um in the standard format, there's going to be weird camel casing and all sorts of other things in there right. So I think, being able to do it in one centralized location is, is important and again sometimes, if you look at it from the end user perspective, sometimes it's actually really hard if you're an application developer. When you're writing your application and instrumenting your application, it's actually really hard to even know hey, which cluster am I going to be running in?

B

How am I going to go? Get that data? um It's actually something that that may not be possible from the application developer perspective. So I think this approach from fluency influent bit of sort of processing, all of the existing streams of data coming out um adds additional value there and unlocks a bunch of use cases for sure, and as you mentioned, I I believe there will be support for all the protocols that open telemetry are going to be crafting and standardizing as well. So you know sort of um it's not an either or thing.

B

It's a it's a different pattern. I think both will really help the industry as a whole moving forward.

A

Awesome yeah, and I think both of us would probably agree that observability has changed significantly in the last three years: new projects, new protocols, new standards. You know as someone who's at the forefront of this. What do you think the next three years hold like what is? Maybe the future in martin mao's, mind of.

B

uh Yeah I'd say the future is always uh hard to to predict for sure, but uh yeah, let's definitely have a crack at it.

B

uh You know, I think, um if you look at the future, what I believe is you know this trend that I talked about at the beginning, where every developer adopts this observability mindset, I think, will continue right and- and I think the the the trend that there will be a huge transfer of both knowledge and skill set from that core sre team from the experts in these practices today to all developers everywhere, and I really hope that that transition continues to happen.

B

I think, as that transition happens um and ends, hopefully as well as that transition happens. There is a focus on the outcome as opposed to the input all the data types as well. So I do see that happening over the next three years, so you know having the the developers optimize for the outcome, which is remediation as quickly as possible, and I think if you assume that that is the direction things are moving, and I think there are a few implications of that or a few outputs of that.

B

One of which is, I think, you're, going to see a lot more of what we're seeing already, where there is conversion between the three data types to optimize for the various phases that you know and really the developers are going to be optimizing for the various phases here. So do you think you'll see a lot more of this? What we're seeing today, already of like transferring between the data types to solve a particular phase and to remediate as quickly as possible, um but not just that.

B

I do also think that you know, as as part of this shift, I think there is also going to be, um I think, better context being passed between each of the phases. Are there as well? So you know going from notification to triage to root cause analysis going through the three phases there.

B

I think there'll be a focus and and sort of innovation on passing more context throughout each of those, and you know one example I can give here is my co-founder rob skillington gave a talk a couple of years ago at kubecon, where we showed how you could jump from a metric data point on a dashboard which you would use for notification and triage straight into the underlying request in the distributed trace system, which you would use for root, cause analysis right so really trying to not begin your search again as you go through the phases, but really take um the the sort of effort you put in from each phase and sort of um use that more effectively in in in the next phase.

B

I do think that we'll see you know more things there and actually I I think that we'll also see um sort of better integration between the the the tiers as well, right and and by tears. I mean infrastructure and the application to you right. So you can imagine- um and I think you may have alluded to this earlier today- that you know that there are some plans on fluent being fluent bit to also sort of uh collect infrastructure, stats or infrastructure metrics from from from the uh the hardware itself.

B

In addition to the streams of let's say, log messages you're getting from applications right, I think, having that in one place, perhaps unlocks a a bunch of potentials as well, in the sense that you know you can imagine if you know that there is a particular spike in infrastructure and cpu or disk usage around a particular time, and you know which applications are emitting data from that particular instance because you're processing it in a single place, you can imagine that there could be better correlations there and that could sort of unlock a lot of value as well right.

B

So you know, I think all of that will be sort of outputs of, hopefully, this sort of focus on outcome and driving better outcomes for the end user developers.

B

I do think that there are going to be sort of other implications of this as well, and I think one of the implications is that the amount of observability data that's going to be produced is going to continue to grow up into the right and I think that's going to probably grow and outpace the growth of infrastructure overall right, because, as more developers adopt, this mindset there's going to be more instrumentation and more data produced.

B

And it's going to be great because you know we we needed to get to better remediation, but I think the the unintended consequence of that is a lot more data being being produced. And you know it's not something. That's just limited to, I would say the monitoring of the observability space.

B

We see this in large data um as well and and in other industries as well and I'd say you know that has implications, for you know the central observability team or the sre team that is managing and running all of the infrastructure and all the observability tooling, that the rest of the developers use and depend on and there's probably two large implications there.

B

The first of which is, I think, that, as the observability tooling becomes a more um of a of an important tool, I would say in the tool set of of developers, the reliability of that system um is gonna is gonna, become more important, and you know this is coming from my experience at uber, where we built a hugely powerful metrics back in storage. Yet you know we couldn't prevent a single developer from writing a single line of code that inadvertently um emitted high, cardinality, metrics or inadvertently emitted.

B

You know you can imagine a lot of log messages and took down the back end and impacted every other developer in the company right. So I think there's going to be a lot more focus on reliability of those tools.

B

Just because there's going to be a larger dependence on those tools and then the second of which is you know, I do think that the monitoring data is, as I mentioned earlier, is going to outpace and uh grow at a much faster rate than you know, our spend or our use of infrastructure, and I think, at a certain point, the central observability team or the sre team is going to have to focus on how, to you know, implement best practices for the developers to understand the implications of the instrumentation and sort of optimize.

B

This data that's been produced to still, you know, solve the problem and optimize the outcome, but perhaps not in in a way. That's like hey just produce as much data as you can and sort of hope for the best. I think there will be a lot of focus on how to deal with that side of the problem um as well um looking forward but yeah. That's uh it's probably my best guess that what we're going to see in the next three years here.

A

Awesome awesome, yeah. I know I think every everyone's going to be a part of it right if you're watching this, you are probably at the forefront of observability, so really appreciate your your answer and your honesty right. It's a maybe an expensive future. um So so, with that, I I think uh yeah we can. We can go ahead and close up and you know thank you so much again, martin for your time and your insights and yeah we'll we'll chat again soon.

B

Yeah thanks for having me today and hopefully next time we chat, we can do it in person. Fingers crossed.