Calyptia Fluent Bit, 27 May 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Anurag Gupta - Calyptia - Wrangling Data to Multiple Places With Fluent Bit - Percona Live 2021

Description

As more and more users move to #Kubernetes they also may start using multiple backends and analytic tools.

Comment 💬, Share 🔗, Like 👍, and Subscribe ✅ to our channel 📺, and turn on the 🔔 to be alerted to new videos about #OpenSource #Databases... and many other things!
https://percona.tv/subscribe

How do you collect once and send everywhere? In this talk, Anurag will talk about a Cloud Native Computing Foundation (#CNCF) graduated project Fluent Bit and how you can collect once, and send to all the backends you want. Additionally, Anurag will discuss some of Fluent Bit's advanced capabilities such as enrichment, parsing, and data reduction that helps users get the most out of their backends.

A

Hi, everyone welcome to personal, live very happy to be here today. What we're going to be talking about is wrangling data to multiple places, with the project fluent bit, I'm one of the open source, maintainers and also work at a company called calyptia that helps build these projects.

A

So today, what we're going to talk about is a little intro to. Why does wrangling data even matter in the first place? How do you do that with flintbit? What are some of the use cases that wrangling data can be very effective for your business, for your enterprise, for your startup, whatever it may be, and then, last but not least, we're going to talk a little bit about advanced use cases and walk through a quick demo.

A

So first, let's talk about. Why do we even care about wrangling data? So what's that challenge really? What's that problem? uh The first, I think everyone here would agree that data is just growing at a tremendous rate and whether or not we want to collect all of it. The truth is it's: it's out there uh there are insights to be had, um and how do we go about collecting? All of that we have to collect from all these different sources and send to all these different destinations.

A

So every few months we're seeing there's brand new backends, there's brand new databases, there's name brand new places where folks are telling us hey. If you send us your data, we'll give you the best amount of insights, and so this problem of having so many sources and so many different destinations, and so many different ways to get that data in is really a challenge. We don't want to have a thousand agents on a server. We don't want to have 50 000 agents deployed in kubernetes just to route data to all of the various destinations.

A

Now, last but not least, is is formatting. We have just all these different applications. New languages coming up. They all have their different stack traces. They all have their different formats. They'll have their different log styles. uh How do we make sure that we're being effective and catering to that and and ensuring that, when we're looking at it from an insight perspective doing analytics, on top of it, we're able to get the most information out of all these various formats?

A

So to solve all of these challenges about 10 years ago there was a project created called fluentd and that ecosystem spurred vendor neutral, open source data collectors and what that means is they're not tied to a single back end they're, proud of part of a foundation uh cloud native computing foundation, so right next to kubernetes right next to prometheus.

A

It's a graduated project. Fluent bid is part of that ecosystem. It's apache 2 license it's deployed 2.4 million times a day, and it really builds off these challenges. One data is growing. How do we make sure we can grab data from all these sources and send to all these destinations, so you can get to your analytics faster and last but not least, do some formatting do some processing in between now from uh who's using fluent side, we have a ton of cloud providers that are utilizing in their cloud. Aws.

A

Google cloud microsoft, a lot of folks are utilizing fluent bit in the enterprise today, so you can look at this project as having a lot of maturity, a lot of scale and something that you can build plug-ins to so how does it work so fluent fit is really based off of two big components. One is a plug-in system and the second is tagging, so the plug-in system allows us to say there's many sources, so each of these sources can have different inputs and there's many outputs.

A

So many of these things like say elasticsearch, grafana, loki, postgres splunk. uh Each of them can have outputs and as we define routes, we say that one thing is tagged with a and then splunk and elastic will have matching a's assigned to them. Now these are routed to one or more outputs and the this this project of fluent bid specifically, is written to be very, very performant. It's deployed on embedded systems, it's built for the container age.

A

You know here's a small benchmark that we run as part of our integration test suite for about 10 000 events per second, the cpu here in cpu seconds time is, is extremely small from a memory perspective. You know, there's 3.79 megs and 7.48 meg for 10 000 events per second, that's not too bad, so I'm really focused on extreme high performance, really high portability and and making sure that you're able to plug in what you need.

A

So let's talk a little bit about the use cases when you're wrangling data to multiple back-ends. What does that gain you work? Is this? Just are we doing this just to to send data to multiple locations, or are we doing this to actually solve real problems?

A

And when we look at the latter side, there's really four places that this comes into play and we'll walk through each of them individually, but the first is reducing costs. I think all of us have gone through this notion of we're collecting more and more data everything's becoming more and more expensive, but the insights are just not necessarily lining up with that data curve.

A

So how can we make that more effective send data to where it needs to go to maximize that insight, curve, enriching data, getting data and adding context context for analytics we'll walk through a use case of kubernetes unifying the format? How can you do processing? How can you do things that are based off of no specific schema and then decreasing vendor locking, so a lot of vendors will provide agents that are proprietary or open source that only allow you to send data to this specific backend and the truth is with back ends just continually evolving.

A

When you look at fluent bid and wrangling data to multiple locations, you don't want your middle tier to have a preference on where to send the data, so here really fluent bit being apache 2 part of the cloud native computing foundation that allows you to to really make sure you're not taking a reliance on a specific vendor solution.

A

So let's look at the first one: reducing cost. How does this work so there's a couple ways you can reduce costs. The first, which I think is the most basic, is you are sending data, say on the right hand, side we're tailing a log and we're send it to this dollar dollar dollar expensive back end. Now, what we can do is add a filter we can just say: let's remove the noise, remove null values, remove empty values, remove values that might contain debug or info logs. Let's get rid of that.

A

The second way is, of course, multiple backends we're talking about wrangling data. Here uh there are, you know, cheaper backends that are available. There's file system storage, that's available, and this can be something where I might be a financial institution that needs to archive data for a series of seven years.

A

I might need to take data and just hold it for compliance reasons from security side, and so what I can do is I can take that data. I can remove the noise and send it over to the expensive backend, but I can also send a copy to a cheaper backend. I can also use specific pieces in the log to say I want to send error logs to my super fast, expensive. Backend and I want to send info logs to my cheap backend and then on top of that.

A

One thing that fluent bit offers is really smart snapshotting. This is with our stream processing feature which we'll talk a little bit about later, but what this allows you to do is say: if I encounter a specific event, don't just send me those 400, you know 400 404 events or those error. Events send me it with context. I want to see the path before and I want to see the path after, and that is is really awesome. When you get into, I want to send snapshots based off certain metric values.

A

I want to send snapshots based off of certain error values. It can really help you reduce noise but keep context, which is something a lot of times things don't allow you to do now. Let's talk about enriching data, you know one of the major places that fluent bit is deployed is within kubernetes and when you deploy within kubernetes there's a lot of context that each log should have the pod the name space the container. These are pieces of information that make it helpful for you when you go to debug.

A

So when you're doing this enrichment, we can do things like look up aws filters uh and and give you things like hey. This is the cloud resources the region? This is the az. You can use geoip, lookup so being able to stream those messages in line. Tell you what country, what city you're in in case you're part of some certain privacy regulations uh and then the last one. Of course we just talked about is the kubernetes filter. Now these are things where they're continually evolving. You can enrich with custom data.

A

We have folks who build plugins that do security lookups for for certain ips to tell whether or not they're malicious. So the the really great part here is because this is open source. There's a lot of extensibility here with enriching your data now unifying format. Now we sort of talked about this in the beginning, but to dive a little bit deeper one is we want to be able to take all this unstructured data and give it some structure, so we can run metrics on top of it. We can perform aggregations.

A

We can send that to a place where it will be indexed correctly and most optimally, and so there one thing that flynn bit includes is a list of parsers like docker cri for the container based environments, log format, json csv things like syslog, tcp, 363164 5424. All these different various rfcs and formats come out of the box with with fluent bit, making it very, very simple to just connect in the data source parse it and and have some level of key value pairs, making that data more useful.

A

Now, if all else fails, you can of course use regex and we also include lua filtering. So if you're already using lua filtering, let's say a web server, you can apply some of those same logic points to fluent bit as well.

A

Okay, so let's talk a little bit about sql stream processing. This is really the heavy advanced use cases that come with fluid bit.

A

What we did back about five or six years ago was say we want to make something very, very lightweight deployable at the edge and, as folks started, deploying fluent bit at the edge more and more. What we found is that, because we were using so little resources, we could add a decent amount of power at in the middle with sql stream processing.

A

So what does this allow you to do essentially within the pipeline that fluent bit has its inputs, it's parsers to build that formatting, the filtering to remove data or enrich data, the storage layer in the router we've added this brand new stream processor, and we use you know ansi compliance sql here we're not trying to invent a brand new language. I think there's quite enough of query language is out there. So if we can use standard sql, we can use this to say. Let's allow for aggregations, we can build predictions. We can build functions.

A

uh We can do things like max min, we'll talk about that in the next slide.

A

uh The great part about this no database required it runs in memory still very, very performant, so even if you're collecting those thousands of logs per second, we can run those aggregations, and this can be very beneficial when you're trying to wrangle data to multiple back ends, because you might not want to send a thousand metric messages per second instead, once you aggregate that to one message per second, that just includes all the details, the summary the average, uh the other big part about this is schema list. So we don't require any format.

A

We don't require you to have x y z, we're just going to take in the format that you've brought in and try to operate. On top of that.

A

So some some use cases that come with the stream processing and we're going to demo a bit of this in in a little bit. The first is aggregating and routing data effectively. So again summarize those events before sending you can use max min average sum sending events only that matter talked about this a little bit where you can send data with context using the snapshot parameter time series predictions so fluid bit does offer some time series prediction capabilities. Where you can say, let me predict out to the next 10 seconds and then alert faster.

A

So if we find that cpu or memory or some derivative metric derived from a log is uh triggering at some level, we can alert directly to an output. Like slack, we can send a message to splunk. We can send a message to elasticsearch or loki uh to postgres.

A

We can do all sorts of really fun things when we're doing a little bit of this analysis at the at the edge layer from a stream processing site. Now this is not meant to replace your backend analytics by any means, but instead be an augmentation where you can say. Instead of having to run these alerts and cause this this havoc and processing power, or I'm trying to do some schema on read scheme on write type operations in my pipelines, why don't I bring some of that logic at a distributed layer?

A

So, instead of centralizing your back-end analytics? Why don't we distribute that across thousands and thousands of edges, your pods, your nodes, your kubernetes nodes, things that are ephemeral that are growing that are distributed and add a little bit of load for each of them. So you know the same same benefits that we see from a cloud computing side of going distributed, we're trying to bring that to the analytics space here again, not a replacement, but something that can really augment your entire pipeline here now.

A

Something I'm very excited for um here is some new features that are coming out uh and- and this is this is really a place where we've been investing to make sure that this process is a little easier and can benefit the entire fluent bit community. The first is dynamic stream processing, so being able to say dynamically switch, my endpoint from one back into that.

A

Next, you can imagine using a perhaps a bill trigger uh an alert on on a certain cost threshold to say switch, my back end from very expensive back end to very cheap block, storage, and- and this is something that's configurable being http endpoint. You can view lists and create stream process tasks, for example on the right. We have what a stream process task influence bit looks like and then, from a live query perspective.

A

We can actually flush that data directly to the lens of of what we're operating on so we're able to very quickly go ahead and see what does live data look like and how do we most effectively uh bring this to to a place? That's going to be really useful for us so with that, let's go ahead and switch into a quick demo, um we're going to look at live stream processing and creating some dynamic stream processing.

A

Jobs: okay, so what we're going to go ahead and do now is we're going to look at some live stream processing, and so what we're going to be doing here is we're going to tell fluent bit to go ahead and add a stream processing task which will take in some data and then we're going to live query that data now the cool part here is when we're tailing live log files, so we're looking at var log messages.

A

I have three windows open, so the second window I have here is really just another window where I can run logger, so we can create some test messages. You can see that it just popped up here and then the last one is. I have my my local laptop, so this server is running in the cloud, so I'm going to actually be performing all the stream processing and live query jobs remotely so not on the same same machine.

A

It is right now, so let's go ahead and first look at the active tasks that are part of flipbit right now. So in order to do that, we're going to first grab the curl query.

A

So here I have the list tasks I'm going to go ahead and just look at what that looks like I'm, going to go ahead and grab that curl command, I'm going to remove the verbose piece and we're going to switch this from localhost to my server, which is perf test.onebit.io, and we can see it's empty, so we don't really have anything in there as of yet, but let's go ahead and look at the query that we want to add.

A

So here we're going to look at snapshot query, and if we look at this you can see. I have one. A task name defined called live query.

A

The second is the sql statement, so I create snapshot tail snapshot and I'm doing this with a bulk of five seconds, with the tag tail snapshot which will be used for routing. So in case I have another event that only matches tail snapshot and I'm going to select every single piece of that record. That has the initial tag of tail. The other piece that I've added on here is I'm limiting this to five records because we're doing a live query.

A

I'm not too interested in getting all the data, I'm just interested in getting a preview of that data.

A

So let's go ahead and quit and we're gonna now look at the create task, so here we're going to make a post command we'll switch that localhost over and we're using the snapshot query as the data.

A

So let's copy that here and we're going to switch localhost.

A

Okay looks like we got a 201 created: let's go ahead and list that again so run the same command again and here we can see that our stream process task is now listed. So we have that available to us now, let's go ahead and run the live query. So here's the live. Query we're going to go ahead and grab that.

A

We're gonna go ahead and again switch our endpoint here to perf test uploaded to io. Now, if I go back really quickly, uh there might have been a couple messages since my high. But let's look just in case my guess is this will be empty okay, so here we don't really have anything. It looks like it is still pretty empty.

A

So what we're going to go ahead and do is create a couple of logger messages, so i1 i2, i3 i4. Let's look if those appeared here we're going to go ahead and run this now here, and we can see that we now have this json record that high one high two high three high four.

A

um Now, if I was to run this again because it's flushed the context, I would actually just get a empty message. So that's what shows up right here.

A

So what this lets you do is, for example, go ahead and create dynamic stream processing tasks, which could be something as simple as what I've done here, which is select everything limit. Those messages to five to something, much more advanced like create the average or switch to the average.

A

Once we are below a threshold or if we're above a threshold keeps keep sending all the most relevant data, and then we can live query that data. Now, I'm querying this just with my curl command, but I think, as everyone here would be able to attest to you, can imagine the power of being able to query this via remote apis via automation, and really plug this into your entire data pipeline.

A

So with that, I'm going to go ahead and stop sharing so for those who are interested in additional stream processing, type functionality or stream processing hands-on activities. I have in the slide deck attached a a github repo. That includes four examples. Those four examples will walk through very basic stream processing, uh some advanced stream processing. Some group buys some macs a min some average and as well as a time search prediction, uh so it serves as a really nice basis.

A

So anyways uh thanks. Thank you all for giving me the opportunity to speak. I do hope that you'll be able to join us on the fluent community at slac.fluency.org, participate within any of our community events and give us feedback. We're always looking to improve the projects and build a better ecosystem for our users. So with that, thank you so much and have a good one.