Calyptia Fluentcon Europe 2022, 19 May 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Modernizing to the Fluent Stack: An Asana Story - James Elías Sigurðarson, Asana

Description

Modernizing to the Fluent Stack: An Asana Story - James Elías Sigurðarson, Asana

In 2021, Asana migrated its event emission infrastructure to use Fluent Bit, from an old logging system based on Facebook's Scribe. At Asana, event emission requires high flexibility, performance, and durability, as it powers important Asana features, such as mobile notifications. In this session, we'll take a look at how we modernised our infrastructure, and used the Fluent Node library to meet these challenges, eventually scaling up to 3.5 billion events per day.

A

Hello uh good afternoon, all right, so I guess yeah. My name is james, um I'm here to give a talk about a project that we did where we moved a sort of interesting system that we had over to use fluent logging.

A

So just quick background about myself, I work at asana, where I'm a technical lead on the infra topology pod, like team group, um been there since 2019, mostly working on stability and scalability uh yeah, just a quick background on the system that we had to work with here.

A

It's not it wasn't quite logging per say it was event logging, which was sort of the same thing, but basically the goal was to allow developers who are working on the application to like send events and reliably get them to uh where they wanted them to go so like they could process things elsewhere. For example, think notifications like when you create a task in the sauna, like you, want to send the notification out to a bunch of people, and that happened through this system and in effect the system worked like this.

A

So we had a bunch of ec2 nodes and those ec2 nodes were running a single like running some set of node processes node.js, and they would write those events into a sync which we'll go into detail later.

A

That's the component that we're talking about here but and this sink would basically handle writing them to kinesis and the reason that we did it this way initially is because these node processes didn't have a very long lifetime and so, like we cycle them relatively often so like having an external process here, helped us get durability, helped us get some reliance on like during network outages and so on, uh yeah, and so the api that we exposed to developers for this interface was just you basically call to the event, logger and log, the category that you want and then you'd pass the data and then, additionally, you could also configure where you wanted your events to go.

A

So in our system, like you, you basically passed in the name of the kinesis stream, and uh then you could set the event category to direct this, and otherwise everything just went to the default stream, which was pretty cool um yeah. So this system as a whole processes, 3.3 billion events per day and almost all of those events, are actually really important.

A

So we needed to be really sure to get the durability aspect here and out of out of these events like for this for the system, we basically gathered these four requirements that the system had to be flexible, uh so we needed to be able to basically send anything we wanted to it without doing too much work or too much effort.

A

We wanted developers to do like minimal effort logging here, but but also we wanted to be really high performance, so we basically wanted it to be able to handle the load and we wanted to be able to send like get those events into kinesis as fast as possible, um and ideally, the these uh this system was meant to support to survive like, as I mentioned, no node crashes and outages as well.

A

Like the ec2 node going down or like the availability zone going down even and stuff like that, and then also we need to be able to configure the system easily so because we expected really anyone anyone at the company to be able to do it um yeah. So, in the past we had this architecture built out with scribe.

A

For those of you not familiar with scribe, it's a system that was developed by facebook back in. I think it was 2009 and they killed it off back in 2014, but we still had it yay.

A

And it was finally time to get rid of this but yeah, and we we did this with scribe, which which basically handled the buff, the buffering part and the receiving messages part, but it didn't have this ability to like pump things into kinesis and stuff. So what you could do there was. You could implement a custom scribe server, just that supported this protocol, which we had implemented a custom. Jvm application called describe, kinesis sync, which handled this, taking those log messages and pushing them out into kinesis um yeah.

A

But, as I mentioned, we needed to move away from this. uh Mostly. This is part of a bigger project. Moving us on a dishonest system into kubernetes and scribe being dead in 2014 made it very hard to do this because uh it like it was a c binary it needed to be compiled, especially for it it needed like a bunch of work. uh The jvm application doesn't really help with like doing very small services and small pods, so we ended up deciding to get rid of this.

A

Yeah, so when looking at our options, we had, we had effectively three options that we really considered. We could just let the node process do things directly.

A

There wasn't really, and just there is a kinesis library for node um and there's also we could have like written uh log files out and then like used, fluentd or fluent bit to parse those files and write those into kinesis, and then we could also used the forward port and when, like we looked at these and like looked at the pros and cons of these approaches and ended up, selecting fluency with the forward port and mostly the reason for this was because well direct to kinesis.

A

It doesn't survive node process crashes so like if there was something left in the process like you, if you hadn't gotten it out to kinesis for like network outages or something it was going to be lost, so that was more or less off the table. Due to that, the log file tooth plus fluency approach was a little bit more enticing. It was.

A

It was arguably simpler, but the main thing that made it hard for us to work with it was because of the low ease of use in the sense that going back to uh this slide, like this configuration of the uh like categories like, we really wanted to be able to do anything here and like being able to send various tags and so on, and it wasn't possible to have the same tag uh like it needed for a file at the time.

A

At least it needed to be this file gets this tag, so we need to have multiple files for every single category, which was difficult so being using the forward port, and the fluent protocol allowed us to basically send these messages directly to fluentd influent bit with, like the category actually encoded as the fluent tag.

A

So that's why we went with that. This has some cost to durability, because now you need to make sure to send the message to to fluently before the process dies. But hopefully this is mitigated a little bit by the process being local to the to the ec2 node yeah. So that's the approach we went with. So, let's talk about how we built it, so we set this up initially using fluentd.

A

The reason for this was just because we were already using fluentd for logging and so like it was pretty natural to just let's drop in a forward port config here and see what happens um and yeah, and we set this up with disk buffering, of course, so that when it couldn't flush events, um it would just save them and try again later um and yeah, and basically this turned out to be a pretty good swap for the existing system um yeah.

A

So we tagged the record so as you passed in log, it got tagged with like uh sorry. This is a um a fluent record like a format according to the fluent protocol.

A

So this is the tag as it ends up, and this is the timestamp that it gets, and then this is the log message itself and then inside the configuration we had just during our deployments, we had this template configuration which gener, which we used to just generate the configuration for fluent fluent eve from the json setup yeah.

A

So in in order to test this, we really didn't want to lose any events here. So what we decided to do was uh we set this up just side by side, so we still had scribe and everything running, but we just added this config to fluentd and then also what we did was we just made a test: kinesis stream, where we wrote everything so basically every single event that the node process had it now wrote it both to scribe anti-fluenty and fluenty. All it did with.

A

It was just put it into a random kinesis stream, where it would get thrown away but uh yeah. What this did for us was that it allowed us to basically roll out everything as we were, as we actually as the system was supposed to work, but without relying fully on it, and through this we actually caught quite a few bugs um so yeah the big one that we ran into was that um once we were at about 500 events per second, um then then we just saw like the graph.

A

The graph drop events like this, so this is what this metric is- is it's emitted by the node process, as as so as perceived by the node process. How many events one eye died? Didn't I managed to send out so we saw like random spikes of like 40 events, which meant that when the process died, there were 40 events left in the queue that we weren't able to even get to fluency.

A

So in in inspecting this uh one thing we realized was that this system had like really highly spiky loads, so developers would some developers would write an action that suddenly just like generated 10, 000 events, um and this so like all of a sudden like this one note process would just need to send thousands of events into one fluency cluster, and then it would just drop to zero. After that. um So and then what we did, then we looked into like why this was happening.

A

Why we're losing these events and it turned out that the library, the way it had implemented its send queue, was, was basically meant that it needed to write one event or one fluent record so one of these, and it will wait for that entire write to complete before being able to send the next event.

A

So like wait for uh the operating system to report, this uh right had fully been written to the socket, and this actually had quite a huge performance cost um in in this. So basically, what we did was just decided to make a faster library um and yeah.

A

We we implemented a different library which had a little bit of a different architecture to be able to control how like how many events you wrote at the same time and how many events like uh what type of forwarding protocol the fluent protocol supports three to four different ways of sending events so like you can batch them together or you can uh send them one by one, and we wanted to be able to like test which one worked for us and also like things like acknowledgements and about a bunch of stuff.

A

If you're interested in this, I highly recommend checking it out yeah and anyway, we did this and now node was actually able to keep up. So we didn't see as many drop messages, so we were actually able to go from 500 events per second on the last slide to 30k events per second events per second at during peak time, which was about 240 events per second per host.

A

What we actually ran into here was a different like slightly larger problem. Was that at this load we saw fluent fluent d actually run into issues.

A

Basically, it started using about two gigs of memory and started losing a lot of events so effectively immediately as we rolled out to 30 30 000 events like when we're at 20 000. This was this graph was at zero and then we went out went up to 30 000. um This graph went up to 80 000 over 300 300 000 events, so we're dropping like a relatively huge amount of events over 20, um and this just turned out to be a lower, like, I guess, uh lower performance of fluency.

A

So yeah like there were a bunch of approaches that we could have taken here. We could fluently. Has this thing called workers, so we thought about setting that up, but before we did that someone just said, let's try out fluent pit and see what happens so. That's we just did that it actually took less than a day to just throw it out to our infrastructure and try it and we set up like we had the same setup file system buffering.

A

The only thing that we needed to do especially was that the the default plug-in in c didn't have a fluent kinesis aggregation. So we needed to set up the custom, go line plug-in to do that, but we didn't see any performance issue, so we rolled out fully.

A

So all of our events were getting double written to both scribe and kinesis, and we saw no drop messages from the perspective node, uh so it managed to flush out all of its events into fluent fluent bit. um So at this point, we sort of were a little bit sad that we had to introduce a new system. So now we both have fluency and fluent bit, but we decided to go.

A

We had with what we had and there's still actually a pending thing to dig into if we can improve the performance of fluenty, but I'm actually now pushing for fluent bit um yeah. So now that we've done this, uh we need we need to actually roll this out so start relying on it.

A

um What we did with this was we set it to a per category thing uh so for category a we could set our configuration to write it to fluent bit and write everything else described and then slowly we increase the size of this set that gets written to fluent bit, and then we worked with like the consumer teams to ensure that the rollout went well so like they were monitoring their downstream systems to make sure that, like the log volume didn't drop massively all of a sudden and so on, and this went pretty well, we had a couple issues with fluentpit.

A

It wasn't quite free. We had uh base. Basically the main problems here were actually the kinesis of the kinesis plug-in the the kinesis plug-in had some issues with how it did kpl aggregation, which actually ended up causing corruption bugs in fluent bit. So we ended up with corrupt chunks in there.

A

So turning chaksumming on for fluent for fluent bit uh when you're using disk buffering, is really cool. uh What it does, then, is it just checks? Hey is this? uh Is this trunk like? Does it it checks the crc checksum? So hey? Is this chunk still? Okay, and if it, if it's broken, then it just throws it throws it away.

A

Instead of what we saw happening was just a crash loop and then yeah, and then we fixed the upstream bugs on how it computed partition keys for kpl aggregation um yeah, and with that we managed to roll it out fully.

A

It's just running in production now, and we like this boosted, our we've actually since managed to also move this most of the system over to kubernetes. So that really helped with that and just a quick timeline for this work and yeah.

A

A couple learnings of this work for like for me uh personally, was that being able to do this double right thing like allowed us to be really calm in the rollout, so like? Oh, no, it broke and we just turned it off um getting like getting actually high performance like I had a lot of fringe cases that ended up being sort of just. I guess this is what was broken.

A

um It was like some of them ended up being kind of hard to reproduce and also just going in and fixing the open source code was just really easy and it also like helped us get a very deep understanding of the system.

A

Yeah. Thank you there's time for questions, I think, if yeah, no, there isn't so I'm on slack. If you want to ask me anything.

B

Maybe one question: if anyone has it, anyone have any questions. Maybe do one real, quick.

C

Yeah I was interested in the maybe it's not it's only a part of the big picture that you explained by the way great work and I was interested in the disk buffering. So you are on amazon right, yes and this disk buffering. How does it work.

A

um So, basically, uh we so I can actually follow up with you better after after this, um but in effect, what we did was we just set it up to cash forever.

A

So, if anything happens, we collect chunks indefinitely. We do have a cleanup process that cleans these up eventually, but most of the time we just we, we wanted the high durability in kubernetes. Of course we do regular deployments and so on uh yeah, I'm not sure. If that answers, your question yeah.

C

Sorry, if that's pretty it's okay, it's okay! Thank you.

B

Anything else anyone else got a question before we wrap up. The wrap-up session will take about five minutes so.

B

Okay, all right awesome. Thank you. So much.

B