►
Description
Modernizing to the Fluent Stack: An Asana Story - James Elías Sigurðarson, Asana
In 2021, Asana migrated its event emission infrastructure to use Fluent Bit, from an old logging system based on Facebook's Scribe. At Asana, event emission requires high flexibility, performance, and durability, as it powers important Asana features, such as mobile notifications. In this session, we'll take a look at how we modernised our infrastructure, and used the Fluent Node library to meet these challenges, eventually scaling up to 3.5 billion events per day.
A
Hello
good
afternoon,
all
right,
so
I
guess
yeah.
My
name
is
james,
I'm
here
to
give
a
talk
about
a
project
that
we
did
where
we
moved
a
sort
of
interesting
system
that
we
had
over
to
use
fluent
logging.
A
So
just
quick
background
about
myself,
I
work
at
asana,
where
I'm
a
technical
lead
on
the
infra
topology
pod,
like
team
group,
been
there
since
2019,
mostly
working
on
stability
and
scalability
yeah,
just
a
quick
background
on
the
system
that
we
had
to
work
with
here.
A
It's
not
it
wasn't
quite
logging
per
say
it
was
event
logging,
which
was
sort
of
the
same
thing,
but
basically
the
goal
was
to
allow
developers
who
are
working
on
the
application
to
like
send
events
and
reliably
get
them
to
where
they
wanted
them
to
go
so
like
they
could
process
things
elsewhere.
For
example,
think
notifications
like
when
you
create
a
task
in
the
sauna,
like
you,
want
to
send
the
notification
out
to
a
bunch
of
people,
and
that
happened
through
this
system
and
in
effect
the
system
worked
like
this.
A
That's
the
component
that
we're
talking
about
here
but
and
this
sink
would
basically
handle
writing
them
to
kinesis
and
the
reason
that
we
did
it
this
way
initially
is
because
these
node
processes
didn't
have
a
very
long
lifetime
and
so,
like
we
cycle
them
relatively
often
so
like
having
an
external
process
here,
helped
us
get
durability,
helped
us
get
some
reliance
on
like
during
network
outages
and
so
on,
yeah,
and
so
the
api
that
we
exposed
to
developers
for
this
interface
was
just
you
basically
call
to
the
event,
logger
and
log,
the
category
that
you
want
and
then
you'd
pass
the
data
and
then,
additionally,
you
could
also
configure
where
you
wanted
your
events
to
go.
A
So
in
our
system,
like
you,
you
basically
passed
in
the
name
of
the
kinesis
stream,
and
then
you
could
set
the
event
category
to
direct
this,
and
otherwise
everything
just
went
to
the
default
stream,
which
was
pretty
cool
yeah.
So
this
system
as
a
whole
processes,
3.3
billion
events
per
day
and
almost
all
of
those
events,
are
actually
really
important.
A
So
we
needed
to
be
really
sure
to
get
the
durability
aspect
here
and
out
of
out
of
these
events
like
for
this
for
the
system,
we
basically
gathered
these
four
requirements
that
the
system
had
to
be
flexible,
so
we
needed
to
be
able
to
basically
send
anything
we
wanted
to
it
without
doing
too
much
work
or
too
much
effort.
A
We
wanted
developers
to
do
like
minimal
effort
logging
here,
but
but
also
we
wanted
to
be
really
high
performance,
so
we
basically
wanted
it
to
be
able
to
handle
the
load
and
we
wanted
to
be
able
to
send
like
get
those
events
into
kinesis
as
fast
as
possible,
and
ideally,
the
these
this
system
was
meant
to
support
to
survive
like,
as
I
mentioned,
no
node
crashes
and
outages
as
well.
A
Like
the
ec2
node
going
down
or
like
the
availability
zone
going
down
even
and
stuff
like
that,
and
then
also
we
need
to
be
able
to
configure
the
system
easily
so
because
we
expected
really
anyone
anyone
at
the
company
to
be
able
to
do
it
yeah.
So,
in
the
past
we
had
this
architecture
built
out
with
scribe.
A
A
And
it
was
finally
time
to
get
rid
of
this
but
yeah,
and
we
we
did
this
with
scribe,
which
which
basically
handled
the
buff,
the
buffering
part
and
the
receiving
messages
part,
but
it
didn't
have
this
ability
to
like
pump
things
into
kinesis
and
stuff.
So
what
you
could
do
there
was.
You
could
implement
a
custom
scribe
server,
just
that
supported
this
protocol,
which
we
had
implemented
a
custom.
Jvm
application
called
describe,
kinesis
sync,
which
handled
this,
taking
those
log
messages
and
pushing
them
out
into
kinesis
yeah.
A
But,
as
I
mentioned,
we
needed
to
move
away
from
this.
Mostly.
This
is
part
of
a
bigger
project.
Moving
us
on
a
dishonest
system
into
kubernetes
and
scribe
being
dead
in
2014
made
it
very
hard
to
do
this
because
it
like
it
was
a
c
binary
it
needed
to
be
compiled,
especially
for
it
it
needed
like
a
bunch
of
work.
The
jvm
application
doesn't
really
help
with
like
doing
very
small
services
and
small
pods,
so
we
ended
up
deciding
to
get
rid
of
this.
A
A
There
wasn't
really,
and
just
there
is
a
kinesis
library
for
node
and
there's
also
we
could
have
like
written
log
files
out
and
then
like
used,
fluentd
or
fluent
bit
to
parse
those
files
and
write
those
into
kinesis,
and
then
we
could
also
used
the
forward
port
and
when,
like
we
looked
at
these
and
like
looked
at
the
pros
and
cons
of
these
approaches
and
ended
up,
selecting
fluency
with
the
forward
port
and
mostly
the
reason
for
this
was
because
well
direct
to
kinesis.
A
It
doesn't
survive
node
process
crashes
so
like
if
there
was
something
left
in
the
process
like
you,
if
you
hadn't
gotten
it
out
to
kinesis
for
like
network
outages
or
something
it
was
going
to
be
lost,
so
that
was
more
or
less
off
the
table.
Due
to
that,
the
log
file
tooth
plus
fluency
approach
was
a
little
bit
more
enticing.
It
was.
A
It
was
arguably
simpler,
but
the
main
thing
that
made
it
hard
for
us
to
work
with
it
was
because
of
the
low
ease
of
use
in
the
sense
that
going
back
to
this
slide,
like
this
configuration
of
the
like
categories
like,
we
really
wanted
to
be
able
to
do
anything
here
and
like
being
able
to
send
various
tags
and
so
on,
and
it
wasn't
possible
to
have
the
same
tag
like
it
needed
for
a
file
at
the
time.
A
So
that's
why
we
went
with
that.
This
has
some
cost
to
durability,
because
now
you
need
to
make
sure
to
send
the
message
to
to
fluently
before
the
process
dies.
But
hopefully
this
is
mitigated
a
little
bit
by
the
process
being
local
to
the
to
the
ec2
node
yeah.
So
that's
the
approach
we
went
with.
So,
let's
talk
about
how
we
built
it,
so
we
set
this
up
initially
using
fluentd.
A
The
reason
for
this
was
just
because
we
were
already
using
fluentd
for
logging
and
so
like
it
was
pretty
natural
to
just
let's
drop
in
a
forward
port
config
here
and
see
what
happens
and
yeah,
and
we
set
this
up
with
disk
buffering,
of
course,
so
that
when
it
couldn't
flush
events,
it
would
just
save
them
and
try
again
later
and
yeah,
and
basically
this
turned
out
to
be
a
pretty
good
swap
for
the
existing
system
yeah.
A
So
we
tagged
the
record
so
as
you
passed
in
log,
it
got
tagged
with
like
sorry.
This
is
a
a
fluent
record
like
a
format
according
to
the
fluent
protocol.
A
So
in
in
order
to
test
this,
we
really
didn't
want
to
lose
any
events
here.
So
what
we
decided
to
do
was
we
set
this
up
just
side
by
side,
so
we
still
had
scribe
and
everything
running,
but
we
just
added
this
config
to
fluentd
and
then
also
what
we
did
was
we
just
made
a
test:
kinesis
stream,
where
we
wrote
everything
so
basically
every
single
event
that
the
node
process
had
it
now
wrote
it
both
to
scribe
anti-fluenty
and
fluenty.
All
it
did
with.
A
It
was
just
put
it
into
a
random
kinesis
stream,
where
it
would
get
thrown
away
but
yeah.
What
this
did
for
us
was
that
it
allowed
us
to
basically
roll
out
everything
as
we
were,
as
we
actually
as
the
system
was
supposed
to
work,
but
without
relying
fully
on
it,
and
through
this
we
actually
caught
quite
a
few
bugs
so
yeah
the
big
one
that
we
ran
into
was
that
once
we
were
at
about
500
events
per
second,
then
then
we
just
saw
like
the
graph.
A
The
graph
drop
events
like
this,
so
this
is
what
this
metric
is-
is
it's
emitted
by
the
node
process,
as
as
so
as
perceived
by
the
node
process.
How
many
events
one
eye
died?
Didn't
I
managed
to
send
out
so
we
saw
like
random
spikes
of
like
40
events,
which
meant
that
when
the
process
died,
there
were
40
events
left
in
the
queue
that
we
weren't
able
to
even
get
to
fluency.
A
So
in
in
inspecting
this
one
thing
we
realized
was
that
this
system
had
like
really
highly
spiky
loads,
so
developers
would
some
developers
would
write
an
action
that
suddenly
just
like
generated
10,
000
events,
and
this
so
like
all
of
a
sudden
like
this
one
note
process
would
just
need
to
send
thousands
of
events
into
one
fluency
cluster,
and
then
it
would
just
drop
to
zero.
After
that.
So
and
then
what
we
did,
then
we
looked
into
like
why
this
was
happening.
A
So
like
wait
for
the
operating
system
to
report,
this
right
had
fully
been
written
to
the
socket,
and
this
actually
had
quite
a
huge
performance
cost
in
in
this.
So
basically,
what
we
did
was
just
decided
to
make
a
faster
library
and
yeah.
A
We
we
implemented
a
different
library
which
had
a
little
bit
of
a
different
architecture
to
be
able
to
control
how
like
how
many
events
you
wrote
at
the
same
time
and
how
many
events
like
what
type
of
forwarding
protocol
the
fluent
protocol
supports
three
to
four
different
ways
of
sending
events
so
like
you
can
batch
them
together
or
you
can
send
them
one
by
one,
and
we
wanted
to
be
able
to
like
test
which
one
worked
for
us
and
also
like
things
like
acknowledgements
and
about
a
bunch
of
stuff.
A
If
you're
interested
in
this,
I
highly
recommend
checking
it
out
yeah
and
anyway,
we
did
this
and
now
node
was
actually
able
to
keep
up.
So
we
didn't
see
as
many
drop
messages,
so
we
were
actually
able
to
go
from
500
events
per
second
on
the
last
slide
to
30k
events
per
second
events
per
second
at
during
peak
time,
which
was
about
240
events
per
second
per
host.
A
A
Basically,
it
started
using
about
two
gigs
of
memory
and
started
losing
a
lot
of
events
so
effectively
immediately
as
we
rolled
out
to
30
30
000
events
like
when
we're
at
20
000.
This
was
this
graph
was
at
zero
and
then
we
went
out
went
up
to
30
000.
This
graph
went
up
to
80
000
over
300
300
000
events,
so
we're
dropping
like
a
relatively
huge
amount
of
events
over
20,
and
this
just
turned
out
to
be
a
lower,
like,
I
guess,
lower
performance
of
fluency.
A
So
yeah
like
there
were
a
bunch
of
approaches
that
we
could
have
taken
here.
We
could
fluently.
Has
this
thing
called
workers,
so
we
thought
about
setting
that
up,
but
before
we
did
that
someone
just
said,
let's
try
out
fluent
pit
and
see
what
happens
so.
That's
we
just
did
that
it
actually
took
less
than
a
day
to
just
throw
it
out
to
our
infrastructure
and
try
it
and
we
set
up
like
we
had
the
same
setup
file
system
buffering.
A
A
So
all
of
our
events
were
getting
double
written
to
both
scribe
and
kinesis,
and
we
saw
no
drop
messages
from
the
perspective
node,
so
it
managed
to
flush
out
all
of
its
events
into
fluent
fluent
bit.
So
at
this
point,
we
sort
of
were
a
little
bit
sad
that
we
had
to
introduce
a
new
system.
So
now
we
both
have
fluency
and
fluent
bit,
but
we
decided
to
go.
A
We
had
with
what
we
had
and
there's
still
actually
a
pending
thing
to
dig
into
if
we
can
improve
the
performance
of
fluenty,
but
I'm
actually
now
pushing
for
fluent
bit
yeah.
So
now
that
we've
done
this,
we
need
we
need
to
actually
roll
this
out
so
start
relying
on
it.
A
What
we
did
with
this
was
we
set
it
to
a
per
category
thing
so
for
category
a
we
could
set
our
configuration
to
write
it
to
fluent
bit
and
write
everything
else
described
and
then
slowly
we
increase
the
size
of
this
set
that
gets
written
to
fluent
bit,
and
then
we
worked
with
like
the
consumer
teams
to
ensure
that
the
rollout
went
well
so
like
they
were
monitoring
their
downstream
systems
to
make
sure
that,
like
the
log
volume
didn't
drop
massively
all
of
a
sudden
and
so
on,
and
this
went
pretty
well,
we
had
a
couple
issues
with
fluentpit.
A
A
So
turning
chaksumming
on
for
fluent
for
fluent
bit
when
you're
using
disk
buffering,
is
really
cool.
What
it
does,
then,
is
it
just
checks?
Hey
is
this?
Is
this
trunk
like?
Does
it
it
checks
the
crc
checksum?
So
hey?
Is
this
chunk
still?
Okay,
and
if
it,
if
it's
broken,
then
it
just
throws
it
throws
it
away.
A
Instead
of
what
we
saw
happening
was
just
a
crash
loop
and
then
yeah,
and
then
we
fixed
the
upstream
bugs
on
how
it
computed
partition
keys
for
kpl
aggregation
yeah,
and
with
that
we
managed
to
roll
it
out
fully.
A
A
A
couple
learnings
of
this
work
for
like
for
me
personally,
was
that
being
able
to
do
this
double
right
thing
like
allowed
us
to
be
really
calm
in
the
rollout,
so
like?
Oh,
no,
it
broke
and
we
just
turned
it
off
getting
like
getting
actually
high
performance
like
I
had
a
lot
of
fringe
cases
that
ended
up
being
sort
of
just.
I
guess
this
is
what
was
broken.