►
Description
Monitoring Fluent Bit in Production - Pandu Aji, Microsoft
Fluent Bit comes with built-in features to allow you to query internal information and monitor metrics of each running plugin. But how do you decide which metrics to use for measuring the reliability of your pipeline? How do you leverage the storage metrics, and what are the challenges? How do you detect if your logging pipeline is unhealthy and if there is any congestion? This talk will describe our Fluent Bit monitoring story. It will briefly touch on some of the design choices but will primarily focus on how to monitor Fluent Bit in the production clusters.
A
I
work
for
the
azure
machine
learning
infrastructure
team
at
microsoft,
so
we
built
we
run
and
maintain
kubernetes
clusters
for
the
azure
machine
learning
team
and
we
build
tools
to
improve
the
speed
and
the
reliability
of
the
deployments
and
we
and
to
also
increase
the
observability
of
the
application
running
in
the
cluster.
A
There
are
two
open
source
projects
that
caught
our
attention
fluently
and
fluent
bit.
We
went
with
fluentd
at
that
time
because
fluent
bit
was
missing,
some
key
features,
files
file
system
buffering
and
then
the
scripting
blocking,
but
fast
forward
about
a
year
ago
our
business
is
growing
and
we
are
looking
for
more
efficient
logging
solution.
A
A
So
this
is
our
production
environment,
the
current
state
of
it.
Some
of
the
interesting
numbers
are
the
overall
log
volume
which
is
somewhere
between
750
to
850k
per
second
in
our
pcs
region.
The
log
volume
cluster
is
about
80k
per
second,
and
the
log
volume
per
node
is
varies
between
1
to
2k
per
second.
A
A
A
A
In
our
case,
the
locking
agent
sidecar
is
microsoft,
internal
agent.
But
technically
you
can
replace
the
logging
engine
cycle
with
fluent
bit
to
send
the
logs
to
the
login
agent
cycle.
A
We
instrument
the
application
with
the
logging
library,
so
one
of
the
downside
with
the
logging
library
is
that
when
you
have
applications
written
in
different
languages,
then
you
have
to
support
multiple
logging
library,
but
we
needed
the
the
logging
library
for
two
reasons.
One
is
that
we
want
our.
A
We
want
all
our
application
logs
to
have
free
defined
schema,
just
to
simplify
the
login
queries,
and
to
is
that
we
we
want
to
have
the
application
to
have
the
option
to
either
to
be
able
to
be
to
be
able
to
write
to
standard
art
and
as
well
to
the
right
directly
to
the
cycle,
so
the
logging
agent
will
when,
when
the
login
agency
hey,
there
is
a
cycle
on
my
bot.
A
A
A
So
so
the
aggregator
then
write
those
logs
to
the
azure
storage
backend.
In
our
case,
it's
azure
and
the.
A
And
the
aggregator
actually
does
all
the
business
logic
to
transform
the
logs
and
then
send
them
to
the
sidecar.
A
A
So,
let's
talk
about
fluent
bit
metrics,
so
the
http
server
on
fluent
bit,
they
ex
it
exposed
multiple
endpoints.
A
Two
of
them
that
we
are
going
to
talk
about
is
the
matrix
endpoint
that
is
in
prometheus
format.
This
matrix
exposes
the
metrics
for
each
running
block
plugin.
The
second
endpoint
is
the
storage
endpoint.
It
exposed
the
storage
information,
but
they
are
in
the
json
format,
and
this
is
a
output
of
the
storage
endpoint.
A
So
we
see
chunks.
What
is
chunks
chunks
is
how
fluent
bit
groups
and
store
the
log
the
logs
in
the
file
system.
So
there
are
two
types
of
chunks:
the
up
chunks
there.
They
exist
in
the
file
system
as
well
as
in
them
as
well
as
in
the
memory.
They
are
chunks
that
are
currently
being
processed
and
you
can
configure
the
maximum
number
of
up
chunks
allowed
in
the
memory
the
down
chunk.
The
dongjeong
only
exists
in
the
file
system.
A
So,
if
your,
if
your
logging
pipeline,
if
they
are
healthy,
typically
the
down
chunks,
is
usually
zero
or
very
close
to
zero,
and
if
your
pipeline
is
congested
or
slow,
then
the
up
chance
then
starts
starts
to
grow
and
when
it
reaches
the
maximum
limit,
then
the
down
jump
starts
to
starts
to
accumulate.
A
So
this
is
actually
the
metrics
that
we
use
to
monitor
our
logging
pipeline
before
moving
to
the
next
slide
pay
attention
to
the
json
path
of
the
fs
chunks.
Up
and
down
it
is
storage,
layer,
chunks,
fx,
chunks,
up
or
down.
A
A
So
this
is
the
example
of
our
config,
so
the
name
here
flow
and
b
storage
layer
is
the
metric
name.
Prefix
and
the
value
of
the
values
field
is
the
value
is
the
metric
that
you
want
to
export.
So
in
this
case,
is
the
fs
chunks
up
and
fs
chunks
down
and
the
exporter
to
get
the
metric
value.
A
The
exporter
will
follow
the
json
path,
those
those
are
the
one
that
is
in
the
curly
braces
and
if
you
look
at
it
is
the
storage
layer,
the
chunks
that
fs
chunks
up
or
down.
A
A
A
So
while
it's
loading,
okay,
it
loads
now
the
top
panels
here
these
are
the
global
summary
of
the
down
chunks
across
all
the
cluster.
So
it's
kind
of
nice
to
be
able
to
see
your
the
overall
health
on
a
single
pin
and
you'll
see
that
the
down
chunks
are
pretty
much
zero.
A
There
is
mostly
likely
five
and
then
they
drop
back
down
to
zero,
and
these
are
from
japanese
and
you
can
see
this
one
is
from
germany,
west
central
so
because
this
is
global,
the
drop
down
here
doesn't
work
on
this
panel.
A
So
the
panels
on
the
file
storage
group,
there
are
the
down
chunks
and
the
up
chunks
for
by
pot,
and
you
can.
Actually.
This
is
our
busiest
region
which
is
in
east
u.s,
so
we
can
maybe
switch
to
brazil,
south
and
change.
There
is
no
downtown,
it's
super
healthy
and
if
you
switch
back
to
east
u.s,.
A
Oh
actually,
why
does
it
show?
Can
you
switch
excuse
me.
A
A
Since
we
cannot
switch
this,
can
we
switch
this
to
the
the
other
and
okay,
never
mind
I'll,
just
go
ahead
and
use
the
pre-screenshot
thing.
A
A
So
if
you,
if
you
remember,
we
look
at
the
overall
log
volume-
that's
actually
taken
from
this
graph,
so
it
doesn't,
it
doesn't
include
the
volume
from
the
logging
agent
cycle,
so
it's
only
from
the
forwarder.
A
The
next
two
panels,
on
the
I
o
is,
is
the
input
rate
and
the
output
rate
by
part.
The
last
is
the
errors
and
the
retries
we
seldom
use
it.
But
it's
it's
a
nice,
it's
nice
to
have
for
an
additional
diagnostic
tool.
A
Next,
we'll
take
a
look
at
of
take
a
look
at
couple
of
studies.
A
A
The
last
panel
on
the
bottom
right
there
it's
actually
taken
from
another
dashboard.
I
just
shuffled
it
here
and
took
a
screenshot,
is
the
cpu
usage
of
the
folder
and
similar
to
the
input
rate.
A
A
So
what
happened
here
so
there
are,
there
is
a
one
up.
There
is
one
tenon
that
generated
too
many
locks,
and
was
it
has
a
very
few
replicas
though,
so
it
only
happened
on
one
or
two
notes:
it's
it
generated
two
money
logs
and
it's
holding
resources.
So
as
as
a
result,
our
our
forwarder
is
kind
of
like
struggling
and
failed
to
keep
up.
A
Now,
if
you
look
at
the
input
rate,
there
is
a
very
big
jump
like
around
maybe
16
45,
like
maybe
from
20k
30k
to
800k,
and
then,
if
you
look
at
the
aggregator
replicas
panel,
the
the
aggregator
seems
to
try
to
scale
up,
but
then
it's
scaled
back
down
again
and
then,
if
you
look
at
the
crashing
parts
most
of
the
forward
bots
are
crashing.
So
this
are
actually.
This
is
pretty,
was
pretty
bad
x
incident.
A
So
what
happened
here
is
that
for
some
reason
the
applications
like
scaling
up
rapidly,
while
while
the
aggregator
is
adding
replica
and
waiting
for
the
bots
to
get
ready,
the
existing
replicas
were
overwhelmed
and
bots
are
crashing
and
your
when
your
parts
are
crashing,
the
hp
is
not
gonna,
it
was
it's
not
going
to
work
properly,
so
the
mitigation
is
for
us
at
around
maybe
1845
there
we
manually
scale
up
the
aggregator
and
then
you
look
at
the
down
chunks.
It's
the
forwarder
starts
to
recover.
A
So
what-
and
there
are
a
couple
fixes-
one
of
the
fixes-
one
of
the
fixes-
is
to
make
the
hpa
more
aggressive,
so
the
the
aggregator
will
will
scale
up
earlier,
but
this
is
just
to
alleviate.
The
problem
doesn't
actu,
it's
not
actually
bulletproof,
because
there
is
always
a
time
delay
from
when
the
time
the
load
starts
to
increase
until
all
the
new
aggregated
parts
are
ready.
A
So,
theoretically
we
still
can
hit
this
problem,
so
we
also
make
the
second
changes.
The
second
change
is
to
actually
on
board
on
board
few
applications
that
both
very
very
chatty
and
they
then-
and
they
have
the
tendency
to
scale
up
rapidly
to
the
sidecar
solution
and
what
are
the
key
takeaways
here?
There
is
actually
no
perfect
solution
for
everyone,
so
choose
solution
that
works
best
for
your
environment.
A
A
So
your
solution
must
evolve
and
adapt
to
the
changes
and
kubernetes
and
also,
for
example,
we
we
added
this
sidecar
solution
just
about
a
year
ago,
and
and
also
that
we
are
exploring
kind
of
like
adding
persistency
to
the
aggregator
on
or
also
the
idea
of
merging
the
forwarder
and
the
aggregator.
A
A
The
aggregator
is
it's.
If
you
merge
that
together
is
it's
expensive
on
the
note
right,
because
all
then
your
note
then
will
have
to
do
all
the
business
logics
that
aggregator
does
so
it
takes
resources
from
the
node,
so
we're
still
exploring
it
nice.
Is
that
your
question?
Do
I
answer
your
question
yeah.
A
But
I
also
want
to
mention
that
the
the
aggregator,
although
it
is
expensive,
I
think
it
it
works
in
certain
scenario
like
like
when
you're
like
for
us
in
the
second
second
case
study,
because
when
you
have
the
logging
cycle
inside
your
port
together
with
your
application,
if
your
application
scale
up
you're
logging,
it
you're
logging,
your
login
agent,
also
scale
up
together
at
the
same
time.
So
I
think
that's
the
advantage,
but
it's
it
is
expensive.
A
Okay,
thanks
for
attending
my
session,
if
you
have
any
further
questions,
feel
free
to
talk
to
me
afterwards
I'll
be
around
today.
Thank
you.
Thank
you.