►
From YouTube: Anurag Gupta - Calyptia - Wrangling Data to Multiple Places With Fluent Bit - Percona Live 2021
Description
As more and more users move to #Kubernetes they also may start using multiple backends and analytic tools.
Comment 💬, Share 🔗, Like 👍, and Subscribe ✅ to our channel 📺, and turn on the 🔔 to be alerted to new videos about #OpenSource #Databases... and many other things!
https://percona.tv/subscribe
How do you collect once and send everywhere? In this talk, Anurag will talk about a Cloud Native Computing Foundation (#CNCF) graduated project Fluent Bit and how you can collect once, and send to all the backends you want. Additionally, Anurag will discuss some of Fluent Bit's advanced capabilities such as enrichment, parsing, and data reduction that helps users get the most out of their backends.
A
A
So
today,
what
we're
going
to
talk
about
is
a
little
intro
to.
Why
does
wrangling
data
even
matter
in
the
first
place?
How
do
you
do
that
with
flintbit?
What
are
some
of
the
use
cases
that
wrangling
data
can
be
very
effective
for
your
business,
for
your
enterprise,
for
your
startup,
whatever
it
may
be,
and
then,
last
but
not
least,
we're
going
to
talk
a
little
bit
about
advanced
use
cases
and
walk
through
a
quick
demo.
A
So
first,
let's
talk
about.
Why
do
we
even
care
about
wrangling
data?
So
what's
that
challenge
really?
What's
that
problem?
The
first,
I
think
everyone
here
would
agree
that
data
is
just
growing
at
a
tremendous
rate
and
whether
or
not
we
want
to
collect
all
of
it.
The
truth
is
it's:
it's
out
there
there
are
insights
to
be
had,
and
how
do
we
go
about
collecting?
All
of
that
we
have
to
collect
from
all
these
different
sources
and
send
to
all
these
different
destinations.
A
So
every
few
months
we're
seeing
there's
brand
new
backends,
there's
brand
new
databases,
there's
name
brand
new
places
where
folks
are
telling
us
hey.
If
you
send
us
your
data,
we'll
give
you
the
best
amount
of
insights,
and
so
this
problem
of
having
so
many
sources
and
so
many
different
destinations,
and
so
many
different
ways
to
get
that
data
in
is
really
a
challenge.
We
don't
want
to
have
a
thousand
agents
on
a
server.
We
don't
want
to
have
50
000
agents
deployed
in
kubernetes
just
to
route
data
to
all
of
the
various
destinations.
A
Now,
last
but
not
least,
is
is
formatting.
We
have
just
all
these
different
applications.
New
languages
coming
up.
They
all
have
their
different
stack
traces.
They
all
have
their
different
formats.
They'll
have
their
different
log
styles.
How
do
we
make
sure
that
we're
being
effective
and
catering
to
that
and
and
ensuring
that,
when
we're
looking
at
it
from
an
insight
perspective
doing
analytics,
on
top
of
it,
we're
able
to
get
the
most
information
out
of
all
these
various
formats?
A
So
to
solve
all
of
these
challenges
about
10
years
ago
there
was
a
project
created
called
fluentd
and
that
ecosystem
spurred
vendor
neutral,
open
source
data
collectors
and
what
that
means
is
they're
not
tied
to
a
single
back
end
they're,
proud
of
part
of
a
foundation
cloud
native
computing
foundation,
so
right
next
to
kubernetes
right
next
to
prometheus.
A
It's
a
graduated
project.
Fluent
bid
is
part
of
that
ecosystem.
It's
apache
2
license
it's
deployed
2.4
million
times
a
day,
and
it
really
builds
off
these
challenges.
One
data
is
growing.
How
do
we
make
sure
we
can
grab
data
from
all
these
sources
and
send
to
all
these
destinations,
so
you
can
get
to
your
analytics
faster
and
last
but
not
least,
do
some
formatting
do
some
processing
in
between
now
from
who's
using
fluent
side,
we
have
a
ton
of
cloud
providers
that
are
utilizing
in
their
cloud.
Aws.
A
Google
cloud
microsoft,
a
lot
of
folks
are
utilizing
fluent
bit
in
the
enterprise
today,
so
you
can
look
at
this
project
as
having
a
lot
of
maturity,
a
lot
of
scale
and
something
that
you
can
build
plug-ins
to
so
how
does
it
work
so
fluent
fit
is
really
based
off
of
two
big
components.
One
is
a
plug-in
system
and
the
second
is
tagging,
so
the
plug-in
system
allows
us
to
say
there's
many
sources,
so
each
of
these
sources
can
have
different
inputs
and
there's
many
outputs.
A
So
many
of
these
things
like
say
elasticsearch,
grafana,
loki,
postgres
splunk.
Each
of
them
can
have
outputs
and
as
we
define
routes,
we
say
that
one
thing
is
tagged
with
a
and
then
splunk
and
elastic
will
have
matching
a's
assigned
to
them.
Now
these
are
routed
to
one
or
more
outputs
and
the
this
this
project
of
fluent
bid
specifically,
is
written
to
be
very,
very
performant.
It's
deployed
on
embedded
systems,
it's
built
for
the
container
age.
A
You
know
here's
a
small
benchmark
that
we
run
as
part
of
our
integration
test
suite
for
about
10
000
events
per
second,
the
cpu
here
in
cpu
seconds
time
is,
is
extremely
small
from
a
memory
perspective.
You
know,
there's
3.79
megs
and
7.48
meg
for
10
000
events
per
second,
that's
not
too
bad,
so
I'm
really
focused
on
extreme
high
performance,
really
high
portability
and
and
making
sure
that
you're
able
to
plug
in
what
you
need.
A
A
And
when
we
look
at
the
latter
side,
there's
really
four
places
that
this
comes
into
play
and
we'll
walk
through
each
of
them
individually,
but
the
first
is
reducing
costs.
I
think
all
of
us
have
gone
through
this
notion
of
we're
collecting
more
and
more
data
everything's
becoming
more
and
more
expensive,
but
the
insights
are
just
not
necessarily
lining
up
with
that
data
curve.
A
So
how
can
we
make
that
more
effective
send
data
to
where
it
needs
to
go
to
maximize
that
insight,
curve,
enriching
data,
getting
data
and
adding
context
context
for
analytics
we'll
walk
through
a
use
case
of
kubernetes
unifying
the
format?
How
can
you
do
processing?
How
can
you
do
things
that
are
based
off
of
no
specific
schema
and
then
decreasing
vendor
locking,
so
a
lot
of
vendors
will
provide
agents
that
are
proprietary
or
open
source
that
only
allow
you
to
send
data
to
this
specific
backend
and
the
truth
is
with
back
ends
just
continually
evolving.
A
So
let's
look
at
the
first
one:
reducing
cost.
How
does
this
work
so
there's
a
couple
ways
you
can
reduce
costs.
The
first,
which
I
think
is
the
most
basic,
is
you
are
sending
data,
say
on
the
right
hand,
side
we're
tailing
a
log
and
we're
send
it
to
this
dollar
dollar
dollar
expensive
back
end.
Now,
what
we
can
do
is
add
a
filter
we
can
just
say:
let's
remove
the
noise,
remove
null
values,
remove
empty
values,
remove
values
that
might
contain
debug
or
info
logs.
Let's
get
rid
of
that.
A
The
second
way
is,
of
course,
multiple
backends
we're
talking
about
wrangling
data.
Here
there
are,
you
know,
cheaper
backends
that
are
available.
There's
file
system
storage,
that's
available,
and
this
can
be
something
where
I
might
be
a
financial
institution
that
needs
to
archive
data
for
a
series
of
seven
years.
A
I
might
need
to
take
data
and
just
hold
it
for
compliance
reasons
from
security
side,
and
so
what
I
can
do
is
I
can
take
that
data.
I
can
remove
the
noise
and
send
it
over
to
the
expensive
backend,
but
I
can
also
send
a
copy
to
a
cheaper
backend.
I
can
also
use
specific
pieces
in
the
log
to
say
I
want
to
send
error
logs
to
my
super
fast,
expensive.
Backend
and
I
want
to
send
info
logs
to
my
cheap
backend
and
then
on
top
of
that.
A
One
thing
that
fluent
bit
offers
is
really
smart
snapshotting.
This
is
with
our
stream
processing
feature
which
we'll
talk
a
little
bit
about
later,
but
what
this
allows
you
to
do
is
say:
if
I
encounter
a
specific
event,
don't
just
send
me
those
400,
you
know
400
404
events
or
those
error.
Events
send
me
it
with
context.
I
want
to
see
the
path
before
and
I
want
to
see
the
path
after,
and
that
is
is
really
awesome.
When
you
get
into,
I
want
to
send
snapshots
based
off
certain
metric
values.
A
I
want
to
send
snapshots
based
off
of
certain
error
values.
It
can
really
help
you
reduce
noise
but
keep
context,
which
is
something
a
lot
of
times
things
don't
allow
you
to
do
now.
Let's
talk
about
enriching
data,
you
know
one
of
the
major
places
that
fluent
bit
is
deployed
is
within
kubernetes
and
when
you
deploy
within
kubernetes
there's
a
lot
of
context
that
each
log
should
have
the
pod
the
name
space
the
container.
These
are
pieces
of
information
that
make
it
helpful
for
you
when
you
go
to
debug.
A
So
when
you're
doing
this
enrichment,
we
can
do
things
like
look
up
aws
filters
and
and
give
you
things
like
hey.
This
is
the
cloud
resources
the
region?
This
is
the
az.
You
can
use
geoip,
lookup
so
being
able
to
stream
those
messages
in
line.
Tell
you
what
country,
what
city
you're
in
in
case
you're
part
of
some
certain
privacy
regulations
and
then
the
last
one.
Of
course
we
just
talked
about
is
the
kubernetes
filter.
Now
these
are
things
where
they're
continually
evolving.
You
can
enrich
with
custom
data.
A
We
have
folks
who
build
plugins
that
do
security
lookups
for
for
certain
ips
to
tell
whether
or
not
they're
malicious.
So
the
the
really
great
part
here
is
because
this
is
open
source.
There's
a
lot
of
extensibility
here
with
enriching
your
data
now
unifying
format.
Now
we
sort
of
talked
about
this
in
the
beginning,
but
to
dive
a
little
bit
deeper
one
is
we
want
to
be
able
to
take
all
this
unstructured
data
and
give
it
some
structure,
so
we
can
run
metrics
on
top
of
it.
We
can
perform
aggregations.
A
We
can
send
that
to
a
place
where
it
will
be
indexed
correctly
and
most
optimally,
and
so
there
one
thing
that
flynn
bit
includes
is
a
list
of
parsers
like
docker
cri
for
the
container
based
environments,
log
format,
json
csv
things
like
syslog,
tcp,
363164
5424.
All
these
different
various
rfcs
and
formats
come
out
of
the
box
with
with
fluent
bit,
making
it
very,
very
simple
to
just
connect
in
the
data
source
parse
it
and
and
have
some
level
of
key
value
pairs,
making
that
data
more
useful.
A
A
A
What
we
did
back
about
five
or
six
years
ago
was
say
we
want
to
make
something
very,
very
lightweight
deployable
at
the
edge
and,
as
folks
started,
deploying
fluent
bit
at
the
edge
more
and
more.
What
we
found
is
that,
because
we
were
using
so
little
resources,
we
could
add
a
decent
amount
of
power
at
in
the
middle
with
sql
stream
processing.
A
So
what
does
this
allow
you
to
do
essentially
within
the
pipeline
that
fluent
bit
has
its
inputs,
it's
parsers
to
build
that
formatting,
the
filtering
to
remove
data
or
enrich
data,
the
storage
layer
in
the
router
we've
added
this
brand
new
stream
processor,
and
we
use
you
know
ansi
compliance
sql
here
we're
not
trying
to
invent
a
brand
new
language.
I
think
there's
quite
enough
of
query
language
is
out
there.
So
if
we
can
use
standard
sql,
we
can
use
this
to
say.
Let's
allow
for
aggregations,
we
can
build
predictions.
We
can
build
functions.
A
The
great
part
about
this
no
database
required
it
runs
in
memory
still
very,
very
performant,
so
even
if
you're
collecting
those
thousands
of
logs
per
second,
we
can
run
those
aggregations,
and
this
can
be
very
beneficial
when
you're
trying
to
wrangle
data
to
multiple
back
ends,
because
you
might
not
want
to
send
a
thousand
metric
messages
per
second
instead,
once
you
aggregate
that
to
one
message
per
second,
that
just
includes
all
the
details,
the
summary
the
average,
the
other
big
part
about
this
is
schema
list.
So
we
don't
require
any
format.
A
A
So
some
some
use
cases
that
come
with
the
stream
processing
and
we're
going
to
demo
a
bit
of
this
in
in
a
little
bit.
The
first
is
aggregating
and
routing
data
effectively.
So
again
summarize
those
events
before
sending
you
can
use
max
min
average
sum
sending
events
only
that
matter
talked
about
this
a
little
bit
where
you
can
send
data
with
context
using
the
snapshot
parameter
time
series
predictions
so
fluid
bit
does
offer
some
time
series
prediction
capabilities.
Where
you
can
say,
let
me
predict
out
to
the
next
10
seconds
and
then
alert
faster.
A
So
if
we
find
that
cpu
or
memory
or
some
derivative
metric
derived
from
a
log
is
triggering
at
some
level,
we
can
alert
directly
to
an
output.
Like
slack,
we
can
send
a
message
to
splunk.
We
can
send
a
message
to
elasticsearch
or
loki
to
postgres.
A
We
can
do
all
sorts
of
really
fun
things
when
we're
doing
a
little
bit
of
this
analysis
at
the
at
the
edge
layer
from
a
stream
processing
site.
Now
this
is
not
meant
to
replace
your
backend
analytics
by
any
means,
but
instead
be
an
augmentation
where
you
can
say.
Instead
of
having
to
run
these
alerts
and
cause
this
this
havoc
and
processing
power,
or
I'm
trying
to
do
some
schema
on
read
scheme
on
write
type
operations
in
my
pipelines,
why
don't
I
bring
some
of
that
logic
at
a
distributed
layer?
A
So,
instead
of
centralizing
your
back-end
analytics?
Why
don't
we
distribute
that
across
thousands
and
thousands
of
edges,
your
pods,
your
nodes,
your
kubernetes
nodes,
things
that
are
ephemeral
that
are
growing
that
are
distributed
and
add
a
little
bit
of
load
for
each
of
them.
So
you
know
the
same
same
benefits
that
we
see
from
a
cloud
computing
side
of
going
distributed,
we're
trying
to
bring
that
to
the
analytics
space
here
again,
not
a
replacement,
but
something
that
can
really
augment
your
entire
pipeline
here
now.
A
Something
I'm
very
excited
for
here
is
some
new
features
that
are
coming
out
and-
and
this
is
this
is
really
a
place
where
we've
been
investing
to
make
sure
that
this
process
is
a
little
easier
and
can
benefit
the
entire
fluent
bit
community.
The
first
is
dynamic
stream
processing,
so
being
able
to
say
dynamically
switch,
my
endpoint
from
one
back
into
that.
A
Next,
you
can
imagine
using
a
perhaps
a
bill
trigger
an
alert
on
on
a
certain
cost
threshold
to
say
switch,
my
back
end
from
very
expensive
back
end
to
very
cheap
block,
storage,
and-
and
this
is
something
that's
configurable
being
http
endpoint.
You
can
view
lists
and
create
stream
process
tasks,
for
example
on
the
right.
We
have
what
a
stream
process
task
influence
bit
looks
like
and
then,
from
a
live
query
perspective.
A
We
can
actually
flush
that
data
directly
to
the
lens
of
of
what
we're
operating
on
so
we're
able
to
very
quickly
go
ahead
and
see
what
does
live
data
look
like
and
how
do
we
most
effectively
bring
this
to
to
a
place?
That's
going
to
be
really
useful
for
us
so
with
that,
let's
go
ahead
and
switch
into
a
quick
demo,
we're
going
to
look
at
live
stream
processing
and
creating
some
dynamic
stream
processing.
A
I
have
three
windows
open,
so
the
second
window
I
have
here
is
really
just
another
window
where
I
can
run
logger,
so
we
can
create
some
test
messages.
You
can
see
that
it
just
popped
up
here
and
then
the
last
one
is.
I
have
my
my
local
laptop,
so
this
server
is
running
in
the
cloud,
so
I'm
going
to
actually
be
performing
all
the
stream
processing
and
live
query
jobs
remotely
so
not
on
the
same
same
machine.
A
A
A
The
second
is
the
sql
statement,
so
I
create
snapshot
tail
snapshot
and
I'm
doing
this
with
a
bulk
of
five
seconds,
with
the
tag
tail
snapshot
which
will
be
used
for
routing.
So
in
case
I
have
another
event
that
only
matches
tail
snapshot
and
I'm
going
to
select
every
single
piece
of
that
record.
That
has
the
initial
tag
of
tail.
The
other
piece
that
I've
added
on
here
is
I'm
limiting
this
to
five
records
because
we're
doing
a
live
query.
A
Okay
looks
like
we
got
a
201
created:
let's
go
ahead
and
list
that
again
so
run
the
same
command
again
and
here
we
can
see
that
our
stream
process
task
is
now
listed.
So
we
have
that
available
to
us
now,
let's
go
ahead
and
run
the
live
query.
So
here's
the
live.
Query
we're
going
to
go
ahead
and
grab
that.
A
We're
gonna
go
ahead
and
again
switch
our
endpoint
here
to
perf
test
uploaded
to
io.
Now,
if
I
go
back
really
quickly,
there
might
have
been
a
couple
messages
since
my
high.
But
let's
look
just
in
case
my
guess
is
this
will
be
empty
okay,
so
here
we
don't
really
have
anything.
It
looks
like
it
is
still
pretty
empty.
A
A
Now,
if
I
was
to
run
this
again
because
it's
flushed
the
context,
I
would
actually
just
get
a
empty
message.
So
that's
what
shows
up
right
here.
A
A
Once
we
are
below
a
threshold
or
if
we're
above
a
threshold
keeps
keep
sending
all
the
most
relevant
data,
and
then
we
can
live
query
that
data.
Now,
I'm
querying
this
just
with
my
curl
command,
but
I
think,
as
everyone
here
would
be
able
to
attest
to
you,
can
imagine
the
power
of
being
able
to
query
this
via
remote
apis
via
automation,
and
really
plug
this
into
your
entire
data
pipeline.
A
So
with
that,
I'm
going
to
go
ahead
and
stop
sharing
so
for
those
who
are
interested
in
additional
stream
processing,
type
functionality
or
stream
processing
hands-on
activities.
I
have
in
the
slide
deck
attached
a
a
github
repo.
That
includes
four
examples.
Those
four
examples
will
walk
through
very
basic
stream
processing,
some
advanced
stream
processing.
Some
group
buys
some
macs
a
min
some
average
and
as
well
as
a
time
search
prediction,
so
it
serves
as
a
really
nice
basis.
A
So
anyways
thanks.
Thank
you
all
for
giving
me
the
opportunity
to
speak.
I
do
hope
that
you'll
be
able
to
join
us
on
the
fluent
community
at
slac.fluency.org,
participate
within
any
of
our
community
events
and
give
us
feedback.
We're
always
looking
to
improve
the
projects
and
build
a
better
ecosystem
for
our
users.
So
with
that,
thank
you
so
much
and
have
a
good
one.