►
Description
This talk will cover setting up a multi-tenant Observability pipeline for metrics and traces using entirely opensource projects such as OpenTelemetry, Prometheus, Jaeger, Kafka, and Cassandra.
A
So
yeah,
so
today
I'm
going
to
talk
about
something
I
guess
like
a
lot
of
people
do
in
their
respective
organizations,
which
is
handling
observability
right.
A
But
today
I'm
going
to
be
talking
about
a
very
specific
use
case
or
specific
thing
that
I
sort
of
helped
build
for
cygnus.io,
one
of
my
clients,
with
the
use
of
the
new
fangled
tool
sort
of
it's
a
new
entrant
to
the
observability
block
since
the
last
year,
around
2019,
which
is
open,
telemetry,
also
good,
to
note
that,
like
signals.I,
o
recently
got
selected
for
yc
winter
batch
and
some
of
my
work,
I
hope
most
of
it
will
make
through.
A
So
just
before
I
get
into
the
construction
of
this
observability
pipeline.
Let
me
describe
what
I
understand
personally
by
observability.
This
is
my
mind,
map
of
understanding
how
you
know
how
we
consume
observable
services
in
a
cloud
native
environment.
Of
course,
we
all
know
about
the
three
pillars,
which
is
logs,
metrics
and
dresses,
but
the
way
I
visualize
is
that
they
are
also
interrelated.
Whenever
a
particular
event
happens
in
any
application.
A
We
do
not
consume
events
from
only
a
single
source
of
this
three
pillars:
right.
We
need
information
from
all
of
them
and
these
sort
of
intertwine
in
various
ways
for
folks,
like
us,
devops
engineers
or
infrastructure
engineers,
to
make
sense
of
what's
going
on
inside
the
system,
and
that
is
the
point
of
the
term
observability
right.
That
a
system
which
is
sort
of
black
box
is
exposing
data,
and
with
that
data
with
the
sensors,
we
can
make
sense
of
what's
going
on
inside
the
system.
A
What's
the
state
of
the
system
from
the
external
matrix
right,
so
we
have
like
logs
metrics
and
trusses,
but
our
objective
is
to
get
to
the
point
where
we
can
start
doing
rcas
and
we
can
start
doing
sli
and
slo
violations.
We
can
do
anomaly,
detection
right
and
what
we
see
in
our
daily
work
is
that
having
these
different
event,
sources,
isolated
in
different
dashboards
or
you
know
different
sort
of
data
visualizations,
it
doesn't
really
work
right.
You
know
what
we
usually
do
is
we
have
two
three
different
screens
open.
A
We
have
multi-screen,
monitor,
setups
right
and
in
one
place
we
are
visualizing
logs
in
kibana
we
are
using
graphana
to
visualize,
let's
say,
parameters
metrics
and
we
have,
let's
say
a
jager
dashboard
or
something
to
visualize
the
tres
events
right
and
the
actual
correlation
of
all
these
events
to
understand
what
is
going
on
with
the
system
during,
let's
say,
a
downtime
or
a
particular
outage.
That's
going
on
happens
in
our
mind,
because
most
of
these
tools
still
now
have
been
dealing
with
these
three
pillars
in
a
very,
very
isolated
manner.
A
There
are
explicit
tools
that
handle
only
one
of
these
verticals,
and
so
we
have
to
mentally
map
all
this
data
and
then
make
any
sense
of
it.
Sort
of
this
is
where
you
know
open
telemetry
steps
in
so
this
is
sort
of
a
slide.
I
sometimes
keep
for
you
know
an
audience
if
you
are
not
familiar
with
distributed
dressing,
but
I
think
today
I
spoke
to
ashish
about
this.
I
might
skip
that
because
I
believe
most
of
you
folks
are
familiar
with
distributed
dressing
and
what
it
is
and
how
it
works.
A
So
I'm
going
to
skip
over
this
cool.
So
the
reason
why
we
choose
open
telemetry
as
our
core
pillar
of
this
product
that
we
were
building
is
the
first
point.
The
three
pillars
of
observability
was
under
one
single
roof
for
almost
the
first
time
right
before
this,
every
project
started
tried,
dealing
with
individual
pillars
and
what
open
telemetry,
also
brought
to
the
table
was
a
vendor
neutral
data
format.
A
So,
as
you
can
see
on
the
right-hand
side
of
the
slide,
the
current
state
of
vendors
and
you
know
sort
of
fragmentation,
or
you
know-
diversity
of
vendors
in
the
monitoring
landscape
right
and
what
has
happened
with
each
of
these
vendors
and
their?
You
know
a
sort
of
earlier
interoperable
non-interoperable
data
formats
was
that
once
you
set
up
one
thing,
you're
sort
of
logged
into
that
and
you
couldn't
really
plug
and
play
the
different
formats.
A
You
know
you
couldn't
interplay
with
the
data
you
couldn't,
let's
say,
take
neuralic
data
and
use
that
to
push
into
jaeger
right.
You
got
you
couldn't
convert
that
right.
A
So
this
is
where
again
open
telemetry
sort
of
steps
in-
and
you
know
many
of
you
might
be
knowing
this
is
that
they
created
this
interoperable
data
format
called
otlp
and
each
of
these
vendors
very
interestingly
sort
of
stepped
up
and
created
data
format,
converters
to
otlp,
so
that
all
of
this
data
becomes
interoperable
right
and
that
created
a
huge
opportunity
and
suddenly
you
can
basically
take
zipkin
data
and
convert
it
to
jager
right.
You
could
take
prometheus
metrics
and
convert
it
to
a
different
type
of
metrics.
A
Let's
say
like
statsd,
which
created
a
lot
of
in
independence
in
how
we,
in
our
daily
practices,
like
you,
know
how
we
handle
observability
data.
Suddenly
we
didn't
have
to
build
things
in
a
very,
very
opinionated
way.
We
could
run
something
on
the
cluster
but
visualize
in
an
entirely
different
data
format.
Right
we
were
not,
I
guess
like
captive
of
one
single
particular
product.
A
So
this
is
one
of
the
reasons
why
we
chose
to
you
know:
go
with
open
to
limit
and
open
to
limit
is
not
a
very
old
project.
It
started
somewhere
in
2019,
we'll
come
to
that.
The
other
part
that
we
found
very
interesting
was
there's
a
particular
component
within
open
telemetry,
which
lets
not
only
do
data
format,
conversions
from
one
vendor
type
to
another,
vendor
type
or
vendor,
to
open
data
formats.
A
It
let
us
process
that
custom
data
with
very
minimal
golan
code
like
I
could
consume
certain
data,
add
to
it,
enrich
it
and
then
pass
it
on
in
the
otlp
format
and
automatically
the
exporters
would
basically
interpret
it
in
the
target
format
and
we'll
see
how
sort
of
open
telemetry
does
this
right.
But
this
is
one
of
the
key
things
that
made
us
choose
open,
telemetry
over.
You
know
using
other
open
formats,
so
yeah
we'll
come
to
the
lineage
of
open
telemetry,
and
this
is
sort
of
a
personal
thing.
A
Every
time
I
work
with
a
new
ecosystem,
or
I
start
studying
about
something,
that's
new.
I
try
to
try
and
go
back
on
at
what
point
of
time
the
ieee
standards
were
written,
and
you
know
how
this
whole
thing
started,
because
every
single
piece
of
tech
that
we
use
daily
is
not
really
novel.
They
all
have
a
sort
of
a
long.
Lineage
of
you
know,
history
that
sort
of
stays
hidden
because
not
everyone.
You
know
you
know
digs
through
that,
but
they
do
have
almost
like
they
get
spanning
history
before
it
comes.
A
You
know
in
front
of
us
and
it's
suddenly
a
new
thing.
A
same
goes
for
open
telemetry
right,
so
it
sort
of
started.
If
you
see
this
particular
chart,
that's
on
our
screen
right
now.
It
sort
of
started
way
back
in
2002,
3
7,
like
almost
like.
You
know
two
decades
back
right.
A
So
everybody
talks
about
the
google
dapper
paper
as
the
source,
but
what
people
don't
talk
about
is
google
diaper
paper
was
itself
an
assimilation
of
three
pr
papers
right
one
was
the
ieee
pinpoint
the
2003's
magpie
paper
and
the
2007's
x-rays
paper.
While
I
cannot
claim
to
have
read
all
these
papers
completely,
but
I
have
glanced
through
them
and
I've
compared
them
to
the
google
diaper
paper.
Google
does
borrow
a
lot
of
concepts.
A
You
know
and
reconciles
a
lot
of
competing
concepts
in
between
these
three
papers
to
arrive
at
what
we
have
come
to
know.
As
the
you
know,
the
google
tapper
paper,
which
sort
of
is
central
to
the
observability
ecosystem
you
know
going
at
so
what
happens?
Is
google
makes
on
2010
the
dapper
paper?
A
It
releases
this
to
the
public
and,
as
usual,
with
any
google
paper
like
hdfs
or
any
other
white
papers
that
were
released,
another
company
sort
of
picks
it
up
and
tries
to
implement
the
paper,
and
in
our
case
this
is
twitter,
so
twitter
implements
the
darpa
paper
into
something
called
zipkin
internally,
and
at
this
time
twitter
was
using
the
finagle
system.
You
know
in
their
infra
and
they
sort
of
used.
A
Finagle
took
cassandra
as
the
back
end
storage
and
created
this
implementation
of
dapa
paper
called
zipkin
and
on
2012
they
released
this
particular
implementation
to
the
public
as
an
open
source
project
and
called
it
open,
zipkin,
sometimes
down
the
line,
uber
also
sort
of
starts
doing
the
same
thing
right.
They
already
had
an
internal
tracing
sort
of
a
system
called
the
mercy
cakes
bunch
of
you
might
be
knowing
about
this,
but
these
sort
of
adapt,
open,
zipkins
principle.
A
This
is
the
first
of
year
I
think,
or
the
first
two
years
when
cncf
was
just
like.
You
know,
coming
up
a
post,
kubernetes
era
right.
This
is
when
open,
zipkin
and
giga
sort
of
gets
merged
into
this
thing
called
open
tracing,
and
this
is
being
laid
by
the
people
who
were
leading
this
particular
projects.
We
had
ben
siegelman
from
google
dapper
paper.
A
We
had,
I
guess,
adrian
cole,
who
was
at
the
point
leading
zipkin
development,
and
we
had
eures
kuro
from
jager
right
and
this
three
core
maintainers
of
this
three
very
different
projects
decide
that
we
need
an
interoperable
central
format
on
how
to
do
traces
and
they
create
this
thing,
called
open
tracing
and
donate
it
to
cncf.
It's
one
of
the
probably
earliest
projects
to
get
donated
to
cncf
right
parallel
to
this
google.
A
After
releasing
the
data
paper,
they
were
also
working
on
another
internal
product
for
observability
called
census,
and
they
released
this
to
the
public
in
2018
and
call
it
open
senses
and
for
quite
some
time,
open.
Tracing
and
open
sensors
remain
sort
of
competing
open
standards
right
till
2019,
when
basically
ben
sigelman
again
steps
up,
and
he
says
that
he
both
of
these
projects
have
very
very
common
goals
in
in
their
minds.
We
are
doing
a
lot
of
things.
Similarly,
why
don't
we
sort
of
merge
this
into
something?
A
That's
going
to
benefit
the
community
and,
unlike
in
other
segments
where
everything
was,
everyone
was
doing
forking
off
and
doing
their
own
implementations?
We
saw
a
merger
of
two
very,
very
popular
projects,
into
something
called
as
open
telemetry
and
as
late
as
I
think,
2019
december
november.
A
Don't
quote
me
on
that,
but
late
2019
right
and
it
also
borrows
from
the
2018
w3c
standard
headers
for
tracing,
which
is
the
trace
context
and
correlation
context
headers,
and
it
borrows
certain
concepts
from
the
w3c
paper,
uses
open
tracing
and
open
sensors
as
core
ideas
and
creates
a
new
project
called
open
telemetry.
A
You
know
dave
starts
and
if
you
can,
if
you
folks
see
here-
and
I
guess
our
organizer
here
is
one
of
the
contributors
to
open
telemetry-
we
have
some
of
the
biggest
names
in
the
industry
and
not
only
the
industry
but
in
the
observability
ecosystem
right
like
new,
relic
data
dog
bunch
of
companies
who
are
working
together
to
make
this
interoperable
data
pipeline
and
format
which
also
contributed
to
why
we
wanted
to
sort
of
you
know,
step
up
and
use
this
to
build
a
good
product
cool,
so
we'll
sort
of
again
slightly
go
into
the
architecture
of
this
and
certain
folks
here
might
be
familiar
with
this.
A
But
I'll
still
try
and
you
know
explain
my
best.
So
the
way
open
telemetry
is
structured
is
through
three
core
items,
basically,
which
is
receiver,
processor
and
exporter.
A
receiver
is
a
basically
a
component
which
receives
in
either
proprietary
or
or
in
a
certain
data
format
like
even
an
open
data
format
right.
So
if
you
have
jaeger,
there
is
a
jaeger
receiver.
If
you
have
zipkin
data,
there
is
a
zipkin
receiver
and
it
also
has
the
open
format
called
otlp
and
they
have
created.
A
The
project
has
sdks
in
most
of
the
popular
languages
out
there
to
basically
integrate
the
otlp
sdk
and
create
traces
and
metrics
using
that
particular
sdk
right.
So
you
instrument
your
code
and
the
trusses
and
metrics
are
flowing
and
for
ruby
and
java
as
the
picture
states,
they
have
auto
instrumentation
libraries,
but
for
the
others
the
instrumentation
has
to
be
done
by
the
developer,
and
so
we
have
basically
one
one
receiver,
each
for
each
of
these
different
languages
and
different
data
formats.
A
There
is
batching
where
you
can
say
that
okay
batch
up
this
many
events
and
then
send
to
something
else,
or
you
can
do
a
queued
retry,
where
it
will
send
but
wait
for
an
acknowledgement
and
if
it
is,
doesn't
receive
an
acknowledgement,
it
will
re-cue
it
back
and
then
send
it
again
later
on
right.
So
this
is
sort
of
the
state
manager
of
this
whole
pipeline
right.
A
It
holds
the
events
for
certain
time
and
processes
based
on
certain
queuing
logic
or
batching
logic,
but
in
addition
to
that,
this
processor
also
allows
you
to
write
custom
processing
logic.
This
could
be
property.
This
could
be
open
where
you
could
basically
process
the
data
that
otlp
has.
The
receivers
has
have
provided
to
you
and
enrich
it
in
certain
ways.
So
one
of
our
ideas
was
that
hey,
we
have
trace
data.
Why
don't
we
start
deriving
certain
inherent
metrics
from
the
trace
data
itself
like
why?
A
Don't
we
derive
latency,
and
you
know,
request
per
second
qps
metrics
directly
from
the
trusses
itself
right,
so
we
could
do
that
with
the
processor
implementation.
And
finally,
once
this
data
processing
and
queueing
is
done,
it's
all
sent
to
this
exporter
component
and
again
exporter.
A
Implementation
is
very
specific
to
the
target
right
where
we
are
exporting
the
data
to
so
you
could
basically
write
instrument
your
code
using
otl
pr
using
jaeger,
but
if
you
plug
this
whole
pipeline
into
let's
say,
open
sensors
exporter,
your
final
output
format
is
going
to
be
open
sensors-
and
this
is
true
for
even
vendor
formats
right,
because
there
is
a
contrib
repo,
where
the
vendors
are
actually
creating.
This
receivers
and
exporters
per
vendor
to
allow,
for
you,
know,
data
conversion
between
different
formats,
right
cool.
A
So
this
is
again
a
sort
of
a
brief
overview
of
the
internal
component
layout
and
sort
of
the
repo
layout.
Also,
I
would
say
so.
If
you
can
see
the
green
components
are
the
are
sort
of
on
the
core
repo
and
open
telemetry
is
a
separate,
contrib
repo,
where
the
vendors
contribute
to
either
receiving
their
format.
A
They
either
write
receivers
or
exporters
to
their
particular
proprietary
formats
right
so
most
of
the
vendor
stuff,
if
you
can
see,
is
in
the
contrib
repo,
but
most
of
the
open
source
stuff
implementations
are
in
the
core
repo
right,
so
we
have
standard
receivers
like
open,
sensors,
jager,
zipkin
otlp.
We
have
tres
receivers,
proprietary
ones
like
signalfx
for
matrix.
We
have
standard
again
otp
host
metrics
prometheus,
of
course,
the
one
of
the
code
pillars
of
our
observability
and
metrics
community,
and
then
we
have
you
know.
Proprietary
sort
of
implementations
is
carbon.
A
There
is
a
open,
kubernetes
matrix
implementation,
also,
which
is
not
reliant
on
prometheus,
but
it
directly
grabs
metrics
from
the
kubernetes
cluster,
using
the
api,
server
queries
or
ap
server
apis
right
and
then
we
have
different
processors.
Here
we
have
attribute
processor,
where
you
can
add
or
subtract
attributes
from
the
trace
data
or
the
event
data
that
you
have
received.
You
can,
as
I
talked
about
batch,
you
can
filter
data
out.
You
can
do
qt
retry,
you
can
do
sampling,
we'll
have
a
brief.
A
You
know
segment
about
doing
sampling,
whether
it's
tail
sampling
or
head
sampling,
or
you
know
the
various
types
of
that
right.
Tail
sampling
is
hard
by
the
way
head
sampling
comes
so
is
fairly
easy
to
do,
but
while
open
telemetry
does
provide
implementations
of
tail
sampling,
it's
still
sort
of
limited,
so
yeah.
So
these
are
basically
some
of
the
processors.
A
Don't
have
enough
time
to
talk
about
each
of
them
in
detail,
but
some
of
them
are
fairly
interesting.
Right,
like
sampling
is
probably
one
of
the
most
important
processors
and
attribute
processor,
also
filter
too
right.
So
again,
as
you
see
on
the
exporter
layer,
we
have
exporters
for
each
of
the
different
formats.
We
have
the
standard
otlp
we
have
jaeger
zipkin,
which
both
are
open
formats
and
then
we
have
a
bunch
of
you
know
proper
implementation,
exporters
for
property
data
syncs.
What
if
you
want
to
consume
in
jaeger?
A
You
want
to
instrument
your
code
in
jager,
but
you
want
to
send
it
to
aws
x-ray
right
so
that
pipeline
can
be
created
through
this
particular
structure.
Right,
if
you
have
your
receiver
as
let's
say
jager,
but
you're
exported
as
aws
x-ray,
you
can
visualize
your
traces
in
aws
x-ray.
So
this
is
what
the
different
exporters
enable
you
to
do
same
goes
for
matrix,
otlp,
prometheus
and
then
carbon
signal,
fx,
stackdriver,
etc
cool.
A
So
one
important
point
to
note
here
is
that
the
binaries
for
the
contrib
repo
and
the
core
repo
are
not
the
same.
So
if
you
want
to
use
the
features
or
the
exporters
and
receivers
marked
here
in
red,
you'll
have
to
use
the
docker
image
or
the
compiled
code
from
the
contrib
repo,
which
is
sort
of
a
superstate
of
the
core
repo.
So
it
has
got
the
core
components,
plus
the
extra
components
from
the
vendors.
A
But
if
you
end
up
using
the
core
repos
docker
image
in
your
deployment,
you
will
miss
out
on
the
components
that
are
marked
in
rate.
You
don't
get
any
of
the
contrib
components
there.
This
was
a
bit
of
a
tricky
thing
for
me
when
I
started
building
this
thing,
because
I
was
expecting
more
of
a
plug-in
system.
But
finally,
we
saw
that
it
had
to
be
recompiled
or
different.
Docker
images
need
to
be
used
if
you
want
to
use
different
features
cool.
So
now
we
come
to
our
first.
A
You
know
product
objective
target.
How
do
we
consume
the
client's
data?
Thankfully
open
telemetry
to
the
rescue
right
so
to
basically
build
our
internal
demo?
And
at
this
point
we
do
not
did
not
have
clients
we're
just
you
know,
creating
our
product
from
scratch.
We
needed
to
do
emulation
of
client
data
and
various
formats
of
client
data.
We
wanted
to
have
the
data,
interoperability
and
variety
as
one
of
our
course
friends.
So
we
created
a
load
generator
application
with
various
existing
sort
of
load
generators.
A
We
had
open
sensors
trace
and
metrics
using
a
simple
golang
app
one
file,
golang
app,
basically,
which
is
there
and
then
we
had
sort
of
omniscient
synthetic
load
generator
for
both
jager
and
zip
interests,
and
we
used
fresh
tracks,
prometheus
metrics
called
avalanche.
Basically,
it's
it
generates
a
fake,
prometheus
metrics,
which
sort
of
emulates
an
application,
but
we
sort
of
combined
all
of
this
into
a
single
implements
and
says
hey.
This
is
our
load
generator.
We
can
toggle
certain.
A
You
know,
control
certain
tiles
to
basically
either
up
the
scale
or
down
the
scale
and
test
our
internet
implementation
out.
So
this
is
our
load
generator
and
then
we
had
to
deploy
the
open
telemetry
agent
on
the
client
side.
Now
how
we
did
it
is
that
we
had
two
separate
clusters:
sort
of
deployed.
One
was
emulating
as
a
client
cluster,
and
we
had
our.
You
know
the
platform
cluster
separately
right.
So
we
deployed
the
auto
agent
with
receivers
mapping
to
our
sort
of
emitters,
the
event
emitters
that
we
had
configured.
A
So
we
had
open
sensors,
jager
and
zipkin
receiver,
and
then
we
had
parameters
receiver,
which
sort
of
works
in
a
different
way.
So
all
the
trust
receivers
are
they
directly
receive
data
from
the
actual
instrumented
code,
but
promise
you
scrapes
data.
So
it's
sort
of
how
normal
prometheus
works.
You
provide
the
promises,
receiver,
the
script
config
and
a
service
endpoint,
and
it
will
go
and
scrape
that
end
point
and
get
the
data
out
right.
A
And
then
we
fan
in
this
data
to
the
batch
and
queued
retro,
processor
and,
finally,
to
initially
we
were
not
using
the
otlp
receiver
exporter.
We
were
using
open
sensors,
so
this
image
has
that,
but
eventually
we
moved
to
the
otlp
exporter.
A
So
we
found
this
into
the
otp
exporter,
and
that
is
where
our
client
side
implementation
sort
of
ends,
and
this
is
so
awesome
right
because
we
did
not
have
to
write
even
a
single
line
of
code,
but
suddenly
we
were
able
to
consume
at
least
four
to
five
different
types
of
data
formats.
It's
representing
different
clients
that
we
might
have
in
future
using
a
simple
open
source
component
deployment
right
and
we
consumed
this
on
a
platform
cluster
using
the
collector
headless
service,
and
this
is
where
the
fanout
sort
of
started
right.
A
So
we
had
an
otlp
open
sensors
initially,
but
later
on,
we
moved
to
otlp.
We
had
the
receiver
which
again
routed
through
the
batch,
include
detroit
processors
and
finally
to
three
different
exporters.
We
consumed
from
around
four
formats
and
we
were
experimenting
also,
you
know
seeing
whether
we
can
fan
that
out
and
is.
A
Is
the
data
being
received
correctly
in
all
the
different
formats
right
and
this
experiment
turned
out
to
be
true,
so
we,
when
we
find
out
our
jager
exporter,
had
data
not
only
from
you
know,
giga,
but
it
also
got
the
generator
data
from
both
jager
zipkin
and
open
sensors.
We
had
the
combined
data
from
all
three
different
sources
in
a
single
jager
format
and
so
went
zipkin.
We
also
saw
that
prometheus
sort
of
stayed
linear
because
we
are
scripting
with
prometheus
and
also
exporting
two
prometheus,
so
we
got
the
data
we
were
expecting.
A
So
this
is
a
stage
one,
the
first,
you
know
how
we
started.
Building
our
prototype
and
this
sort
of
shows
you
the
yaml
view
of
the
pipeline.
The
previous
one
was
the
architecture
view.
If
you
see
this
is
how
we
write
it
like
we
say
that
hey
these
are
the
receivable
map
of
receivers.
We
have
open
sensors
zipkin.
These
are
the
end
points
for
open
sensors
and
I
don't
mention
it
goes
to
the
default
endpoints
and
for
prometheus.
I
provide
a
script
config.
A
We
give
it
the
job
name:
okay,
go
scrape
load
generator
on
port
9001,
scrape
out
the
matrix
right,
and
all
of
this
goes
to
the
processor
definition
and
finally
to
the
exporter.
Definition
right
and
finally,
we
organize
this
into
the
pipeline
schema
where
we
say:
okay,
you
know
these
are
my
trace
pipelines,
and
these
are
my
matrix
pipelines
right.
So
tres
pipeline
goes
in
a
way
that
okay
receivers
are
open.
Sensors
and
zipkin
processors
are
batch
and
queued
retry
and,
finally,
my
exporters
are,
let's
say:
open:
sensors
and
login
metrics.
A
Also
receivers
are
open,
sensors
are
providers,
exporters
are
logging
and
open
sensors.
The
reason
I
have
the
logging
exporters
in
both
cases
is
that
we
were
in
debug
mode
and,
of
course,
I
needed
to
ensure
that
everything
that
was
being
received
was
also
being
finally
consumed
in
the
final
data
syncs
right
and
but
it's
probably
ideal
not
to
have
the
logging
exporter
enabled
in
any
of
the
production
deployment.
It's
only
good
for
debug,
because
then
you're
basically
dumping
a
client's
data
which
could
be
sensitive.
A
You
wouldn't
know
into
their
city
out
right,
so
that's
sort
of
not
recommended,
but
we're
just
building
a
product,
we're
building
our
mvp
right
so
yeah.
Now
we
come
to
the
one
of
the
most
important
processes
which
was
still
sampling
right,
which
we
talked
about
previously,
and
this
is
how
you
do
tell
sampling
using
one
of
the
given
processors
is
that
you
define
a
policy.
A
You
see
that
okay,
if
a
particular
attribute
type
with
a
certain
key
and
within
a
certain
value
range,
if
you,
if
you
match
that,
then
do
not
sample
or
if
it
matches
that
then
do
sample.
So,
basically,
if
you
see
from
there
this
particular
conflict,
what
it
says
is
that
if
the
http
status
code
is
200,
then
sample
it,
because
we
do
not
want
to
have
a
lot
of
200
okay
data
in
our
traces
right,
because
we
are
not
worried
about
success
cases.
We
are
more
worried
about
failure
cases
right.
A
So
this
is
where
the
first
map
sort
of
goes
is
that
if
the
status
codes
mean
value
and
max
value
like,
if,
if
it's
exit
zero,
then
sort
of
sample
right
like
if,
if
it's
not
an
error,
then
do
sample.
So
this
can
also
be
inverted,
and
we
can
explicitly
say
that,
hey
if
the
error
matches
or
if
the
status
code
matches-
let's
say,
5xx
or
xx,
or
four
x
x,
right
certain
http
status
code.
A
We
want
to
send
those
data
same
each
of
the
stresses
individually
right,
so
that
configuration
can
be
written
down
under
the
policies.
A
And
the
one
thing
about
you
know
tail
sampling
is
that
it
does
not
work
in
clustered
mode
of
open
telemetry,
given
that
cell
sampling
happens
in
memory
after
you
have
received
a
significant
chunk
of
traces
inside
your
particular
deployment
right
so
to
basically
do
tail
sampling
across,
let's
say
in
different
replicas
of
a
single
open,
telemetry
deployment,
all
those
traces
all
those
particular
events
have
to
be
shared
across.
That
means
we
need
sort
of
a
consensus
mechanism.
A
A
friend
in
the
grafana
labs
actually
had
built
one
of
these
forks
and
it's
still
sort
of
under
in
development
and
grafana
labs
also
published
a
sort
of
blog
post
way
back
when
I
was
building
this
that
okay,
this
is
how
we
are
doing,
tell
sampling
with
consensus
on
top
of
open
telemetry,
but
it
sort
of
had
fogged
off
pretty
much
from
the
core
branch
and
the
merge
back
was
not
at
that
point
of
time
allowed,
and
I
think
I
should
connect
back.
A
You
know
with
my
friend
at
grafana
to
discuss,
like
you
know,
sometimes
how
what
they're
planning
I'm
pretty
sure
they
are
still
working
on
it,
even
though
it
was
paused
for
a
bit
right.
So
this
is
how
you
define
you
know
how
to
sample
particular
traces
right.
A
So
now
we
have
solved
one
part
of
the
problem.
The
client
had
the
data
client
had
various
formats
of
data.
We
have
converted
into
our
desirable
data
formats
right
so
now.
One
part
of
the
problem
is
solved
that
we
are
not
worried
about
how
the
client
is
instrumenting
their
code.
They
could
instrument
in
any
particular
thing.
Our
solution
is
to
deploy,
give
them
a
simple
yaml
which
they
can
deploy
to
their
cluster
with
the
correct
receiver
format
configured,
and
we
will
receive
the
data
as
we
intend
to,
and
then
we
can
manipulate
it
around.
A
But
now
we
come
to
the
second
problem,
which
is
now
we
are,
let's
say,
targeting
multiple
clients.
That
means
multi-tenancy
on
our
cluster
right
and
how
do
we
create
a
long-term
data
storage
which
are
resilient
and
which
will
sort
of
thrive
in
sort
of
a
very
high
scale
environment
right.
So
this
is
where
we
come
up
with
a
plan
of
okay,
we'll
again
use
completely
open
source
components
to
build
this
data
pipeline
out
right.
A
So
we
needed
data
syncs
right,
so
we
needed
a
sync
for
pertinent
sync
for
metrics
and
traces,
so
which
means
we
would
deploy
a
particular
prometheus
instance
or
a
deployment
if
you
want,
in
the
form
of
promisius
custom
resource
for
in
the
tenant's
namespace
and
same
goes
for
jaeger,
will
just
deploy
a
giga
custom
resource
in
streaming
mode
and
with
kafka,
so
that
you
know,
data
isn't
lost
right
because
we
do
not
want
to
lose
data
from
our
chess
config.
A
If
you
know
one
of
our
let's
say
deployment
goes
down
right
and
we
planned
on
integrating
the
long-term
storages,
which
were
already
supported
by
both
prometheus
and
a
jager.
At
that
point
of
time
and
still
supported,
which
is
we
plan
to
use
cortex
with
cassandra
backend
for
metrics
data,
and
we
were
again
use
cassandra
back-end
with
jager.
A
So
this
sort
of
arrived
at
a
very
nice
solution
where
all
of
this
data
for
the
metrics
and
traces
were
coming
to
a
single
cassandra
data,
sync
right
and
we
basically
had
10
and
14
and
zero
one.
We
had
10
and
zero
one
key
space
and
for
the
same
dinette
we
had
10
and
zero
one
matrix
key
space,
and
each
of
the
tenant
data
was
isolated
in
their
own
key
spaces
and
we
could
basically
scale
it
out
very,
very
fast
right.
So
this
plan
sort
of
worked.
A
What
we
needed
to
build
this
out
was
couple
of
existing.
You
know
open
source,
kubernetes
operators,
which
was
parameters
operator,
eager
operator
kafka
operator
for
the
kafka
topic.
We
were
creating
pertinent
and
deployment
of
the
cassandra
itself
now.
This
is
where
we
got
stuck
a
little
bit,
because
we
there
were
multiple
offerings
of
how
to
do
cassandra
on
kubernetes.
There
is
killer,
which
is
a
very
popular
project,
but
it's
not
truly
cassandra.
It's
cassandra
compatible.
A
It
adheres
to
the
apis,
but
it's
not
purely
the
same
implementation,
so
we
initially
started
scala.
I
was
pretty
excited
about
using
scala
for
this
purpose,
because
we
had
an
operator.
That
means
we
could
easily
create
one
custom
resource
for
pertain
and
name
space.
We
didn't
have
to
sort
of
have
a
central
thing
there,
but
this
plan
sort
of
did
not
work
out
for
us
kill
us.
A
Current
methodologies
still
have
certain
scalability
things
to
desire,
and
it
did
not
work
for
us,
so
we
feel
back
to
using
a
central
cassandra
cluster
with
bitnami
cassandra
right,
and
we,
of
course
have
to
deploy
cortex
now.
Cortex
itself
is
a
pretty
big
architecture,
sort
of
to
deploy
and
scale.
But
given
this
talk
covers
a
lot
of
other
things,
I'm
not
going
to
you
know,
go
into
them
a
lot,
but
there
are
a
lot
of
documentation
out
there
on
what
cortex
is.
A
I
think,
like
again,
berlin
is
what
I
call
I
was
discussing
with
our
organizer.
Just
before
the
you
know
we
started.
Is
that
berlin's,
probably
the
observability
capital
of
you
know
the
world
right
now,
with
all
the
maintainers
sort
of
there,
in
red
hat
and
in
other
organizations
and
the
you
folks
are
familiar
with
how
cortex
and
other
things
are
there,
so
I'm
not
going
into
that.
A
So
what
we
needed
to
do
was
basically
crea
use
this
components
and
build
a
multitalent,
isolated
sort
of
architecture
where
our
multiple
tenants
data
can
in
no
way
be
consumed
by
another
tenant.
The
data
access
had
to
be
very,
very
controlled
right
and,
to
sort
of
do
that,
we
again
followed
standard
practices.
We
keep
the
data
consumer,
which
is
our
hotel
character.
A
We
created
one
hotel,
correct,
collector,
pertinent
namespace
and
one
prometheus
here:
one
jagger
custom
resource
also
per
name
space
and
one
kafka
topic
per
namespace,
and
we
sort
of
secured
this
boundary
with
network
policies
with
our
back.
Since
we
had
multiple
prometheuses
running,
we
did
not
rely
on
a
cluster
role,
but
each
prometheus
and
each
jaeger
had
their
own
role
and
role
binding.
A
So,
instead
of
trusting
a
single
service
account
token
or
a
single
role,
each
had
their
own
service
account,
trusting
their
within
namespace
single
rpec
implementation
right,
and
we
also
use
network
policies
to
ensure
that
calls
could
only
be
made
on
certain
ports
within
the
name.
Space
and
no
way
can
my
promoters
go,
go
out
and
scrape
another
attendance
name
species
right
and
this
we
did
through
both
network
policies
and
also
using
proper
level
based
scripting
right
so
and
additionally,
securing
the
deployments
themselves
with
bot
security
policies.
A
Ensuring
you
know,
resource
quotas
are
correct,
pretty
standard
practices,
nothing
fancy
and
then
finally,
what
happens
is
the
data
comes
in
an
open
portal.
Collector
finally
goes
to
cortex
cortex
dumps.
The
data
in
the
matrix
key
space
and
jager
turns
the
data
in
the
tresses
key
space.
A
And
finally,
we
have
the
central
data
sync
across
all
our
tenants,
two
key
spaces,
pertinent
where
we
were
getting
all
this
data
and
we
could
finally
build
sort
of
a
query
layer
on
top,
which
is
going
to
be,
of
course,
our
proprietary
query
layer,
but
our
main
code
pipeline
of
getting
vvv
data.
If
you
want
to
call
it
in
you
know
big
data
terms,
which
is
variety
volume,
and
I
guess
velocity
right.
A
So
we
could
consume
data
in
both
variety
in
all
linearity
volume
and
velocity
and
sort
of
have
a
very
stable
pipeline.
A
Using
these
particular
components,
dump
it
to
a
single
data,
sync
which
is
queryable
using
standard
sdks,
and
we
could
build
an
application
on
top
of
so
this
sort
of
this
slide
sort
of
sums
up
what
I
had
gone
through
in
my
previous
slide
sort
of
sums
up
the
whole
architecture
and
what
is
happening,
and
if
you
can
see
this
is
from
combination
of
my
previous
slides,
which
is
okay.
A
We
had
the
load
generator
application
and
we
are,
if
you
follow
the
colored
lines,
you'll
be
able
to
identify
how
the
trace
data
is
flowing
and
how
the
metric
data
is
flowing.
So
the
blue
lines
are
sort
of
trace
data.
So,
if
you
see
there
are
three
sources
representing
our
clients,
which
is
the
open
sensor,
generator
jaeger,
emitter
and
zip
kilometer.
All
of
this
goes
through
the
hotel,
client,
auto
agent
and
then
finally
hotel
collector,
and
we
route
this
through
the
jager
exporter.
A
You
know
to
ajigar
instance,
which
is
running
in
streaming
mode,
uses
kafka
topic
for
you,
know,
resiliency
and
dumps
it
to
cassandra
in
a
large
distributed
storage
format
so
that
we
have
data
in
long-term
storage
prometheus
matrix.
It
also
comes
through
the
prometheus
receiver
goes
out
through
prometheus
exporter
finally
comes
to
be
dumped
in
the
prometheus
custom
resource,
which
remote
writes
to
cortex
and
cortex
finally
uses
again
cassandra
as
the
mean
back-end
storage.
So
we
have
both
matrix
and
trace
data
in
a
single.
A
So
this
is
two
pillars
of
two
out
of
three
pillars
of
observability
inside
a
single
application,
probably
visualizable
through
the
api
layer
and
dashboarding
in
sort
of
a
single
combined
view
when
we
were
building
this.
I
think
the
last
week
when,
during
my
engagement
with
my
client
signals.I,
oh
the
log
integration
was
also
done
to
open
telemetry,
but
unfortunately
I
haven't
had
the
chance
to
look
into
that,
but
you
could
build
it
out
in
the
same
way
right.
A
You
could
take
the
data
logging
data
also
in
the
same
way
and
finally
dump
it
into
a
central
data
store.
And
finally,
we
have
a
proper
pipeline
set
up,
on
top
of
which
you
can
build
a
unified
dashboard
with
all
three
pillars
of
possibility
there
and
you
could
build
correlations
on
top
right.
So
the
problem
of
multiple
dashboards-
and
you
know
frantically.
A
You
know
just
looking
around
to
figure
out
what
is
going
on
looking
at
trace
data
you're,
seeing
whether
spikes
are
happening
on
the
metrics
dashboard,
and
you
know,
you're,
look
digging
through
logs
on
kibana
and
figuring
out.
You
know
which
particular
application
is
misbehaving.
Is
it
a
particular
part
or
not
all
of
that
sort
of
frantic,
looking
sort
of
becomes
much
more
peaceful,
and
now
we
have
consolidated
data
in
one
single
space
cool,
so
this
is
sort
of
the
last
slide.
A
I
have
a
small
demo
prepared
where
I
demonstrate
whatever
I
talked
about.
This
is
again
a
pre-recorded
demo
and,
as
stefan
had
said
earlier,
in
the
talk
that
we
all
learned,
I
think
during
2020
that
doing
live
demos
during
online
talks
are
probably
the
best
ideas
to
have
a
recorded
demo.
But
before
I
go
to
that,
I
will
ask
you
to
probably
scan
this
particular
qr
code,
and
this
will
take
you
to
the
link.
That's
mentioned
above
where
the
base
code
is
dumped.
A
A
A
So
apologies
at
the
very
beginning,
the
video
is
sort
of
choppy.
I
was
using
an
inherent.
You
know
linux
tool
to
sort
of
record
this.
Instead
of
using
a
sort
of
proprietary
recorder
and
the
video
turned
out
to
be
choppy,
we
lost
pixels
on
the
way
cool.
So
let's
wait
there.
So
this
is
what
I
have
you
know.
We
have
a
config
camera.
We
we're
just
looking
at
the
load
generator.
A
We
have
a
deployment,
we
have
a
config
map,
we
have
a
go
golang
application
and
have
a
service
to
expose
this
right
and
finally,
we're
looking
to
the
actual
golang
application
right
and
all
of
this
by
the
way
is
available
in
the
repo
that
I
mentioned
earlier.
So
you
folks
can
go
ahead
and
look
and
see
how
the
whole
thing
is
organized.
So
what
you
can
see
from
this
particular
golan
code
is
that
we
are
sort
of
writing
how
to
generate
trace
which
stresses
to
generate
and
certain
metrics,
also
right.
A
A
Cool,
so
if
you
can
see
the
just
the
previous
frame
right
so
under
the
ctx
tag,
new
context
background
right.
This
is
where
we
are
creating
the
trace,
and
on
top
of
that
we
are
creating
the
open
census
metrics.
So
the
name
description
measure
and
aggregation
and
tag
keys.
The
multiple
maps
that
you
see
within
the
view
this
part
from
line
number
around
6
to
35,
that
is
the
open
census.
A
But
this
is
just
like
a
small
thing
to
you
know:
generate
fake
traces
and
fake
matrix
from
a
single
small
application
right,
so
yeah
again
repeating
like
line
number
six
to
line
number
35
is
metrics
and
then
37
and
oh
sorry,
41
downwards
is
how
I'm
generating
the
tris
for
open
sensors
using
open
sensors
sdk,
and
I
have
no
idea.
Why
am
I
traversing
the
code
backwards
from
down
to
top
very
unusual
so
anyway?
A
So
if
you
can
see
that
we
are
actually
configuring,
the
hotel
agent
end
point
in
this
particular
application
and
we
are
configuring
it
so
that
we
can
send
both
trace
and
matrix
data
to
this
particular
endpoint.
Using
the
open
sensor,
sdk.
A
Cool
now
we
come
to
the
rest
of
the
deployment,
as
you
can
see
that
we
have
the
golang
application
there
already.
We
are
just
mounting
it
inside
a
container
and
using
a
goron
command
to
run
it.
We
have
prominent
avalanche,
which
generates
metrics
and
traces.
I
used
small
numbers
for
this
demo
because
I
did
not
want
my
deployment
string.
A
Advertently
crashed
due
to
high
stress,
but
the
whole
setup
does
hold
up
once
scaled
in
the
correct
ways
right
so
here
I
have
only
metric
count
as
one
one
single
metric
with
50
serieses,
which
are
being
you
know,
exposed
out
of
this
particular
load
generator,
and
then
we
have,
of
course,
synthetic
load
generator
once
running
with
jager
and
once
running
with
zip
king
right
and
in
both
cases.
If
you
see
we
configure
the
hotel
agents,
agent's
receiver
port
as
the
target
from
our
load
generator
application.
A
In
case
of
jager,
we
target
the
14268
port,
which
is
one
of
the
giga
ports,
and
same
goes
for
zipkin.
We
expose.
We
send
the
data
to
the
zipkin
port,
which
is
9411
in
the
zipkin
format,
so
the
receiver
itself
acts
or
masquerades
as
if
it's
a
zipkin
instance.
So
the
application
is
completely
unaware
whether
it's
portal
agent
or
it's
an
actual
zip,
in
instance
it's
sending
to
to
the
application
it's
the
same,
because
the
apis
are
exactly
the
same.
A
Cool
and
of
course,
we
just
exposed
the
load
generator
with
a
service
and
yeah.
This
is,
which
was
a
bit
part
of
my
slides,
which
is
how
the
opens
telemeter
legend
is
sort
of
configured.
As
you
can
see,
we
had
three
data
sources,
open
sensor,
zipkin
and
jager.
We
have
receivers
for
that.
We
have
the
prometheus
with
scrape
scraping
config
and
exporters,
which
is
again
towards
our
targets.
A
Right
again
for
this
demo,
I
was
using
open
sensors,
but
the
ideal
way
to
do
is
use
odlp,
which
is
the
internal
format,
try
and
not
use
open
sensors
just
use.
You
know
if
you're
fanning
in
fan
into
otlp
and
fan
out
from
otlp
do
not
use
open
sensors.
I
this
was
at
the
beginning
when
I
was
still
discovering
the
ecosystem.
So
you
know
the
video
has
that
cool,
and
this
is
finally
the
open
telemetry
as
in
deployment
which
I
think
you
can
find.
A
So,
let's
actually
move
on
right,
because
just
browsing
through
yamls
in
a
demo
is
not
going
to
be
interesting.
So
what
we
have
is
the
hotel
agent
deployed
and
later
on,
this
is
the
logging
part
which
I
was
talking
about
since
logging
exporter.
Basically
lets
me
verify
that
all
the
data
dumps
all
the
data
that
from
load
generator,
are
actually
coming
into
the
hotel
agent,
and
I
can
verify
that.
A
Yes,
all
the
metrics
and
traces
all
of
that
is
actually
finally
coming
through
right
and
it
helps
when
you
are
initially
setting
stuff
up,
and
once
it's
done,
you
can,
you
know
just
forget
about
it.
Now
same
goes
for
open,
telemetry
collector.
I
am
not
going
through
the
config
again,
because
you
know
the
demo
sort
of
the
ppt
sort
of
cover
that,
but,
as
I
was
talking
about
the
tail
sampling,
so
this
demo
has
that
tail
sampling
enabled
and
we
will
have
a
small
look
into
that.
A
Cool,
so
this
is
our
final
result
right
where
we
actually
dump
our
data
into
the
data.
Sync,
we
expose
the
service,
we
open
up
giga
service.
Now
this
is
our
final
jaeger,
where
our
exporters
from
the
open,
telemetry
collector,
are
sending
data
to
right,
and
this
is
technically
in
our
platform
cluster,
where
we
have
received
customer
data
already
and
as
you
can
see,
all
the
data
that
we
had
received
from,
let's
say:
open,
sensors
and
the
other
different.
You
know
emitters.
A
All
of
the
data
is
together
in
a
single
sort
of
format
in
a
single
place,
and
we
can
browse
each
service
separately
right,
so
we
create.
We
can
have
the
data
here
in
our
first
data
sync
right,
and
this
is
all
happening
trend.
This
is
the
visualization
of
the
attack
visualization
through
yeager,
so
yeah.
So
in
this
demo,
all
the
data
is
flowing
through.
You
know
what
we
discuss
the
open
elementary
agent
on
the
times
cluster
and
the
open
telematic
open,
telemetry
collector
on
the
production
cluster,
which
is
a
platform.
A
Cool,
so
we
have
talked
about
the
parameters.
Metric
was
generated
using
the
slow
generical
avalanche
and
we
can
see
that
avalanche
has
started,
generating
the
metrics
which
are
being
consumed
at
the
platform
end
in
a
different
cluster
altogether,
and
we
can
see
all
the
matrix
coming
in
through
open
telemetry
character.
A
And
I
think,
with
that,
the
sort
of
small
demo
ends
there
is
not
much
to
it,
so
this
demo
does
not
have
the
whole
cortex
and
multitenant
setup,
because
we,
I
did
not
have
time
to
get
the
whole
thing
done
for
today,
very
short
notice.
But
the
idea
is
that
the
concepts
that
we
discussed
in
the
slide
are
going
to
hold,
and
this
is
the
backbone
of
sort
of
the
matrix
backbone
of
the
product
that
we
built
and
yeah
with
that.
I
think
my
talk
sort
of
over
and
I'm
open
for.