►
From YouTube: Tempo Community Call 2022-03-10
Description
- Metrics Generator Details
- Search improvements for 1.4
- Looking forward to Parquet!
B
A
Cool,
so
welcome
to
the
community
call
march
edition
2022.
I
can't
believe
it's
the
year
2022,
but
it
is
here
we
are.
We
got
a
couple
things
today
and
I
think
we
even
have
a
fancy
presentation,
we're
going
to
talk
about
the
metrics
generator
and
we
have
some
good
news
in
that.
I
that
will
be
in
one
four,
which
should
I
guess,
maybe
we'll
target
for
next
month
or
is
it
this
month.
A
I
don't
know
I'll,
go
back
and
look.
We
just
generally
do
every
other
month
and
maybe
that's
next
month,
and
then
we
have
some
back
and
search
news.
I
was
mentioning
that
earlier,
but
I
really
feel
like
we
crossed
a
hurdle.
We
added
a
couple
features
there
that
have
seriously
increased
query
times
and
then
I
think
we
have
one
more
coming
soon
and
one
four
is
going
to
see
way.
Faster
search,
speeds
and
marty
and
ananya
can
talk
about
some
of
our
parquet
work.
A
A
So
I
think
I
don't
really
know,
but
I
think
metrics
are
first.
Is
that
right,
oh
before
we
get
into
it,
feel
free
to
put
any
things
in
the
agenda
doc
and
then
also
you're
welcome
to
jump
in
with
questions
either
in
chat
or
you
can
unmute
and
ask
questions.
This
is
very
casual.
A
We
do
have
this
fancy
slide
presentation
thing,
but
you
know
this
is
a
conversation,
don't
don't
be
afraid
to
jump
in
and
ask
what
you
need
to
ask
and
interrupt
you
every
whoever's
chatting
so
with
that?
I
think.
Is
it
metrics
first
conor
odd?
Is
it
you.
B
B
So
I'll
start
by
talking
a
bit
about
the
metrics
generator.
We
already
discussed
this
a
couple
of
times
in
previous
community
calls.
So
this
would
just
be
like
an
update
and
share
a
bit
of
like
the
performance
we're
seeing
in
our
internal
clusters.
B
So
I
think,
last
time
we
discussed
the
design
document
and
going
further
from
there
from
that.
We
have
now
also
merged
the
metric
generator
code
into
the
main
branch.
So
you
can
run
it
already.
If
you
like,
dare
to
run
the
latest
commits
from
temple,
we
run
it
internally
and
it's
been
going
pretty
good,
but
we,
like
you,
know,
doing
a
lot
of
improvements
at
the
moment.
B
I
just
wanted
to
do
like
a
quick
recap
of
the
architecture,
so
how
the
metric
generator
works
is
we're
adding
this
new
component
next
to
the
ingester,
which
will
receive
like
an
a
second
asynchronous
request
from
the
distributor.
So
the
distributor
will
write
to
the
ingester
first
and
then
it
will
do
an
async
request
to
the
metrics
generator
and
it
it
also
has
a
ring
to
do
load
balancing
and
then
the
metrics
generator
will
process
the
the
incoming
spans
and
write
metrics
to
promise
use.
B
I
also
have
an
architecture
diagram
of
the
internals
of
the
metric
generator
and
that's
kind
of
moving
around
as
we
like
discover
stuff
and,
like
figure
out
better
ways
to
design
this.
So
this
is
kind
of
the
architecture
we
are
ending
up
with
now
and
it
feels
like
a
you
know,
very
good
base,
and
so
what
it
looks
like
internal
is,
you
have
kind
of
this,
this
pipeline
from
left
to
right
and
they're
kind
of
like
three
like
modules
or
like
stages
in
this
in
this
pipeline.
B
So
the
first
step
in
this
is
the
processors.
So
the
metric
generator
receives
all
these
spans
that
are
being
written
to
tempo
and
we
use
these
matrix
processors
to
convert
the
spans
into
metrics.
So
the
processor
can,
for
instance,
count
how
many
spans
they
are
with
an
error
and
we'll
just
keep
like
updated
metric.
Like
oh
yeah,
I
saw
spam
with
an
error,
okay
plus
one
and
then
again
like
that
all
the
time
and
the
processor
will
then
update
the
metrics
and
store
them
in
the
registry.
B
We
intend
to
make
the
process
like
kind
of
dynamic,
so
you
can
change
them
at
runtime.
You
can
add
processors,
you
can
disable
them
and
you
can
also
change
the
config
in
them
like
right
now.
You
can
only
enable
and
disable
them,
but
we
also
want
to
make
some
configuration
dynamic,
so
you
can
adjust
them
on
the
fly
in
case
you're,
generating
too
much
metrics
or
too
many
series
and
stuff
like
that,
and
we
currently
also
have
two
processors.
B
B
So
the
next
step
in
this
is
the
registry
and
the
registry.
Is
this
store
of
metrics,
so
it
just
keeps
track
of
all
the
metrics
that
are
like
generated
by
the
metrics
generator.
So
the
processor
uses
the
registry
to
store
the
metrics
and
it
will
always
like
update
them,
increment
them
or
like
do
whatever
the
registry
will
then
scrape
itself
kind
of
like
promiscuous
scripts
other
instances,
it's
it's
a
similar
operation,
so
it
just
gathers
the
state
of
all
the
counters
in
the
registry,
and
then
it
writes
a
sample
to
the
storage
layer.
B
And
in
this
registry
we
can
also
enforce
limits
on
the
amount
of
active
series
and
some
other
properties,
and
then
the
final
step
in
this
pipeline
is
the
storage
layer,
and
what
this
does
is
it
will
buffer
the
samples
as
they
get
written
to
the
downstream
time
series
database.
So
we
will
be
using
the
promisius
remote
right
protocol
to
send
you
know
metrics
to,
for
instance,
promisius,
cortex
or
whatever
other
database
supports
this
protocol
and
the
search
component.
B
Will
you
know
buffer
samples
before
sending
them,
and
it
will
also
handle
retry
logic,
cueing
and
those
things
what's
actually
pretty
fun
is
that
we
didn't
write
this
ourselves.
This
whole
storage
component
with
a
wall
we're
using
the
promiscuous
agents,
so
the
previous
previous
agent
is
like
a
specific
mode
to
run
promisius,
which
is
has
like
a
minimal
wall
implementation
without
the
querying
and
alerting
capabilities.
B
It's
just
a
wall
to
store
samples
and
to
remote,
write
them,
and
we
were
able
to
just
use
that
code
directly
plug
it
into
tempo,
which
is,
you
know,
really
cool,
because
this
is
like
a
very
good
implementation
which
has
been
battle
tested
already,
and
it
will
also
operate
the
same
as
permissions,
so
you
have
the
same
metrics
and
the
same
behavior
as
if
you
would
be
running
a
promiscuous
instance
yeah,
and
it's
kind
of
like
the
architecture
we
have
right
now.
B
So
we
still
have
like
a
couple
of
pr's
up
to
get
this
into
place,
but
it's
like
slowly
getting
like
shape
and
then
I
can
also
share
some
performance
stats.
So
we've
been
running
this
in
our
internal
clusters
for
like
one
or
two,
maybe
two
three
weeks,
I
guess-
and
I
can
already
share
like
some
stats
from
what
we
have
absorbed
observed
already.
B
So
the
first
thing
that
we
observed
is
by
adding
the
metrics
generator.
The
distributor
will
have
to
do
more
work
right.
It
has
to
send
the
second
request
to
the
metric
generator,
and
we
saw
that
this
currently
especially
has
a
high
impact
on
the
cpu.
B
So
we
saw
like
a
30
to
40
percent
increase
in
cpu
usage
across
all
the
instances,
and
this
is
caused
by
you
know
the
extra
work
that
we
have
to
do.
We
have
to
send
a
second
request.
We
have
to
marshal
this
request
again
and
then
buffer
it
for
a
bit
and
stuff
like
that.
We
think
we
can
still
optimize
this
a
bit
by
doing
some
better,
better
buffer
logic.
B
B
So
the
largest
deployment
that
we
have
it
running
in
right
now
is
a
cluster
which
is
ingesting,
like
1.2
million
spans
per
second
by
like
200
megabytes
per
second,
and
we
can
handle
this
load
by
running
eight
metrics
generated
next
next
to
each
other.
So,
like
just
eight
replicas-
and
this
is,
I
think,
about
half
the
amount
of
distributors
and
like
a
third
of
the
amount
of
ingester.
B
So
this
component
is
already
proving
to
be
more
efficient
than
the
distributor
which
is
nice,
because
then
you
need
less
instances
and
yeah
in
our
cluster
was
like
consuming
about
two
cpus
and
like
1.6
gigabytes
of
memory.
So
just
to
kind
of
give
like
an
idea
of
what
this
uses.
B
We
noticed
that
the
cpu
usage
is
correlated
with
the
amount
of
spans
we're
ingesting,
but
memory
not
so
so.
Cpu
cpu
usage
is
expected
because
if
you
ingest
more
spans,
you
have
to
do
more
work
to
deserialize
the
data
to
unmarshal
the
protobuf.
B
So
if
you
go
up
like
we
notice,
if
you
go
higher,
then
the
instance
starts
to
struggle
and
starts
to
crash
and
stuff
like
that.
But
memory
is
not
really
impacted
by
the
amount
of
spans
you
ingest,
so
that's
also
nice,
because
that
means
that
we
can
optimize
the
code
a
bit
more.
B
So
it's
more
cpu
efficient,
but
memory
is
not
like
a
big
issue
right
now,
like
it
seems,
okay
and
then
oh
yeah,
and
then
what
I
can
also
share
is
an
issue
with
generating
metrics
from
traces
is
that
traces
are
typically,
they
have
higher
cardinality
right,
you
can
have
like
a
lot
of
span
names,
a
lot
of
different
tags
within
your
traces
and
when
you
instrument
your
code,
you
might
not
be
thinking
about
cardinality
like
you
might
not
consider
that
you
shouldn't
add,
like
a
random
number
into
your
span,
name
that
can
happen
and
by
generating
metrics
from
traces.
B
We
have
this
risk
of
just
causing
too
many
series
which
which
would
like
bring
down
your
time
series
database,
so
something
we
are
working
on
is
also
adding
controls
to
limit
the
amount
of
active
series
we
are
generating
within
the
generator,
and
I
have
two
screenshots
here,
showing
the
amount
of
active
series
in
the
generator
and
the
first
one
is
without
any
control.
So
this
would
just
every
time
there's
a
like
a
new
span
coming
in
with
a
new
with
new
data,
you
would
start
a
new
serie
like
a
new
time
series.
B
We
should.
We
would
just
keep
sending
them
all
the
time
which
would
mean
if
this
instant
runs
for
multiple
days.
This
number
always
increases,
so
you
can
see
this
always
goes
up.
It
starts
at
like
10
000,
and
it
goes
up
like
15
000,
20
000,
and
you
just
keep
sending
all
this
time
series
all
the
time
and
then
what
we've
now
integrated
into
the
registry
is
the
ability
to
drop
stale
series.
B
So
this
means
a
siri
that
hasn't
been
updated
in
like
15
minutes,
then
you
just
delete
it,
you
stop
sending
it
and
then
you
get
like
a
nice
like
a
flatter
line
and
a
more
stable
system.
So
this
is
one
way
to
kind
of
deal
with.
The
cardinality
is
like:
if
there's
a
spiking
cardinality,
there
will
be
a
moment
in
which
there
are
more
active
series,
but
after
a
while
it
goes
back
down
again
when
the
series
stopped
being
updated.
B
A
second
thing,
we're
adding,
is
also
the
ability
to
set
limits,
so
you
can,
for
instance,
limit
the
generator
on
like
hey.
You
can
only
emit
10
000
active
series
and
when
there's
like
one
more,
you
just
stop,
you
don't
generate
like
a
metric
for
that,
and
so
that
would
be
like
a
hard
stop
to
protect
the
downstream
database
or
like
the
bill
you're
paying
for
it
yeah
and
that's
kind
of
it
for
what
I
have
for
the
metric
generator
so
yeah.
B
I
think
we
have
like
a
couple
of
big
changes
coming
in
like
next
week
in
the
next
two
weeks,
and
then
we
should
be
at
like
a
very
good
spot
to
release
this
with
temple
1.4,
so
we'll
probably
be
working
more
on
like
documentation
and
sharing
some
like
operational
information,
alerting
stuff
like
that
cool.
A
Go
back
a
few
slides
to
your
metrics,
one
of
your
distributors
hit
40
gigs
of
working
set.
Is
there
a
limit
really
over
40
gigs?
How
did
that
not
boom?
What
happened
there
do
you
know.
B
Yeah,
I
don't
know,
I
didn't
really
look
into
it.
I
thought
it
was
just
like
something
to
do
to
roll
out.
Maybe
it's
the
bucket
like
that
didn't
match.
Well,.
C
A
D
B
Oh
yeah,
so
these
graphs
are
not
from
the
million
spans
per
second
cluster,
so
this
is
from
the
smaller
cluster
but
yeah
in
this
cluster,
in
which
we
were
ingesting,
like
a
million
spans
per
second.
I
think
we
had
like
closer
to
a
hundred
thousand
active
series
and
we
notice.
B
Also
increase
all
the
time,
so
it
started
like
a
hundred
thousand
and
if
you
run
it
for
like
a
couple
of
days,
it
goes
up
to
five
hundred
thousand.
So
every
instance
is
sending
500
000
active
series,
which
is
just
too
much
but
yeah
like
we
still
have
to
kind
of
figure
out.
This
ratio
between
the
amount
of
spans,
ingested
and
amount
of
active
series,
like
the
connection
it's
not
always
correlated,
could
be.
B
If
you're
sending
like
a
lot
of
spans
but
they're
all
of
the
same
system,
you
might
have
a
small
amount
of
metrics.
Well,
if
you're
sending
a
couple
of
spams
with
like
very
high
cardinality,
you
might
be
generating
like
a
ton
of
series.
So
that's
something
interesting
that
we'll
have
to
kind
of
learn
how
it
works.
D
And
like
sort
of
we'd
have
to
do
some
sort
of
analysis
to
figure
out
which
act
of
series
are
just
enough
to
create
the
kind
of
dashboards
and
graphs
we
need.
B
D
B
B
D
B
All
right
cool,
I
think
we
can
move
on
to
the
next
part
of
the
community
call
which
is
yeah
back
in
search.
A
That's
me
yeah
back
in
search,
so
we're
going
to
talk
a
little
bit
about
some
of
the
settings
or
tunables
for
search
and
what
we
have
I'll
share
kind
of
our
internal
settings
there,
as
well
as
some
new
ones
coming
out
one.
Fourth,
so
we'll
talk
a
little
bit
about
what's
in
132,
which
is
out
now
and
then
what
we're
going
to
what
we've
changed
in
1
4
and
you
know
what
kind
of
performance
we're
seeing
because
of
those
changes.
A
I'll
share
some
internal
metrics
kind
of
like
kunrow,
did
with
the
metrics
generator.
So
you
know
given
our
spans
per
second,
like
what
kind
of
rates
of
search
are
we
sing
and
then
we're
going
to
talk
about
grafana
cloud
traces
a
little
bit
coming
soon,
it's
already
actually
turned
on
for
a
handful
of
customers
if
you're
interested
in
this
feature
and
you're
using
cloudtraces
you're.
A
Welcome
to
ask
me
in
this
call
or
dm
me
on
slack
I'll
turn
it
on
for
you
as
well,
and
then
across
the
board,
we're
targeting
the
week
of
march
21st.
We
have
some.
We
need
to
have
some
things.
We
need
from
the
hosted
grafana
team,
they
need
to
set
a
feature
flag
for
us
and
they're
working
on
how
they're
going
to
do
that.
A
So
I
think
about
the
week
of
21st.
We'll
finally
have
this
enabled
in
cloud
traces
for
everyone,
which
means
everyone
will
have
this
full
backend
search
experience.
Well,
my
bad,
if
you're
in
google
cloud,
which
almost
all
of
our
customers
are
and
then
we're
looking
to
roll
that
out
to
other
clouds,
probably
in
the
next
few
months,
so
yeah
we
can
move
forward.
A
So
in
132.
We've
kind
of
shared
this
diagram
before.
But
you
have
the
query.
Front-End
a
query
comes
in
and
it's
broken
up
into
thousands
of
jobs
which
the
queries
then
do
their
best
to
service,
and
they
also
pass
these.
And
then
you
can
also
configure
to
use.
Serverless
is
what
we
which,
which
is
what
we
do
given
our
volume.
So
in
132,
the
query
x
is
a
proxy
for
serverless.
A
Basically,
the
query
front
end
makes
thirty
thousand
not
that
many
thousands,
three
four
thousand
jobs
the
queryers
consume
those
jobs
one
at
a
time
they
pass
those
onto
the
serverless
functions
which
then
serves
each
one
one
at
a
time,
and
I've
highlighted
some
settings.
If
you
look
in
our
docs
there's
an
under
the
operational
section,
there
is
a
dock
that
outlines
actually
everything
you're
seeing
here,
but
I'll
just
talk
about
it,
real
quick.
We
need
to
just
make
some
adjustments.
A
You'll
notice,
we've
changed
the
server
timeouts,
the
http
timeouts
we're
still
kind
of
operating
relatively
slowly,
not
as
fast
as
we
want,
but
we're
improving
and
then
the
concurrent
jobs
and
the
maxed
outstanding
pretend
it
just
lets.
You
push
a
whole
lot
more
through
the
query
front
end.
The
defaults
in
the
query
friend
are
front.
End
are
still
kind
of
tuned
for
trace
by
id
search.
A
We
might
want
to
actually
update
that
and
make
the
defaults
work
with
just
the
search
generally
as
we've
built
now,
but
the
important
ones
are
like
the
target
bytes
per
job,
which
will
which
will
divide
your
entire
search
job
into
this
into
individual
jobs.
I'm
sorry,
your
entire
search
request
into
individual
jobs
that
are
about,
in
our
case
10
megs,
each,
which
are
then
served
by
the
queries
or
the
serverless
functions.
And
then
you
can
see
some
other
options
on
there.
Concurrent
jobs
and
max
outstanding.
B
A
Your
careers
to
service
I'm
at
the
querier
level,
the
search
external
endpoints,
is
what
actually
tells
it
to
use
serverless
and
the
max
concurrent
queries
by
default.
The
max
concurrent
queries
is
like
two
again:
everything
is
tuned
for
traced
by
id
search,
but
if
your
queries
are
only
doing
two
at
a
time
each
that's
not
going
to
do
you
any
good,
you
need
them
to
do
way
more.
So
we
have
that
set
to
100
right
now
in
our
largest
cluster.
A
A
If
we
move
forward
so
14140,
we've
added
a
few
settings
that
are
really
kind
of
changing
the
rate
at
which
we
can
search
the
backend
quite
a
bit
one.
Is
we
added
hedging
to
the
query
front
end
actually
just
yesterday
or
something
in
fact,
it's
kind
of
embarrassing
that
it
took
me
this
long
to
add
this
feature
because
it
tripled
our
late
or
it
tripled
our
kind
of
rate
at
which
we
could
search
the
back
end.
A
Our
latencies
were
just
dominated
by
the
long
tail
I
should
have
seen
that
weeks
ago,
but
didn't
I
was
kind
of
caught
on
some
other
details
that
I
was
trying
to
unravel
and
then
once
I
added
edge
requests
to
the
front
end.
We
just
instantly
tripled
the
rate
at
which
we
were
searching
the
back
end
by
just
repeating
requests
once
they
exceeded
a
certain
threshold.
So
the
defaults
there
are
kind
of
tuned
to
search
which
is
five
seconds
for
a
for
a
individual
job
to
be
out
and
we'll
try
it
three
times.
A
So
if
it
doesn't
come
back
in
five
seconds,
we'll
just
try
again,
it
happens
again,
we'll
try
again,
and
these
are
tuned
to
roughly
our
p99
on
on
this
on
this
on
the
search
another
feature
we
added
was
for
so
the
queriers
in
one
three,
two
just
proxy
to
the
server
list
like
we
talked
about,
but
we
added
a
feature,
so
the
queries
will
actually
do
some
work
themselves.
There's
no
reason
for
them
just
to
sit
there
and
do
nothing
so
search
prefer
self
will
have.
A
The
query
are
always
doing
two
jobs
while
continuing
to
proxy
to
the
back
end.
It
adds
a
little
bit
of
throughput
and
the
queries
tend
to
have
more
horsepower
than
the
serverless
function,
so
they
can
churn
through
the
jobs
just
a
bit
faster
and
it
helps
some
on
reducing
your
latency
and
kind
of
better
utilizing
your
queries
and
your
resources
during
when
you're
doing
search
and
then
finally,
in
the
ingester,
we
added
this
ability
to
well
to
take
a
step
back.
A
We've
made
a
change
in
one
four,
where
the
range
on
the
blocks.
The
time
range
on
the
blocks
is
based
on
these
span
times.
It
used
to
be
based
on
ingestion
time,
which
wasn't
great,
and
now
it's
based
on
the
span
times.
Instead.
A
This
allows
us
to
just
kind
of
better
set
the
range
for
the
blocks,
especially
when
we
replay
the
blocks
when
the
injector
restarts,
but
the
problem
we
found
in
basing
on
span
time
is
people
are
sending
us
spans
from
like
three
weeks
ago,
which
didn't
quite
work,
because
basically
every
single
block
covered
most
of
our
retention,
so
any
search
attempted
to
access
every
block
on
the
back
end
because
they
just
said
they
were
head
spans
in
those
time
ranges.
So
this
ingestion
time
ring
slack
will
force
the
ingester
to
only
update
the
block
time
range.
A
If
it's
within
a
certain
slack
time
and
the
main
git
here
is
when
you
used
to
replay
the
wall,
you
would
basically
reset
those
block.
Time
ranges
to
reset
the
block.
Time
ranges
to
be
whenever
the
wall
was
replayed,
which
basically
kind
of
lost
them
in
search
with
these
updates.
When
the
wall
replays,
it
will
use
the
span
times
instead,
and
so
you
won't
lose
those
those
will
have
authentic
min
max
time
ranges
which
is
the
kind
of
big
get
there.
A
If
you
go
forward
so
progress
so
far,
we
fixed
the
min
max
time
range
in
the
block
which
we
talked
about.
This
will
make
our
search
more
authentic.
I
really
wanted
this
in
in
cloud
before
we
rolled
it
out
to
everybody
and
we
added
that
ingestion
time
range
slack
setting,
so
it
doesn't
accept
times
from
three
weeks
ago
or
something
silly.
A
We've
added
a
new
object,
encoding,
actually,
the
only
the
second
ever,
which
is
increasing
our
the
rate
at
which
we
can
search
it's.
So
it's
an
improvement
on
132
we've
added
the
start
and
end
times
outside
proto,
one
of
the
biggest
bottlenecks
in
search
is
unmarshaling
the
proto
and
then
piling
through
every
span.
A
Also,
we
use
that
start
end
time
for
duration.
So
if
you
have
a
duration
in
mind,
and
you
specify
that
in
your
search,
we
can
skip
quite
a
few
traces
as
well.
So
if
you
say
like
one
to
two
seconds
or
something
anything
under
one,
second
won't
even
be
unmarshaled,
and
our
current
throughput
at
tip
of
main
with
all
these
improvements
is
about
30
to
35
gigs
a
second.
A
We
just
immediately
got
30
to
35
gigs,
a
second
which
is
great,
and
then
I
have
an
experimental
branch
with
some
proto-marshalling
improvements
that
sing
60
gigs
per
second
60
gigs
per
second
on
search
pretty
regularly
it's
it's
consistently
at
about
60
gigs,
a
second
and
the
improvements,
unfortunately,
are
to
use
gogo
proto,
which
is
not
maintained,
so
we're
looking
at
possibly
still
doing
that
using
gogo
proto
and
some
features
in
gogo
proto.
A
We're
also
looking
at
the
tess
planet
scale
has
a
proto-marshaller
that
I'm
working
on
actually
just
today
I
was
spending
some
time
getting
our
proto-marshall
using
I'm
sorry,
our
proto-generated,
using
our
generator.
So
we
might
try
the
vitesse
proto
generator
as
well
and
we're
looking
to
get
around
60
gigs
a
second.
A
I
think
at
that
point
at
60
gigs,
a
second
I'm
probably
going
to
let
this
rest,
maybe
there'll
be
some
small
improvements,
but
I
think
we'll
feel
very
I'll
be
happy
with
those
speeds
as
long
as
we
can
get
them
consistently
and
then
look
forward
to
maybe
the
next
generation
of
tempo
back
end,
maybe
parquet
and
we'll
have
the
team
marty
and
ananda
talk
about
that
a
little
bit
about
that
next
and
what
our
thoughts
on
parque
and
our
chances
of
moving
there,
and
so
to
give
some
metrics
here
at
220
megs,
a
second
which
is
what
we're
doing
internally.
A
It
takes
about
20
to
25
seconds,
to
search
one
hour,
not
great,
but
not
terrible.
That's
an
exhaustive
search!
If
you
actually
specified
some
parameters,
it'd
come
back
a
lot
faster
that
is
searching
every
single
trace
in
the
back
end
in
that
hour
and
failing
to
match
anything.
Basically
and
it's
it's
pretty
stable
in
those
time
ranges
20-25
seconds
search
for
these
for
these
very
for
these
larger
installations.
A
For
you
know,
hundreds
of
megs,
a
second
search-
will
continue
to
feel
like
a
batch
job
for
a
while,
but
we're
for
smaller
installations.
You
know
double
digit
megabytes
a
second.
You
could
definitely
see
significantly
faster
search
times,
I
think
cool
and
then
some
some
or
some
metrics
from
our
serverless.
A
If
you
want
to
go
forward
yeah
so
just
to
share
some
of
this
because
it's
fun
to
look
at
I
yesterday
I
was-
or
maybe
this
was
two
days
ago,
two
or
three
days
ago
earlier
this
week,
I
was
like
you
can
kind
of
see
my
pattern
here.
I'd
make
a
small
change
and
then
run
a
test
and
then
make
a
small
change
run
a
test.
A
I
was
trying
to
unlock
the
combination
of
parameters
that
really
helped
us
get
to
this
30
gigs
a
second,
and
so
here
you
can
see
just
to
share
some
of
this.
This
is
the
number
of
instances
of
google
cloud
functions
when
I'm
doing
my
searches
so
they're
hitting
like
five
six
seven
thousand
instances
that
are
active
at
one
time
and
then
you
can
see
the
idle
functions
too.
It's
kind
of
see
the
way
fun
to
see
the
way.
A
Next.
This
is
the
gigabytes
per
second,
as
reported
by
gcs.
So
this
is
compressed.
So
it's
going
to
be
less
than
the
30
we
were
talking
about
because
the
30
is
uncompressed.
Proto
is
what
I'm
talking
about
there
so
we're
seeing
like
10
gigs,
a
second
spiking
up
to
18
gigs,
a
second.
This
is
coming
out
of
gcs
compressed.
A
We
are
currently
using
lz4
on
our
our
back
end.
The
stat
or
the
default
is
actually
z
standard,
but
lz4
it
costs
less
to
on
marshall,
so
we've
been
experimenting
with
it
because
we're
doing
search
honestly,
maybe
for
one
four
or
perhaps
one
five,
we'll
probably
do
a
revisit
on
our
defaults.
A
On
the
back
end,
it's
just
way
cheaper
to
decompress,
which
is
a
lot
nicer
for
the
serverless
functions
they
can
with
with
much
less
cpu
and
memory
resources.
They
can
churn,
through
these
pages,
a
lot
faster
and
then.
Finally,
this
is
the
number
of
requests
per
second
we're
hitting
gcs
with
these
are
all
these
are
reported
by
stackdriver
here,
but
we're
seeing
yeah,
15
000
or
I
guess
about
12
000-
looks
like
average
requests
per
second
gcs
is
returning.
These
are
tiny.
Read
ranges
over
this
block.
A
Each
of
these
serverless
functions
is
slamming
gcs
as
fast
as
possible
to
slurp
out
data.
You
can
even
see
our
polling
cycle
way
down
here.
These
tiny
bumps
every
I
don't
know
five
to
ten
minutes.
You
see
a
very
small
bump.
That's
our
polling
cycle
when
all
the
different
compactors
are
and
all
the
different
components
are
asking
for
the
index
json
or
building
the
index
json.
So
yeah,
that's
serverless,
I'd,
say
things
are
feeling
good
after
feeling
rough
in
132,
I
didn't
really
like
where
it
was,
but
it
was
technically
working.
A
A
I
really
feel,
like
things
are
entering
the
territory
where,
where
I
feel
comfortable
with
the
community
running
it
a
little
bit
more,
I
feel
comfortable
with
it
being
in
grafana
cloud
traces
and
it's
still
not
amazing,
at
extremely
high
volume,
but
it's
functional
and
usable,
and
I
like
where
it
is
at
and
then
hopefully
we'll
look
to
the
future
for
some
better
formats
to
just
reduce
the
amount
of
data
we
pull
we're
pulling
just
gigs
and
gigs
and
gigs
from
gcs
to
answer
these
questions,
which
is
hopefully
going
to
be
unnecessary
in
the
future.
E
Hey,
can
we
go
back
a
couple
slides
to
the
number
of
active
and
idle
instances,
and
I
remember
that
being
a
lot
lower,
like
that's
almost
8
000,
that
was
really
cool.
Was
that
the
trick
of
like
having
multiple
sets
of
functions
and
we
to
get
past
kind
of
like
the
throttling.
A
Right
so
the
serverless
endpoints,
we
actually
have
the
ability
to
have
more
than
one
and
the
reason
is
to
get
through
past
google
cloud
quotas,
so
google
cloud
will
only
have
3000
instances
of
any
function
at
one
time,
so
we
run
10,
which
is
unnecessary.
I've
just
been
trying
every
combination
of
parameters.
I
can
to
see
how
I
can
get
data
out
of
gcs
as
fast
as
possible,
but
we
could
probably
do
this
with
three
or
four
and
basically,
this
is
the
sum
across
10
different
google
cloud
functions.
A
It's
still,
depending
on
your
query,
load
a
fraction
of
the
cost
of
running
tempo
as
a
whole,
but
it
can
add
up
if
you
sat
there
and
just
beat
it
to
death.
Constantly
constant
querying
would
cost
quite
a
bit,
but
tempo
is
not
built
for
that.
Tempo
is
built
to
ingest
as
cheaply
as
possible
because
that's
almost
all
you'll
be
doing
with
a
tracing
back
end
and
query
in
a
more
expensive
way.
A
E
E
E
Okay,
cool,
hey
yeah,
so
a
couple
if
you
were
here
maybe
or
saw
it
a
couple
of
months
ago
at
a
community
call
we
kind
of
talked
about
how
we
had
been
looking
into
kind
of
a
new
block
format
for
the
back
end
kind
of
like
you
know.
We
touched
on
that
a
lot
here
and
a
couple
of
it
was
flat
buffers.
E
We
dug
pretty
deep
into
that
and
we
had
some
kind
of
numbers
and
some
interesting
things
and
findings,
but
ultimately
didn't
feel
like
it
panned
out
since
then,
we've
been
kind
of
looking
into
culinary
and
parquet
in
particular,
so
we'll
kind
of
walk
through
that
with
just
some
thoughts
and
findings
and
status,
and
you
know
just
kind
of
like
what
we're
thinking
there
cool.
So
I'm
just
gonna
do
yeah,
so,
okay,
so
just
basic
stuff,
like
white
colander.
Well,
we
kind
of
touched
on
that.
E
Like
protobuf,
you
have
to
deserialize
the
whole
thing
can't
get
past
it.
That's
not
great
flatbuffers
was
one
way
to
jump
around,
but
you
know
it
didn't
work
out
calmer.
You
only
read
the
columns
you
need
so
naturally
the
data
in
our
filtering,
our
search
behavior,
is
like
very
commoner.
So
as
an
example
in
a
600
megabyte
block,
did
you
know
that
the
just
the
span
ids
are
160
megs?
So
that's
fascinating
to
me
because
that
data
we
never
use
for
anything
so
for
searching.
E
So
why
parquet
so
there's
a
lot
of
other
columnar
technologies
out
there,
so
why
per
k
well
in
particular,
so
it's
a
file
which
is
great
that
works
really
well
for
us
so
like
as
far
as
tempos
blocks,
it'd
be
very
easy
to
put
a
file
in
the
back
end,
treat
it
as
a
file.
It
doesn't
require
like
heavy
state
or
heavy
installations.
Things
like
that,
and
it
already
has
so
many
awesome
things
in
it
already,
which
we
would
want
anyways,
so
different
encodings
for
different
data
types,
so
delta
encoding
dictionaries
run
length.
E
It
has
different
types
of
compression,
page
statistics,
so
you
can
skip
around
and
skip
over
blocks
of
data
balloon
filters.
These
are
all
super
cool,
and
so
they
all
come
together
and
make
it
so
super
fast,
and
hopefully
really
you
know
you
can
read
a
lot
less
data.
So
let's
go
sorry.
Look
around
here.
E
Okay,
column,
design,
so
kind
of
there's
two
top
approaches
to
columns.
Would
you
do
a
nested
kind
of
like
a
trace?
A
trace
is
a
graph
of
batches
spans
attributes
things
like
that.
Would
you
do
that?
Or
could
you
flatten
it
into
just
a
list
of
spans?
Both
options
are
really
really
interesting.
We
actually
tried
both
and
we
it
just
ended
up
that
it
seems
like
the
nested
approach
is
better.
I
mean
it
seems
to
come
down
to
things
that
are
at
the
resource
or
batch
level.
E
Right
are
just
flattening
those
down
to
every
span.
Just
is
kind
of
like
outweighs
any
benefits
you
might
get
from
the
different
designs,
so
examples
of
that
are
like
cluster
namespace
pod
attributes,
so
here's
an
example
schema
just
to
kind
of
show
like
the
graph
right,
so
you
have
trace
root
at
the
top.
It
has
a
trace
id
inside
that
this
is
kind
of
matching
the
open,
telemetry
line
protocol
resource
spans
resource
that
has
attributes
keys
and
values.
E
So
the
actual
columns
in
here
some
of
these
are
virtual,
so
the
actual
columns
are
like
trace
id
the
resource
key
attribute
key
value.
The
span
attributes
cool,
okay,
that's
cool,
but
obviously
like
we
got
to
go
way
further.
There's
so
many
cool
things
you
could
do
so
one
is
we.
E
We
want
these
attributes
that
we're
searching
on
don't
put
them
all
in
one
column
like,
let's
put
them
into
their
own
columns,
that's
really
cool,
so
so
you're
searching
you're
just
targeting
a
lot
less
data,
but
actually
the
actual
block
itself
gets
a
lot
more
efficient
because
the
you're
putting
the
repeated
like
values
together
into
their
own
columns
versus
like
keys
and
values.
E
So
what
what
would
that
look
like
something
like
this,
like
highlighted,
so
the
resource
would
have
a
dedicated
cluster
name
space
pod
container
columns.
These
are
just
attributes
that
we
commonly
use
in
our
own
stuff,
so
yeah,
and
then
so
we
can
target
those
directly.
So
that
would
be
really
cool.
E
Okay,
hey
nanya!
Do
you
want
to
touch
on
these
slides.
D
So
under
the
spans
thing,
the
field
you
can
see,
there's
there's
columns
like
alert
id
attempt
id
block,
and
each
of
these
was
actually
tags
on
the
span
and
we
figured
out
a
way
to
sort
of
blow
these
out
into
their
own
columns,
and
that
way
you
can
actually
search
for
them
in
parallel
and
perform
joins
across
them.
D
So
so
you
only
end
up
reading
like
a
really
small
section
of
the
block,
and
then
you
can
actually
see
the
traces
that
match
and
perform
joins
across
these
columns
to
finally
filter
out
the
traces
that
match
multiple
conditions.
So
so
this
is
really
cool.
One
of
the
problems
that
we
ran
into
can
we
go
to
the
next
slide.
Marty
was
this
so
when
your
app
has
auto
instrumentation,
you
sort
of
end
up
with
this.
D
With
this
bizarre
schema,
where
you
have
a
million
million
tags
on
your
spans
and
these
get
blown
out
into
their
own
columns
and
then
it
starts
looking
something
like
this.
So
we're
trying
to
come
up
with
an
opinionated
way
of
saying:
hey,
we're
not
going
to
blow
out
attributes
that
have
a
prefix
that
says
http
request
or
we're
kind
of
playing
around
with
that.
D
We
don't
know
what
our
approach
is
going
to
be,
but
if
this
turns
out
to
be
too
expensive,
then
we
might
just
put
these
into
like
the
old
way
of
doing
it,
where
you
just
put
each
of
these
into
the
into
one
column.
So
we
won't
provide
search
on
them,
but
you
can
still
reconstruct
the
trace
out
of
these,
so
this
is
still
up
in
the
air,
but
it's
just
something
to
think
about.
The
the
writing
of
the
schema
gets
a
little
expensive
as
we
grow
the
number
of
columns.
D
So
this
might
help
reduce
that
a
bit,
but
our
results
are
still
inconclusive
around
that
and
yeah.
I
think
that's
about
it
for
the
dynamic
scheme
as
if
we
go
to
the
next
slide.
I
just
have
one
last
thing
to
point
out,
and
all
of
this
is
possible.
Thanks
to
these
folks,
the
segment
I
o
team
has
come
up
with
a
really
cool
parquet
sdk
for
golang,
and
we
were
able
to
do
a
lot
of
these
things.
D
E
A
Go
back
to
the
oh,
my
bad!
You
had
some
more
stuff.
A
A
E
D
Also,
it's
also
cool
that
we
have
so
many
languages
internally
and
somehow
all
of
the
sdks
have
convened
around
these
and
you
know,
is
going
to
use
these
or
maybe
it's
our
internal
open,
telemetry
conversion
thing.
So
that
could
be
true.
E
Yeah,
okay,
so
we
have
some
super
early,
but
cool
benchmarks.
We
just
wanted
to
share
them.
You
know
just
to
get
on
the
hype
train,
so
a
super
simple
query
for
one
cluster
value
and
a
min
duration.
So
just
there's
two
test
cases
here,
comparing
like
against
the
original
proto
and
the
new
parquet
stuff.
E
So
you
can
kind
of
see
like
maybe
in
here
it's
reading,
600
megabytes,
so
the
entire
block,
finding
55
traces
after
converting
that
to
par
k,
doing
the
same
thing:
we're
reading
two
megabytes
same
55,
traces,
super
cool,
it's
way
faster,
so
22
seconds
versus
like
this
is
140
milliseconds
yeah.
So
we
got
some
cool
like
numbers
here
now
more
realistic
query
just
for
comparison.
E
I
actually
forgot
to
paste
in
the
time
here,
but
it's
like
200
milliseconds
or
something
but
looking
for
all
of
these
and,
of
course,
they're
really
not
into
their
own
columns
but
like
cluster
namespace
pod
service
name
duration.
So
this
is
like
kind
of
filtering
down
a
lot
further
we're
only
reading
11
megabytes,
so
that's
pretty
cool
yeah
that
is
sweet
yeah.
So
what
else
do
we
have?
Okay?
So,
there's
a
lot
more
to
figure
out.
These
are
just
start
current
findings
and
thoughts,
and
I
kind
of
are
current
directions.
E
So
things
figure
out
dynamic
columns,
add
a
lot
of
moving
parts,
but
they
seem
like
they're
worth
it,
because
it's
just
fantastic
speeds
and
things
like
that.
The
results
look
really
promising.
So
we,
if
you
go
back,
we
just
had
strings
everywhere,
so
we
need
to
figure
out
every
other
data
type.
So
we
have
some
progress
on
that,
but
there
are
still
things
to
figure
out
things
like
you
could
actually
have
attributes
of
the
same
name
but
different
types.
So
we
need
to
figure
that
stuff
like
that
on.
E
How
does
that
even
work?
I'm
not
even
sure
how
that
would
work
with
our
current
search
ui.
So
we
need
to
get
together.
Things
like
that
parquet
column,
naming
is
actually
more
strict
than
the
tag
names
in
the
data,
so
there
has
to
be
some
sort
of
transformation.
So
what
would
that
look
like
in
the
example
is
hhp
dot
status
code
and
then
just
any
collision
in
general?
E
That
maybe
has
to
do
with,
like
the
structure
of
the
thing
that
we're
creating
and,
of
course,
like
the
we're
reading
a
lot
less
data,
but
the
ios
are
still
kind
of
high.
So
that's
just
something
we
need
to
dig
into
that's
the
potential
thing
that
we
need
to
figure
out.
E
D
Does
also
like
each
column
can
have
its
own
compression
type,
so
I
don't
know
how
we'll
figure
that
out
for.
E
E
D
D
Yes,
block
takes
longer
to
create
the
block,
although
it's
not
in
the
critical
path,
because
we
do
it
async,
but
it
still
needs
to
be
fast,
but
otherwise
our
rollouts
will
become
slow.
It'll
just
generally
be
a
little
sad.
A
A
Compression
types
and
all
that
so
we'll
probably
pay
a
little
bit
more
on
right
when
we
move
to
parquet,
but
the
benefits,
I
suppose
would
be
definitely
worth
it.
Not
only
those
awesome
search
metrics,
but
I
love
that
parquet.
If
we
just
put
raw
park
a
and
s3
suddenly
it
unlocks
the
ability
to
use
any
of
a
million
tools
that
already
exist
to
work
with
parquet
and
object,
storage
and
there's
a
lot
of
those.
You
can
query
with
athena.
You
could
write
a
flink
as
flink
a
thing
spark
one.
A
Right,
I've
never
worked
those
technologies,
but
you
could
write
some
of
that
to
take
your
data
and
do
anything
you
want
with
it.
There's
just.
I
think
it
really
unlocks
a
world
of
a
world
of
awesomeness
when
we
connect
or
use
this
open
standard.
If
we
can
get
there.
E
E
So
I
honestly
don't
know
if
we
are,
I
don't
think
we
are,
but
I
haven't
used
those
tools,
a
whole
lot,
so
sure
we'll
get
if
we
are.
If
we
are
please
let
us
know
try
it.
D
I
think
there's
like
parquet
tool,
visualizers
and
stuff
that
we
should
probably
try
using
try
querying
the
parquet
file
outside
of
template.
A
Cool
any
questions
any
concerns
thoughts.
I
think
we've
been
talking
about
this,
some
things
for
a
while,
but
I
also
think
we've
talked
about
some
very
future
things
with
parquet,
which
is
very
cool.
A
I
think
metrics
generators
coming
up
in
one
four
it'll
be
an
awesome
feature.
Addition
for
anyone
running
it
as
well
as
we're
going
to
get
some
of
those
features
into
the
graphonic
cloud.
I
think
one
four
is
going
to
see
way
better
performance
on
search,
so
those
of
you
who
are
using
search
or
wanting
to
experiment
with
that
are
going
to
see
far
better
results
with
one
four.
I
still
owe
the
community
a
guide
on
setting
this
up
in
aws.
A
Technically
we
have
a
proof
of
concept
running
the
aws,
and
I
have
not
put
this
documented
this
anywhere.
If
you
do
check
our
docs
there's
terraform
for
deploying
to
google
cloud,
but
nothing
for
aws.
So
I
do
owe
you
all
better
documentation
on
the
serverless
path,
as
well
as
some
specific
docs
around
how
to
get
this
running
with
lambda,
aws,
lambda
and
s3,
but
other
than
that.