►
Description
Lightning Talk: Troubleshoot Compactor Backlog with Ease - Ben Ye, ByteDance
This talk will talk about a common problem if you are running Thanos and Cortex on large scales: compactor backlog. As a core component, it is important to make sure that the compactors are running smoothly and well scaled. In this talk, Ben Ye will explain why compactor backlog happens and how to prevent it from happening. He will walk through ways to identify and troubleshoot it using existing metrics and tools.
A
A
First,
let
me
introduce
what
is
the
sonos
compactor,
the
sonos
compactor
compacts
blocks
on
the
object
storage
in
order
to
improve
the
query
performance
besides,
it
also
deals
with
block
down
sampling
and
data
retention
as
well
so
from
the
implementation
perspective,
the
compactor
is
just
a
cron
job.
For
example,
it
runs
every
five
minutes
and
each
run
is
called
an
iteration,
so
each
iteration,
the
compactor,
will
perform
the
three
tags
here
in
order,
which
means,
if
there
are
too
much
compaction
work
to
finish,
then
it
can't
start
down
sampling
and
retention.
A
So
usually
the
backlog
happens
in
phase
one,
which
is
the
compaction
phase.
So
why
does
this
happen?
And
maybe
we
can
think
about
this
and
imagine
it
as
a
massive
queue
scenario
so
here
and
the
tunnels
compactor
is
a
massive
q
consumer
as
a
producers
are
silent
side,
cars,
googlers
and
receivers
who
upload
blogs
to
the
object?
Storage,
in
this
case,
object,
storage
is
a
message
queue.
A
So
the
key
thing
here
actually
is
to
identify
the
backlog
issue
and
there
are
several
way
to
go
so.
First,
the
compactor
itself
exposes
some
very
useful
metrics,
so
these
two
metrics
actually
tell
us
the
current
iterations
and
the
down
samplings
performed.
A
So
if
these
two
counters
remain
the
same
value
or
they
increase
slowly,
then
backlog
might
happen,
and
if
you
don't
see
any
retention
happens
for
very
old
blocks,
then
the
compactor
might
be
busy
compacting
your
blocks
and
they
cannot
start
doing
the
compaction.
And
the
last
point
might
not
be
that
obvious.
A
So
another
way
to
identify
the
backlog
issue
is
to
use
the
progress
matrix.
So
since
sanos
v0.24
release
for
new
metrics
are
introduced
and
the
there
are
very
good
signals
to
tell
whether
your
compact
compactor
hit
backlog
or
not,
and
they
can
represent
the
compaction
progress,
please
do
give
them
a
try
and
they
are
very
useful
in
alerts
as
well.
A
So
next,
let's
talk
about
the
solutions
for
the
backlog.
So
in
order
to
solve
the
backlog
problem,
we
definitely
want
to
scale
the
compactors
more
and
the
easiest
way
to
go
is
to
simply
scale
vertically.
So
we
can
add
more
computation
resources
to
the
compactor
instances
and
another
way
to
do
is
to
just
increase
the
compaction
concurrency.
A
So
there
are
two
flags
provided
by
the
tiles
compactor.
One
is
the
compaction
concurrency
and
another
one
is
the
down
sampling
concurrency.
So
we
can
tune
these
flags
and
make
the
compactor
instance
more
powerful
and
another
way
to
go
is
to
scale
horizontally
and
about
horizontal
scaling,
and
there
are
actually
two
ways
to
go.
One
way
is
to
just
short
by
time.
A
So,
for
example,
we
can
have
two
compactors
and
one
compactor.
Take
care
of
logs
produced
last
week
and
another
compactor
take
care
of
blocks
produced
maybe
last
month,
and
in
this
way
we
can
distribute
blocks
to
different
compactors
by
time,
and
another
way
to
go
is
to
shard
the
blocks
by
their
external
labels
so
that
we
can
groups
blocks
from
the
same
clusters
together
to
the
same
compactor,
and
in
this
way
we
achieve
the
same
goal
and
we
successfully
distribute
logs
to
different
compactor
instances.