►
From YouTube: Scaling content routing - Andrew Gillis
Description
How many CIDs are there, and what are the open scaling questions in content routing?
A
A
So
here's
a
rough
idea
of
the
kind
of
scalability
we're
talking
about
our
goal
being
10
to
the
15
indexes.
So
to
give
you
an
idea
about
what
that
means.
If
you
look
at
this
small
red
square
in
the
corner,
that's
represents
1
billion
indexes,
which
is
about
what
we
are
currently
getting
in
12
to
24
hours.
A
So
a
thousand
times
that
10
to
the
12th
would
probably
given
the
rate
of
acceleration
of
indexing
data
coming
in
would
probably
be
something
that
we'd
start
seeing
in
maybe
in
less
than
a
year
and
then
looking
at
the
large
gray
block
our
eventual
goal,
something
that
it's
a
little
bit
harder
to
estimate
the
time
on
and
but
I'm
I
would
really
rough
guess.
A
Maybe
five
years,
here's
where
we
are
now
we
are
at
less
than
less
than
10
to
the
12th,
which
is
basically
the
the
whole
Blue
Block
or
one
small
bit
in
that
big
gray
block.
A
A
It's
going
to
take
a
number
of
steps
to
get
there
and
we're
going
to
have
to
proceed
in
a
way
that
allows
us
to
course
correct,
as
we
go
forward,
accommodate
new
features
and
understand
how
the
growth
velocity
changes
when
new
patterns
of
usage,
Technologies
things
like
that
are
experienced
by
the
indexer,
but
we're
we're
taking
steps
to
eventually
to
get
there
and
the
next
steps
we're
going
to
take
should
hopefully
get
us
a
couple
orders
of
magnitude
closer
to
our
end
goal.
A
So
what
needs
to
be
done
first
well
to
figure
out
what
we
need
to
do?
First,
we
have
to
look
at
what
the
indexer
struggles
to
do
the
most.
Currently,
the
vast
majority
of
the
indexer's
work
comes
from
ingesting
all
the
index
data
published
it's
a
huge
amount
of
data,
that's
being
published
every
day
and
that's
increasing.
We
see
sustained
rates
of
twenty
thousand
to
sixty
thousand
new
indexes
per
second
and
that's
normal
over
7
billion
new
indexes
per
week,
which
is
increasing
every
week.
A
A
So
let's
understand
the
ways
that
that
overloads
the
index
here
a
couple
ways:
one
is
the
shear
volume
of
data,
the
indexer
just
can't
keep
up
with
being
able
to
to
process
it,
be
able
to
pull
it
all
over
the
network
and
be
able
to
deal
with
processing
and
storing
it.
A
It's
the
the
absolute
line
of
data
is
too
much
for
it
and
when
the
indexer
gets
backed
up
and
it
starts
getting
behind
and
it's
backlog
of
work
that
it
has
to
process,
and
that
means
we
need
a
scaling
strategy
so
that
we
can
divide
all
of
those
incoming
data
among
multiple
indexers,
so
that
no
one
indexer
is
forced
to
handle
all
that
data
and
become
overloaded.
A
So
we
have
to
have
a
strategy
that
accounts
for
a
way
of
handling.
When
indexers
reach
reach
a
maximum
capacity,
we
have
to
be
able
to
do
something
to
hand
off
responsibility
or
redistribute
data
or
something
so
that
the
the
storage
load
is
not
too
much
for
any
one
indexer
to
handle.
A
So
this
gives
us
some
ideas
for
for
basic
requirements.
We
need
to
be
able
to
handle,
and
we
know
we
need
to
be
able
to
handle
adding
indexers
anytime,
as
the
capacity
of
that
as
we
need
to
increase
capacity
in
our
indexing
deployment.
A
A
Even
if
it's
from
a
single
data
source,
we
need
to
be
able
to
have
some
way
of
handling
what
an
indexer
can't
absorb
any
more
data.
We
also
need
redundancy,
because
if
we're
indexing
huge
amounts
of
data,
it
can
be
very
expensive
to
lose
that
data
if
an
indexer
goes
down,
so
we
do
want
to
have
the
ability
to
have
redundancy
within
our
deployment
and
since
the
we
don't
know
exactly
which
indexer
is
going
to
ever
have
a
particular
piece
of
of
data.
A
We'll
need
a
way
of
sending
queries
to
all
indexers
so
that
we
can
merge
the
results,
because
the
result
that
we
want
may
come
from
any
of
them
or
all
of
them.
A
So
the
decision
we've
made
to
split
up
the
the
work
of
ingestion
is
to
divide
the
Publishers
over
the
indexers
in
a
pool
of
indexers.
The
publisher
is
the
entity
that
publishes
indexing
data
indexers
it
it
may
or
may
not
be
the
same
as
the
actual
content
provider.
A
So
this
strategy
is
very
simple
and
it
allows
us
to
assign
indexers
a
publisher
and
then,
after
their
assigned
a
publisher,
the
indexer
can
interact
independently
and
directly
with
the
publisher
and
it
doesn't
require
any
use
of
any
sort
of
Upstream
service
to
to
parse
and
redirect
data
to
indexers
indexer
can
just
interact
directly
with
with
the
publisher
and
in
order
to
decide
how
we
assign
Publishers
to
indexers
we're
going
to
use
a
coordination
service.
A
A
coordination
service
is
a
simple
service
that
is
going
to
listen
for,
announced
messages
over
Pub
sub
or
coming
directly
over
http.
The
announced
messages
are
just
when
a
a
publisher
says
it
has
new
data
to
index,
and
it's
going
to
look
at
the
publisher
and
then
announced
message
and
see
if
it's
already
been
assigned.
If
it
has
not
been
assigned
to
an
indexer,
the
the
coordination
service
will
decide
on
an
indexer
to
assign
it
to.
A
After
assignment
the
index
will
be
told
to
sync
to
that
publisher
and
from
then
on
out.
The
indexer
will
act
independently
with
that
publisher
and
the
the
coordination
service
doesn't
need
to
do
anything
more
for
that
publisher.
A
The
coordination
service
was
chosen
because
it's
a
very
simple
way
of
making
the
decision
for
the
indexers
and
it
doesn't
require
any
sort
of
consensus
mechanism
to
be
shared
amongst
the
indexers
to
find
some
way
to
agree
on
who
gets
which
Publishers
and
it's
also
really
not
a
single
point
of
failure,
certainly
not
in
the
short
term,
because
the
the
coordination
service
can
go
offline
and
the
indexers
will
all
continue
to
operate
just
fine
without
it.
A
And
then,
when
the
coordination
service
comes
back
online,
it
can
resume
assigning
Publishers
to
indexers,
and
everything
will
will
catch
up
as
they
go
and
index
the
content.
Nothing
will
be
missed.
A
It's
also
very
simple:
it's
just
a
group
of
indexers
that
a
configuration,
sorry,
a
coordination
service
is
configured
to
control
and
all
that
means
is
the
indexes
are
on
the
same
network
and
their
administrative
interface
is
available
to
the
configuration
service
and
the
indexers
can
be
put
into
the
pool
by
updating
the
configuration,
the
coordination
Services
configuration
and
can
be
removed
from
the
pool
the
same
way
when
an
indexer
is
added
to
the
pool
any
any
Publishers
that
that
are
in
need
of
a
replication
that,
in
other
words,
they
don't
have
enough
indexers
already
to
replicate
as
per
the
configure
replication
settings,
that
indexer
will
be
assigned
those
Publishers
right
away.
A
A
The
coordination
service
can
do
some
other
useful
things.
One
of
those
is,
it
will
be
able
to
pull
the
indexers
and
when
the
indexers
are
at
20
percent,
it
can
sorry
when
they're
at
80
capacity.
In
other
words,
they
only
have
20
percent
of
their
storage
remaining.
It
will
be
able
to
inspect
the
other
indexers
to
make
sure
that
there
are
other
indexers
available
to
take
on
that.
A
The
assignment
of
the
Publishers
when,
when
the
almost
full
indexer
finally
gets
full,
and
if
there
aren't,
then
it
can
alert
administrators
until
something
can
be
done
about
it.
So
it's
it
gives
enough
time
to
send
out
alerts
and
have
a
person
be
able
to
react
to
the
need
for
more
resources
in
the
pool.
If
there's
not,
if
it
predicts
that
there's
not
going
to
be
enough
indexers
to
take
on
the
additional
load
at
some
point.
A
So
what
happens
when
an
indexer
does
get
full?
This
is
where
we
have
what's
called
a
freeze
mode
or
frozen
mode
and
publish
your
hand
off.
So
when
an
indexer
reaches
90
capacity,
it'll
automatically
go
into
a
frozen
mode,
it's
a
special
mode
where
the
indexer
stops
ingesting
any
new
data,
and
it
only
responds
to
removal
ads
and
metadata
updates
from
its
assigned
publishers.
A
The
indexer
continues
to
respond
to
any
queries
for
index
data.
While
it's
frozen,
it
just
can't
accept
any
more
new
data.
So
it's
not
going
to
grow
in
size
anymore.
After
it's
frozen
when
the
Cs.
When
the
coordination
service
sees
that
the
indexer
has
become
Frozen,
it
will
then
assign
that
indexer's
Publishers
to
other
non-frozen
indexers
in
the
pool
and
the
new
indexers
will
sync
with
the
last
CID
that
was
handled
by
the
previously
was
handled
previously
by
the
frozen
indexers.
A
So
the
one
of
the
results
of
this
is
that
the
indexing
data
gets
spread
across
and
multiple
indexers,
even
if
it's
from
the
same
publisher.
A
So,
for
example,
if
we
had
a
an
ad
chain
on
a
publisher
that
from
A
to
Z
and
so
first
indexer
gets
a
b
c
d,
then
freezes
because
it's
full
next
one
gets
EFG,
then
gets
frozen.
The
next
one
gets
hij
and
so
on.
You
can
see
how
that
would
be
spread
all
over
the
indexer
pool
and
that's
fine,
which
allows
us
to
just
keep
adding
indexers
as
more
and
more
data
pours.
In
so
the
as
deletions
occur.
A
There
may
be
sufficient
space
to
consolidate
some
of
the
Frozen
indexers
or
possibly
re-initialize
them
into
the
pool
if
their
data
is
no
longer
useful.
So
there
are
things
that
can
be
done
with
them,
but
the
point
is
that
the
that
freezing
and
provides
a
way
of
of
handing
off
the
data,
basically
so
basically
freezing,
is,
is
a
graceful
way
of
letting
the
data
that's
coming
into
the
indexers
Overflow
from
the
full
ones
onto
ones
that
aren't
full
and
it
doesn't
involve
moving
any
data
around.
A
So
if
you
think
of
other
strategies
that
adding
indexers
or
if
you're,
adding
a
node
to
a
network,
would
cause
a
redistribution
of
of
data,
even
a
minor
one,
that
could
be
very
expensive
in
mixing
situation
just
because
of
the
amount
of
data
that
could
be
expensive
and
very
time
consuming
so
having
this
freezing
strategy
allows
us
to
handle
full
indexers
without
moving
any
data
around.
A
So
what
what
this
requires
is
a
scatter
gather
queries
because
any
indexer
or
all
the
indexers
could
have
the
data
that's
being
looked
for,
so
a
client
doesn't
know
which
indexer
to
ask-
and
there
is
no
there's,
no
idea
of,
of
which
indexer
might
have
an
answer
for
any
key.
The
multiple
ones
might
differ.
You
know
multiple
providers
could
provide
the
data
for
a
key,
so
we
have
to
have
an
ability
to
do.
Scatter
gather
queries.
A
We
already
have
a
service
that
does
that
it's
called
index
star
and
since
we
have
a
homogeneous
network
of
indexers
right
now,
we
already
have
to
do
that.
So
this
service
is
already
available
and
and
working
well
and
it.
It
sends
a
client
query
to
all
the
indexers
and
collects
their
responses
and
merges
them
into
a
single
response
back
to
the
client.
A
So
it
also
does
things
like
is
able
to
have
circuit
breakers
circuit,
breaker
logic
that
if
an
indexer
is
not
becomes
not
responsive,
it's
not
going
to
wait
for
the
response
from
that
and
until
it
becomes
responsive
again
and
the
index
star
is
currently
like
I
said,
is
currently
working
and
and
we're
able
to
you
know
it's
a
necessary
part
of
this
network.
A
So
scatter
gather
queries.
Yes,
they
do
multiply
all
of
the
incoming
queries
out
to
the
you
know
by
the
number
of
indexers
in
the
pool.
So
this
may
be
some
cause
for
concern,
because
it's
if
the
query
load
is
sufficiently
heavy
but
right
now
it's
not.
A
We
cache
sorry.
The
cash
takes
care
of
about
95
of
the
query
load
and
what
does
get
through
is
actually
fairly
small.
Queries
themselves
are
small
and
so
are
the
responses.
So
it's
not
taking
a
huge
amount
of
of
processing
power,
Network
bandwidth
or
anything
to
deal
with
the
query
load
at
this
point.
That
may
change
at
some
point
in
the
future,
but
currently
we're
not
really.
We
don't
feel
that
we
need
to
address
that
in
this
step,
but
as
that
does
increase,
what
are
the
next
steps?
A
One
of
the
eventual
strategies
that
that
we'll
get
to
is-
and
this
depends
on
the
number
of
factors
which
we
don't
know
yet
this
is
this-
is
why
it
would
be
considered
Next
Step
is
we
would
want
to
distribute
the
query
load
by
sharding
the
over
the
key
space.
A
In
other
words,
different
indexers
would
be
responsible
for
handling
a
different
portion
of
the
key
space,
and
we
do
something
like
that
with
a
consistent
hash,
we'd
split,
the
the
ingestion
layer
and
the
query
layer
into
two
separate
two
separate
layers
so
that
the
ingestion
could
be
handled
separately
as
there
was.
The
note
handling
ingestion
would
then
write
the
results
back
to
the
query
layer
that
that's
key
sharded
and
queries
would
then
be
directed
based
on
their
key
to
specific
nodes.
A
In
the
the
query,
layer
that
may
prove
to
be
critical
in
the
future
could
also
be
problematic
just
because,
if
there's
any
need
to
redistribute
data,
even
though
consistent
hash
minimizes
it,
the
sure
amount
of
data
could
be
a
problem.
A
So
it's
a
matter
of
weighing
a
cost
versus
benefit,
and
we
don't
really
have
all
of
those
factors
yet
at
the
scale
that
we'd
start
being
concerned
about
it.
So
that
concludes
my
presentation.
Thank
you.
Very
much
here
are
some
links
where
you
can
find
more
information
about
the
indexers
or
get
in
contact
with
us
feel
free
to
reach
out.
Thank
you
very
much.