►
From YouTube: 2022-11-10 Scalability Team Demo
Description
No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).
A
Good
marker
would
you
like
to
take
us
through
the
first
item.
B
Yeah
sure,
okay,
let
me
share
my
screen.
B
Yeah
so
I
guess
it's
just
a
sharing
for
what
happened
with.
We
had
some
kind
of
regression
on
the
sidekick
for
self-managed
users,
so
opening
up
the
issue
here.
So
what
happened
and
two
months
ago
back
in
15.4
release
we're
trying
to
default
the
routing
rules
whereby
you
wanted
to
change
the
default
from
you
know:
Cube
worker
configuration
to
just
two
cues
one
is
default
and
one
is
mailers
for
typical
assignment
users.
So
we
did.
B
This
was
safe
back
then,
because
our
the
two
groups
default
in
the
psychic
site
are
still
stuck
anyway.
So
we
went
ahead
and
deployed
this
then,
starting
from
last
week,
the
support
Engineers
paint
us.
Basically,
ever
since
the
release,
there
were
a
lot
more
than
this
tickets
regarding
the
back
performance.
B
So
what
happened
was
that
it's
pretty
pretty
much
quite
simple,
so
imagine
you
would
have
this
Q
groups.
They
already
defined
this
previous
to
15.4.
So
what
happened?
Then
this
process
only
processing
or
listening
to
certain
cues
except
default.
B
So
then,
what
happens
for
this
guy
case
is
the
only
running
process
is
just
this
one,
so
the
other
two
would
just
be
idle.
So
that's
what
happened
when
they
start
to
notice
their
sidekick
latency
is
getting
higher
yeah,
so
I've
listed.
These
are
some
of
the
cases,
so
this
was
fine
and
this
is
even
more
serious
where
you
only
have
one
process
instead
of
seven
or
eight
here
yep.
B
B
Yeah
so
basically
I
thought
I
mentioned
the
the
root
causes
just
because
they
have
Define
a
custom,
Q
groups,
and
then
we
unknowingly
assume
that
many
people
wouldn't
use
Q
selector
anyway,
so
that
was
the
miscommunication
there
and
then
yep.
So
then,
after
they
upgrade
to
15.4,
all
the
jobs
are
pushed
to
default
queue
and
then
some
of
the
processes
inside
click
will
become
idle
yeah.
B
So
actually
the
fix
is
actually
pretty
simple.
If
they
they
managed
to
contact
our
support.
We
already
had
five
support.
Engineers
to
there
are
two
ways
so
first
they
can
still
keep
their
q
selectors,
but
then
we
can
kind
of
override
the
routing
rules
to
become.
Basically
all
jobs
will
be
routed
to
the
name
queue
again,
so
it'll
be
back
to
the
400
plus
queues.
B
So
this
will
be
the
previous
date
other
than
that
we
can
start
to
address
customers
instead
of
using
queue
selector.
We
can
now
use
routing
rules.
So
quick
example:
if
they
have
something
like
this
previously,
then
we
can
address
them.
We
can
change
the
Q
groups
to
the
in
the
routing
rules
format.
So
then
they
can
Define.
For
this
skill
attributes
it
will
be
routed
to
this
urgent
other
queue
and
so
on
and
so
forth.
B
Then,
in
the
Q
groups
they
can
simply
just
Define
the
specific
cues
they
want
to
listen
to
for
each
process
yeah.
So
these
are
for
the
affected
customers,
but
we
also
noticed
that,
from
the
support
tickets,
there
will
be
a
few
more
customers
that
have
yet
to
upgrade
to
15.4,
and
there
are
some
big
names
out
there.
So
we
don't
want
to
have
this
breaking
changes
for
them
as
much
as
possible.
So
our
plan
is
that
in
15.6,
the
upcoming
release
next
week,
I
mean
two
weeks
or
three
weeks
from
now.
B
It
would
also
like
to
add
this
logic
in
the
Omnibus
and
chat
so
that
we
set
this
default
routing
list
for
back
of
the
400
plus
queues
if
they
haven't
set
the
routing
rules
and
the
queue
groups
only
consist
of
stars.
So
this,
basically
they
are
default,
Q
groups
per
se.
C
Mark
I
don't
know
if
I've
made
something
there
and
thanks
for
sharing
by
the
way
that
was
the
nice
detailed
walkthrough
of
the
process
and
as
of
15
7
in
does
that
mean?
We
can
then
continue
and
go
ahead
and
deprecate
the
old
Behavior,
or
are
we
waiting
on
customers
or
users
to
make
changes
to
their
configuration.
B
B
D
Yeah
so,
admittedly,
a
bit
of
a
click-plated
title
there
there
were
there's
been
a
whole
bunch
of
efforts
around
efficiency
in
kubernetes
and
some
of
the
challenges
that
have
been
made
had
kind
of
unintended
consequences,
in
particular
putting
more
load
on
some
of
the
shared
resources
and
just
to
show
some
specific
cases
of
that.
D
D
D
So
we
had
an
incident
actually
two
incidents
related
to
PG,
bouncer
client
connection
limits.
Let
me
see
if
I
can
find
the
second
one
I
think
it
was
this
one.
Yes,
so
we
have
one
or
Max
Clan
connections
on
our
replicas
and
we
have
one
for
the
same
thing
on
the
primary.
Our
future
bouncer
setup
is
a
little
different
depending
on
whether
you
talk
to
replicas
or
the
primary.
D
D
So
PG
bouncer
takes
incoming
connections
from
all
of
the
clients,
so
that's
going
to
be
mostly
sidekick
and
web
services,
and
these
are
a
lot
of
connections
because
we
have
you
know
one
or
maybe
even
more
than
one
connection
per
pod
coming
in
and
so
it
sort
of
does
a
fan
in
and
then
multiplexes
that
onto
much
fewer
back-end
connections
when
it
talks
to
postgres
itself,
and
that's
because
postgres
has
kind
of
a
pretty
hard
limit
on
how
many
connections
it
can
support.
D
But
peachy
bouncer,
each
PhD
bouncer
process
can
only
handle
a
certain
number
of
incoming
connections,
and
so
as
we
going
back
to
this
as
we
increase
the
Pod
count
here
we're
putting
more
connections
into
our
PG
bouncers
and,
let's
see
yeah,
we
can
see
a
chart
here.
D
So
this
is,
let
me
open
the
actual
chart
if
I
can
get
it
to
load
and
it
looks
like
it's
not
loading,
not
sure.
What's
up
with
that,
but
it's
it's!
It's
our
saturation
ratio
for
how
many
connections
PG
bouncer
can
handle,
and
we
were
awfully
close
to
the
Limit.
Actually
I
think
we
even
hit
the
limit
in
the
end,
which
then
caused
some
some
user-facing
impact
because
they
get
500s
because
the
pods
can't
connect
to
postgres.
D
We
kind
of
it
we
in
scalability
projecting
our
our
usage
over
time
and
sort
of
capacity
planning
and
when
we
make
changes
to
how
the
infrastructure
runs,
that
can
actually
drastically
change
the
the
usage
and
I've
been
sort
of
thinking
about.
How
do
we
approach
this
in
a
better
way,
so
that
we
can
better
predict
the
outcome
of
those
types
of
changes
right
so
that
when
we
make
a
change
like
that,
where
we're
seeking
to
gain
more
efficiency,
what
were
the
other
consequences
of
that
be?
D
D
D
A
Ahead,
so
the
the
increasing
the
pods
increased.
How
many
connections
needed
to
be
made
is
increasing
the
Pod,
something
that
an
engineer
chooses
to
do,
or
is
that,
like
an
automated
thing
that
that
just
we're
in
a
certain
level
of
usage
is
achieved,
it
automatically
creates
new
pods.
D
Yeah,
so
we
use
a
horizontal
pod,
Auto
scalar
for
all
of
our
web
service
workloads,
and
so
that
means
we
configure.
We've
got
kind
of
two
variables
that
affect
how
many
pods
we
get.
D
We
have
the
utilization
Target,
which
we
set
on
the
horizontal
puddle
to
a
scalar,
and
we
have
the
CPU
requests
and
CPU
requests
kind
of
affects.
How
big
is
each
pod
and
the
utilization
Target
is
relative
to
how
big
it
is.
How
much
do
we
fill
it
up
and
both
of
those
sort
of
and
indirectly
based
off
of
the
dynamic
Behavior
results
in
getting
more
or
fewer
pods,
and
so
maybe
to
to
illustrate
this,
because
that's
kind
of
the
main
thing
I
wanted
to
show
so
I've
been
putting
together.
D
This
is
still
kind
of
early
stages,
but
I
kind
of
try
to
put
together
a
diagram
to
show
some
of
these
interactions,
and
this
is
sort
of
a
a
model.
If
you
will
so,
we
can
sort
of
look
at
this
and
and
try
and
reason
through
some
of
the
scenarios,
there's
a
good
chance
that
there's
stuff
missing
from
here
right.
This
is
this
is
an
abstraction.
This
is
sort
of
simplified,
but
let's
say
we
want
to
increase
efficiency.
D
D
So
if
we
were
to
say,
increase
this
utilization
Target,
that
means
we
drive
each
one
of
these
parts
harder
they
they
have
now.
Each
one
has
more
work
to
do.
D
D
Pod
count
goes
down,
however,
this
has
other
potential
consequences,
because
now
each
of
these
parts
is
running
hotter,
so
that
can
increase
sort
of
the
contention
on
the
host
and
it
can
also
affect
the
contention
of
the
Ruby
Global
VM,
lock,
because
we're
now
so
Ruby
Global
VM
lock
is
based
on
how
many
processes
you
have
per
pod.
That's
the
Puma
was
tunable,
and
so
we
didn't
change
this.
D
So
this
state
game,
but
this
went
up,
which
means
the
CPU
to
Puma
ratio,
changed
in
a
way
that
each
Puma
worker
is
now
doing
more
work
and
that
then
drives
this
contention
metric.
So
you
know
it's
complex
right.
It's
kind
of
the
point
that
I'm
trying
to
make
and
and
it
can
be
kind
of
tricky
to
really
predict
what
exactly
the
effects
are.
Even
with
something
like
this.
We
can
sort
of
try
and
reason
through
it,
but
chances
are
we'll
get
it
wrong.
D
So
I
think
there's
also
an
element
of
scientific
method
involved
in
making
these
types
of
changes.
A
And
everything
that
you're
describing
here
sounds
very
similar.
I
mean
it's
similar
in
terms
of
class
of
problem
to
when
we
were
trying
to
decide
I
think
it
was.
Was
it
the
connection
pool
size
for
the
database
and
it
felt
like
we
would
do
a
certain
amount
of
reasoning
through
the
problem
we
would
set.
A
It
we'd
get
it
wrong,
and
then
we
would
do
another
set
of
reasoning
and
then
set
it
a
different
way
and
see
and
see
what
that
did,
and
it
seems
like
what
you're
talking
about
here
is
is
the
same
class
of
problem.
Where
there's
only
so
much
reasoning,
we
can
do
and
we're
going
to
have
to
set
them
and
see
what
happens
to
see
how
that
actually
plays
out
when
it's
on
production.
D
Yeah
I
mean
I
think
we
can
also
use
those
experiments
to
refine
our
models.
So
we
can,
you
know
we
can
potentially
get
closer
to
being
able
to
actually
predict
it.
But
yes,
I
think
it's
very
much
in
the
same
realm.
What
you
described
well.
E
I
kind
of
copy
of
a
production
environment
where
we
could
learn
synthetic
tests
with
load.
Would
that
be
a
reasonable
approach
or
are
we
do
or
is
the
volume
and
number
of
moving
parts
so
massive
that
it
becomes
impractical.
A
E
D
Yeah
I
think
it's
also
just
technically
pretty
challenging
to
do
like
you.
You
either
do
synthetic
float
and
then
you're
likely
not
going
to
match
what
is
actually
happening
in
production
or
you
try
to
replicate
real
production
traffic,
and
then
you
have
the
question
of
how
do
you
deal
with
States
changes
and
rights
because
there's
a
kind
of
tricky
to
replicate
in
a
way
that
you
can
really
replay
it
properly
because
it
gets
sort
of
out
of
order
effects
and
whatnot,
so
I'm
I'm
a
bit
of
a
skeptic
when
it
comes
to
that
approach?
D
D
One
approach
that
I
do
think
has
also
some
limitations,
but
can
be
pretty
effective,
is
doing
it
on
a
subset
of
making
changes
on
a
subset
of
the
fleet
and
that's
what
we're
doing
right
now.
So
we
have
the
three
zional
clusters
and
we're
making
the
change
in
only
one
of
the
three
there's
still
potentially
some
sort
of
interplay
between
them
and
so
I
I.
Don't
think
this
solves
necessarily
for
all
combinations
of
changes.
It
could
still
kind
of
tip
over
once.
D
We
once
we
roll
out
to
everything,
but
it
gives
us
at
least
a
bit
of
an
idea
of
how
how
the
changes
are
behaving
in
a
way
that
sort
of
mitigate
some
of
the
risks.
E
A
A
controlled
way
plan
for
you
know
if
we
see
these
certain
things
happening
when
we're
making
these
adjustments,
then
we
we
alter
the
plan
and
change
it
in
that
way
to
keep
the
system
safe,
but
it
seems,
like
you
know,
doing
these
experiments
on
production
seems
a
reasonable
thing
to
do.
A
It
seems
like
it's
the
easiest
way
to
see
exactly
what
will
happen
on
production
is
to
is
to
change
it
just
doing
it
within
the
wrist
within
a
risk.
Tolerance.
E
C
Failure
thing
that
we've
we've
discussed
before,
but
obviously
controlling
failure
at
the
same
time,
ego
I,
don't
know
if
this
is
what
you
were
describing
in
a
drug
diagram
or
something
different,
but
is
it?
Is
it
possible
or
feasible
that,
like
our
Auto
scaling
strategy
is
dependent
on
certain
hard
limits
in
the
system
and
I?
Guess
what
I
mean
by
that?
D
I
mean
the
the
question
is:
what
do
we
do
if,
if
the
pods
are
at
100
utilization
and
so
I
guess
what
one
answer
to
your
question
is?
Yes,
we
have
some
controls
on
how
far
we
scale
out
so
for
for
the
horizontal
pod,
Auto
scalar.
We
have
Min
and
Max
replicas,
and
so
that
that
place
is
an
upper
bound.
It's
not
necessarily
in
directly
informed
by
what
Upstream
limits
might
exist.
D
I
mean
we
can
sort
of
model
those
Upstream
limits
and
say
this
depends
on
number
of
Parts
times
number
of
Puma
workers
per
pod,
and
then
we
set
the
limit
accordingly.
So
you
know
that
that's
something
that
we
could
sort
of
semi-statically
calculate
and
then
set
but
I
think,
even
if
we
were
to
do
that,
it
I
mean
it
just
means
that
the
pots
won't
grow
to
a
certain
like
above
a
certain
limit,
but
once
we
reach
that
limit,
those
pods
are
still
they're
still
having
a
bad
time
right.
C
C
A
What
I
was
going
to
say
is
I.
Guess
it's
the
problem
with
having
like
a
limited
arrangement
with
the
database,
because
there's
going
to
be
that
there
is
a
hard
limit,
because
there's
only
one
writable
database
and
if
you
scale
beyond
what
that
database
can
handle
everything's
going
to
have
a
bad
time,
but
at
some
point
like
at
some
point
that
limit
gets
reached
and
it's
the
case.
Well,
what
do
you
do
at
the
point
that
that
limit
is
reached?
E
C
Back
to
like
saturation
forecast,
try
which,
as
I,
think
like
the
demonstration,
it's
like
that's
the
way
that
we
work
and
it
works
until
it
almost
doesn't
right.
When
you,
you
kind
of
get
a
change
that
happens
so
quickly
that
the
system
starts
behaving
in
a
different
way,
which
invalidates
a
forecast
that
we've
made
previously.
D
Yeah
so
I
think
one
other
sort
of
orthogonal
axis
here
is
the
the
isolation
side.
D
So
if
we
think
about
sort
of
isolation
patterns
and
patterns
for
handling
overload,
the
the
two
that
come
to
my
mind
are
bulkheading
and
circuit
breakers.
So
bulkheading
is
effectively
functional,
partitioning
or
sort
of
partitioning
by
some
kind
of
failure
domain
or
some
kind
of
group
that
you
want
to
isolate.
D
And
so,
if
we
think
about
the
database,
you
know
if
we
had
maybe
dedicated
PG
bounce
notes
or
dedicated
PG
bouncer
pools.
We
can
allocate
them
with
fine-grained
and
say
this
group
of
PODS
can
now
use
up
to
this
many
connections
and
if
it
tries
to
use
more
it'll
break
but
it'll,
because
it's
a
shared
resource,
it
will
not
affect
the
rest
of
the
consumers.
So
there's
sort
of
that
that
kind
of
overload
protection
there
and
then
circuit
breakers
are
really
for.
D
If
we
do
reach
overload,
how
do
we
deal
with
that
in
a
way
that
we
can
recover
from
it?
And
and
so
it's
more
about
detecting
that
the
Upstream
system
is
not
able
to
respond
and
kind
of
having
the
clients
back
off,
and
so
that
can
also
help
kind
of
stabilize
during
situations
where
that
overload
does
occur.
D
Yeah,
that's
that's
more
or
less
what
I
wanted
to
share
I!
Don't
have
any
real
conclusions
just
yet,
but
it's
something
I've
been
thinking
about
and
I
I
do
think
it's
kind
of
relevant
to
what
we
do
in
in
scalability.
So
yeah
I
wanted
to
share
that.
A
A
Just
heading
back
to
the
agenda:
I,
don't
see
that
anything
else
is
there
yet
is
anyone?
Is
there
anything
else
anyone
would
like
to
demo
or
chat
about.
A
Alrighty
well,
thank
you
so
much
for
joining
the
call.
Thanks
for
the
conversation,
I'll
upload,
the
video
and
I
hope
you
have
a
good
rest
of
your
day,
but.