►
Description
Multi-cluster management is hard. Technology, teams, and culture clash in a race to deliver clusters and applications in a secure and compliant way. Red Hat Advanced Cluster Management for Kubernetes (RHACM) provides the capabilities to address common challenges that administrators and site reliability engineers face as they work across a range of public and private cloud environments. Clusters and applications are all visible and managed from a single console—with security policy built-in.
A
A
B
Good
morning,
good
afternoon,
good
evening,
wherever
you're
hailing
from
welcome
to
another
episode
of
red
hat
advanced
cluster
management
presents,
I
am
joined
by
the
team.
We
call
rackham
it's
one
of
my
favorite
products
here
at
red
hat,
because
it
does
such
an
amazing
job
at
multi-cluster
management.
So
I'm
going
to
hand
it
over
to
scott
behrens
to
kind
of
tell
us
what
we're
talking
about
today.
C
C
I'm
a
product
manager
and
I
love
to
solve
problems
in
the
multi-cluster
management
space.
So
that's
why
we're
here
is
to
talk
about
what
are
we
doing
with
rackham
and
management
at
the
edge,
a
very
exciting
topic,
I'm
going
to
also
introduce
my
colleague,
brad
white
and
benner
he's
new
to
the
team,
but
his
focus
is
in
this
telco
edge
scale.
Space
brad
go
ahead
and
introduce
yourself
hello,.
C
C
Just
soaking
in
the
sunshine
and
then
in
terms
of
the
actual
technical
horsepower
and
the
real
brains
behind
this,
I'm
going
to
turn
it
to
how
and
how
lou
can
introduce
himself
and
we'll
pass
it
around.
The
team.
E
Hello,
my
name
is
hal.
I've
been
on
the
acm
team
for
a
long
time.
I
was
there
since
the
initial
poc
of
the
product.
A
lot
of
random
things
on
the
team,
helped
design
the
trusted
life
cycle
bits
with
integration
with
hive
and
also
I'm
online
on
the
cicd
team,
and
now
I'm
focused
on
getting
acm
to
scale
and
help
expand
into
the
edge
arena.
C
F
G
A
Hi
everybody,
my
name
is
chris
doan
and
I've
been
with
acm
for
quite
a
long
time,
but
I'm
actually
from
the
sre
squad.
But
somehow
hal
was
able
to
wrangle
me
onto
this
far
edge
effort
and
I
try
to
contribute
wherever
I
can,
but
yeah
glad
to
be
here
I'll
pass
it
on
to
alex.
H
Yep
hi
I'm
alex
cross,
so
I'm
the
one
member
of
the
team,
that's
actually
on
a
different
team,
I'm
on
the
telco
5g
performance
and
scale
team
based
in
raleigh,
I'm
actually
on
my
second
tour
of
duty
with
red
hat.
That
makes
me
a
boomerang
employee.
H
B
F
C
And
we
understand
that
the
notion
of
a
cluster
being
this
gigantic
thing,
with
hundreds
of
nodes
and
just
a
large
footprint,
a
multi-tenant
cluster
that
still
exists,
and
we
do
see
that,
but
less
and
less
of
that
we
start
to
see
smaller
clusters.
We
we
have
new
topologies
coming
out
like
compact
clusters,
where
there's
a
shared.
You
know
three
master,
three
worker
kind
of
scenario
and
then,
as
that
gets
smaller,
you
see
like
a
single
node
openshift.
C
We
don't
even
really
want
to
call
that
a
cluster,
maybe
that's
a
whole
different
debate
over
a
picture
of
beer.
But
you
know
we're
in
this
space
where
we
need
to
have
a
smaller
footprint.
You
need
to
be
able
to
manage
that.
So
there
has
to
be
enough
tooling
enough,
componentry
in
place
to
manage
that
thing
out
on
the
edge
and
that's
what
we
call
a
single
node
openshift,
that's
been
introduced
as
part
of
the
4.8
release,
that's
coming
out
and
our
team
has
been
working
with
that
day
and
night.
C
So
I
think,
let's
just
say
arbitrarily
was
you
gotta?
You
got
to
finish
this
in
10
hours
and
you
have
to
be
able
to
deploy
a
thousand
of
them
ready
set,
go
so
like.
How
did
how?
How
would
you
solve
that
right?
So
what
we're?
What
we're
here
to
talk
about
today?
Some
of
the
growing
pains
some
of
the
learning
some
of
the
stories
that
we've
gone
through
and
why
we
have
the
gray
hair.
We
do
to
get
to
the
point
that
we're
at
which
is
incredibly
awesome
like
we
could
deploy
a
thousand
clusters.
C
E
First
of
all,
a
little
bit
of
background
right,
scott
came
to
me
with
this,
but
like
end
of
last
year,
I
was
like
you
want
what
now,
what
for
a
little
piece
of
history
right
at
that
at
that
point
like
we
have
only
tested
acm
or
whether
only
able
to
test
the
acm
up
to
50
cluster,
and
that's
us
like
still
bag
and
borrow
clusters
to
be
managed
by
acm.
E
We
with
the
resources
that
we
have
we're
only
able
to
test
the
acm
up
to
managing
50
clusters
for
a
short
period
of
time,
so
so
understanding
that
a
thousand
right
is
order
of
magnitude
higher
than
50..
So
I
start
I
was
like
okay,
that
sounds
fun.
Let's
do!
E
Let's
do
it
right,
so
there's
a
couple
early
early
lessons
that
we
learned.
I
think
that
that
is
just
fascinating
and
is
generally
applicable
for
any
web
app
that
we
built.
So
the
first
thing
I
ever
try
after
scott
approached
me
is
okay,
I'm
gonna
go
stand
up
with
openshift
cluster,
I'm
gonna
create.
Well,
let's
see
how
many
main
spaces
I
can
create
right,
because
in
acm
every
single
main
space,
every
single
cluster
has
its
own
main
space
to
serve
as
our
back
border
to
contain
the
resources
that
manage
cluster
can
access.
E
So
now,
let's
see
how
does
openshift
respond
to
a
thousand
link
spaces
or
two
thousand
namespaces,
so
we
started
to
just
simple
script:
looping
through
creating
name
spaces,
and
we
found
I
found
that
like
what
the
heck
after
about
like
two
thousand
namespaces,
the
the
control
plane
crashes,
and
at
that
time
I
started
to
panic
a
little
bit.
Oh
crap
am
I
am
I
setting
myself
up
to
for
something:
that's
not
doable,
so
we
learned
our
first
lesson
right
and
let
me
show
you
a
little
bit
of
graphic
about
about
this.
E
C
E
So
I
I
took
the
weekend
did
some
reading
and
ran
across
this
document,
so
this
is
the
comparison
of
different
story
type
on
aws.
So
at
that
time
we
were
mostly
testing
on
aws,
because
that's
just
the
most
highly
available
resources
that
we
have
so
gp2
right
is
the
default
storage
that
we
use
and
we
found.
One
thing
that
I
found
out
is
that
it's
got
a
burst
budget,
meaning
that
like
when
you
first
provision
it
it.
E
It
performs
fantastically
right,
kyops
comparable
to
our
l1,
but
as
we
exhaust
that
burst
budget,
the
iops
tank
like
by
default,
I
think
we
carry
a
300
gig
storage
and
that's
only
about
9000
iops
after
we
exhaust
the
burst
budget.
So
that's
what
we
saw
so
the
first
lesson
here
when
you
build
a
cloud
native
application
that
deploys
on
a
cloud
provider
is
well
there's
a
hidden
wall.
You
wonder
why
your
wonderfully
built
application
doesn't
scale
storage.
E
I
guess
should
be
the
first
thing
that
that
that
we
take
a
look
at
so
once
we
replaced
our
storage
with
io1
with
a
reasonably
high
iops
like
3000,
then
I
could
I
end
up
being
able
to
query,
like
you
know,
tens
of
thousands
of
lane
spaces
without
a
problem.
E
And
you
did
that.
E
Now
the
checkbook
actually
didn't
help
that
much
because
aws
throttles
your
api
right.
So
you
can't
actually
create
a
thousand
cluster
with
a
snapple
finger
on
aws,
because
certain
apis
are
very
limited,
like
you
know
the
ones
that
you
create
dns
stones,
those
are
really
limited.
E
E
Yes,
exactly
hi,
I'm
a
seasoned
performance
engineer,
I'm
here
to
help
like
wonderful,
so
I
I
would
like
to
pass
to
alex
to
talk
a
little
bit
about
how
we
ended
up
approaching
this
problem.
H
Sure
so,
when
I
kind
of
joined
to
help
here
with
acm,
it
was
probably
around
november
or
so
having
been
seasoned
with
the
scale
lab
and
the
the
the
hardware
that
we
have
available
at
red
hat
for
scale,
testing
and
whatnot,
I
had
kind
of
known
what
our
capabilities
were
and
what
not,
and
the
first
kind
of
test
that
was
it
was
kind
of
asked
of
me-
was:
let's
just
see
how
many
openshift
clusters
we
can
get
off
of
a
certain
chunk
of
hardware,
so
the
first
iteration
of
that
actually
involved
openstack.
H
So
we
had
requested
some
hardware.
We
got
some
hardware
like
around
32
nodes,
or
so
we
deployed
openstack
on
top
of
that
deployed
a
hub
cluster,
and
I
tried
to
deploy
as
many
spoke
clusters
as
I
could,
after
sizing
everything
down
as
small
as
I
possibly
could.
We
actually
only
got
about
one
hub
cluster
plus
55
spoke
clusters,
so
we
really
got
nowhere
further
for
testing
than
what
hal
had
done
in
the
past,
with
the
where
beg,
borrowing
and
borrowing
everybody's
cluster,
they
could
find
out
of
aws
and
whatnot,
which.
C
Was
also
a
fun
exercise
as
we
were
cobbling
together,
you
know
clusters
from
every
different
line
of
business
that
we
could
find
like.
Oh
you've
got
three
you've
got
two.
Here's
five
over
here
anyway,
that's
a
story
for
another
day,
but
alex
you're
on
the
right
path.
You're,
you
you're,
the
you
were
the
shining
light
that
figured
out
how
to
actually
get
these
resources
in
place.
H
And
about
that
time,
that's
when
I
was
playing
with
acm
as
well
and
seeing
when
you
manage
clusters
you're
creating
a
namespace
for
it,
and
I
was
you
know,
obviously
heard
the
the
goal
there
of
a
thousand
clusters
being
managed
by
acm.
My
first
thought
was
like
well
shoot.
That's
a
hundred!
That's
a
thousand
name
spaces
right
there
and
previous
performance
testing
of
open
open
shift
has
stressed
the
name.
Space
is
in
a
certain
dimension,
of
course.
H
Openshift
and
kubernetes
is
a
multi-dimensional
it's
multi-dimensional
in
order
to
try
to
form
where
scalability
is
so.
You
could
create
a
ton
of
name
spaces,
but
if
they
don't
have
a
ton
of
other
resources
that
might
work
fine,
if
you
create,
so
it's
really
multi-dimensional,
so
you
really
have
to
kind
of
test
with
a
real
environment
more
closely
to
what
the
kind
of
customer
should
have
deployed
there.
H
H
H
We
had
actually
asked
for
more
nodes,
since
we
got
to
we
have
the
55
clusters
and
throughout
that
we've
also
had
to
manage
through
various
infrastructure
issues
of
just
this
cluster
is
not
working
or
something
else,
or
this
build
was
not
not
working,
but
anyway
we
we
improv,
we
added
more
hardware,
and
then
we
actually
got
to
the
point
where
we
could
decrease
the
size
of
the
openshift
clusters
themselves,
so,
rather
than
55
full
three
node
masters
with
two
worker
nodes,
we
shrunk
that
down
into
sno
clusters
at
that
point
in
time,
however,
because
we're
still
using
openstack,
we
ran
into
kind
of
other
scaling
issues
and
issues
that
weren't
surrounding
how
sno
or
single
node
openshift
was
supposed
to
be
represented
as
a
far
edge
cluster
there.
H
One
of
those
was
that
in
openstack
it
would
still
create
a
boot,
a
bootstrap
node,
so
we
had
to
plan
capacity
of
our
cloud
around
that.
But
after
we
worked
through
all
that
with
about
the
64
nodes,
carve
off
a
few
that
pieces
of
hardware
that
may
have
failed
and
we
actually
got
up
to
about
320
clusters
at
that
point
in
time
and
that's
when
we
started
hitting
the
scaling
limits
of
what
we
were
doing
there
with
having
the
infrastructure
as
a
service
kind
of
layer
that
we
had
there
being
openstack.
H
So
that's
when
we
made
our
last
pivot,
that's
been
our
most
recent
test
there
and
the
last
pivot
is:
we've
actually
just
removed
the
openstack
layer
and
we've
gone
with
completely
bare
metal.
So
we
made
our
our
hub
cluster
completely
bare
metal.
We've
actually
had
to
use
similar.
You
know
I
said,
pass
through
nvme
now
the
nvme
is
right
there
for
the
the
hub
cluster,
but
you
still
have
to
allocate
that
nvme.
So
what
we
do
is
we
actually
just
use
ignition
configuration
to
make
the
nvme
have
that
cd
mounted
on
it.
H
We
also
use
the
the
nvme
on
our
worker
nodes
that
is
able
to
actually
serve
as
local
storage.
So
that's
how
we
solve
kind
of
a
storage
solution
for
our
bare
metal
cluster
and
then
for
our
spoke
clusters
for
management.
We
actually
have
just
pure
rail,
with
liver,
hypervisors,
so
and
depending
on
which
piece
of
hardware
we
have
from
the
from
the
lab.
We
can
fit
up
to
17
or
seven,
depending
on
our
sizing
that
we
we
saw
previously
with
capacity
kind
of
analysis
that
we
did
with
the
hardware.
So.
C
A
lot
of
a
lot
of
gymnastics,
a
lot
of
head
banging
head
scratching
to
get
to
that
point.
That
was
what
a
couple
of
months
of
just
learning
what
we
have
access
to
and
how
can
we
maneuver,
you
know
basically
more
clusters
into
that:
a
more
dense
test.
H
Bed,
yes
and
then
it,
and
so
one
of
the
other
big
things,
the
pivoting
when
we
changed
from
that
pivot,
when
we
started
to
use
sno
on
libert,
is
that
we
actually
had
new
technologies
integrated
into
to
openshift
at
the
time.
So,
instead
of
having
acm
working
with
hive
to
provision
clusters,
there's
now
acm
with
the
assisted
installer
with
hive
boom
and
that's.
C
Next
generation
of
technology
that's
coming
out,
it's
actually
already
available
at
cloud.redhat.com
as
a
as
a
sas,
and
that
is
a
tech
preview
offering
to
start
carving
out
bare
metal
in
your
data
center
with
discovery,
isos
and
all
this
magic.
But
that
pivot
point
is
key,
because
now
we're
bringing
that
technology
into
the
on-prem
space
and
so
alex.
I
mean
talk
us
through
what
that
looked
like
and
how
the
team
responded.
H
Yeah
so
the
biggest
savior
there
was
one
of
the
other
scaling
limits
that
I
neglected
to
mention
a
little
bit
earlier
was
when
we
were
on
top
of
openstack.
We
had
to
make
much
more
planning
for
a
hub
cluster,
not
just
fcd
with
nvme,
but
it
was
also
hive
would
create
an
installation
pod
and
that
pod
required
800
megabytes
of
memory
almost
almost
a
game.
So
if
we
wanted
a
high
concurrency
of
installations,
we
had
to
create
enough
worker
nodes
that
could
host
all
of
these
pods.
H
In
addition
to
that,
it
would
actually
download
an
image
file
that
would
that
would
then
serve
so
that
would
consume
ephemeral
disk
space.
So
we
had
to
plan
around
memory
and
ephemeral
disk
space
on
those
those
nodes.
In
reality,
though,
all
the
installation
is
happening
on
this
remote
remote
machine,
so
why
can't
it
just
happen
there?
H
Well,
thankfully,
we
had
the
assisted
installer
and
that's
what
got
us
there,
so
that
really
shaved
down
the
resources
for
our
hub
cluster
and
actually,
once
we
moved
the
bare
metal,
we
had
originally
planned
for
extra
nodes
and
we
ended
up
having
to
get
extra
hypervisor.
So
that's
what
allowed
us
to
end
up
scaling
up
to
greater
than
a
thousand
clusters
with
that,
with
a
rough
roughly
about
a
hundred
hundred
or
so
nodes
in
the
lab.
C
E
Yeah,
the
amazing
guidance
from
alex
right
enable
us
to
just
go
ahead
and
test
our
system
like
to
see.
Where
are
the
pain
points,
and
we
found
a
lot
of
design
choices
that
about
that.
We
made
and
implementation
choices
that
we
made
that
can
be
can
be
improved
as
well
as
this
new
assistant
installing
technology
that
that
help
us
address
the
spike
of
resource
utilization
right
during
the
cluster
provisioning
time.
E
So
we
so
the
customer
don't
have
to
plan
for
this
excess
resources
that
only
gets
used
during
the
plus
the
provision
time
and
afterwards
kind
of
just
lay
there
doing
nothing
which
is
wasteful.
C
E
Yeah
well,
one
of
the
interesting
assumptions,
sometimes
like
people
think
scalability
is
kind
of
linear,
not
not
not
actually
choose
their
occasionally.
There's
just
one
of
these
like
at
this
number
stuff.
Just
completely
disintegrates
right,
and
these
are
the
things
that
we
are
really
not
able
to
see
until
we
have
the
resources
until
we
have
the
actual
clusters
to
play
with
so,
but
han
is
one
of
my
favorite
software
developer
like
please
follow
him
on
github
like
it's
awesome
and
there's
a
lot
of
things.
E
There's
a
lot
of
lessons
that
we
learned
during
this
journey
and
that
helps
us
kill
our
operators
to
help
break
through
this
bottlenecks.
C
And
this
is
a
big
moment,
because
this
is
this:
is
the
mentor
sharing
kudos
to
the
protege,
because
I
I
mean
I
remember
seeing
you
know
han
who's
been
he's
been
growing
under
your
leadership.
How
but
seeing
him
take
off
in
this
space
so
take
it.
G
There
you
go
yeah
cool,
thank
you.
How
and
I
I
did
prepare
some
slides
this
morning
and
as
I
mentioned
this
release,
we
achieved
a
thousand
goal
and
we
learned
a
lot.
So
my
as
I
mentioned
I
mentioned
my
job
is
basically
developer
and
I
do
have
I've
been
working
on
controllers
for
several
releases
and
this
release.
G
They
just
keep
crashing
when
we
have
a
thousand
clusters
and
the
reason
is
basically
out
of
memory
and
it's
pretty
easy.
We
can
just
increase
memory
limit,
but
that
is
not
elegant
and
that's
not
the
solution.
We
want
right.
So
actually
we
did
some
investigation
and
there
something
gotcha
is
there,
and
I
want
to
share
and
also
another
thing
is
about
performance
here.
G
I'm
I
mean
the
speed
is
too
low,
too
slow,
and
it's
always
we
possible
that
we
can
refactor
our
logic,
but
actually
again,
there
are
some
very
easy
solutions
we
can
choose
and
I
also
want
to
share
so.
First,
let's
talk
about
the
other
memory
and
all
the
memory
killed
with
their
summer.
Investigation
turns
out
it's
because
of
the
cash
some
background
about
the
cash
it's
basically,
if
you're,
using
some
gokan
to
contact
you,
the
kubernetes,
and
for
here
we
are
using
the
control,
runtime
and
most
of
the
gokan.
G
They
actually
have
some
cache
under
the
background.
So
when
you
you're
using
a
cat
you're
using
a
client
to
do
a
watch
like
kubernetes
design,
you
you
watch
for
some
resource.
If
you
change,
then
you
will
modify,
you
will
do
some
reconciliation
and
when
you
are
doing
that
watch
actually
there
are
some
background,
goes
up
routine
as
doing
the
cache
and
they
will
save
every
change
in
the
cache
yeah
and.
H
G
If
you're
doing
doing
a
list
or
get
with
the
client,
they
actually
also
use
a
cache.
They
will
copy
everything
to
cache
and
then
save
everything
and
that's
yeah,
that's
something
we
don't
know.
We
didn't
know
before.
Oh
we
we
actually
know,
but
we
didn't
realize
how
how
terrible
it
is.
It
can
be
to
affect
our
performance
and
especially
if
you
are
just
getting
one
result
like
you
get
one
secret
in
the
cluster.
You
just
want
to
function
get
and
the
in
the
background.
It
will
cache
everything
in
the
cluster.
G
So
that's
not
something
we
want
yeah,
and
so
we
figure
out
the
solution.
It's
pretty
easy
because
the
cache
is
problem,
and
actually
we
don't
need
cache
everything.
Sometimes
we
just
cache
some
of
the
results
we
care
like
for
secret.
We
only
each
name
space.
We
only
have
one
secret.
We
care,
we
don't
need
every
secret
in
cluster.
So
actually
we
just
don't
catch
everything
we
don't
need.
G
So
this
is
first
first.
I
I
want
to
recommend
that
if
you
have
you,
you
can
choose
a
namespace
scope
to
client.
You
just
use
namespace
because
scopedclient
and
if
you
can
use
labels
to
like
select
the
resources
to
reduce
the
cache,
just
use
some
labels
and
again
the
third
thing
is:
don't
ever
catch
any
secrets
of
the
whole
cluster.
That's
a
lot
of
memory
and
I
will
show
some
examples
there.
We
really
we
we
have
all
most
of
our
controllers,
they
crash
because
they
are
catching
the
secrets.
G
The
secrets
is
super
large
in
the
clusters
and
another
exciting
news
just
happened
this
week:
controller
runtime,
which
is
a
very
popular
library
for
controllers.
They
release
a
0.9
and,
in
this
release,
there's
the
builder
with
options
with
this
configuration
user
can
easily
configure
the
cache
to
add
the
labels
and
or
use
any
selectors.
They
want
to
just
cache
whatever
they
need.
So
this
is.
F
G
C
So
here
this
is
awesome
because
chris
I
know
we
we
get
on
your
show
and
it's
a
bunch
of
smoke
and
mirrors,
but
this
is
like
real
bona
fide
development
stuff
like
right
now
I
just
in
awe
of
this
team
and
the
way
they
brought
this
together.
Sorry,
I
didn't
mean
to
throw
you
off
your
game
on,
but
yeah.
G
It's
okay,
we're
also
really
excited
with
this,
and
also
here.
Here's
the
example
of
one
of
our
controller.
It
catch
the
secret
and
because
we
didn't
realize
we
should
use
labels
or
use
any
technique
and
we
just
capture
everything.
So
this
is
before
we
are
catching
everything
you
can
see.
We
have
a
thousand
clusters
and
for
openshift
each
classroom
will
create
a
namespace
and
a
namespace.
They
will
have
three
service
accounts.
Each
service
account.
G
They
will
have
two
secrets:
one
is
for
a
service
account
token
and
the
other
one
is
for
docker
config
and
though,
all
those
secrets,
that's
like
6000
secrets,
all
those
secrets
adds
up.
They
can
be
several
hundreds
of
megabytes,
let
alone
other
secrets
actually
running
for
each
component
or
controllers
and
actually
those
secrets
we
don't
need,
and
then
we
just
use
the.
F
G
With
labels
and
boom,
we
just
reduced
500
megabytes
memory.
Now
it's
just
nothing
right
before
it's
509,
nothing
so
consider
we
have.
A
couple
of
controllers
are
catching
the
secret.
We
actually
reduce
several
gig
of
memory,
that's
a
lot
for
us
and
we
are
super
happy
for
this
result,
and
another
thing
is
about
the
performance.
G
Basically,
performance
is
too
slow
because
we
have
a
thousand
clusters
and
we
we
should
know
that.
There's
no
one-size-fits-all
solutions
for
performance.
Turning,
sometimes
the
only
solution
is
just
refactoring,
but
sometimes
it
can
be
very
easy
because
we
have
some
of
the
configuration
always
have
some
configuration
available
like
here
example,
is
all
for
control.
Runtime
first
is
there's
a
con
qps
and
the
burst
current
qps
is
when
you're
using
the
kubernetes
client
you're
doing
a
get
or
list
of
nugget
lists
mainly
apply,
update
or
patch.
G
Something
like
that
and
there's
the
qps
is
limiting
the
default
is
just
20
and
there's
a
burst
is
versus
for
there's
a
buffer,
and
you
can
do
like
qps
for
30
at
most
but
yeah,
and
if
you
are
doing
a
lot
of
requests
at
the
shelter
runtime,
you
will
see
a
lot
of
struggling
keywords
in
the
logs
and
in
this
time
maybe
you
can
consider
up
just
scale
up
the
qps
and
then
see
if
it
can
help.
You
reduce
the
pro
solve
the
problem
and
the
example
is.
G
We
apply
a
thousand
manifest
to
one
cluster,
the
hub
cluster
and
we
we
want
to
apply
it
in
one
recycle,
reconcile
and
because
of
the
qps.
Here
it
takes
like
30
or
40
seconds
for
wine
reconcile,
which
is
super
slow
and
after
we
change
the
qps
to
200.
It's
just
several
seconds,
and
it's
super
fast
now
and
another
thing
is
one
queue
limiter.
This
is
the
reconcile
every
time
when
you
watch
a
resource
and
it
will
trigger
a
reconcile.
There's
a
limited
rate
limiter
here
default
is
10.
G
If
you
think
it,
it
may
help
you
to
reduce
the
speed
to
speed
up
your
controller,
then
maybe
you
can
choose
this
one
another
another
one
is
the
max
current
to
reconcile
this
configuration.
The
name
explains
everything.
Basically,
you
can
add.
The
concurrencies
default
is
always
one
so
you're
doing
one
thread,
and
if
your
task
is
very
time
consuming
and
the
task
can
be
done
parallel,
I
think
this
configuration
can
be
helpful.
Like
our
example
is
we
apply
one
manifest
when
we
are
importing
clusters
by
importing.
G
Actually,
we
are
just
applying
some
manifest
on
the
remote
clusters
and
after
we
apply
the
agent
on
remote
cluster
agent.
World
turns
up
and
get
all
the
cluster
imported.
So
we
will
apply
the
manifest
on
the
remote
clusters.
We
will
apply
it
every
every
class
of
the
thousand
clusters
and
because
we
only
have
one
thread
so
they
are
doing
linearly
and
because
the
remote
cluster
can
be
super
busy,
it
takes
a
long
time
and
adds
up.
It
takes
a
lot
and
we
have
a
really
fresh
example
here.
This
is
last
month.
G
We
have
an
experiment
of
a
thousand
clusters.
Here
is
the
orange
lines
means
the
cluster
is
installed
complete
finish.
The
small
install
the
slo
single
node
openshift
and
after
after
the
after
every
class
is
finished,
actually
we're
expecting
our
controller
to
automatically
import
the
cluster
so
that
we
acm
can
manage
it.
So
the
green
line
is
managed,
but
I
mean
the
the
managed
process
is
just
apply
manifest,
so
we're
expecting
it
should
be
super
fast.
G
It
shouldn't
take
very
long
and
but
let's
see
the
custom
install
only
takes
three
and
three
hour
out,
but
the
import
actually
takes
four
and
a
half
hours,
there's
a
one
half
hour
difference
which
we
didn't
expect
and
after
some
investigation
we
found
this
because
we
only
have
concurrency
one
and
also
because
the
costs
are
remote
clusters
and
they
are
just
finishing
stall.
They
have
a
lot
of
things
going
on.
So
when
we
are
applying
manifest,
it
takes
a
while,
like
10
or
20
seconds
so
add
up,
because
we
only
have
a
single
thread.
G
G
B
G
Yeah,
that's
super
cool
and
we're
super
happy
and
let
me
do
some
conclusion.
So
refactoring
is
always
good
if
you
have
time,
but
we
don't
and
so
before,
you're
trying
to
increase
in
the
memory
limit,
maybe
think
about
cash
and
before
your
refactoring,
maybe
think
about
qps
and
concurrency
yeah.
That's
everything
I
want
to
share.
Thank
you.
C
E
Yeah
and
clearly
in
the
community
that
uses
control
one
time
right,
the
cash
problems
definitely
have
been
observed,
or
else
we
wouldn't
have
seen
that
change
to
implement
the
filter.
Cache
like
come
up,
it's
serendipity,
it
just
happens
at
exactly
the
same
time.
We
need
it.
We
khan
will
probably
go
contribute
to
it.
If.
E
But
he
was
too
slow.
He
didn't
get
that
pr
at
any
time,
but
it's
just
wonderful.
Now,
the
last
graph,
the
home
brought
up
right,
shows
how
fast
we
were
able
to
provision
clusters
holy
crap.
That
was
a
thousand
cluster
within
three
hours
right.
That
was
not
achievable
without
a
significant
amount
of
resources,
if
we're
using
infrastructure
provision.
E
Sorry
installer
provision,
infrastructure
ipi,
which,
which
is
essentially
what
you
do
when
you
openshift
install
with,
with
conjunction
with
hi,
just
just
because
the
sheer
amount
of
resources
that
we
need
to
pre-provision
and
pre-prepare
in
order
to
achieve
the
concurrency
that
we
need
right.
E
Well,
we
mentioned
an
assistant
installer
already
and
crystal
have
a
really
well
written
document
that
kind
of
described,
what's
the
what's
the
magic
here
that
makes
it
different
and
that
reduces
then
well
the
resource
that
we
need
to
prepare
and
that
allow
us
to
achieve
this
thousand
plus
the
provisioning
in
three.
E
D
Yeah,
no,
that
was
beautiful,
the
dc
compute
dropped
there.
You
could
see
it
go
from
yellow
to
green
on
the
on
the
graph
and-
and
that
was
just
a
beautiful
thing.
I
know
there's
a
lot
of
hard
work
behind
the
scenes.
Lessons
learned
and
it's
just
optimizing
the
bits
before
they're
out
there
and
so
great
great
phenomenal
work.
Scott
would
that
stuff
show
up
in
the
cost
management.
You
said
you
know
knowing
it's
a
sas
offering.
C
Yeah,
I
don't
know
if
they've
connected
the
dots,
that's
a
great
idea,
though
I
mean
if
it's
in
aws.
I
think
they
would
probably
already
have
that.
Those
are
you
talking
about
openshift
4.8,
which
is
yeah
not
into
ga
yet.
C
E
Oh
sorry,
khan
shows
that
how
how
we
were
able
to
provision
a
thousand
clusters
within
three
hours-
and
I
just
wanted
to
spend
some
time-
the
lead
crystal
kind
of
shows
us
what's
going
on.
What's
the
magic
here
that
allow
us
to
achieve
that.
C
So,
from
the
sre
perspective,
chris
you've
had
your
eye
on
metrics
data
gathering
usage
graphs,
all
that
kind
of
stuff
enlighten
us.
What
are
we
missing
out
on
here.
A
A
Out
data
gathering,
I
mean,
as
we've
been
doing
these
tests,
we've
always
been
collecting
the
metrics
for
our
provisioning
time,
like
the
graph
that
han
has
showed.
I
think
one
of
the
things
that
we'll
have
to
roll
back
into
the
release
is
we're
generating
these
metrics
today.
It
will
be
even
better
if
these
metrics
are
captured
and
stored
within
our
platform,
so
that
we
can
query
it.
A
I
think
we,
we
query
bishop
metrics
today
already,
but
these
metrics
aren't
that
easily
accessible,
so
that
could
be
one
set
of
metrics
that
we
could
roll
into
the
product
and
then,
if
we
can
roll
into
the
product,
I
guess
in
my
mind,
if
customers
want
to
replay
the
work
that
we've
done
here
in
their
own
environment,
they
could
re-qualify
our
results
and
that
could
bolster
their
competence
in
our
platform
right
as
we
present
or
make
all
the
the,
for
example,
the
automation
that
we
constructed
to
get
to
this
point
right
that
should
also
or
could
also
be
made
public
and
that
customers.
A
C
A
It's
really
high,
it's
three
percent
failure
or
issues.
A
May
be
attributed
to
the
environment
right,
we
are
using
a
scale
lab
environment,
but
they're
still,
these
are
still
virtual,
bare
metal
as
well.
So
there
could
be
some
nuances
in
the
environment
that
leads
to
some.
C
We
intend
to
deliver
that
as
a
dev
preview
in
the
version
2.3
coming
in
july,
so
that
gets
us
to
this
point
of
I'm
deploying
clusters
and
I'm
going
to
come
back
and
say
so
what
okay?
Like
you,
you
did
some
good
work,
but
so
what
I
want
to
manage
at
the
edge-
and
I
need
tools
to
do
that.
I
need
policy.
I
need
compliance.
I
need
to
be
able
to
configure
something
centrally
so
policy.
A
That's
that's
kind
of
like
what
the
the
slide
that
han
was
showing
as
well
the
fact
that
we
can
provision
these
snow
single
node
openshift
clusters
using
assist
installer,
but
then
the
next
part
is
that
we
actually
import
those
managed
clusters
into
the
hub
and
once
you
have
the
the
managed
clusters
imported
that
opens
up
the
window
for
the
rest
of
our
capabilities
on
racking
capabilities,
right
policy
management
and
application
lifecycle
management
right,
focusing
on
policy
up
that
the
the
the
day
two
configuration
that
you
were
mentioning
right.
A
As
long
as
the
configurations
are
controlled
by
openshift
operators,
you
can
pretty
much
define
any
kind
of
policy
to
con
to
modify
or
or
or
or
constrain
those
those
behaviors
and
and
by
creating
a
policy
that
you
can
distribute
across
those
ten
thousand
or
one
thousand
managed
clusters.
You
can
I'm
jumping
the
gun
there.
You
can
consistently
keep
your
your
fleet
consistent
right.
C
You
mentioned
operators,
but
this
would
be
you
know,
kubernetes
resources,
really
anything
that
you
can
describe
with
a
within
a
piece
of
enamel.
You
can
now
start
to
define
as
a
desired
state
model
across
your
fleet,
and
in
this
case
these
could
be
dev
clusters
that
operate
differently
from
prod
clusters
and
those
might
operate
differently
from
west
coast
versus
east
coast.
Excellent.
C
C
F
So
the
magic
in
that
is
that
with
policy,
it
kind
of
comes
from
what
rackham
deploys
known
as
the
grc
framework,
and
it's
been.
It's
had
a
lot
of
great
work
done
to
it
in
that
it's
not
only
scalable
across
all
these
thousand
managed
clusters,
but
it
is
able
to
deploy
all
these
policies
very
fast
and
very
efficiently.
F
So
in
that
sense
you
kind
of
get
to
manage
your
clusters
and
know
that
they're
compliant
or
non-compliant
like
very,
very
fast
and
in
our
initial
testing.
We
kind
of
found
that
we
started
off
with
100
policies
deployed
over
these
thousand
snl
clusters
and
it
took
about
90
minutes
to
propagate
all
these
policies
to
all
the
managed
clusters.
F
So
you
have
about
a
hundred
thousand
objects
from
that,
but
after
some
tuning
done
from
the
rest
of
our
team
and
some
efficiency
scaling,
we
were
able
to
get
that
down
to
about
10
minutes
for
propagating
all
these
policies.
Yeah.
This
is
a
huge
improvement
shout
out
to
ian
who's
on
our
team
as
well.
F
Who
has
done
the
qps
tuning
for
that
that
how
I
mean
that
han
oslo
mentioned
earlier,
so
with
that
performance
in
mind,
it's
an
incredible
improvement
over
something
that
was
already
very
scalable
in
the
first
place
and
that's
kind
of
the
magic
of
it
is
how
it
was
built
up
in
the
first
place
to
be
scalable
and
then
from
there
it
was
moved
towards
something
more
efficient.
From
that
point
of
view,.
C
The
fine-tuning
of
the
configuration
to
to
quickly
ingest
that
policy
definition
and
ensure
compliance
across
the
end-to-end
fleet
yeah
and
be,
like,
I
think,
han
pointed
out.
That
was
a
one-line
change
and
then
we
did
the
the
concurrency
magic.
So
show
me
a
picture,
or
do
you
have
something
that
kind
of
describes
like
the
journey
that
you
went
through
in
that
policy?
C
F
Here
we
go
so
this
is
kind
of
our
initial
findings
document.
F
As
you
can
see,
when
we
first
tested
this
out,
we
created
a
hundred
policies
on
our
hub,
which
then
propagates
to
all
of
the
managed
snl
clusters,
the
about
a
thousand
or
so,
and
that
took
about
1.5
hours
and
with
this
testing
we
kind
of
wanted
to
see
how
long
it
would
a
take
the
policy
to
propagate
and
then
how
long
be
once
we
switch
it
from
inform
to
enforce
how
long
that
would
take
to
show
up
as
compliant
from
all
the
managed
clusters.
And
that's
why
you
see
bullet
point.
F
C
So
the
difference
there
is
subtle,
but
let's
hit
that
for
just
a
second
one
of
the
things
that
our
customers
have
told
us
is
that
they
love
the
ability
to
check
and
kind
of
use,
an
audit
type
of
framework
to
see
what
is
compliant
and
non-compliant,
and
we
call
that
inform
so
there's
an
inform
mode
which
is
a
yaml
verb.
That
says
just
inform
you
of
what's
going
on
in
terms
of
the
compliance
spec
but
you're
telling
me
you,
you
can
actually
enforce
so
change
that
verb
to
enforce,
and
now
I
can
make
changes
right.
F
But
of
course
this
was
just
the
initial
investigation,
and
I
have
a
graph
right
here
that
kind
of
shows,
like
the
testing
that
we
did
for
that
and,
as
you
can
see,
the
amount
of
time
1.5
hours
to
propagate
and
then
1.5
hours
to
switch
from
inform
to
enforce
fully,
but
after
our
after
the
efficiency
qps
tuning,
it
dropped
down
to
10
minutes
for
each
of
those
things
so
10
minutes
to
propagate
initially
and
then
10
minutes
when
switching
from
inform
to
enforced-
and
I
do
not
have
a
picture
of
that
right
now,
but
take
my
word
for
it.
E
E
C
How
many
different
clusters
do
I
have
to
log
into,
and
where
do
I
need
to
set
the
context?
I'm
like?
No?
No,
no
that's
the
problem.
We're
solving
is
that
you
don't
have
to
jump
into
context
on
all
these
different
clusters.
We
provide
one
interface
for
you
to
do
all
of
that.
To
set
those
controls
from
one
spot,
and
I
forget,
I
think
it
was
chris
doane
who
was
mentioning
the
get
ops
part
of
this,
where
these
policies
are
actually
stored.
C
In
a
repository
you
know,
and
and
being
able
to
have
a
code
source
and
a
source
of
truth
for
what
that
policy
should
look
like,
and
then
you
know
designating
that
policy
as
as
what
you
want
to
distribute
to
the
fleet
and
what
they
should
all
be
compliant
towards,
so
that
I
mean
that
part
of
this
story
is
is
what's
the
super
powerful
part,
I
don't
have
just
one
model
or
one
way
to
introduce
a
policy.
I
have
multiple
ways
I
can
cube
apply
it.
C
E
I
really
wanted
to
for
crystal
to
spend
some
time
on
defense
showing
us
the
magic
of
assisting
installer,
and
why
is
it
what's
the
difference
between
that
and
api,
but
I
don't
know
if
we
have
enough
time
for
that.
I.
C
Yeah,
well,
you
know
we
were
put
on
this
planet
to
help
create
clusters
right.
We
want
openshift
to
be
everywhere,
and
so
ipi
was
the
first
model,
the
first
tool
that
we
really
started
with
and
installer
provision
infrastructure.
Chris,
you
know
more
about
that.
Anybody
so
take
it
away.
Tell
us
tell
us
that.
E
Story
crystal
not
chris
stone.
Sorry,
the
the
two
name
gets
a
little
close
to
each
other.
Oh,
go
ahead!.
A
Yeah,
it
was
crystal
right
how.
C
F
Yeah,
so
assisted
installer
is
like
the
service
that,
like
alex
mentioned,
came
in
at
the
right
time
at
the
right
moment.
That
kind
of
helped
like
funnel
all
these
things
that
we're
doing
and
as
I
like
mentioned
before
and
how
has
mentioned
before,
they
were
using
ipi
to
with
hive
in
order
to
actually
create
all
these
clusters
that
they
wanted
to
scale
and
you
know
being
at
the
far
edge.
F
They
found
snl
clusters,
but,
of
course
that
came
with
a
lot
of
disadvantages
like
they
mentioned,
like
the
ephemeral
storage
that
was
needed,
or
just
the
extra
memory
that
was
needed
so
with
assisted
installer,
it
kind
of
came
in
and
was
able
to
take
on
all
of
these
installation
procedures
that
are
required
to
run
on
sno
clusters
and
move
that
away
from
the
hub.
So
that
way,
you
don't
need
to
have
these
extra
storage
spaces
or
the
you
need.
You
don't
need
to
plan
for
any
extra
concurrency
failures
that
would
happen
with
ipi.
F
You
just
have
the
assisted,
installer
kind
of
take
it
over
to
the
cluster.
You
want
to
provision
and
run
everything
on
its
own.
Therefore,
kind
of
increasing
the
success
rate
of
these
clusters
because
with
ipi
there
was
failure
due
to
unexpected.
You
know,
memory
issues,
but
with
the
assistant
seller
we
got
so
many
more
clusters
and
it
was
able
to
help
us
kind
of
provision
like
all
these
thousand
snl
clusters.
So
another
thing
with
assisted
installer,
and
I
think
how
this
is.
What
you
wanted
me
to
show
was
I'll.
F
Give
you
a
sneak
preview
exclusively
for
this.
This
will
come
out
in
a
dock,
probably
a
little
different
in
our
official
rockum
docks.
But
this
is
the
sneak
preview
of
how
or
what
assisted
installer
comes
with,
which
is
fantastic,
assisted
installer
enables
something
called
zero
touch,
provisioning
ztp
for
short,
so
with
zero
touch
provisioning.
We
just
have
these
five
simple
steps.
That
is
done
once
you
kind
of
put
everything
you
need
to
configure
for
assisted
installer
and
the
configuration
is
fairly
simple,
but
once
that's
going
it
has
this
great
feature.
Zero
touch,
provisioning.
F
That
kind
of
is
where
the
assisted
installer
takes
over
and
provisions
your
cluster
for
you.
So
your
managed
cluster.
You
don't
need
to
actually
go
into
the
managed
cluster
at
all
or
onto
the
actual
machine
to
do
anything.
It
just
handles
everything
for
you,
and
these
are
the
five
steps
which
hopefully
are
very
simple.
So
first
it
generates
the
discovery
iso,
which
is
an
image
used
to
boot,
the
managed
cluster
which
you
can
see
on
the
right
side.
F
F
It
boots
this
iso
for
you
and
then
afterwards,
once
it's
successfully
booted,
it
will
report
hardware
information
back
to
your
hub
cluster
and
when
your
hub
cluster
is
aware
of
all
the
hardware
information,
it
will
then
proceed
to
install
openshift
container
platform
on
the
bare
mineral
machine,
thus
kind
of
giving
you
the
sno
cluster,
with
the
single
node
on
running
on
that
bare
metal
machine
and
then
open
shift
on
top
of
it.
And
then,
after
that,
you
have
ocp
when
it
finishes
installing
the
hub
will
then
or
the
hub
as
red
hat.
F
Advanced
container
management
will
then
take
on
that
new
single
node
openshift
cluster
as
a
managed
cluster
and
from
there
that's
where
you
get
all
the
good
stuff
from
rackham,
which
is
all
the
deployments
of
the
add-ons
and
all
the
management
that
we
previously
talked
about
with
policy
application,
etc.
So
that's
kind
of
the
basic
flow
of
ztp
and
of
course
all
you
need
to
do
is
log
into
your
hub
cluster
and
just
do
the
provisioning
and
let
assisted
installer.
Take
it
away
from
you
for
you.
E
This
really
does
abstract
away
a
lot
of
complexity
from
how
to
provision
cluster
on
a
bare
metal
machine
right
before
you
have
to
set
up
a
provisioning
network
to
hold
a
provisioning
server
to
host
a
bootstrap.
The
setup
was
complicated.
This
one
bootstraps
in
place,
like
you,
don't
need
anything
external
right.
This
is
it.
It
reaches
out,
boosts
the
machine.
It
forms
a
cluster
done.
Acm
manages
it.
You
deliver
whatever
configuration
that
you
want
to.
C
Deliver
your
compliance
model
from
that
point
forward,
it's
under
management
and
you
make
it
sound
so
easy
crystal
your
team
has
worked
beautifully
to.
I
know
this
has
been
development
under
you
know
the
pressure
of
creation
and
you've
created
a
diamond
here
out
of
the
rough,
but
seeing
your
team
work
together
with
assisted
installer
with
metal
ztp.
C
All
the
componentry.
That's
come
together
into
this
package
that
acm
is
delivering
it's
it's!
It's
awesome.
It's
just
really
awesome.
The
way
this
team
has
performed
has
been
brilliant
so
anyway,
I
should
stop
sharing
the
kudos
here
in
the
last
stretch.
But
chris
do
we
have
any
questions
that
have
come
forward.
E
Well,
presentation
for
that
much
information:
it
is
kind
of
hard
and
we
are
it's
still
kind
of
under
investigation,
a
little
bit
ux
improvements,
but
we
function
pretty
darn
well
with
a
thousand
cluster
at
this
moment
now
we
we
do
just
scoped
it
to
a
couple
of
components
and
a
couple
of
features
for
now
like
we,
we
focus
heavily
on
policy
just
because
scott
says
so.
E
Really
true,
that's
really
true
right,
right,
monitoring
and
alerting
so
that
no
one
have
to
actually
stare
at
the
dashboard
for
a
1000
cluster
to
figure
out.
What's
going
on,
like
the
centralized
monitoring
spent.
C
D
It's
a
good
cast
of
20
28
or
so
by
my
account
right
on
the
on
the
far
edge
squad
and
any
anyone
we
want
to
name
drop
and,
thank
I
know.
Even
today,
we
saw
like
emily
demoing
some
new,
some
new
stuff
and
and
randy
george
and
others
crystal
anyone
who
you
want
to
name
drop
on
the
squad
or
how
here
it's
this
special
moment
here
as
we're
debuting
some
of
this
great
stuff.
H
D
The
additional
three,
but
it's
I'm
just
proud
to
be
part
of
the
far
edge
effort
with
the
scale
and
performance
and
then
connecting
with
all
the
other
components
as
scott
was
mentioning,
observability
and
and
grc.
Thank
you.
C
That's
a
great
story
that
we're
pushing
for
so
we'll
have
the
the
bits
will
be
in
there
as
a
dev
preview,
we'll
be
moving
towards
tech
preview
in
the
fall,
and
I
think
by
that
time
who
knows
maybe
that
number
is
bigger.
Maybe
we'll
be
back
here.
E
The
general
improvement
that
we
have
done
to
acm
is
generally
applicable,
so
in
two
three
you
will
expect
acm
to
use
less
memory
and
less
resources
in
general,
leader
and
meter.
B
Yes,
fantastic
work
team
seriously.
Thank
you
so
much.
I
can't
wait
to
like
hear
more
about
this.
This
journey
and
just
pushing
the
edge
further.
If
that
makes
sense,
it.