►
From YouTube: SIG - Performance and scale 2022-06-30
Description
Meeting Notes:
https://docs.google.com/document/d/1d_b2o05FfBG37VwlC2Z1ZArnT9-_AEJoQTe7iKaQZ6I/edit#heading=h.tybh
A
Okay,
welcome
to
sex
scale.
It's
june
30th
2022
I'll
link
to
the
notes
in
the
chat.
A
Okay,
please
add
yourself
an
attendee,
please,
okay,
let's
go
over
the
performance,
periodic
tests,
all
right,
so
marcelo
you
had
you
had
the
fix
for
the
unexpected
number
of
vmis.
This
is
strange.
I
don't
know
why
we
why
this
only
showed
now
I
found
this
surprising
that
we
did.
Maybe
it
was
because
the
the
number
of
maybe
because
the
the
test
changed
and
and
now
it's
catching
it
or
something.
A
B
B
Yeah,
as
I
mentioned,
I
did
a
quick
fix
so
and
I'm
trying
to
find
the
notes.
Okay,
to
put
my
name
so,
did
a
quick
fix
because
we
are
creating.
You
know
in
the
in
the
end
101,
so
we
should
be
fine.
A
B
A
B
Yeah,
it
should
should
be
density,
yeah,
and
here.
A
Doesn't
delete
yep,
yep,
okay,
makes
sense,
makes
sense
and
then
where's
that,
like
so
where's
the
weight
where's
the
weight
100
happening
only
which
it's
in
your
pr,
wouldn't
you
change
this.
A
Okay,
so
vmware,
okay,
so
test
performance
density,
okay,
I
was
right
there
I
was
it
was
right,
though,
there's
your
change
right.
There,
okay
makes
sense.
Okay
got
it,
got
it
yeah
that
makes
sense.
Okay,.
B
Cool
yeah,
okay,
maybe
I
should
put
like
some
comments
here,
but
I
did
it
very
very,
very
fast.
Well.
A
No,
no,
that's
fine!
I
I
think
like
this
is
this
is
actually
I
think
this
is
really
good.
It's
actually.
So
I
think
before
we
were
maybe
not
waiting
for
the
multi.
I
think
this
might
just
confirm
that
what
we
have
we
might
have
had
a
bug
previously
and
now
we've
with
like
waiting
and
now
it's
and
now
it's
fixed
because
we're
if
we're
cons.
If
we
were
consistently
seeing
this,
then
we
are
consistently
checking
that
we
have
the
right
number
of
vms.
So
I
I
think
that
that's
right.
I
think
that
makes
sense.
A
Well,
we
had
some
success.
Look
since
what
was
this,
I
don't
know
when
this
merged,
but
we
had
so
we
had
a
bunch
in
here.
So
that's
good.
I
bet
this
is
the
oh.
No,
this
is
something
else.
This
is.
A
This
failed
immediately.
This
probably
has
something
to
do
with
an
end
point
being
down.
I'm
guessing
like.
A
A
B
A
I
don't
want
to
see
if
there's
some
oh
here,
we
go
here's
our
answer,
so
this
is
this
is
the
the
one
from
the
bug
from.
A
B
C
Yeah,
sorry,
what
if
you
want
to
assign
it
to
me,
go
ahead,
feel
free
to
do
so
I'll.
Try
to
dig
into
that.
A
B
A
B
And
the
other
thing
that
I
was
thinking
also,
is
you
know
when
I
create
the
range
like
you
know,
400,
I
don't
know,
I
don't
know
if
it's
400,
200
and
400
really
makes
sense.
You
know
I
think
600,
it's
fine,
because
I
see
more
issues
with
600,
for
example,
when
I
see
the
graph
on
dashboard
anyway,
what
I
want
to
mention
is:
maybe
it's
better
to
have
like
separate
jobs
for
each
number.
B
Instead
of
have
a
job
that
creates
200,
400
and
600,
we
have
different
jobs.
You
know
it
will
fix
the
issue
to
delete
and
but
of
course
we
don't
want
to.
You
know
hide
the
issue
so
just
but
I'm
just
thinking
that,
because
if
we
put
a
you
know
a
bigger
interval,
I
don't
know,
can
you
open
the
graphone
dashboard?
Let
me
send
you
here.
B
Okay,
so
did
you
see
these
bumps
here,
the
the
we
can
see
the
vm
counter?
Maybe
it's
the
easier
one,
the
lower
okay
yeah.
B
So
first
of
all,
you
can
see
that
the
previous
test,
the
600,
were
failing
now
the
600
is
working
the
for
the
newest
ones,
but
if
you
see
that
there
was
some
task
rates
100
and
the
other
test
that
creates
200,
400
and
600,
but
they
are
very
tight
together,
you
see.
So
it's
really
hard
to
analyze
like
that.
A
B
Yeah
exactly
so
it's
easier
to
visualize.
You
know
we
can
yeah,
I'm
just
saying
that
if,
if
it's
fails,
we
know
exactly
it's
it's
easier
to
know
exactly
which
test,
instead
of
gauging
the
logs
you
see,
which
one
failed.
You
know
just
just
like
that.
So
and
also
maybe
we
don't
need
one.
Two
hundred
we
have
100,
maybe
400
600,
you
know
600
is
the
maximum.
We
cannot
have
more
than
that.
We
have
like
only
three
nodes
right
and
maybe
400,
I
don't
know
we
can
check.
B
If
may,
if
it,
I
don't
know
if
it
makes,
if
it's
really
really
valuable,
to
see
400
because
we
have
100.
This
is
important
and
then
600
like
the
with
our
big
range,
and
if
we
want
to
introduce
more
tests,
it
can
be
something
else
like,
for
example,
the
steady
state.
You
know
we
can
save
time
here.
We
cannot,
we
need
to
timeshare,
we
cannot
run
parallel
tests,
otherwise
they
will
impact
each
other
and
we
can
decrease
not
only
for
600
and
then
and
then
include
another
test.
A
Okay
yeah,
maybe
we
can
tweak
this
a
little
bit.
I
think
yeah,
maybe
like
yeah
like
a
good
just
a
600
isolation
would
be
interesting,
yeah
and
then.
A
And
we
still
have
like
we
still
want
to
do
like
we.
This
is
burst.
We
still
steady
state,
we
could
do
it
would
be.
I
mean
this
is
600
max
very
valuable,
like
we've
got
a
lot
of
information,
it'd
be
cool
to
see,
600
study
states,
and
then
we
can
configure
the
rate
and
there's
there's
like
an
infinite
number
of
tests.
A
We
can
configure
there,
which
is
the
rate
of
deletions
and
recreation
as
long
as
we
can
do
yeah,
okay
yeah,
but
I
like
the
idea
that,
like
they're
likely
conceptualized
that
these
are
two
separate
things
and
they
deal
definitely
two
different
results.
So
yeah.
Maybe
if
we're
getting
a
lot
of
results
from
600,
then
maybe
we
just
do
a
600
in
isolation
and
and
we
do
we
maybe
do
some
measurements
off-
that
we
do
the
measurements
off
100
and
then
we
can.
B
A
A
Yeah
makes
sense,
okay,
yeah,
but
at
some
point
well,
maybe
we'll
come
back
to
I
I
mean
my
guess
is
like
with
steady
state,
will
kind
of
take
over
this
because,
like
steady
state
is
kind
of
like
this
and
that
like
right,
we're
just
creating
and
deleting
creating
deleting,
but
instead
of
starting
at
what
we
started
like
you
know,
the
higher
value
probably
is
what
we
do.
So
I
mean
this
is
kind
of
like
a
steady
state
test.
I
guess.
B
Yeah-
and
I
think
it's
good
to
replace
you
somehow,
because
you
know
to
have
our
we
can
even
like
you
know-
increase
the
200,
the
100.
B
If
we
see
that
100,
it's
not
showing
enough
data
for
us
and
and
have
the
steady
state,
and
I
wouldn't
break
too
much
test
because
otherwise,
as
you
saw
like
in
the
grafana
it
might
it
start
to
be
very
hard.
You
know
to
compare
things
yeah.
A
B
A
Know
for
like
us,
and
video
like
we
have
our
vm
count
is
a
lot
is
limited
by
the
number
of
gpus,
so
we
don't
have
incredible
density
whereas,
like
this
is
three
notes,
that's
incredible
density,
so
I
mean
this
is
a
good
range
like
this
is,
I
think
it's
pretty
valuable,
we'll
get
data
from
both
and
then
getting
the
and
then
yeah,
then
the
study
state
and
then
I
think,
then
steady
state
will
have
to
play
around
a
lot
with
you
know
the
right.
A
The
right
I
mean
we
kind
of
want
to
find
a
balance
like
this.
Something
like
this
where
I,
but
I
think
we
sort
of
have
two
different
axes
here
like
where
we
start
from,
is
one
like
100
600
and
then
the
rates
that
we
do
the
delete
and
the
recreates
is
this
or
the
other
axi
axis,
so
it
we'll
have
to
we'll
plan
to
play
with
a
little
bit.
A
So
I
think
that's
probably
how
we'll
go
with
this.
Okay
still
still
works
to
do
that
on
the
steady
state
test.
That
is,
it's
not
fully
complete,
but
we'll
have
to
yeah
something
we
can
we'll
do
in
the
future.
Okay
makes
sense,
and
this
and
this
grafana
is
this.
Something
like
is
this.
If
I
is
this
like
this
is
publicly
accessible
right,
like
I
think
it
lasts,
yeah.
Okay,
let
me
make.
A
B
B
Recreation
or
something
like
the
vm,
the
latency,
the
api
request,
latency
the
number
of
requests,
because
we
have
here,
you
know
you
know,
requests
per
second,
for
example,
for
for
the
different,
you
know,
components
and
they,
you
know,
see
the
request.
Duration.
Oh,
you
can
see
the
one
metric
that
is
interesting
if
you
go
down
yeah,
oh,
but
up
a
little
bit
now,
yeah,
just
the
rate
limit
duration,
if
you
can
click
in
the
bridge
controller.
B
B
A
Yeah,
no,
that's
that's
really
good.
Let's
see
so
did
you.
I
think
there
were
some
bugs
that
you
had
associated
with.
I
think,
like
the
work
you
this
one
right,
mm-hmm.
B
A
Completed,
I
thought
maybe
did
I
mark
the
close.
Oh.
B
A
That
definitely
was
okay,
that's
cool
and
then
yeah.
We
got
oh
and
then
that
was
one
thing
we
wanted.
We
talked
about
last
time.
I
just
remembered
we
you
mentioned
did
andrew.
I
don't
know
if
we
have
andrew
here
but
did.
Did
you
even
speak
with
him
about
about
the
vert
controller
node,
we're
queue.
B
B
A
B
A
B
B
B
We
just
need
to
be
our
aware
about
that,
but
it
seems
to
me
something
weird
because
it's
scaling
with
the
number
of
great
vms
and-
and
you
know,
to
requeue
a
key
too
many
like
that
it
doesn't
compare
to.
I
don't
know
if
it's
too
many,
but
compared
to
the
other
controllers,
it
seems
to
be
very
high,
isn't
it
so
it's
definitely
something
that
you
need
to
keep
an
eye
on
that.
A
Okay,
I
think
we'll
allay
we'll
have
yeah
if
you've
got
some
time.
Maybe
what
so.
A
Lay
we'll
have
it
so
you
can
do.
How
do
we
take
we'll
do
this?
Let's,
let's
fix
the
performance
job
first
just
so
we
get
that
one
out
of
the
way.
So
have
you
look
into
that
one
first
and
then
we'll
next
meeting,
let's
I'll
book
some
time,
because
I
need
to
do
some
research
on
this
one
myself.
Maybe
we
can
all
do
some
research
and
we
can.
We
can
have
this
as
a
discussion
topic
for
next
meeting
and
see
what
we
find
and
then
we
can
update
the
card.
A
Okay,
all
right
yeah-
I
just
I
wanted
to
have
it
here
just
a
week,
so
I
remember
to
come
back
to
this.
Okay,
so
yeah
this
one's
definitely
interesting
we'll
have
to
so
anyway,
like,
like,
I
said
we'll
follow
up
on
this
one.
We'll
do
some
investigation
we'll
come
back
to
it.
Okay,
let
me
see
I
just
here
so
from
last
time:
marcelo
did
you
get
a
chance
to
do
any
tracing,
or
did
you
get
any
of
the
tracing
results?
You
have
it
available.
B
A
B
In
a
way
that
I'm
down
you
know,
I
don't
know
about
the
results,
so
it's
not
showing
any
other.
You
know
tracing
points
and
then
I
was
thinking.
Maybe
I
run
the
wrong
convert
or
or
it's
it's
fine,
so
it's
it.
The
other
trace
points
was
not
lo
higher
than
one
second
and
then
that's
why
it
not
was
not
appearing
the
log.
So
I'm
sure
that
I
I
deployed
it
in
the
right
way,
but
I'm
in
doubt
because
I
didn't
see
any
other
trace
points.
B
B
I
think
I
I
don't
have
the
cluster
anymore
and
it
will
be
hard
to
do
that
again,
but
it
could
be
a
way
to
do
that.
It's
remove
the
one
second
or
or
lower
that
for
100
milliseconds,
for
example,
and
then
everything
would
appear
and
then
we
can
could
see
like
exactly.
You
know
the
time
in
the
each
each
point
that
we
put
the
trace,
but.
A
E
I
was
away
from
these
discussions
here.
Unfortunately,
since
nvidia
already
release
them
source
code
of
them,
drivers
me
and
my
team.
We
are
trying
to
start
to
develop
a
gpu
live
migration,
for
I
send
a
link
on
the
chat
window
for
you
understand:
what's
what's
the
the
unknowns
from
any
video?
E
What
is
the
correct
forum
to
talk
about
what
have
been
done
already
regarding
live
migration
before
we
understanding
what
have
been
done
so
far
and
discuss
who
have
done
that
work?
How
to
implement
the
gpu
live
migration?
Also,
since
we
have
already
the
the
open
source
drivers
of
nvidia.
A
Probably
the
the
wednesday
meeting
for
the
the
the
keyboard.
E
E
B
Yeah,
so
there
is
also
eyes
lock,
because
you
know
this
is
regarding
a
feature.
Isn't
it
so
I
I
didn't
play
with
live
migration
for
gpus,
not
aware
about.
A
B
E
B
B
E
Yes,
I
asked
them
already
nobody
answered
well
now.
Let's
wait,
then
sometime.
B
Fingers
get
lost
there,
maybe
just
rephrase
and
send
again.
You
know
and
simplify
just
say
I
have
issue
with
live.
Gpu
live
migration,
who's
in
charge
of
that.
Can
you
help
you
know
yeah
and
because
maybe
sometimes
when
the
message
is
too
big,
I
don't
know
if
you
did
that,
but
people
get
lazy
to
read.
You
know
just
go
straight:
okay,.
E
Okay,
thank
you
so
much
guys.
Okay
is
there.
I
was
away
also.
Is
there
up
that
date
for
release
version
1.0.
A
There's
I
I
don't
think
there's
been
a
publicly
stated
target
at
this
point.
It's
I
mean.
All
I
can
only
know
is
that
it's
it's
we've
had
a
lot
of
discussions
about
it
and
I
think
it's
moving
along
like
I
think
I
would
say
it's
a
lot
closer
than
it
was
you
know
a
few
months
ago,
and
it's
it's
definitely
something
that's
in
focus
by
a
lot
of
people,
but
yeah
I
mean
there
isn't
really
a
there's
a
hard
date.
A
I
mean
there's
like
I
think,
there's
a
document
floating
around
somewhere
with
like
the
with
the
remaining
items.
I
think
the
last
from
what
I
recall.
The
last
remaining
item
was
a
policy
for
how
we
handle
decker
decrementing
apis
and
when
you
started
like
alpha
apis
or
things
like
that,
that's
that's
already
being
worked
on,
there's
already
a
document
for
it
somewhere.
It's,
I
think
it's
in
the
community
repo
and
I
think
after
that
merges.
I
think
that
was
the
last
item
to
then
say
that
we
have
everything
for
v1.
A
So
it's
coming
up,
but
there
isn't.
There
isn't
a
date
that
I
know
of
that's.
You
know
say
when
it's
going
to
be,
but
I
I
expect
it'll
do
some
more
info
soon,
because
I
mean
I
can
tell
you
that
at
least
you
know.
From
my
perspective,
I'm
definitely
very
interested
as
well
like.
A
This
is
something
you
know
we
also
want
to.
We
also
want
to
get
to
so
it
should
be
soon.
I
mean
it's
just
I
can't
I
don't
know
can't
say
when
yeah
there's
another
release
date.
The
feeling
is,
that's
that
that
is
this
year,
correct,
yeah,
my
feeling
is:
is
this
year
my
feelings
is,
I
I
feel
I
think
just
based
on
conversations
and
what's
remaining,
I
I
feel
confident
and
yeah.
The
thing
this
year
will
give
you
one.
A
Okay,
all
right,
let
me
go
back
to
let's
see,
look
back
at
what
we
had
last
meeting.
So
so
that's
good
marcelo.
It's
awesome
to
see
that
this.
How
much
that
that
impacted
the
work
you
that's
really
good.
Did
you
did
you
ever
rerun
the?
Have
you
rerun
the
test
recently
or
you
actually
know
you
did
right.
You
ran
it
yeah
here
you
go,
you
ran
it
after
and
you
already
saw
you
recorded
the
performance
improvement.
So
that's
really
good
at
some
point.
A
I
still
want
to
do
that
that
test,
where
we
can
like
because,
like
you
have
here
like
a
really
good
measurement
of
like
how
much
of
an
improvement
that
you've
made,
but
we
have
like
we
still
don't
have
anything
to
measure
against
so
that
we
can
publish
you
know
our
data
still
something
that
bothers
me.
I
wish
we
had.
You
know
like
this
patched.
It
had
this
much
of
an
effect
on
performance
or
something
you
know
what
I
mean.
B
Exactly
this
one,
so
this
is
what's
just
showing:
what
is
the
improvement,
so
it's
like
what's
62
times
improve,
you
know
the
improvement
in
the
in
the
latency
to
create
1000
vms.
Something
like
that.
You
know
yeah.
A
But
I
guess
I
mean
sorry
what
I
mean
is
like
this
like
this
is
really
good,
but
like
with
your
scenario,
you
were
able
to
induce
this
problem
like
we
need.
I
guess
maybe
that's
what
I'm
saying
it's
like.
We
need
specific
scenarios
that,
like
we
can
measure
against,
say
like
we
tested
thousand
vms
in
12-month
cluster,
and
this
pr
did
this
improvement.
A
But
we
don't
have
that
standard
to
like
measure
across
releases
to
say:
okay,
here's
what
it
was
two
or
three
five:
zero.
Four,
two:
zero:
five:
zero,
zero,
five!
Three
and
you
can
see
the
improvement
like
we
don't
we
don't
quite
have
that
measurements
just
because.
Well
I
mean
it's
it's
difficult
to.
Maybe
it's
something
we'll
have
to
do.
We
can
work
on
the
performance
clusters
like
we
can
yeah.
A
You
know
as
we
can
use
to
like
get
to
get
the
standardization
or
something
because
I
think
like
I
mean
it
really
just
because
this
is
because
I
mean
this
is
great
work
and
it
just
needs
to
be
highlighted
like
and
after
even
at
the
release
level
like.
There
is
a
massive
improvement
here.
B
E
I
would
like
to
ask
you
something
here:
we
are
doing
also
some
tests
but
much
higher
volume,
10
000
vms,
across
1250
nodes
cluster.
A
Yeah
that
that's
what
I'm
saying
is
that,
like
your,
your
your
environment
is
different,
and
so
your
performance
is
going
to
vary,
but
what
we're
saying
is
like
so
marcelo's
got
this
pr
here
that
that
greatly
improves
performance
on
a
12-0
cluster.
When
you
create
high
amounts
of
density,
which
is
you
know,
a
thousand
vms,
it
eventually
basically
shows
here
like
it.
The
the
vmi
creation
latency
just
is
very
high
with
with
that,
before
this
change
and
then
with
it
it
gets
it
gets
much
faster.
A
So
I
don't
know
if
you're
seeing
this
on
your
clusters,
it's
it's
hard
to
say,
but
I
mean
at
least
for
you
know
a
cluster
that
that
gets
this
much
density.
A
We
should
see
an
improvement,
so
I
mean
you,
yours
might
be.
Yours
might
be
a
little
bit
less.
So
it's
hard
to
say,
but
the
test
is
different
is
really
what
the
point
is.
E
B
Okay
and
the
vm
automatically
starts
the
vmware.
E
E
Because
these
vms
change
the
size
over
time
and
with
that,
we
also
change
the
number
of
of
nodes
in
the
cluster
across
because
the
same
cluster
can
reach
for
10
000
users
reach
157
nodes
between
157
roads
and
1250
nodes
for
the
same
10
000
users,
because
all
the
vms
are
different
like
two
virtual
cpus,
four
gigabytes
of
ram
and
another
one
has
four
virtual
cpus
and
eight
gigabytes
of
ram,
and
things
like
that.
E
E
Also,
there
are
some
clusters
in
brazil
clusters
in
u.s
classes
in
europe,
clusters
in
asia,
and
also
we
are
doing
everything
to
be-
let's
say,
scalable
all
over
the
world
only
for
which
our
solution
works,
but
your
numbers
seems
to
be
worse
than
we
are
getting
that's.
Why
I'm
asking
you?
Why
also
the
pvc
behind
the
scenes?
E
E
We
we
grab
six
300
gigabytes
of
ram
to
have
a
ram
disk
and
we
expose
all
these
300
gigabytes
of
ram
of
every
node
as
an
cluster
file
system
to
the
vms,
and
that's
why
we
are
reaching
1
million
times,
sometimes
faster
than
mvna's
storage,
and
that's
why
we
are
getting
better
numbers
behind
the
scenes
how
we
are
making
having
everything
happen
before
you
understand,
we
was
working
before
with
rook
with
and
seth,
but
rook
and
seth,
the
the
duplication
part
of
it
is
on
alpha
stage,
and
then
we
roll
back
to
gloucester,
for
you
know
how
we
are
exposing
these
ram
disks
to
the
vms
and
why
we
are
getting
better
numbers
than
yours.
B
B
And
what's
it
what's,
the
number
of
you
know
parallel
vms
vms
creation,
perk
cluster,
we.
E
We
have
in
each
cluster
ten
thousand
users,
and
we
are,
we
have
hundred
thousand
co-current
users
logging
in
at
the
seven
a.m.
Before
you
understand
these
are
like
spread
all
over
the
clusters.
We
have.
Every
cluster
handle
ten
thousand
users,
only
okay,
and
we
have
more
than
hundred
thousand
concurrent
users.
E
A
Well,
there's
also,
you
said
this,
so
I
think
what
I
heard
was
10
000
users
per
cluster,
and
then
I
think,
a
thousand
nodes
is
that
what
it
is
per
cluster.
A
E
E
A
What's
catching
your
your
speech,
can
you
tell
say
again:
oh
how
many
kubernetes
api
servers
do
you
have
per
cluster
the
three.
B
B
Yeah
yeah
are.
B
Yeah
that
that's
that's
very
interesting,
you
know
how
actually
the
kubernetes
control
plane
impacts
the
convert
control
plane,
and
this
is
a
test
that
I
didn't
have
the
time
to
do,
and
this
is
interesting
you
know
to
increase
the
data.
E
B
B
E
But
for
you
know
we
are
building
infrastructure
for
one
million
concurrent
users
to
be
in
production
and
on
the
second
semester
before
you
understand
these
gonna
be
like
hundreds
of
clusters.
B
And
are:
do
you
have
any
plan
to
you
know
maybe
show
some
metrics
from
the
cluster.
I
don't
know
if
you,
if
that's.
E
B
B
B
A
Do
you
do
you
see
at
all
the
do
you
see
any
pvc,
so
you
said
you
said
you've
high
performance
pvcs,
but
the
do
you
see
any
latency
when
actually
creating
them
like
with
kubernetes,
actually
goes
in
and
and
creates
them
and
deletes
them.
Do
you
like
see
any
issues
at
all
with
that,
because
I
don't
think
that
has
anything
to
do
with
like
this
on
the
creation.
A
Yeah,
this
is
something
we've
at
least
observed:
there's
a
there's,
a
pvc
protection
controller
in
kubernetes
that
builds
up
quite
a
work,
cue
on
deletion,
and
it's
we've
at
least
from
some
some
of
our
testing
that
we
have.
This
is
one
thing
we
run
into
a
lot
as
the
work
queue
grows
very
large
and
it
can
cause
latency
issues
during
creation,
vmi
creation,
just
because
we're
deleting
the
the
pvcs
and
this
controller
is
using
a
lot
of
the
api
services
resources.
E
A
Yeah
makes
sense
yeah
it's
similar
use
case,
so
what
actually
yeah?
What
we
do
too,
like
because
the
point
is
right,
you
delete
the
pvcs
after
each
vmi
is
is
finished
right,
so
you
get
right.
So
you
have
a
lot
of
delete
requests,
which
is
what
leads
you
to
this:
the
cleanup
you're
being
very
slow,
yeah,
exactly
okay,
yeah,
very
similar
problems
to
what
we've
been
dealing
with
internally,
okay,
interesting.
E
Can
I
ask
you:
what
kind
of
cpu
are
you
using
intel
amg
arm,
amd's
yeah?
I
would
like
to
have
it.
We
are
using
intel.
A
B
A
A
A
Okay,
well,
anyway,
the
the
this.
This
exercise,
though,
is
kind
of
I
mean
it's
interesting
like
like.
I
was
saying
that
this
scenario,
like
you,
you
definitely
hit
a
pressure
point
here,
marcelo
with
this
with
the
density.
Maybe
it's
just
because
you
went
over
like
like
maybe
you
went
over
80
or
90,
or
something
and
that's
just
based
on
the
configuration
of
your
cluster
of
three
api
servers
or
something
like
that.
It's
when
we
run
into
this
kind
of
latency.
A
It
might
be
something
like
that,
but
it's
still
it's
just
it's
beside
the
point,
because
it's
it's
a
valid
use
case.
So
it's
something
we
need
to
address
and
you're
totally
right
that
the
the
qps
should
be
higher,
like
just
based
on
what
your
analysis
was.
So
it
makes
sense,
but
it's
interesting
how
it
affects
people
differently.
B
Yeah
I
was
doing
that
but,
okay,
you
know
yeah,
I'm
changing
the
project
nice
right
now
I
talked
to
ryan.
So
I'm
going
to
you
know,
I'm
smoothly
going
to
you
know,
leave
the
cup
vertical.
You
know
a
couple
of
projects.
Unfortunately,.
B
But
we
were
we're
improving
the
cold
and
trying
to
you
know
brian
also
put
some
traces
in
the
code
and
create
also
a
sequence
diagram.
You
know
to
understand
how
the
workflow
that
the
vmi
goes
at
least
for
some
part
of
it.
It
has
more
things
that
we
were
discussing
before,
but
it's
the.
I
think
the
whole
goal
here
is
to
understand
bottom
ax
and
then
try
to
you
know
to
identify
that
in
the
code
and
improve
the
code.
A
Well
then,
that's
exactly
right:
what
marcel
has
been
testing.
E
E
E
Of
vms
per
the
kubernetes
number
of
pods
per
per
per
vm
on
kubernetes,
now.
B
You
think
you
can
increase
that
you
can
increase
that
for
open
shift
by
default
already
increase
that
for
400
500.
I
think-
and
I
did,
that
I
did
some
tests,
for
example
in
hours
I
did
some
tests
that
I
was
creating
400
vms
in
an
old,
so
change
something
the
convert
code
to
do
that
also,
but
you
can
you
can
increase
that,
but
for
like
I.
B
Yeah,
it's
I
think,
officially
openshift,
not
not
appreciation.
Officially
cooper
says
that
we
can
support
250
in
a
saved
way
per
node.
Okay,
of
course,
depends
on
the
because
when
it's
the
problem,
you
have
like
too
many
parts
per
node.
It
starts
to
overload
kubelet.
Okay,
so
and
it
it's
things
you
know
the
the
container
run
time
start
to
be
overload
and
thing
gets
nasty
but
250,
it's
it's!
Okay,
as
I
mentioned
to
you,
I
could
run
400
very
tiny,
vms,
okay
and
without
big
problems.
B
More
than
400
is
start
to
be
like
too
many,
and
and
if
you,
if
you
aim
to
go
to
200,
it
should
be
fine,
we
are
actually
creating
200.
Now,
in
our
you
know,
perform
steps.
E
E
Is
not
what's.
D
A
Yeah,
it
was
going
to
ask
you
because,
like
with
your
gpu
workload,
do
you
run
into
any
issues
like
with
when
the
way
you
slice?
When
you
wait,
you
slice
your
gpus?
Do
you
run
into
issues
on
any
of
the
smaller,
like
smaller
configurations,
like
an
eight
to
one
four
to
one
or
something?
Do
you
run
that
small
and
do
you
do
you
run
into
any
issues
like
with
performance?
A
I
don't
know.
Do
you
use
cpu.
A
E
16
gigabytes
in
a
16
of
one
gig,
eight
of
two
gigs,
four
of
four
gigs
or
four
of
eight
gigs.
It's
simple
as
that:
okay,.
E
A
But
do
you
so
I
I
guess
from
here
is
like
so
you
have
the
so
like
on
your
node,
you,
you
have
what
appears
to
be
one
physical
gpu,
which
then
you
pass
to
a
you
pass
through
to
one
customer's
vm,
because
you
said
it,
you
said
you
do
slice,
but
it
gets
passed
through
as
a
physical
gpu.
A
Yeah
yeah,
I
well,
I
know
I
know
it
is,
but
I
mean
like
it's
the
but
you're,
not
okay,
so
you're
you're,
you're,
you're
slice,
you're
slicing
up
into
an
eighth
and
you're
and
you're
you're,
using
like
vfio
or
something
to
yeah,
okay,
correct.
A
A
So,
in
the
case
of
you
have
like
the
eight
to
one,
the
vgp
or
something,
if
you
on
the
eight
to
one
free
gpu,
are
you
at
all
we
reaching
six
to
one
okay
and
the
1601
gpus?
Are
you
at
all
having
any
performance
issues
with
like
the.
E
A
A
So
then,
each
so
then
each
time.
So
then
I
guess:
how
do
you
allocate
the
cpus
and
for
each
of
them,
if
you're
so,
do
you
give
like
each
so
you
give
one
a
1
16
gpu
and
do
you
allocate
a
whole
cpu.
E
A
So
you
do
you,
don't
do
anything
with
pinning
or
or
any
memory
bandwidth
allocations.
A
A
Exactly
support
it
right
now
for
what
this
essentially
I'm
talking
about.
So
I
was
wondering
if
you
had
a
solution
that
was-
and
you
does
hubert
because
doesn't
support
it.
So
I
was
wondering
if
you
had
some
other
solution
that
you
that
you
could
publish,
because
it's
something
that
is
something
that's
interesting
too,
like
I
mean
there's
a
lot
of
there's
a
lot
of
things
in
that
area.
That
would
be
interesting
to
see
in
the
community.
E
What
we
are
working
on
also,
but
this
is
one
or
two
years
work
we
plan
to
to
to
have
finished,
is
the
video
I
send
you
here
is
to
finalize
the
para
virtualization
virtual
3g
have
known
for
for
linux.
We
plan
to
have
also
for
windows
and
mac
also.
A
Yeah,
well,
that's
that's
cool.
I
mean
you
should,
at
some
point,
it'd
be
cool
to
share
your
some
of
your
the
phase
transition
times.
It'd
be
cool
to
see
like
how
you
guys
perform
with
those
transition
times
to
in
your
cluster.
If
you,
maybe
you
have
some
optimizations
that
we
can,
we
might
be
able
to
publish
so
that
others
can
can
copy
it.
So
it
would
be
cool
to
see
at
some
point
yeah.
That's
the
goal,
great,
okay!
Well,
all
right!
A
Well,
thanks
for
sharing,
so
any
more
any
more
topics,
some
people,
I
think,
already
covered.
Quite
quite
a
few
things.
I
think,
there's
nothing
else
yeah.
If
there's
anything
else
going
on
in
the
left
side.
E
If
you
have
any
any
further
scalability
issues,
that's
why
I
plan
to
to
be
here
always
to
share
what
we
are
already
reaching.
We
are
doing
a
stress
test
for
100
000,
concurrent
users
across
multiple
clusters
across
multiple
regions,
behind
the
scenes.
A
A
Work
like
marcel's
a
lot
of
work
on
the
load
generator
and
we
have
the
burst
test.
Eventually,
we
want
to
get
to
doing
steady
state
and
which
is
probably
more
in
line
of
what
your
use
case
and
it'll
be
interesting.
I
mean
we,
we
we're,
obviously
don't
have
as
many
as
much
hardware
as
you
do,
but
maybe
we
can
try
and
simulate
some
of
the
pressure
at
a
lower
scale
see
if
we
can
find
so.
B
Yeah,
yeah,
great
and
and
again
so
some
of
the
bottlenecks
that
you
see
and
if,
if
it's
possible
open
an
issue-
and
you
could
refer
it,
you
know
and
describing
a
little
bit
the
bottleneck.
So
so
we
we
can
work
on
that.
You
know
discuss
that.
Also.
There.
E
E
E
D
B
And
when
you
join
the
meeting
yeah,
if
you,
if
you
it's
okay,
for
you
put
your
name
here
in
the
meetings
notes,
it's
I
think
it's
I'm.
E
Trying
to
find
this
pdf,
the
the
the
the
google
docs,
where
is
it
for?
I
understand
normally.
E
When
I
enter
late,
it
doesn't
show
what
I
have
send
before.
I.
B
B
Oh
you
got
it
now,
but
it's
just
just
because
it's
kind
of
you
know
put
the
attention
how
many
people
are
attending
the
you
know
this
meeting
and
more
people
it's
better.
So
it
brings
more
attention
to
performance,
yeah.
D
E
And
hr,
where
are
you
working
on?
Are
you
from
ibm
also.
E
A
Yeah,
it's
just
it's
this
video.com!
It's
my
my
nick
here.
A
Yeah,
well
I
mean
I
I
reason
andre
like
I
was
so
interested
in
a
lot
of
what
you're
doing
is
because
it's
it's,
it's
actually
a
lot
of
similar,
what
we're
doing
in
the
video.
So
it's
we
have
almost
identical
use
cases
in
in
the
infrastructure
side,
not
but
not
end
user
side,
not
exactly
the
other
side,
but
it's
very
similar
in
for
side.
You
know,
so
it's
that's
cool,
like
your
scale
and
everything
so
yeah
definitely
share
the
problems
that
you're
seeing
you
know.
I
bet
you
there.