►
From YouTube: Kubernetes SIG Scheduling Weekly Meeting for 20230309
Description
Kubernetes SIG Scheduling Weekly Meeting 2023-03-09T17:52:05Z
B
Yeah
yeah
hi
everyone
so
today
is
March
9th,
and
this
is
six
scheduling.
Bi-Weekly
meeting
this
meeting
is
recorded
and
will
be
uploaded
to
YouTube.
So
please
adhere
to
kubernetes
code
of
conduct.
B
We've
got
a
few
topics
on
the
agenda.
First,
one
Aldo.
We
have
a
race
condition
and
pod
info
and
yes
nominated
part,
can.
B
C
So
basically,
Patrick
was
working
on
Dynamic
resource
allocation
and
then
hit
trigger
his
new
tests
trigger
data
race
in
scheduling.
Wait.
One
second
I
just
want
to
share
my
screen
now.
C
All
right
so
yeah,
this
is
the
issue,
so
Patrick
found
this
this
data
race,
while
adding
this
new
test,
but
after
further
analysis
this
didn't
this
was
not
exclusive
to
the
new
tests.
It's
just
the
new
test
made
it
more
easier
to
to
trigger
and
way
was
able
to.
Was
it
a
way?
C
A
A
We
have
the
prevention,
because
that
pie
is
unscrewable,
so
it
will
be
so
basically
the
the
racing
between
two
gold
things,
one
is
that
a
pie
is
claimed
as
unskidable
and
then
the
other
way.
The
event
Handler
is
updating
the
power,
so
this
should
go
routines
can
can
raise
so
in
some
of
the
three
tests,
although
not
too
much
like
the
preemption
can
also
hit
this
kind
of
code
path.
So
that
is
why
he
can
reproduce
that
using
the
preemption,
tester
Suite.
B
C
C
The
this
this
routine-
that
is,
writing
which
comes
from
the
scheduling
events,
the
the
event
handlers
and,
at
the
same
time,
there
is
a
read
from
the
scheduling
cycle
which
obtains
the
information
about
the
nominated
Parts.
They're
nominated,
yes,
nominated
parts.
C
So
that's
that's
the
the
race
that
was
found
in
both
both
scenarios.
Well,
this
changes
in
in
Patrick's
scenario,
but
the
the
update
is
the
same.
The
the
same
grid
so
after
some
debugging
well
Patrick
proposed
one
solution
which
was
adding
a
lock
or
a
an
atomic
pointer,
but
we
figured
that
might
not
be
enough
because
there
are
other
fields
in
the
in
the
bot
info
that
could
have
also
hit
the
data
race,
but
we
discovered
that
we
have
two
objects.
What
is
this?
C
So
we
have
two
objects
in
I.
Think
it's
this
one.
Yes,
we
have
two
locks,
different
locks
between
two
objects,
the
port
nominator
and
the
scheduling
queue,
and
they
both
share
the
same
pointers
towards
the
the
cute,
the
port
info.
That
holds
the
information
about
about
queuing.
C
So
so,
basically,
the
The
Proposal
that
I
came
to
is
that
we
somehow
make
those
two
objects
use
the
same.
Lock,
here's
a
PR
it
way
was
able
to
test
it
with
the
preemption
path
and
it
it
solves.
It
seems
to
solve
the
problem.
C
Patrick's
still
still
going
going
to
test
it
against
his
new
tests,
but
also
I
want
to
write
a
specific
integration
test,
hopefully
to
to
trigger
this
reliable
in
before
before
the
fix,
and
that's
the
only
thing
that
is
holding
me
from
merging
this
PR,
so
I
allowed
the
integration
test
and
then
they
should
be
ready
for
merging.
C
There
was
some
discussion
about
whether
there
are
other
possible
risk
databases,
for
example,
when
popping
when
popping
an
element
from
the
queue
from
when
popping
a
plot
from
the
queue
we
we
use
that
same
pointer
in
the
in
the
scheduling
cycle.
So
there
was
a
question
about
whether
that's
that
could
trigger
another
database,
but
it
turns
out
it's
not
a
problem
because
precisely
we're
popping
out
of
the
queue.
C
So
if
there
is
an
update
that
comes
from
outside
the
that
comes
from
the
event
handlers,
while
we
are
scheduling
well,
there
is
no
object
in
the
in
the
queue,
so
there
wouldn't
be
a
there,
wouldn't
be
a
right
in
place
in
place
right.
So
that's
that
should
be
fine
and
also
if
that
was
the
case,
we'll
probably
have
hit,
have
found
the
database
way
earlier.
So
I
think
I
feel
pretty
confident
about
this.
This
fix,
but,
as
I
said,
I
will
add
the
integration
test.
C
If
you
have
any
questions,
please
don't
hesitate
to
put
them
in
the
APR
or
any
other
concerns.
We
will
cherry
pick
this
to
to
any
any
versions.
I
think
all
versions
are
affected
by
all.
Supported
versions
are
affected
by
this
database,
and
that's
it
about
this
point.
Unless
there
is
any
questions.
B
A
Okay,
so
the
second
one
is
the
house,
a
new
Alpha
feature,
and
so
basically
the
background
is
similar
like
what
we
introduced
at
the
mesh
label
keys
in
the
portability
spread.
So
we
want
to
introduce
to
part
affinity
and
part
entire
thing
as
well.
So
basically
the
logic
are
pretty
the
same:
I
reviewed
the
logic
and
there's
a
bunch
of
tests.
So
basically
it
looks
good.
So
it's
just
I
think
in
the
next
Tuesdays.
Sometimes
it'll
be
the
code
phrase.
A
A
C
Quick
question
about
the
API
so
previously
I
suppose
the
label
selector
was
mandatory.
A
A
I
think
it's
much
all.
For
example,
if
you
specify
a
pipe
and
the
I
said
no
op.
C
Inion,
okay
and
then
the
two
queries
are
ended.
Okay,
I
think
I
can
give
a
review
from
APA
point
of
view
at
least
the
first
pass.
B
Okay,
thanks
just
related,
there
is
also
another
PR
that
is
related
to
sidecar
containers,
there's
refactoring.
B
Overall,
it's
okay,
it's
always
scary,
because
it's
these
lucky
factorings
can
make
nouns
changes
to
some
assumptions
and
semantics.
B
Hopefully
it's
okay!
It
will
be
useful
in
the
future
as
well,
because
we
had
pod
overhead
and
pod.
What
was
the
other
one,
the
resize,
how
we
had
to
change
things
in
multiple
places?
So
hopefully
this
will
help
us
in
the
future
to
centralize
it
I,
don't
think
it
did
go
all
the
way
it's
still
more
room
to
improve,
but
I
think
they
did
enough.
B
C
B
That's
the
one
where
it's
like
makes
things
a
bit
ugly,
like
you
know
the
non-zero
thing
right,
yeah,
that's
the
one
that
is
still
like
a
source
thump
and
the
way
that
we
calculate
things
it's
overall
I
think
they
at
least
unified.
B
The
way
we
iterate
over
the
containers
like
we
have
the
main
containers,
the
eight
containers
and
most
likely
in
the
in
the
next
PR
that
they
have
is
they're
going
to
be
the
sidecar
containers,
so
at
least
the
iteration,
the
unified
over
all
places,
but
that
specific
one
yeah
they
had
to
do
a
bit
of
quirky
logic
to
basically
take
into
account
the
non-zero
defaults
for
scoring.
B
Sergey
mentioned
there
is
a
very
slight
chance.
I,
don't
know
if
they're
going
to
get
in
in
the
next
two
days.
The
the
main
PR
is
up
for
review.
I
think
this
one
is
every
Factor,
that's
supposed
to
make
things
easier
for
them.
C
B
Okay,
Part
Group
objects
for
MPI
operator.
C
Yeah,
so
not
sure
if
I
don't
want
to
spend
too
much
time
on
this,
but
I
wanted
to
share
a
discussion
that
we
were
having
in
MPI
operator
so
I
guess.
Let
me
share
my
screen
again.
C
So
MPA
operator
and
in
general
Kim
flow
has
currently
has
support
for
pot
group
from
volcano
and
the
the
main
training
operator.
Kim
flow
training
operator,
also
added
support
for
the
Costco
cost
scheduling,
plugin
frippo,
the
Pod
group
from
the
Costco
scheduling,
plugins
repo,
and
similarly
there
is
another
open
issue
to
add
support
for.
C
What's
this
other
one
coordinator
with
a
k,
so
basically
I
am
I'm
not
liking.
The
idea
that
we
dump
a
lot
of
apis
in
keep
flow
to
support
different
schedulers,
so
I
was
kind
of
playing
their
little
Devil's
Advocate
to
to
ask
for
a
more
unified
solution
that
doesn't
involve
importing
all
these
dependencies
in
at
least
in
mp
operator,
but
the
same
thinking
extends
to
to
the
rest
of
key
flow
and
the
main.
C
The
main
problem
is
that
if
we
move
dependency
out
of
MBA
operator
or
the
dependencies,
then
it
means
that,
on
on
the
other
side,
the
schedulers
have
to
add
the
the
code
to
support
the
odd
groups,
the
different
sorry,
the
schedulers
need
to
add
code
to
handle.
C
The
specific
job
objects
in
this
case
the
MPI
job,
tensorflow,
job
and
so
on
and
so
forth.
So
in
a
way,
the
dependency
nightmare
switches
from
from
one
repo
to
two
other
ripples
and
in
general,
I
guess
this
is
an
and
end-to-end
problem
right.
You
have
end
schedulers
and
you
have
M
jobs
and
there's
gonna
be
a
this
dependency
nightmare
so
somewhere
so
I
was
I
was
advocating
for
not
having
this
nightmare
in
Q
flow.
C
But
then
it's
unclear
what
the
correct
solution
is.
So
it's
a
long
threat,
but
one
of
my
proposals
was
that
Okay.
So
if
we
look
at
them
the
key
flow
objects.
Let
me
just
share
that
real
quick.
So
if
we
look
at
the
key
flow
objects,.
C
They
all
look
similar,
so
that's
the
good
thing
about
it.
They
all
look
the
same
rather
so
there's
this
object
called
run
policy
which
is
in
this
spec,
so
MBA
jobs
back
and
similarly
other
specs.
Have
this
run
policy
field
and
then,
within
the
Run
policy
field
there
is
the
scheduling,
I,
think
it's
called
scheduling
policy
field
and
then
the
scheduling
policy
field
does
it
work.
C
C
C
C
So
this
is
just
for
qflow
right.
If
we
we
think
of
Ray
or
we
think
of
spark
likely,
they
don't
have
this
unified
spec.
C
So
that's
still
a
problem
so
again
to
to
summarize,
my
My
overall
idea
was
that
well
we
have
this
similar
specs,
so
there
could
be
a
separate
controller
or
even
the
scalar
plugins
themselves
itself.
Sorry
that
can
use
this
API
to
build
the
port
group.
C
Then
there
is
no
dependency
of
MPI
operator
into
cost
scheduling,
plugins
and
with
that
I
would
probably
also
advocate
for
removing
the
volcano
code
from
from
NPR
operator,
pushing
them
pushing
volcano
to
do
the
same
to
by
by
the
problem,
basically
so
by
the
problem
of
of
having
to
deal
with
this
integration,
and
my
hope
is
that,
of
course,
this
is
not
maintainable
right.
C
C
For
example,
the
idea
of
having
a
suspensive
resource
or
you
can
rename
it
to
job
queuing
some
resource
or
I
know
there
can
be
other
names,
so
crds
would
would
be
able
to
advertise
their
scalable
as
a
group
and
things
like
that
in
a
unified
manner.
C
Sorry
in
a
yeah,
I
guess
in
a
unifying
manner
or
the
other
alternative,
which
is
to
actually
Upstream
the
the
scheduling,
plugin
support
group
API
into
domain
kubernetes,
which
last
time
we
discussed
it
was
spending
on
on
a
better
understanding
for
auto
scaling
and
the
usage
of
the
physical
pmq
extension
point.
C
So
anyways
I
don't
expect
that
we
will
be
holding
from
doing
just
what
we
have
been
doing,
which
is
at
the
cost.
Curling
plug
independency
in
the
operator,
but
at
least
I
wanted
everybody
to
still
have
to
think
about
a
bigger
picture
and
that
what
what
we
are
trying
to
push
for
is
not
maintainable
for
any
or
any
project,
whether
the
skill
flow
or
curly,
plugins
or
or
or
volcano.
C
Yes.
So
that's
that's
the
summary
of
the
discussion,
but
if
you
have
any
thoughts
now
or
if
you
wanna
go
into
the
issues
and
have
share
your
thoughts,
that
would
be
great
or
if
anybody
wants
to
start
designing
or
thinking
about
these
Solutions
about
a
Super,
Source
or
upstreaming.
The
bot
group
that
would
be
that
would
be
great
anyways.
I'll
stop
here
for
more
questions.
A
How
many
operators
are
there
supporting
are
using
the
same
spec
like
scheduling
policy
and
the
leveraging
the
park
MPI
operator
training
operator,
anyone
else
so.
C
In
terms
of
reposes
too,
in
terms
of
apis,
there
are
more
than
I
I
think
there
are
around
five
or
seven
they're,
very
PF
job
by
torch.
Job
I,
don't
remember
the
other
names,
but
they.
A
C
They
they
all
share
the
same
code
except
for
NPR
barrier.
This
is
a
pretty
cool
reporter.
The
rest
share
the
same
code,
so
they
look
the
same
yeah,
the
I
guess
the
good
thing
about
this
scheduling
policy
struct
is
that
well,
it's
kind
of
serving
as
a
as
a
playground.
Maybe
that's
that's
all
the
fields
we
need.
Maybe
we
will
discover
that
we
need
more
fields
and
then
maybe
this
scaling
policy
can
actually
also
be
upstream
or
be
part
of
this
super
resource
that
we
Define
for
jobs.
B
B
You
would
create
that
you
know
potentially
unification
right,
but
again,
all
the
solution
needs
convincing
and,
like
collaboration
from
different
viewpoints,
lights
either,
for
example,
these
reports
to
change
the
way
that
you
described
advertising
a
sub
resource
or
whatnot
or
building
on
a
unified
API
that
handles
all
of
this.
For
everyone.
C
B
A
Yeah
one
second,
let
me
share
my
screen,
so
I
think
this
is
why
the
maybe
you
or
Otto
mentioned
a
little
bit
before
so
yeah.
The
background
is
that
we
do
have
score
plugins
and
one
users
ask
how
can
I?
How
can
I
desire
this
Behavior
over
the
other
score
program?
Our
official
answer
is
to
use
the
weight
for
the
individual
project
right,
so
that's
correct,
but
don't
forget
that
we
have
the
percentage
of
nodes
to
score.
A
So,
let's
think
about
this
a
simple
case.
Supposedly
we
have
a
20
Nails
cluster
and
the
potential
nodes
to
squat
equals
ten
percent.
Just
simplify
the
real
logic
that
we
have
a
minimum
100
nails
to
score.
But
just
forgot
forget
about
that.
It's
a
simple,
simple,
five
kicks
says
that
20
multiply.
10
equals
true.
Okay,
since
we
have
a
department
with
two
parts
and
they
have
the
preferred
part.
A
Affinity
says
they
want
to
collocate
to
the
same
node
together
and
then
part
one
part
A1
is
scheduled,
is
scored
in
Moscow
because
of
the
percentage
note
score.
So
it's
evaluate
node
one,
no
two,
so
it
randomly,
let's
say
land
on
this
one.
Two.
So
for
part
A2
when
the
camps
they
do
want
man,
the
the
note
that
part
A1
lens,
but
internally
we
have
a
scoring
index
randomly
move
by
one
position
and
then,
as
time
goes,
you
can't
land
starts
with
any
index.
A
Let's
say:
okay,
this
time
when
part
A2
comes
the
evaluation,
scope
is
no
six
and
no
seven.
So
in
this
case
the
peripheral
path,
Affinity
primitive,
is
to
know
op,
because
it
cannot
evaluate
the
no
one,
no
two,
it's
a
Red
Cross
Key,
so
that
the
part
A2
land
are
randomly,
and
so
the
scoring
plugin
doesn't
work
at
all
in
this
case.
So
this
issue
is
pretty
obvious
in
the
idle
cluster
because
idle
class,
you
stop
a
pound.
A
You'll
reach
the
sample
size
right,
but
in
a
very
busy
cluster
you
maybe
just
need
to
go
through
all
the
ordinals
to
reach
the
sampling
size.
So
that
is
what
I
observe
in
our
internal
production
cluster.
So
this
is
the
I
would
say
in
the
this
is
one
case,
and
there
is
another.
So
this
case
is
happening
on
the
scheduling,
primitive.
We
provide
workers
on
par
and
that
can
be
some
other
scenarios
like
the
second
primitive
was
applied
in
the
node,
for
example,
the
preferred
no
scheduled
tint
that
can
apply
to
the
notes.
A
A
Maybe
customer
customized
situations
that
they
don't
want
incoming
parts
to
land
down,
but
they
don't
want
to
put
a
strict,
no
schedule
looking
but
prefer-
and
in
this
case
again
when
the
scheduling
internal
scoring
index
lands
on
this,
and
maybe
just
just
pick
up
the
exact
notes
with
the
prefer,
no
scaling
tanks
so
that
the
past
lens
solver.
So
that
is
also
not
ideal
case,
which
makes
the
prefer
Note
10
he's
just
a
no
op.
A
So
what
I
want
to
propose
is
that
two
to
have
a
way
to
have
the
scheduled
plugins
to
offer
options
to
start
this
company
knows
like
in
our
pre-filter
we
right
now.
We
each
we
enable
the
user
to
provide
a
list
to
say:
okay,
you
must,
and
the
own
name
must
try
this
nose
right.
That's
that
is
the
situation.
There's
basically
no
sorting
there.
It's
just
a
master
dual
list,
but
in
this
case
it's
a
prioritize.
The
list
some
knows
I
want
to
be
prioritized
and
some
notes
I
may
want
to
deprioritize.
A
So
providing
a
sorting
function
out
there
can
make
this
make
the
if
sampling
logic
more
accurate.
So
this
is
basically
my
idea
and
this
logic
should
be
applied
to
the
prefilter,
either
through
a
parameter
or
through
some
like
preamp,
so
pre-filtered
extension,
something
so
that
can
provide
a
sorting
function.
So
the
pre-filter
button
can
Implement
that
so
that
is
Sovereign
basic
idea.
Also
is
obvious
issue.
I.
Think
Abdullah.
You
used
a
long
time
ago.
Where
is
some
similar
idea.
B
Of
is
where
people
are
just
basically
assuming
that,
if
a
note
exists
with
the
preferred
assignment
that
the
schedule
will
actually
pick
it,
but
for
a
very
long
list
of
reasons
the
schedule
doesn't.
One
of
them
is
obviously
this
one's
like.
We
are
always
looking
at
a
subset
of
the
nodes
right,
exactly
I
mean
we
tried
to
fix
the
waiting
issue
between
scoring
plugins,
but
this
one
I
think
is
a
bigger
problem.
Right,
yeah.
B
C
But
I
think
intrinsically
This
is
Gonna
Hurt
performance,
because,
basically
you
need
to
evaluate
all
nodes
in
a
way.
So
so.
A
B
B
Just
I
understand
what
you're
proposing
I'm
just
saying
that
in
the
most
extreme
case,
where
you
want
to
handle
all
the
corner
cases
like
the
one
that
Elder
mentioned,
is
basically
run
Discord
before
and
then
and
then
apply
the
filter.
B
C
A
There's
definitely
conflict
they're
working
together.
It's
just
the
amount
of
this
percentage.
You
know
to
score
I
want
to
prioritize
some
notes.
Otherwise,
I
cannot
find
that
the
design
knows
that
I
want
it's.
Basically,
the
the
path
sampling
is
pretty
random.
That's
the
that's.
The
key
issues
here.
A
To
the
scheduling
framework,
so
user
can
choose
how
to
implement
that
for
this
case,
when
the
the
part
A2
camps
I
will
see.
Okay,
it
carries
the
Preferred
Product
Affinity,
so
the
proof,
the
product,
Affinity
interpolar,
Affinity
plugin,
can
implement
the
contract
to
say
Okay
I
want
to
prioritize
evaluating
no
one
and
node
two
and
I'm
fine,
that
is
these
two
Nails
doesn't
fit
for
power
A2,
so
I
can
I
go
to
the
other.
I
don't
know,
but
if
this
choose
fits
I
want
to
I
want
to
give
you
the
pretty
high
score.
A
C
A
C
Assuming,
let's,
let's
simplify
and
assume
one
just
the
only
exists,
one
score:
plugging
yeah,
so
sorting
wise
in.
Why
something
different
from
scoring
100
of
the
notes.
A
C
C
B
This
one,
you
could
potentially
have
a
heuristic
that
iterates
all
the
all
over
all
the
nodes,
but
doesn't
really
need
to
have
a
like
a
as
expensive
as
the
scoring
something
much
simpler,
potentially
to
to
sorting
like,
depending
on
the
plugin.
Of
course,.
A
B
Guess
I'll
just
point
is
that
is:
is
there
is
it
possible
that
we
can
sort
at
lower
cost
than
scoring
all
the
nodes
so.
A
Yeah
that
can
be
that
kind
of
optimization.
That
is
why
another
possible
option
is
to
ask
the
schedule
plugins
to
return
a
list
of
prioritize,
no
less
and
maybe
a
deeper
or
it's
faster
less.
There
can
be
another
implementation
option,
so
basically,
I
will
in
the
in
the
proposal.
I
will
list
all
the
options
here.
Starting
is
one
just
simply
giving
a
original
list
is
also
one
and.
B
A
If,
if
we
give
this
option
to
choose
which
nails
are
prioritize,
it
depends
on
how
the
all
the
implementation
and
how
the
interface
we
give
to
them.
If
we
say
okay,
just
name
the
nodes,
that
is
one
contract,
so
that
they
can
basically
use
some
kind
of
Oscar,
selector,
no
selector,
to
give
us
the
no
list
on
the
on
the.
On
the
other
hand,
if
we
want
the
contract
more
generic,
maybe
starting
it's
just
a
big
old
and
login
operation,
cost.
B
B
Contract
like
the
the
pre-filter
or
whatever
extension
that
you're
going
to
run
before
filter
the
things
that
whatever
privatization
or
a
subset
of
nodes,
yeah.
A
C
A
I
would
say:
sometimes
there
can
be
still
some
performance
against
comparing
to
a
Brute,
Force
100
scoring
so
yeah.
Let's,
let's
see
how
the
proposal
can
resolve
your
concern,
comparing
the
their
benefit
comparing
to
the
100
of
Mail
system,
Squad
solution.
B
C
I
I
think
the
designer
has
to
include
specific
examples
and
not
just
how
the
API
looks
like,
because
if
it's
gonna
end
up
costing
the
same
amount
of
CPU
as
just
scoring
100
percent
I,
don't
see
a
reason
why
we
should
do
it.