►
From YouTube: Kubernetes SIG Scheduling Weekly Meeting for 20220324
Description
No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).
A
Hi
everyone
welcome
to
today's
six
scheduling.
Meeting
it's
been
meeting,
it's
been
recorded,
so
we'll
be
uploading
to
youtube
so
be
aware
for
what
they
are
saying
be
respectful
to
each
other.
Okay,
and
today
we
have
two
agendas
items
here.
So
first
one
is
to
give
some
introduction
on
the
cubeflex
scheduler
plugins
and
claudia.
If
you
are
ready,
I
can
hand
over
to
you.
B
All
right,
great
thanks
right,
hi,
everyone
thanks
for
the
opportunity
and
to
speak
about
the
scheduled
plugin
we've
been
working
on,
I'm
claudia.
I
work
for
ibm
research
in
new
york.
I
am
part
of
the
hybrid
cloud
infrastructure
software
team.
I
mainly
do.
B
Any
few
things
related
to
kubernetes
topics,
and
one
of
these
work
is
coop
flux.
We
I'm
also
working
in
collaboration,
so
it's
not
just
a
one-man
band.
This
is
a
collaboration
with
reddit.
I
think
eduardo
is
here
too,
and
lawrence
livermore
national
labrador
is
one
of
the
national
lab
in
california,
and
they
also
are
the
creators
and
implementers
of
flux.
But
let's
get
into
that.
B
So
the
reason
why
we
are
we
started
working
on
this
project
is
that
we
want
to
target
more
hpc-like
scheduling
in
kubernetes
and
to
do
that,
so
we
we,
the
community,
also,
I
guess,
would
agree
with.
That
is
the
things
that
these
are
features
that
we
want
to
have
when
we
target
hpc
applications
is
to
have
batch
or
group
scheduling,
possibly
enable
topology
awareness
in
the
scheduling
feature
and
also
having
optimized
placement
for
different
workloads
in
within
one
cluster.
B
That
would
be
a
desired
feature
and
also
having
different
placement
algorithms
so
that
the
user
can
decide
how
to
better,
have
a
mapping
of
their
application
in
the
cluster
so
to
target
all
the
all
of
those
features,
so
the
best-
probably
probably
not
the
best,
but
a
good
way
to
do.
That
is
to
use
schedulers
that
do
that
for
a
living.
So
why
don't
we
use
an
hpc
platform
scheduler
to
do
this
job
for
us
and
why
don't
we
try
to
integrate
that
into
kubernetes
as
a
scheduler
plug-in?
B
So
we
implemented
coop
flux.
This
is
based
on
so
flux
is
a
is
a
framework
and
that's
the
job
manager,
workflow
manager
and
job
scheduler.
That
is
implemented
at
livermore.
So
they
run
this
on
their
big
machines
and
that
that
specifically
targets
hpc
workloads
within
this
entire
framework
is
a
flux.
Framework
has
different
components
and
one
of
those
is
the
scheduler
and
the
scheduler.
B
It
name
is
fluxion
and
we
use
that
specific
component
in
the
scheduler
plug-in
as
a
library,
and
we
use
that
as
a
is
wrapped
into
a
container
that
we
use
as
a
sidecar
to
the
to
the
scheduler
plugin
pod
fluxion
is
a
graph
based
scheduler
and
the
entire
flux
infrastructure
is
targets.
B
The
representation
of
the
resources
and
representation
of
the
jobs
as
graphs
and
the
matching
of
the
workload
that
wants
to
be
scheduled
within
to
the
cluster,
where
it
wants
to
be
scheduled,
is
basically
solving
a
graph
matching
problem
and
then
having
a
graph
also
gives
us
for
free
a
topology
awareness
of
the
of
the
cluster,
where
the
workload
is
is
going
to
run
in
our
case,
in
this
case,
of
course,
talking
about
the
kubernetes
caster,
so
as
a
very
high
level
overview
of
the
of
how
the
flux
plugin
is
implemented,
so
we
of
course
make
use
of
the
scheduling
framework
to
implement
it,
and
we
use
the
pre-filter
and
filter
extension
points
and
in
the
pre-filter
extension
point,
we
we
get
the
information
about
the
the
pod
that
that
that
wants
to
be
scheduled
and
we
map
the
the
pod
specification
into
a
job
specification.
B
That
is
what
flux
is
able
to
understand.
So
we
do
this.
This
map
matching
this
mapping
from
pods
back
to
jobs
back,
and
we
asked
the
the
sidecar
container
to
give
us
an
allocation
for
that
pod.
If
the
allocation
exists,
it's
possible
to
locate
the
job,
we
get
back
a
a
node
node
name,
and
then
we
go
over.
We
move
on
with
the
with
the
filter
extension
point
and
we
get
the
model
located
and
executed.
B
I
would
like
to
go
back
to
the
list
of
things
that
we
want
to
enable
when
we
talk
about
hpc
applications
on
kubernetes,
because
we
with
kuflux
we're
trying
to
address
them.
So
I
as
an
exercise,
also
try
to
explain
how
and
why
we
are
able
to
target
those
this
desired
features
so
first
about
batching
groups
or
group
scheduling
so
for
hpc
workloads
having
the
possibility
to
see
when
targeting
when
talking
about
kubernetes
running
on
kubernetes
cluster,
of
course,
considering
npi
application
for
the
easiest
example.
B
So
we
know
that
having
that
that
workload
seen
as
a
unique,
a
single
unit
or
group,
that's
very
important
because
of
how
an
mp
application
work,
you
need
to
have
all
the
ranks
ready
before
starting.
So
we
all
know
that,
but
having
the
ability
to
to
treat
that
kind
of
workloads
as
a
single
unit
is
just
the
starting
point.
So
you
have
that,
but
then
what
you
do
with
that,
so
we
with
within
our
plugin.
B
We
also
use
the
depot
group
crd
to
manage
this.
B
The
creation
of
a
group
of
pods,
but
that
that's
not
mandatory
in
innocence,
that
also
put
backward
scheduling,
is
possible,
of
course,
but
once
you
have
your
the
group
that
you
want
to
schedule
also
with
the
scheduler,
we
can
enable
some
optimization,
like
placing
the
group
being
aware
of
the
topology
of
the
cluster
and
also
given,
given
the
fact
that
we
know
how
the
cluster
looks
like
we
can
better
pack
or
spread
the
workload
on
top
of
the
cluster,
and
the
user
is
also
able
to
to
decide
how
to
place
the
the
workload
itself
to
about
the
topology
awareness.
B
So,
as
I
mentioned
at
the
beginning,
flux
is
the
resource
model
is
graph
based,
so
any
the
the
cluster
where
it
runs
on
is
defined
as
a
graph
can
be
more
or
less
detailed.
B
That
depends
on
how
the
the
user
wants
to
go
in
details
with
the
cluster
representation,
but
of
course
that
cannot
be
done
by
hand
all
the
time.
B
So
within
the
plugin
we
create
a
representation
of
the
cluster
that
is
detailed
as
much
as
we
can,
but
also
has
to
be
generic
enough
so
that
it
can
apply,
let's
say
to
any
kubernetes
cluster,
because
if
we
start
to
go
deep
into
the
features
of
single
nodes
or
how
the
cluster
is
or
the
entire
topology
of
the
cluster,
if
it's
on
a
single
rack,
if
it's
multi-rack
is
a
multi-cloud,
so
this
can
can
be,
can
be
implemented
how
in
detail
to
go.
B
But
we
found
that
having
a
it's
possible
and
already
good
enough
with
what
kubernetes
provide
to
have
a
graph
that
enables
us
to
do
a
better
job
scheduling,
for
instance
yeah
I'm
to
when
we
talk
about
mpi.
B
So
since
I
mentioned
that,
I
would
like
to
bring
this
example
over
so
since
also
jobs
are
met
to
graph,
so
that
the
scheduler
can
do
the
graph
matching
algorithm
to
today,
then
I
think,
then,
the
the
scheduling,
sorry
so
also
jobs
are
back
to
graph,
and
this
is
done
internally
by
flux.
B
We
don't
really
have
to
do
anything
of
that,
rather
than
only
just
doing
the
translation
from
the
pods
back
to
the
job
specification
file
that
flux
understands
and,
for
instance,
if
we
have
a
workload,
that
is
an
mpi
workload
that
has
four
ranks,
and
each
of
them
wants
one
core.
This
gets
translated
into
a
like
single
job
that
has
a
count
of
four
requests
of
one
core
each.
B
So
in
this
way,
we're
able
to
do
like
this
group
scheduling,
let's
say
and
that's
submitted
as
a
single
request
to
to
flags
at
the
same
way,
we
can
also
do
that
as
a
pod
bipod
having
instead
of
candle
four,
we
have
a
counter
one
and
we
have
four
requests
from
them.
Having
this,
the
the
result
would
be
the
same
in
terms
of
what
the
scheduler
gives
as
a
result,
because
it's.
B
There
is
given
that
the
placement
algorithm
that's
chosen.
The
result
would
be
consistent,
but
having
having
the
possibility
to
group
to
make
the
entire
group
as
a
single
request.
This
also
helps
optimizing
the
time
that
you
get
the
entire
location
because,
instead
of
having
like
four
different
requests,
you
have
just
one.
So
this
is
an
optimization
that
can
help
in
certain
cases
or
for
even
having
put
blackboard.
That
would
be
the
same.
I
saw
a
raised
hand.
E
Yeah,
just
a
quick
question
on
the
translation
you
mentioned
between
the
pi
spectrum,
like
arms,
like
the,
if
cube
scheduled
framework,
is
a
pod
scheduler.
How
do
you
translate
that
to
count
four,
for
example
like
do
you
like
yeah?
How
does
that
work?
How
do
you
determine
the
job
size.
B
So,
let's
consider
the
two
cases
where
we
have
a
pod,
bipod
or
a
group
scheduling
for
pod.
Bipod
is
just
a
single
pod,
so
we
get
the
request
if
any
about
the
cpu
memory
accelerator,
whatever
it's
it's
requested
by
the
pod
and
we
create
a
simple
basic
job.
Spec
is
a
yaml
file,
basically
yaml
format,
and
that
represents
the
the
single
pod.
If
instead
is
a
group
of
them,
one,
the
not
easiest
but
the
better
way,
at
least
in
my
opinion,
but
can
be.
B
I
can
be
wrong,
of
course,
is
that
by
having,
if
when
we
want
to
do
a
group
scheduling
having
the
crd,
that
is
the
pod
group,
that
already
has
the
information
of
the
members
of
the
group.
So
given
that
number,
we
can
define
the
count
here
so
that
gives
is
given
like
for
free.
If
you
don't
have
the
pod
group
you
can
like,
if
the
quality
is
part
of
the
replica
set,
you
know
the
red,
because
that
so
this
you
can
get
that
to
that
information.
B
If
you
want,
but
yeah
that
that's
also
workload
dependent
with
the
pod
group
is
kindly
you're
abstracting.
That.
E
Okay,
so
I
thought
that
this
is
already
implemented
and
you
already
have
an
opinion
like
the
mecubeflux
plugin,
having
a
bit
how
to
define
a
group
of
pods
as
being
a
single
unit
of
scheduling.
But
I
guess
at
this
point
you
still
have
this
model
where
there
is
a
drop
controller
with,
for
example,
mpi
job
yeah
operator
that
actually
creates
the
pods
right
like
and
then
then
you
need
to
look
at
these
parts
and
identify
them
as
all
these
pods
are
related
to
each
other
needs
to
be
scheduled
together.
E
You
will
hold
them
and
not
schedule
any
of
them,
but
create
a
single
resource
of
cube
flocks
or
flicks
on
sorry,
I'm
not
maybe
I'm
missing
the
terminology.
Job.
E
Your
system,
and
then
your
system
will
respond
to
you
with
scheduling
like
which
noise
these
four
parts,
for
example,
should
land
on
and
then
that's
the
workflow.
B
Yeah
yeah
yeah,
that's
about
pods,
it's
not
a
workflow
manager,
so
the
the
job,
even
if
it's
a
group
is
treated,
can
be
treated
as
a
one
by
one
or
all
together.
So
what
we
do
well,
let's
consider
the
this
example
here
where
we
have
a
single
workload
with
the
four
pods
and
we
want
to
do
a
single
request.
B
So
we
pack
that
this
thing
this
is
highly
simplified
and
we
request
the
the
allocation
to
the
sidecar
container
and
what
we
get
back
can
be
either
like
one
single
node
or
more
nodes.
So
we
we
get
the
entire
result.
We
we
save
that
information
for
the
pod
group
and
as
the
pod
arrive,
we
give
a
node.
We
assign
a
node
to
that
pod
and
we
go
on
until
we
finish
all
the
all
the
nodes.
E
So
do
you
use
that,
like
the
I'm
sorry,
I'm
drawing
black
the?
What
is
the
extension
point
where
we
hold
the
parts
appending
reserves,
no
permit
coming.
B
Right
so
that
that
part
with
that,
we
can
do
this
before
so
we
in
the
pre-filter
stage.
Let
me
go
back,
so
we
have
it
yeah,
so
we
are
using
those
two
just
for
flux.
When
then,
it
comes
to
the
group
scheduling,
we
also
enable
the
other
ones
and
we
exploit.
We
can
exploit
the
call
scheduling
extension
points,
so
we
kind
of
combine
two
of
them,
but
the
the
information,
the
list
of
the
nodes
or
the
nodes
where
the
pods
have
to
land.
B
They
come
here
in
the
pre-filter
extension
point,
because
here
is
where
the
pod
arrives
first
and
we
can
get
the
and
we
ask
flux
where
to
put
the
node
or
the
group
of
nodes.
So
at
this
point
also
we
get
the
entire
location
and
then
we
feel
at
the
in
in
the
filter
extension
point:
that's
where
we
can.
F
B
It's
a
better
way
to
just
skip
the
filtering
of
in
any
further
extension
points
that
now
I
have
the
doubt.
A
So
it
sounds
to
me:
you
are
not
leveraging
the
schedule
framework
that
much
is
just
through
the
starting
cycle.
You
just
use
that
as
a
hook
to
invoke
your
external
fashion
server
and
the
version
server
is
doing
heavy
work.
So
my
question
is
that
the
fraction
work
watch
all
the
events
related
with
scheduling
so
that
it
can
make
a
good
scheduled
decision.
A
B
Because
we
at
the
so
when
this
scheduler
is
deployed,
we
get
this
the
entire
cluster
and
we
build
the
resources,
the
graph
of
the
resources
out
of
that.
So
whatever
is
built
there.
We
know
that
we
can
match
on
that
graph
and
also
when,
when
so
us,
assuming
we
have
two
different
workloads
that
have
to
be
placed.
B
The
first
one
arrives.
Do
the
math
do
the
allocation
you
get?
There
is
a
set
of
nodes
and-
and
you
run
the
first
batch
of
pods.
If
another
one
arrives
and
the
previous
one
is
not
completed,
the
resources
are
still
marked
as
being
used.
B
So
the
fluxion
component
knows
that
those
pods
are
still
there
and
the
resources
are
not
free
and
once
the
that
previous
job
or
any
job
that
was
scheduled
by
flux
is
completed,
then
the
resources
are
being
freed.
So
actually
there
is
a
a
watcher
on
pods
here
in
the
flux
part,
but
this
is
not
really
related
to
how
to
enable
the
scheduling,
because
that's
just
to
like
to
free
the
resources
after
the
jobs
are
completed.
F
A
question,
so
are
you
saying
that
you,
your
schedule,
maintains
its
own
resource
resource
calculation,
resource
database,
correct
yeah,.
B
As
of
today,
that's
that's
static,
in
the
sense
that
it's
easier
to
avoid
this
big
brain
for
the
resources.
B
If
you
kind
of
reserve
the
nodes
to
flux
scheduler,
so
you
have
a
set
of
nodes
that
you
partition
your
cluster
and
you
assign
that
to
the
scheduler
is
in
our
plan
to
make
sure
that
the
the
internal
state
is
kept
up
to
date
with
whatever
else
is
running
on
the
cluster,
if
the
nodes
cannot
be
reserved.
B
F
F
F
So,
okay!
So,
but
if
you,
if
your
schedule
is
running
as
a
plug-in
right,
I
might
understand
it
as
a
plug-in
to
those
current
communities
scheduling
framework.
F
F
If
that's
the
case,
and
then
you
know,
there
could
be
scheduling
conflict
right,
because
it's
like
two
brands
are
making
are
working
to
make
the
decision
looking
at
the
same
resource,
same
cluster
resource
right
and
you
have
different
data
and
the
resource
might
not
be
synchronized
very
well,
because
there
are
two
schedulers
running
in
parallel.
B
Yeah,
that's
correct
that
that's
a
problem.
It's
like
a
on
in
one
way,
if
you
are
the
kubernetes
class
scheduler,
you
know
what
resources
flux
is
scheduling
because
you
see
the
pods
all
the
workloads
is
not
true
for
the
other
way
around
because
of
the
internal
resource
representation.
B
That's
not
a
unsolvable
problem
because
you
can
update
at
every
scheduling
cycle,
as
you
usually
do,
or
you
update
the
node
list
and
you
can
update
you
can
see,
what's
actually
running
and
update
the
the
internal
state
of
of
the
flux
scheduler
so
that
that's
possible
to
do.
F
You
have
to
participate,
manage
my
one
schedule:
okay,
but
anyway
going
down
this
chart.
I
think
you
know
there
are
quite
some
some
issues
that
need
to
be
considered.
I
have
another
question:
it's
about
the
topology
aware.
So
what
could
you
go
to
your
next
slide?
That
shows
the
technology
aware
yeah.
So
you
are
to
project
where
you
go
to
the
gpu
core,
so
you
are
going
to
have
that
nodes,
gpu
and
core
layout
or
topology
information
in
your
topology.
F
I
mean
database
or
whatever
do
you
have
that
or
just
to
say
how
many
gpu,
how
many?
How
many
cpu
core
or
you
have
the
real
topology
hardware
topology.
B
We
can
get
to
the
real
hardware
topology.
We
were
doing
that
with
the
not
feature
discovery
operator,
so
we
already
have
some
integration
with
that,
so
it's
possible
to
go
as
deep
in
detail
as
we
want.
So
if
we
have
the
nfd
operator
that
gives
us
more
information
about
the
hardware,
we
can
do
that.
Yes
and
the
scheduler
knows
all
the
things
related
to
the
infrastructure
and
can
take
advantage
of
that
information,
but
to
keep
it
easy,
actually
yeah
this
core,
maybe
cpu,
was
better
to
keep
it
easy.
B
That's
this
is
the
kind
of
representation
that
we
create
by
default.
So
that's
what
kubernetes
gives
you.
B
Oh
yeah
sure
yeah,
I
don't
have
a
lot
more
slides,
so
yeah
we
mentioned
mentioned
this
already,
so
another
feature
that
having
a
well-oiled
machine
for
scheduling
will
give.
You
is
that
they
can
implement
more
matching
policies.
B
So,
in
this
case
there
are,
there
are
a
few
of
them
like
five
or
four
or
five.
I
remember
that
correctly,
more
can
be,
can
be
plugged.
So
it's
also
composable
in
that
perspective,
and
in
this
way
we
can
also
enable
packing
or
spreading
when
doing
the
scheduling.
This
has
to
be
selected
at
initial
at
startup
time.
So
when
you
create
the
scheduler,
you
have
to
define
that,
but
that
that
can
be
also
customized
and
decided
at
runtime.
B
What
kind
of
algorithm
you
want
to
do
and
what
kind
of
allocation
you
want
for
for
your
nodes.
This
can
indeed
make
a
difference
when
you
have
workloads
that
are
that
are
either
computational,
compute,
intensive
or
networking
or
network
intensive
and
yeah.
Also,
as
you
can
go
with
the
topology
specification
with
the
graphics,
you
can
go
fine
grain
or
coarse
grain.
B
Why
an
optimized
play
placement
is
a
feature
that
can
can
be
very
useful
so
as
if
you
have
a
large
cluster,
and
you
want
to
like
partition
the
cluster
and
run
different
workloads
on
on
that
one.
B
On
the
on
that
cluster
partition
in
a
like
soft
way,
not
partitioning
the
nodes,
so
just
like
having
a
soft
partition
of
the
of
the
cluster,
where
you
want
to
run,
for
instance,
two
molecular
dynamic
simulation
and
some
other
stuff,
I'm
using
an
example,
but
we
have
also
some
results
on
for
that.
Lamps
is
a
md
simulator
and
these
are
computation
the
desired
network,
intensive
sorry
and
it's
they
perform
better
if
the
ranks
are
close
to
each
other.
B
So,
instead
of
putting
the
or
the
ranks
here
and
there
in
the
cluster,
you
would
really
want
to
have
them
close
to
each
other.
But
if
you
have
two
workloads,
so
you
want
to
pack
those
into
a
subset
of
the
nodes
so
that
they
don't
interfere
with
each
other
and
with
the
pack
algorithm.
B
You
can
do
that.
We
can
do
that
with
with
with
the
scheduler,
and
so
we
we're
trying.
We
are
doing
some
performance
evaluation
of
how
those
workloads
work
on
on
kubernetes,
so
this
is
very
preliminary,
so
more
optimizations
can
be
done
and
also
we
have
to
learn
more
about
the
the
the
benchmark
itself,
lamps
to
be
able
to
make
it
run
better.
B
So
here
we
are
kind
of
playing
with
the
input
parameters,
the
number
of
npl
workers
strengths
and
how
you
want
to
run
this,
but
just
as
a
as
as
is
we
when
we
compare
in
this
graph,
we
compare
cool
flux
with
a
low
node
match
policy
that
is
doing
packing,
because
we
we
know
that
lamps
is
network
intensive.
So
we
don't
want
to
spread
the
pods
all
over
and
we
compare
that
with
what
the
default
scheduler
can
do,
and
so
here
this
is
a
cluster
with
24
nodes.
B
So
you
can
have
like
one
one
to
one
mapping
up
to
like
here.
So
then
performance
can
start
degrading,
but
what
I
want
to
show
is
that,
having
I
mean,
this
is
very
I
mean
naive,
but
having
a
better
scheduling
can
make
a
whole
lot
of
difference
in
the
performance
of
the
application.
When
we
also
talk
about
hpc,
traditional
hpc
workloads,
here,
you
see
like
a
5x
improvement.
Higher
is
better
in
terms
of
time
steps
a
second.
B
So
this
is
very
promising
and
we're
very
enthusiastic
about
the
results,
and
I
think
that's,
oh
yeah,
another
bunch
of
use
cases.
Those
are
these,
come
from
the
national
labs
and
more
like
hpc,
traditional
hbc
workloads.
E
D
Your
previous
graph,
which
one
that
was.
D
Kind
of
I'm
wondering
if
the
comparison
is
fair
because
cube
scheduler
doesn't
know
anything
about
the
application
right.
So
if,
but
on
the
other
hand,
there
are
certain
apis
in
the
pod
spec
that
you
can
use
to.
B
Absolutely
correct,
indeed,
we
are
trying
to
be
as
fair
as
possible
and
with
the
the
npi
operator.
So
here
this
is
an
mpi
application
and
we
use
the
mpi
operator
to
deploy
it,
and
this
is
the
I
mean
as
far
as
we
got
with
the.
B
With
the
testing,
this
is
the
best
that
we
get
so
far.
Configuration
of
workers
mpi
ranks
per
per
worker.
So
there
is
this
slot
per
worker
configuration
that
you
can
set
in
the
npi
job
spec
and
also
we
tune
the
npi
run
command
so
that
it's
doing
the
mapping
of
the
ranks
per
node
so
that
it's
trying
to
do
its
best
to
those
are
application.
D
Level
things
I'm
I'm
wondering
more
of
a
scheduling
configurations,
for
example,
using
pod
affinities
to
to
suggest
the
scheduler
that
the
pods
should
be
closer
in
the
same,
the
same
topology
using.
B
Okay,
yeah,
we
didn't,
we
didn't
try
that
we
were
focusing
more
on
tuning
the
application,
regardless
of
the
of
the
scheduler.
So
those
are
optimization
that
the
hpc
person
would
do
to
run
on
bare
metal
regardless,
regardless
of
the
scheduler
but
yeah
you.
You
definitely
have
a
point.
We
have
to
investigate
on
the
on
that
side.
That's
very
preliminary.
I
collected
this
data
yesterday
evening.
B
D
So
I'm
also
very
concerned
with
the
architecture
you
have
because
you
have
basically
two
schedulers,
so
I
wonder
why
didn't
you
consider
just
to
run
your
own
scheduling
your
own
schedule
entirely
instead
of
having
these
two
states?
And
I'm
not
I'm
not
really
sure.
D
So
there
is
another
project
that
that
is
within
this
sig.
It's,
it's
called
the
cost,
scheduling
plugin,
it's
a
it's
not
in
the
track.
It's
not
in
core
kubernetes.
It's
so
yeah.
F
D
And
it
actually
runs
using
the
framework
and
not
splitting
the
work
into
into
places.
So
I
wonder
if
you
investigated
that
route
and
if
you,
if
you
plan
to
merge
those
or
like
all,
on
the
other
hand,
whether
you
just
consider
writing
an
entirely
new
scheduler,
which
is
what
other
platforms
are
doing
like
volcano,
for
example,
yeah,
dreams
right
and-
and
this
seems
like
a
weird
middle
place-
that
I
I
I'm
I
just
want
to
know.
What's
the
advantage
of
this
one
and
this
these
are
this
architecture.
B
B
The
I
get
also
so
the
the
comparison
with
the
cost
scheduling
cost
scheduler
is
not
completely
fair
because
those
are
solving
two
different
problems,
but
I
get
what
you're
saying,
meaning
that
why
don't
you
implement
the
entire
algorithm
within
the
plugin
for
the
same
reason
as
there
is
an
existing
thing
that
we
can
use.
This,
of
course
introduces
order.
Other
problems
like
this
double
resource
state,
but
I
mean
that's,
probably
the
price
to
pay
for
this
specific
case.
So
that's
the
product
that
we
decided
to
take.
B
A
Sorry
yeah,
I
think
you
because
it's
it's
a
blocking
operation
and
if
that
overheads
is
too
much
too
much
latency,
I
think
it
will
be
more
good
to
incorporate
sort
of
migrate.
Your
fraction
internal
data
structure
to
into
the
scheduling,
scheduling,
plugin
itself,
then
you
do
everything
in
the
memory.
Instead
of
calling
your
external
grpc
service
yeah,
I
think
it's
good
performance.
B
Right,
the
so
the
with
the
one
with
the
grpc
is
like
the
second
version
we
were
have
we
implemented
one
where
everything
was
in
the
same
container,
so
there
was
no
grpc
in
between.
B
There
is
really
no
difference
and
we're
talking
about
milliseconds.
So
that's
definitely
something
that
we
took
into
consideration
it's
tcp
now.
Another
optimization
that
can
be
done
is
to
move
that
to
next
domain
sockets.
We
didn't
do
that
yet,
but
that's
in
the
plan,
but
really
it's
not.
B
It
is
not
introducing
any
latency.
We've
been
measuring
the
startup
time
and
in
the
scheduling
time
between
the
two
like
group,
scheduler
and
cool
flags,
and
there
is
really
no
difference.
F
B
Yeah
that
I
I
agree
indeed,
so
we
also
had
discussions
about
that
with
our
red
hat
counterpart
and
like
asking:
how
would
how
would
you
solve
this
problem,
or
does
it
make
sense
to
solve
the
problem
and
the
feedback
that
we
had?
Is
that
having
the
cluster
partition
and
having
like
nodes
reserved
to
this
kind
of
workloads?
B
That
would
be
definitely
the
preferred
way.
So
that's
also
why
we
were
sticking
to
the
this
preserving
nodes
to
the
specific
scheduler.
F
Yeah,
but
if
you
have
like
partition
right,
you
have
each
partition.
Has
you
know
it's
dedicated
scheduler?
Then
it's
not
optimal
resource,
optimal
right,
because
you
know
one
partition
might
run
off
resource,
but
the
other
causation
there's
still
resources
in
the
other
partition.
But
you
cannot
use
that
scheduler.
B
Right,
yeah,
yeah,
yeah,
absolutely
yeah,
but
that
this
depends.
I
guess,
on
the
how
you
and
how
you
need
to
use
the
cluster,
because
this,
of
course
there
might
be
occasions
where
you
run
out
of
resources.
Anyone
and
you
would
really
like
to
have
more
resources
than
stealing
them
from
the
other
subset
of
nodes.
B
But
if
you
are
dividing
the
nodes
into
two
or
more
groups-
and
you
want
a
certain
set
of
applications
to
run
there
and
you
and
to
have
that
number
of
nodes
just
for
yourself,
then
this
is,
I
think,
acceptable.
B
But
I
really
that
this
depends
on
cluster
of
admin
and
all
that
stuff.
I
guess.
A
D
All
right,
one,
quick
quote
you
I
don't
know
you
might
be
aware
of
the
the
working
group
batch
that
we
created
so
yeah.
We
could
continue
a
discussion
there.
What
you
said
like
non-green
vendor
wheel
is
something
we
definitely
want
to
encourage.
A
D
This
architecture,
it
needs
to
be
discussed
further.
If
we
want
to
have
a
single
solution
for
hpc.
Definitely
we
want
the
the
the
knowledge
of
the
hpc
community,
so
we
need
to
find
a
good
way
of
attaching
this
this
knowledge
into,
perhaps
the
existing
instead
of
trying
to
partition
the
decision
making
so
yeah.
Maybe
we
can
have
another
session
in
the
working
group
sure
and
there
we
also
have
people
working
in
topology
for
the
for
the
nodes,
specifically
so
that
there
could.
F
A
So
the
next
item
I
want
to
mention
is
that
I
haven't
created
issue
yet
that
this
is
a
part
resource
metric
that
was
introduced
in
121
by
clayton.
So,
basically,
the
background
is
that
different
components
may
compute
the
power
request
differently,
like
some
called
the
buggy
like,
doesn't,
doesn't
take
the
init
containers
requirement
into
consideration
or
calculating
the
wrong
way
and
some
component
like
yesterday,
I
checked
the
metric
of
the
cubesat
matrix.
They
just
iterated
the
regular
containers
and
added
them.
A
So
that's
not
correct,
because
you
have
to
consider
immediate
container
and
it
consider
the
part
overhead.
So
what
what
clayton
wants
to
propose
is
to
have
a
source
of
the
choose
view
of
the
path
request
and
now
all
the
kubernetes
components.
So
that's
pretty
good,
so
yeah,
I'm
I'm
really
like
it
so
like,
so
that
we
can
use
this
metric
to
come
up
with
a
aggregated
usage.
A
Node
for
the
aggregated
part
request.
So,
but
the
problem
here
is
that
yesterday,
a
colleague
came
to
me
to
say
that
the
cardinality
is
pretty
large,
especially
the
path
dimension,
so
because
this
is
multiplying
relationship.
So,
in
a
large
cluster,
if
you
have
five
four
thousand
five
thousand
nodes
and
each
each
node
has
like
50
or
100
parts,
this
is
a
huge
multiplier.
A
So
this
makes
the
primitive
scrapper
a
large
pressure,
even
one
a
single
time
to
scrap
scrape
the
magic.
So
I'm
considering
so
basically
I
don't
see
that
a
metric
involves
a
port
level.
So
it's
basically
a
high
level
matrix
and
I
think
namespace
node
probably
are
all
good.
It's
just
that
the
path
level
is
too
to
to
low
levels.
So
I
I
do
want
to
eliminate
the
part
dimension.
So
I
want
to
yeah
see
how
how
do
you
think.
D
And
clayton's
defense
was
that
this
is
well
for
once
it's
not
maintained
in
memory,
so
you
only
I
mean
only
the
scraper
would
pay
the
the
cost
not
skip
scheduler,
but
if,
but
I
think,
the
need
is
to
actually
want
to
actually
know
how
many
resources
one
bot
is
using,
and
if
you
don't
have
the
product
label,
then
there
is
no
point.
A
A
Okay,
I
can
talk
to
clayton
on
this,
so
basically
it's
just
practical,
not
that
ideal,
because
if
you're
troubleshooting,
some
performance
problem
usually
just
metric
gives
you
a
high
level
high
level
overview
on
this.
So
if
you
want
to
pinpoint
the
specific
part,
usually
you
see
the
metric
is
not
the
standard
way
to
to
do
that
and
that's
why
I
don't
want
to
have
the
part
in
the
metric
level
yeah.
I
just
want
to
restart
this
other
than
that.
A
A
If
not,
I
will
close
today's
meeting
and
yeah.
I
want
to
mention
that
abdullah
raised
the
record
meeting
from
30
minutes
so
for
15
minutes
or
15
55
minutes.
So
basically,
that's
the
well
still
be
the
rule
for
for
regular
meetings.
Yeah
you
can
we
can.
We
can
accommodate
more
items
in
the
future.