►
From YouTube: Kubernetes WG Batch Bi-Weekly Meeting for 20220623
Description
No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).
A
Good
morning,
good
evening,
good
afternoon,
depending
on
where
you
are
today
is
june
23rd,
my
name
is
mati
and
I'll
be
your
host
today,
and
this
is
another
of
our
bi-weekly
worker
batch-
calls
quick
announcement.
A
A
So
with
that
out
of
the
way
we
have
william,
who
will
be
giving
us
a
volcano
overview
so
take
it
away
from
here,
william.
A
B
B
B
B
B
B
But
the
major
problem
is
that
all
these
platforms
are
using
different
tech,
technical
stacks
building,
different
ecosystems,
which
make
it
hard
to
share
resources
between
these
different
kinds
of
workloads
and
our
resource
utilization
is
lower
than
expected
and
maintaining
multiple
technical
stacks
at
the
same
time
takes
a
lot
of
costs
and
effort.
So
more
and
more
organizations
were
leveraging
collaborative
technology
to
build
a
unified
platform
for
our
workloads,
but
there
are
still
some
gaps
in
clinicic
system
to
to
support
a
batch.
B
It's
a
common
requirement
to
have
different
approach,
templates
and
fingering,
the
lifecycle
management
job
and
the
second
gap
is
about
the
scheduling
such
as
the
job-based
priority.
The
plan
preemption
resource
reservation,
backfield,
topology
and
and
so
on,
and
the
third
gap
is
both
the
is
is
is
is
about
the
crd,
so
support
different
workloads
in
in
common
job
of
crd,
to
reduce
the
complexity
and
the
maintenance
effort.
B
As
we
know,
there's
there's
there.
There
are
too
too
many
crd
so
far,
the
such
as
the
mpl
printer
pet
touch
operator,
10
kf
operator.
B
B
So
so
that's
that's
why
we
started
the
volcano
project
several
years
ago,
and
here
is
the
overview
of
volcano
volcano.
In
includes
several
components
for
multi-cloud
scenario.
We
have
federation
sub
sub
project
in
pipeline
to
balance
results
between
different
clusters
and
in
each
individual
cluster.
B
We
introduce
several
crds,
such
as
job
for
common
batch
workload
and
the
queue
for
resource
sharing
and
the
controller
to
help
management
to
the
manage
the
lifecycle
of
these
crds
and
the
volcano.
Scheduler
vc
scheduler
provides
rich
advanced,
scheduler
scheduling,
policies
and
performance
enhancement,
especially
to
have
those
batch
workload
better
running
on
top
of
kubernetes
and
from
the
graph
we
can
see.
A
volcano
engaged
deeply
with
upstream
computing
framework,
like
spark
flink
tension
flow.
B
So,
as
we
know
in
batch
system,
there
are
several
concepts
which
are
important
to
the
high
level
design.
The
first
concept
is
the
job
for
batch
system
is
usually
have
common
job
specification
for
all
kinds
of
workloads
such
as
such
as
ai,
big
data
and
so
on,
include
the
native
x
system
is
required
to
introduce
multiple
code
templates
and
fingering.
The
error
handling.
B
The
second
one
is
is
about
the
tenants.
Traditionally
badge
system
introduced
the
user
as
a
tenant
and
kubernetes
use
namespace
as
tenant,
based
on
my
my
understand
and
in
order
to
avoid
system
overloading
since
resource
limits
is
required
to
the
batch
system.
For
example,
the
slurm
has
job
queue,
quality
of
service
and
kubernetes
supports
quota
to
control.
How
many
resources
can
be
created
in
in
system
in
system
for
each
tenant?
B
So
last
one
is
is
about
the
queue.
Q
is
common
concept
in
batch
system
and
it
is
widely
used
in
batch
systems
such
as
learn
rsf.
It
has
administrator
to
manage,
manage
resources
in
cluster
and
share
resources
between
these
different
tenants.
In
addition,
some
seems
some
system
support,
different
scheduling,
scheduling
configuration
for
each
queue.
It
is
useful
feature
for
for
users,
so
in
the
following
class,
I'm
going
to
introduce
the
detail
of
keno.
The
first
section
is
about
the
job
management
in
volcano.
B
B
We
can
restart
the
whole
job
when
one
task
in
the
job
fields.
We
also
build
job
plugin
to
help
a
user
to
customize
enhancement.
For
example,
the
ssh
plugin
is
used
to
configure
ssh
automatically
for
mti
job
and
the
service
plugin
is
used
to
create
a
headline
service
for
communication
between
the
posts
in
each
job.
B
As
we
know,
q
is
common
concept
in
batch
system.
It
is
used
to
share
results
between
different
tenants
in
volcano.
We
follow
this
practice
to
make
a
q
in
cluster
level,
so
q
share
resources
between
my
students
and
the
quota
is
considered
as
a
resource
limit
of
tenants.
Currently,
a
volcano
supports
the
first
in
first
first
out
fairness,
priority
algorithms,
false
queue
and
the
configuration
is
global
for
all
oculus.
B
Clear
is
another
example
showing
how
to
share
resources
between
queues.
Suppose
we
have
six
cpu
in
in
the
cluster
and
two
queues.
One
queue
is
q1,
the
other
one
is
q2,
which
is
mapping
two
teams,
the
the
the
width
are
two
and
one
accordingly.
At
the
beginning,
in
the
q1
and
q2
at
the
beginning,
we
submit
the
one
job
with
with
six
code
in
q1
and
the
q2
is
empty.
B
B
The
next
one
is
about
the
fair
share
scenario:
fair
share
is
a
common
requirement
for
elastic
or
streaming
jobs
like
spark
flink.
However,
in
kubernetes
the
mobile
submitted
the
more
possibility
to
get
more
resources
there,
there's
no
fair
share
and
volcano
prod,
plus
the
fair
share
between
jobs
and
namespace
for
user.
B
So
we
can
see
from
the
graph
that
user1
and
user
2
submit
smart
job
and
big
job
to
the
sim
queue.
The
smart
job
maps
get
starving
without
fair
fairness,
getting
fairness
scheduling,
so
can
you
ensure
insurance,
big
job
and
smart
job?
Get
resources
fairly
with
drf
agrees,
but
only
only
double
fair
share
is
not
enough.
Suppose,
one
namespace
oversubmits
a
lot
of
jobs
to
skill
compared
to
other
name
services.
It
would
pos
it
would
possibly
occupy
most
of
the
queue
resources.
B
This
is
important
in
a
multi-tenance
environment,
for
customer,
and
here
are
here-
are
some
some
scheduling
policies
in
volcano,
and
I
will
have
you
because
I
will
show
some
part
of
them
in
in
the
in
in
the
slides.
B
So
the
one
is
about
the
elastic
elastic
training
scenario,
a
machine
learning
workload
has
higher
demand
for
gpu
compared
to
traditional
workloads.
So
gpu
is
expensive
resource.
How
to
improve
the
gpu
utilization
is
a
hot
topic
and
have
great
value.
Elastic
elastic
training
can
dynamically
adjust
the
number
of
instances
involved
in
in
the
training
and
so
greatly
improving
the
utilization
of
gpu
resources,
especially
on
the
public
cloud.
It
can
work
with
sport
instance
to
get
a
lower
cost
and
improve
the
training
progress.
B
Firstly,
let's
see
what
elastic
job
is
like
the
last,
the
left
figure
shows
a
volcano
job.
The
main
available
shows
in
refers
to
the
job
which
has
5
port
at
least
and
the
replicas
in
first
to
the
pod.
The
job
has
10
temples
at
most,
the
job
gets
running
where
failed
posts
get
allocated,
and
then
the
job
will
extend
more
more
posts
if
there's
more
free,
gpu
resources.
B
Here
is
another
scenario
as
used.
As
you
know,
the
inference
service
always
have
a
lower
gpu
utilization
compared
to
the
training
workload,
as
people
tends
to
collocated
colloquiate
the
inference
service,
with
the
training
workload
to
improve
the
overall
unit
utilization.
The
the
red
figure
shows
an
example:
the
influence
job
2.
B
B
Clear
is
another
useful
policy
for
for
for
user
it.
This
page
shows
how
the
task
topology
and
the
l
awareness
helps
the
distributing
training
for
some
gpu
training
thesis.
The
data
exchange
between
tasks
cost
a
lot
of
time
and
and
becomes
the
bottom
line
of
the
training
in
our
performance
test.
If
the
time
of
the
data
exchange
could
be
reduced,
the
training
performance
can
be
improved.
B
We
also
had
the
test
with
the
default
scheduler
and
there
are
three
nodes
in
the
test
cluster
and
we
submit
training
job
with
two
ps
and
four
workers.
We
got
three
different
three
different
placements
at
last.
The
results
are
random.
With
default
scheduler,
as
you
know,
the
group
c
is
the
best
placement
that
we
want
with
the
task
topology
scheduling.
We
are
able
to
get
stable
results
as
group
c.
B
As
far
as
I
know,
some
users
use
the
affinity
and
anti-affinity
features
to
achieve
this
goal.
However,
in
test
the
complex,
the
complexity
increases
with
this
cluster
skill,
and
the
performance
is
not
so
good
enough.
We
also
do
some
research
on
the
io
awareness
scheduling
with
the
task
topology
info,
and
I
o
information.
We
can
minimal
minimal
lights,
minimalize
the
max
data
transfer
latency
and
get
even
better
performance
for
some
kind
of
models.
This
figure
shows
the
vgg
16
model,
training
results
with
default,
scheduler
task,
topology
and
io
awareness
scheduling.
B
The
lower
knee
scheduling
gets
us.
30
percent
performance
increase
compared
with
default
scheduler
and
the
result
depends.
The
results
depends
on
the
data
exchange
and
the
the
the
mode
models.
B
So
in
a
real
production
cluster
we
user
often
submit
much
multiple
kind
of
overclock,
for
example
the
smart
job
and
the
big
job.
How
to
avoid
the
big
job
or
small
job
guide.
Starving
is
very
important,
so
the
traditionally
traditional
hpc
system
usually
supports
this
kind
of
feature,
but
in
coordinates
there's
no
kind
of
capability.
The
left,
the
left
figure,
shows
an
example
at
the
moment
of
t1
users
submit
a
big
job
with
gun
and
the
smart
job
too.
B
The
smart
job
gets
allocated
and
job
and
big
job
1
keep
pending
due
to
the
insufficient
resource
at
the
at
the
moment.
T2,
a
new
small
jobs
theory
was
submitted
and
got
allocated
the
big
job.
One
keep
pending
with
the
time
goes
on.
The
big
job
will
get
starving
if
the
release
the
resources
always
cannot
satisfy
its
gun
and
the
user
submits
some
small
job
continuously.
B
B
Spark
starts
to
pro
provide
support
in
spark
2.3
version
in
2017
and
later
spa
computer
provides
another
vid
to
help
run
spark
on
kubernetes
as
well.
However,
in
a
lot
in
a
very
long
time,
survival
companies
lack
of
batch
scheduling,
abilities
so
late.
Last
year
we
started
we
started
to
work
with
spa
contributor
to
support
custom
customer
matched.
B
The
batch
scheduler
for
spark
and
spark
with
volcano
provides
the
the
common
batch
scheduling
abilities
like
job-based
priority
queue,
fair
share
resource
resolution
and
so
on,
and
it
has,
it
has
been
released
in
spark
3.3
this
week
and
and
and
user
user
user
can
use
spark
3.3
with
volcano
to
to
enable
the
batch
schedule
batch
scheduling
policies.
B
As
this
is
another,
this
is
another
use
case
in
in
huawei
cloud
as
the
amount
of
data
it
continues
to
grow
and
the
capacity
of
business
increase.
B
Our
users
required
coordinated
clusters
support
larger
skill.
So
we
also
spend
a
lot
a
lot
of
effort
to
support
10
000
nodes
and
mainly
cost
1
million
cluster.
By
optimizing,
the
container
network
scheduling,
container
engine
etcd
and
api
server,
the
red
part
leads
to
some
specific
approaches.
Take
a
schedule.
For
example,
we
improve
the
scheduling
throughput
to
101.5
thousand
ports
per
second
by
adopt
the
the
cache
batch
binding
and
isenc
async
bonding
and
the
code
level
pro
code
level
of
team.
Optimizing.
B
So
here
I
will
have
you,
I
will
show
several
use
cases
for
kubernetes
and
volcano.
So
is
one
of
the
top
social
media,
media
and
e-commerce
company
in
china.
Many
people
use
their
apps
in
mobile
phone
and
they
have
one
million
active
users
per
month
and
the
main
workload
they
are
running
to
is
to
provide
the
recom
recommendation
for
end
users.
They
need
to
refresh
the
module
every
every
s,
every
several
minutes
and
see,
and
they
have
some
online
service
to
do
immediately
reaction,
when
user
refresh
then
refresh
their
their
notes.
B
So
the
challenge
is,
they
have
large
cluster
with
nearly
with
nearly
2000
nodes
and
the
model
have
the
model
have
100
million
parameters,
one
tencent
fluid
one.
Tensorflow
job
has
more
than
hundreds
of
ps
and
a
worker
code.
Actually
they
want
the
best
resource
allocation
and
the
utilization,
so
the
user
adopts
the
task,
topology
scheduling
and
again
the
20
about
around
20
performance
increased,
and
he
also
used
the
see
the
sre
based
scheduling
to
prevent
large.
That
job
from
starting
another
case
is
about
the
financial
sector
user.
B
They
initially
used
the
hadoop
pr
to
schedule
the
batch
jobs
and
the
company
groups.
The
environmental
policy
also
changed
different
research.
Research
team
require
container
deployment
to
avoid
the
environmental,
confliction
and
dependency.
So,
however,
the
quantities
default
scheduler
like
lacks
fair,
share
scheduling
between
much
multiple
teams.
So
we
have
requirements
of
using
different
frameworks
such
as
the
also
they
also
have
the
requirements
of
using
different
framework
such
as
tensorflow,
pathage
and
mpr,
that
that
requirement.
B
They
are
a
small
fast
fast
starting
company,
and
the
user
looks
for
the
solution
and
found
a
volcano
can
satisfy
their
environment,
their
requirements
and
offer
diverse
scheduling,
abilities
and
what's
more
and
they
could
use
a
volcano
job
to
unify
the
tension,
flow
path,
touch
and
ampli
jobs
individually.
They
decide
to
migrate
it
from
the
yard
to
kubernetes
and
the
volcano.
B
B
B
So
here's
here
is
the
release
journey
of
volcano.
We
have
released
more
than
more
than
16
major
versions
since
open
sourced
at
the
early
age.
We
did
we
developed
a
set
of
scheduling
policies
to
support
batch
workload
and
then
integrate
in
integrate,
integrate
with
ecosystem
such
as
cube
flow
spark
operator,
flink
operator,
argo,
manuspor,
pad
pedo
and
so
on,
and
then
we
found
that
there's
a
lot
of
gaps
in
the
job
management.
So
we
spent
quite
a
lot
of
time
to
enhance
the
job
management
to
deeply
support
upstream
computing
framework
and
in.
B
A
Yeah
sorry
to
interrupt
you,
we
are
halfway
well
two-thirds
through
our
reserve
time,
and
I
would
like
to
use
the
remaining
15
minutes
for
the
questions.
Oh.
A
B
A
Okay,
if
you,
if
you
don't,
have
any
further
closing
comments,
abdullah,
you
had
your
hands
raised
for
quite
a
while
go
ahead.
C
Thanks
this
was
like
really
great.
You
know
rundown
of
the
volcano
ecosystem.
I
have
a
couple
of
questions.
I'll
start
with
one
related
to
the
the
job
api.
I
mentioned
that
you
have
multiple.
You
know
components
in
volcano,
which
is
like
the
scheduler,
the
job
orchestra,
like
management
and
whatnot.
C
I'm
wondering
like
if
volcano
has
been
like
started.
Like
I
don't
know
in
2017
three
four
years
ago,
I
was
wondering
why
didn't
you
think
about
improving
the
existing
job
api
to
satisfy
your
needs
like
within
the
next
like
in
three
four
years?
I
would
imagine
by
now,
like
the
job
aware,
could
have
evolved
to
fix
all
these
gaps
that
you've
noted
earlier
in
the
journey.
C
B
So
for
the
job
management,
I
think
this
is
the
biggest
gap
is
the
the
multiple
template
because,
because
for
the
batch
workload,
the
all
most
of
the
batch
workloads
have
have
several
rules.
Several
rows
in
one
job.
B
B
C
Right
but
like
was
this
proposed,
for
example,
to
the
kate's
community
and
was
rejected
like,
for
example,
I
imagine
like
a
v2
job
api
that
includes
multiple
templates
like
to
to
fix
this
gap.
It's
just
like
my
concern
here
is
that
again,
like
with
it
introducing
a
new
job
api
like
if
you
had
invested
in
in
the
the
core
combination,
you
might
have
gotten
like
more
traction
with
with
volcano,
because
it
will
be
easier
for
users
to
you
know,
migrate
to
use
volcano
without
changing
too
much
and
without
being
concerned
about.
C
You
know
fitting
their
workload
for
a
yet
another
custom
api,
but
I,
but
I
totally
get
like
if,
if
you
have
a
concern
about
like
velocity,
how
fast
we
can
iterate
over
it,
it's
just
like.
I
was
like
because
you
you
started
like
three
four
years
ago.
C
B
B
C
That
is
okay.
I
think
I
just
hope
that
if
we
manage
to
evolve
the
job
api
to
stage
where
it
fixes
these
caps,
maybe
volcano
can
migrate
to
use
that
common
api.
This
could
be
like
one
way
moving
forward
with
this,
but
my
second
high
level
question
about
the
schedule
like
it
is:
how
do
you
keep
updated
with
the
new
kids
features
like,
for
example,
wait
for
first
consumer
volumes,
those
are
like
baked
in
into
the
core
kubernetes
scheduler,
or
this
new
proposal.
C
I
don't
know
if
you've
been
following
it
in
kubernetes,
which
is
the
dynamic
resource
allocation.
This
is
a
really
complex
proposal
that
involves
many
components,
including
the
scheduler.
Do
you
import
any
code
from
the
code
kubernetes
scheduler
to
achieve
these?
Like
you
know,
requirements
similar
things
with
like
affinity
and
whatnot?
I
don't
know
if
you
reimplemented
them
yourself
or
like
volcano,
I
mean,
or
you
just
import
them.
B
From
oh,
that's
a
good
question:
the
kubernetes
support
multiple
scheduler,
so
in
volcano
we
cannot
just
import
the
defaults,
the
scheduler
framework
package
and
use
the
and
use
this
use
the
input
to
import
the
the
default
scheduler
ability
in
volcano
to
support
the
affinity,
the
the
toleration
and
all
these
all
these
policies.
A
D
E
B
Yeah
yeah
the
we
we
we
have
planned
for
the
volcano
federation
and
to
balance
resources
among
different
clusters,
and
this
feature
will
be.
We
will
be
ready
in
the
this
year
and
we
also
found
that
there
are.
There
are
already
several
multiple
cluster
projects
like
commander,
like
other
other
other
project,
other
projects,
but
all
these
this
project
are
aims
to
resolve
aims
to
provide
the
dha
for
for
the
micro
service,
so
our
user,
our
community
users,
always
have
multiple
class
clusters
in
their
environment.
So
a
lot
of
users
require
this.
B
B
Currently,
I
I
just
I
I
know
little
about
the
amanda
amanda,
some
of
the
users,
you,
you
use
user
amanda.
B
Maybe
I
I
have,
I
haven't,
have
a
deeper
investigation
and
compare
comparison
so
far,
maybe
a
in
a
little
time
of
I
I
will
maybe
after
the
investigation
I
will.
E
Okay,
so
I
have
one
more
question,
so
do
you
so
one
of
the
worries
with
something
like
volcano
to
me
to
me
at
least,
would
be
load
on
xcd.
E
E
E
And
no
worries,
if
you
don't
haven't,
I
just
curious.
F
Yes,
so
a
common,
a
common
complaint,
I've
I've
heard
from
customers
about
about
volcano
is
about
its
interactions
with
with
a
cluster
autoscaler.
Do
you
have
any
any
plans
to
to?
You
know
solve
this
issue
or,
like
do
you
plan
to
also
fork
a
cluster
outer
scaler?
B
B
B
So
we
have
a
plan
and
a
a
basic
idea,
but
it
is
not.
This
is
not
material.
The
basic
idea
is
we.
We
might
create
a
crt
to
to
to
to
to
connect
the
the
sk
out
scaling
and
the
scheduler.
We
use
the
reuse,
the
crd,
to
communicate
the
the
detailed
information
to
make
this
scheduling
more
intelligent,
but
so
far
I
have
no
detailed
idea.
F
It
would
be
good
if
that
you,
you
can
bring
that
idea
once
you
have
it,
because
I
I
I
think
it
would
be
a
very
bad
thing
if,
if
you
end
up
forking,
let's
say
cluster
autoscaler,
because
then
we
you
know
it's
harder
for
people
to
to
migrate,
to
maintain
two
systems
and
so
on
and
so
forth.
B
Yes,
so
far,
we
focus
autoscaler
and
do
some
enhancements
internally.
B
We
ever
submit
some
patch
to
to
to
upstream,
but
it's
sometimes
it
it
can't.
D
Yeah,
so
I
in
the
presentation,
I
noticed
that
you
mentioned
volcano,
supports
scheduling
policies
and
one
of
them
is
no
awareness,
I'm
just
curious.
How
is
that
support
enabled
for
batch
workloads
in
say,
kubernetes
and
one
of
the
the
reason
I'm
asking?
That
is
because
we've
been
working
on
enabling
new
awareness
in
kubernetes
so
like?
Is
there
something
we
need
to
converge?
Those
efforts
is
there
dependency?
Is
there
reuse
of
some
of
the
work
that
we
have
been
doing
or
is
there
something
we
should
be
doing
to
reuse?
B
Yes,
yes,
this
is
a
good
question.
The
first
new
awareness
scheduling
for
in
the
last
year.
We
are
one
of
our
customer
wants
this
feature,
so
we
communicate
with
the
is
the
the
community.
According
to
community
a
the
developer
said
the
is
a
long-term
feature
and
cannot
meet
of
the
of
of
that
line.
So
we
start
the
support
study
to
support
pneumonia
during
in
volcano.
B
D
In
terms
of
like,
if
you
can
point
us
to
like
some
of
the
code
that
has
been
done
to
implement
that
and
volcano,
we
can
take
a
look
because
I'd
be
curious
to
see
like
how
the
apis
look
like
and
if
there's
something
that
maybe
we
need
to
take
into
consideration
when
we
are
enabling
no
awareness
in
kubernetes
natively.
A
G
I
I
do
have
a
question
so
for
volcano:
did
you
do
any
kind
of
benchmarking,
which
means
that
published
number
hey?
This
can
support
this
many
job
burst
or
for
this
cluster
size
so
that
we
can
have
an
idea.
What
is
the
scalability
limit
and
yeah.
B
Benchmark
we
we
we
test
for
for
some,
we
task
for
the
scheduling
policies,
several
zero,
zero
scheduling
policies,
just
like
the
the
minimal
resource
reservation
for
spark
for
the
performance
increase
and
for
we
test
for
each
important
scheduling,
policies
for
the
the
performance
improved
and-
and
they
also
see
the
scheduling
through
throughput,
but
we
don't
have
a
a
complete
benchmark
to
test
all
the.
Oh
all
these.
What
you
said
what
you
just
said.
G
B
Sorry,
I
I
didn't.
I
didn't
catch
you,
okay,.
B
It's
it's
it's!
It's
it's
one
kubernetes
cluster
and
in
in
huawei
huawei
cluster.
A
B
A
Okay,
thank
you
very
much
all
sorry
for
all
for
taking
extra
five
minutes
above
the
usual
time.
Thank
you
very
much
william
for
presenting
the
volcano
and
see
you
all
in
two
weeks.
Bye
all.