►
From YouTube: Kubernetes WG Batch Weekly Meeting for 20220707
Description
No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).
B
I
was
just
I
wanted
to
try.
Should
we
start,
I
think
she'll
start
I'll
wait
for
a
couple
moments.
B
Hello,
everyone,
so
this
is
bash
working
group
july
7th
this
meeting
is
recorded,
will
be
uploaded
to
youtube.
So
please
adhere
to
kubernetes
guidelines
when,
when
commenting
so
today,
we
have
antanas
so
he'll,
be
giving
us
they'll
be
giving
us
an
overview
of
slurn.
B
C
It's
telling
me
host
disable
screen
sharing.
If
you
can.
C
B
B
D
Okay,
yeah.
Thank
you,
everybody
for
joining
this
talk
today,
so
yeah.
I
participated.
Some
of
the
previous
calls
in
the
batch
group
and
I
heard
interesting
topics
in
respect
to
badge.
Job
submission
on
kubernetes
and
my
background.
D
Before
going
to
the
cloud
native
site
was
hpc
and
basically
we
used
a
lot
basically
in
high
performance
computing.
We
we
have
similar
problems.
What
people
have
to
deal
with
in
kubernetes
nowadays,
so
we
are
having
distributed
computing
and
a
lot
of
the
workloads
are
processed
in
a
pet
shop
manner
and
there
are
different
solutions.
How
to
do
the
scattering
and
resource
management,
and
very
popular,
of
course,
is
learn.
So
I
thought
it
would
be
a
nice
kind
of
think
to
share
what
was
done
in
hpc
with
the
community
here.
D
If
you
are
developing
new
tools,
maybe
this
will
give
you
an
idea
for
nice
interfaces
or
idea
how
to
extend
existing
tools,
make
them
cloud
ready
and
so
on.
So
I
think
I
hope
it
will
be
valuable
for
everybody.
D
So,
let's
jump
into
the
details
a
little
bit.
What
is
learn
so
slurm
had
long
history.
It
started
development
in
2002
in
laura's,
livermore,
national,
lab
laboratory,
and
you
see
already
a
little
bit
about
the
history
in
the
name.
It
was
simple:
linux
utility
for
resource
management.
D
Actually,
the
first
version
of
slurm
was
a
resource
management
tool.
It
was
still
not
having
a
lot
of
scheduling
capabilities,
but
this
was
the
start.
It's
a
c
code,
quite
a
good
size,
c
code,
500
000
lines
of
code.
D
If
you
compare
it
to
to
other
tools,
I
don't
know
how
big
kubernetes
is,
for
example,
but
it's
quite
compact
for
the
tasks
what
it
has
to
do
with
and
very
efficient.
If
you
have
used
it
in
the
past
on
the
hpc
systems,
you
most
probably
aware
that
the
two
can
scale
to
thousands
of
nodes
basically
can
submit,
or
you
can
run
jobs
on
thousands
of
nodes
and
basically
the
deployment
time
or
the
time
to
start.
D
The
job
usually
on
such
kind
of
clusters
is
within
a
minute
what
I
heard
in
the
past.
So
it's
really
quite
optimal
solution,
so
you
can
find
details
more
about
the
the
actual
software
in
the
link
below.
But
I
tried
to
capture
some
key
points
and
give
you
some
overview.
What's
learned
this
and
can
do.
D
So
just
have
to
deal
with
powerpoint
right
so
yeah
the
smarm
had
very
simple
original
idea.
Basically,
it
tried
to
coming
out
from
from
normal
applications
on
pc
when
when
it
was
developed
2002,
this
was
just
the
time
where
parallel
applications
were
getting
more
important.
D
We
got
in
the
time
of
the
first
multi-core
processors
and
so
on.
So
basically,
the
idea
is
how
I
can
can
we
run
parallel
applications
in
a
similar
way.
Why
we
do
it
on
standard
pc?
You
see
an
example.
You
have
an
linux
application,
some
application
a
and
you
want
to
execute
it
on
linux.
This
is
very
simple.
You
just
call
the
application
and
slurm
in
slurm.
D
You
can
execute
the
version
of
certification
just
with
a
single
command,
s-run
and
you're
running
a
parallel
application,
and
eight
tells
that
you
want
to
run
eight
copies
of
that
application.
So
it's
quite
intuitive
idea,
let's
say,
and
in
hpc,
usually
behind
the
a
stays,
a
mpi
parallel
application.
D
This
was
the
main
idea.
What's
what
they
want
to
achieve.
Originally
and
yeah,
the
two
became
quite
big
and
offered
a
lot
of
additional
features.
It's
it's
fault,
tolerant,
full
scale,
cluster
management
tool.
Basically,
it
has
three
main
things
you
can
use
it
as
a
resource
manager,
so
it
can
deal
with
all
the
resources
available
on
the
compute
hardware,
allocate
them
free
them.
D
D
Right
and
just
I
found
this
kind
of
information
from
scat
md
skate,
md
is
the
original
company
or
community
which
currently
drives
the
development
of
slurm,
and
they
gave
a
short
overview.
What
is
the
difference
between
resource
management
and
job
scheduling?
D
I
think
most
probably
everybody
in
the
this
community
is
aware
of
that
resource
management.
How
do
you
allocate
resources
on
the
compute
nodes,
so
resources
can
be
cores
memory,
caches
networks,
gpus
switches,
so
many
many
things
are
seen
as
resources
and
job
scheduling
is
yeah.
How?
How
can
you
schedule
your
jobs
more
optimal
if
you
have
complex
network
topologies,
if
you
have
time
slices,
if
you
want
to
limit
the
execution
of
jobs,
let's
say
we
have
in
my
company
the
case.
D
D
Actually,
there
are
more
solutions
out
there
for
that.
You
see
below
some
small
kind
of
table.
Slurm
covers
both
resource
management
and
scheduling
aspects,
and,
if
you
ask
yourself
what's
kubernetes
today,
kubernetes
also
covers
both
resource
management
and
scheduling.
So
it's
another
kind
of
bar
in
in
this
table.
D
Looking
behind
the
architecture,
so
basically
on
this
diagram,
let's
start
with
the
with
this
box
with
the
lilac
cover.
Those
are
the
main
components.
There
is
a
control
plane.
D
So
basically,
this
is
the
main
main
component
with
which
the
users
are
speaking
to
issue
commands,
but
also
it
speaks
with
the
database.
This
is
the
swarm
database,
which
is
mostly
managing
the
accounting
values
like.
If
you
want
to
introduce
users,
policies,
time
limits,
those
will
be
stored
in
the
database
and
the
control
demon
will
ask
for
them
or
basically
they
will
be
pushed
back.
D
So
then
you
can
have
multiple
copies
of
the
control
daemon.
If
there
are
more
users
on
the
cluster
and
a
single
process
cannot
handle
that,
you
can
replicate,
it
have
an
optional
demon
to
handle
more
work
and
then
the
the
yellow
box
basically
is
run
on
the
compute
hardware.
D
You
have
a
slurm
daemon,
which
is
responsible
for
basically,
you
get
the
job
request
from
the
users
it's
registered
from
the
control
daemon
and
the
control
demon
sends
it
to
to
the
actual
nodes.
The
actual
nodes
running
the
swarm
daemon
will
take
care
then
for
resource
allocation
and
and
job
execution.
D
So
the
slurm
demon
is
something
like
cubelet.
If
you
search
for
or
some
sort
of
references
or
connection
to
existing
kubernetes
architecture
in
terms
of
commands,
what
the
user
are
using,
you
can
see
them
in
this
blue
box.
Speaking
with
the
control
demon
is
done
with
the
s
control.
Basically
sq.
You
can
ask
what
queues
are
available
on
the
system.
You
can
view
the
queues
if
the
jobs
are
acute.
D
How
much
you
have
to
wait
that
your
job
gets
running.
You
can
achieve
that
through
the
sq
in
queueing.
Job
happens
with
as
patch,
so
as
batches
just
fire
and
forget
at
some
point
you
don't
you,
you
get
an
output
from
the
job
and
then
s
run
is
more
interactive
blocking
process.
So
actually
you're
you
enqueue
your
job,
you
block
or
the
control
daemon
is
blocking
the
user
request
and
you,
your
job,
gets
running
up
and
running.
D
It's
more
like
an
interactive
experience
and
there
are
some
further
functions
for
accounting
to
to
monitor
or
to
see
the
accounting
values
which
were
stored
in
the
database.
D
B
D
Are
right,
they
are
rely,
they
are
residing
on
different
servers,
so
the
database
can
be
a
different
server.
The
yellow
boxes
are
all
on
different
servers.
You
will
have
one
demon
per
compute
node
right.
B
I
guess
they're
trying
to
mention
here
like
in
kubernetes,
like
you,
don't
have
these
calls
being
explicitly
made
between
components
like
components,
don't
call
each
other
directly
here
they
do
like,
for
example,
s
run
which
is
running
on
the
client.
It
actually
makes
an
rpc
call
to
slam
d.
In
our
case
the
cubelet
explicitly
to
start
a
job
node.
D
Yes,
so
there
are
some
differences
still,
as
you
see
so
yeah.
D
This
is
a
little
bit
detailed
description
or
some
bullet
points
for
each
component.
What
it
does
the
most
important
components,
basically,
the
smart,
slurm
controller
so
and
then
the
database,
the
slurp
demon,
but
we
we
captured
those
as
batch.
D
Some
of
the
basic
utility
is
used
to
submit
jobs,
are
as
patched
as
run.
Jobs
are
basically
text
files,
simple
text
files.
I
will
show
an
example
how
a
search
file
looks
like,
but
you
just
call
as
patch
with
this
job
file
and
then
all
all
goes
to
the
controller
and
the
job
x
is
executed.
It's
very
simple
and
you
can
also
have
something
like
interactive
jobs.
D
D
Job
location,
basically,
it's
it's
waiting
that
the
resources
are
allocated
when
you
get
the
resources
you
get
unblocked,
and
this
is
the
difference
as
batch.
You
will
shoot
the
job
you
enqueue,
it
put
it
in
a
queue
and
the
the
command
line
gets
unblocked.
D
B
D
Right
right,
you
can
request
to
block
to
to
allocate
10
nodes,
let's
say
and
run
as
run
with
an
argument
to
allocate
the
nodes.
It
will
wait
that
the
10
nodes
are
available
from
the
queue
for
the
certain
time
span.
You
can
provide
the
time
information.
I
want
to
allocate
those
nodes
for
10
minutes.
So
if
the,
if
the
nodes
are
available
on
the
queue
and
you're
the
next
user,
who
can
get
them,
then
you
get
unblocked
at
some
point
and
you
can
do
your
work
with
that.
D
D
What's
the
state
of
my
job
through
the
sq
command
command,
you
can
see
if,
if
it's
running
or
if
it's
still
in
in
queued
state,
so
you
might
need
to
wait
until
it's
executed.
B
One
more
thing
here,
like
you,
mentioned
a
lot
of
commands-
and
it's
like
one
thing,
for
example
with
kubernetes-
is
that
people
sometimes
like
build
automation
like
they
build
controllers,
to
do
a
lot
of
the
work.
B
I'm
wondering
if
this
is
a
usage
pattern
that
you
have
also
would
slim,
or
is
it
mostly
that
there
are
physical
end
users
that
are
actually
starting
the
job
etc
or
is
there
a
user?
Are
you
like
a
pattern
where
no
all
of
that
fires
jobs
some
at
some
great
level.
D
I
did
not
see
such
thing.
Usually
there
are
users
submitting
jobs,
so
you
don't
have
some
sort
of
connection
to
to
some
events.
It's
not
really
events
driven
concept.
Most
probably
you
you
can
do
some
some
implementation,
which
does
that,
but
maybe
kubernetes
and
the
whole
components
have
better
ways
to
do
that.
If
you
want
to
react
to
an
events
and
stuff
like
that,.
D
F
I
just
would
say
users
will
use
frameworks,
like
you
know,
workflow
engines
and
stuff
like
that,
that
interact
with
the
batch
system.
So
it's
a
different
model
than
maybe
kubernetes,
but
you
know
they're
sort
of
an
analog.
D
F
D
The
the
flow
engines
they're
going
a
little
bit
in
the
same
direction.
They
they
have
some
elements
from
cron
jobs
and
stuff
like
that.
Yeah
yeah.
D
Right,
so
there
are
a
little
bit
alternative
commands.
You
can
just
allocate
nodes
and
attach
to
them
later.
This
can
be
done
with
slog
as
attached,
and
you
see
some
additional
arguments
you
can.
We
have
accounts
associated
to
the
whole
commands,
so
there
there
is,
there
are
components
for
permission
models
and
so
on.
You
can
give
the
time
spans
and
so
on
so
quite
yeah.
D
A
lot
of
features
available
through
the
api
very
interesting
is
to
look
on
the
s
batch
files.
They
are.
I
I
saw
some
examples,
how
you
are
thinking
to
define
jobs
for
kubernetes
and
the
nice
thing
about
slurm
drops
they're
very
compact.
Very
you
could
see
what
you
need
to
express
a
bad
job,
maybe
through
through
this
example.
So,
basically
you
have
the
name
of
the
of
the
bad
job.
You
can
tell
how
many
nodes
you
need.
You
can
have
different
number
of
tasks.
D
The
task
can
be.
You
can
assign
also
how
many
cpus
you
want
per
task,
how
much
memory
you
want
per
cpu.
So
you
do
similar
things.
What
you
do
in
your
pots,
basically
with
requesting
resources.
So
it's
a
little
bit
different
here.
The
resource
requests
are
inside
the
batch
job.
So
if
you
are
thinking
how
to
implement
this
in
kubernetes,
you
will
need
to
find
a
way
to
translate.
D
If
you
want
similar
kind
of
interface,
it
becomes
interesting.
How
do
I
do
I
translate
that
to
bot
requests,
so
it's
a
little
bit
difficult
problem
and
again
you
have
time
you
can
specify
after
that,
and
usually
what
this
is.
This
part
is
the
job
description
part
where
you
have
resource
requests,
some
some
definition
of
time
and
so
on.
After
that,
it's
usually
shell
scripting.
What
follows
and
you
can
execute
any
kind
of
linux
application
inside.
D
Basically,
it's
a
normal
shell
script,
following
that
in
in
many
cases
it
was
a
an
hpc
application
and
usually
in
hpc
community
they
are,
they
use
this
kind
of
module
environments
which
is
a
way
to
configure.
D
D
There
are
extensions
for
gpus
and
then
basically,
there
are
also,
after
that,
the
nodes
get
also
different
states,
just
a
node,
basically
switches
between
those
states.
It's
idle
mixed
or
are
located.
D
So
it's
a
little
bit
different
in
the
compared
to
cloud
world
where
you
have
virtualized
resources,
and
you
might
be
not
the
only
one
on
that
note.
So
then
the
other
interesting
things
are
partitions.
D
I
will
spend
a
little
bit
more
time
on
partitions,
because
I
find
this
concept
very
nice
how
you
could
distinguish
between
different
types
of
resources,
or
so
you
could
group
basically
pieces
of
your
cluster
in
partitions.
You
can
tell
this
part
of
the
cluster
is
partition.
One.
This
part
is
partition,
two,
where
it's
useful,
you
could
imagine.
Let's
say
you
have
a
hybrid
cluster.
D
You
have
some
cpu
only
part
of
the
cluster,
and
then
you
have
gpu
only
or
a
cluster
with
gpus,
so
you
could
build
two
partitions
and
instead
of
giving
access
or
scattering
to
all
the
nodes
and
basically
allocating
some
drop
on
which
which
uses
more
cpu
on
the
gpu
node,
you
could
control
that
through
the
partitions
and
actually
in
in
slurm.
This
can
be
done
automatically,
so
you
could
reset
a
request,
a
resource,
let's
say
a
gpu
resource,
and
then
the
scheduler
should
find
out
automatically.
D
What's
the
best
suited
partition
or
you
will
see
later,
you
can
specify
also
explicitly
what
partition
you
want
to
use
and
under
the
hoods
for
each
partition,
you
have
a
queue
which
is
basically
then
responsible
for
inquiring,
the
jobs
or
getting
the
jobs,
and
then
they
are
served
in
the
order.
Accordingly,.
B
D
D
Okay,
this
is
summary
of
the
resource
requests.
What
you
usually
will
see
in
the
jobs.
Okay,
this
is
more
on
the
cpu
side.
There
are
further
things
what
you
can
control
with
plugins
for
gpus
and
so
on,
but
yeah.
This
is
how
usually
you
can
control,
how
many
cores
you
get
and
and
so
on,
and
how
much
memory
per
cpu
and
so
on
now
back
to
the
partitions.
So,
for
example,
if
you
want
a
bird
you
could
this
example,
which
I
took
from
enroll?
D
It
was
with
two
islands.
You
had
an
island
roughly
100
nodes
which
had
let's
say
hard
drive
capabilities
above
one
terabyte,
so
you
could
basically
group
them
and
every
job
request
which
wants
to
use
which
needs
file
system
bigger
than
one
terabyte
will
be
served
by
this
another
example.
You
could
group
by
memory,
so
you
could
tell
at
least
make
a
queue
or
make
a
partition,
basically,
which
will
cover
all
requests
with.
D
For
jobs
having
more
than
requiring
more
than
96
gigabyte
of
memory-
and
you
in
that
you
can
have
two
clusters,
as
you
see
so
one
with
192.,
it
doesn't
have
to
be
homogeneous,
but
you
can
bind
them
in
one
partition.
One
queue
both
both
will
fulfill
the
requirements.
This
is
just
a
simple
example,
and
this
is
extended
further.
You
can
have
a
third
one
with
gpus
and
the
way
how
you
specify
your
your
requests.
D
You
add
these
parameters
to
the
as
patch
commands
or
s-run
commands.
You
tell
explicitly.
I
want
500
gigabyte
of
memory
or
I
want
20
terabytes
of
storage
and
then
slurm
will
find
out
which
one
of
the
partitions
is
best
suited
for
this
job.
Basically,.
D
Yeah
they
are
actually
recommending
to
to
do
explicit
resource
requests,
but
you
have
also
the
possibility
to
choose
the
partition
on
explicitly.
You
can
list
them,
so
you
you
can
they
are.
You
can
always
see
what
what
are
the
available
partitions,
what
kind
of
how
many
nodes
you
have
inside
and
then
there
is
an
argument:
minus
p.
I
think
what
you
can
pass
to
the
s
patch
to
to
use
the
explicit
partition,
but
they
are
not
advertising
that
usually.
D
Yeah
this,
this
is
the
example.
Basically,
you
have
the
standard
s
run
command.
You
have
a
time
parameter.
How
many
nodes
you
want
four
nodes,
basically
with
that
capacity
of
memory.
So
this
is
how
you
request
it
or
if
you
want
a
bad
job
or
if
you
want
a
bad
job
donating
what
you
need
to
change
you
you
do
this
s
run
to
s
patch,
then
it
will
basically
push
it
on
the
queue
and
give
you
the
control
back
right
and
here
another
example.
D
Basically
with
gpus,
you
want
two
gpus,
basically
to
use
two
gpus
on
eight
or
eight
nodes
with
two
two
gpus
each.
So
this
is
how
you
can
request
it
and
in
the
background
it
will
choose
the
partition
or
explicitly.
You
can
specify
minus
p
here.
D
Yeah,
I
think
I'm
a
little
bit
quick.
So
this
is
the
basic
interface,
as
you
see
is
very
simple,
very
intuitive.
This
is
a
little
bit
the
difference
to
the
kubernetes,
where
you
have
descriptions
through
yabo.
They
are
a
little
bit
verbose
and
for
bad
job
processing.
At
least
this.
D
This
kind
of
slurm
stuff
is
very
yeah
very
easy
to
use
it's
basically
a
client,
then
you
have
your
shell
script
and
it's
executed
yeah,
and
I
have
some
open
discussion
at
the
end
so
yeah
I
have
maybe
questions
to
the
community
or
if
somebody
already
have
made
thoughts
or
some
experiences,
if
those
both
both
worlds
can
be
combined,
so
the
whole
patch.
D
G
I
have
a
question
so
can
a
job
specify
multiple
partitions?
Let's
say
I
would
say
gpu
two
and
memory.
Something
and
slum
is
gonna.
Do
the
you
know
intersection
of
two
partition
and
find
the
common
nodes
and
stuff
like
that.
D
If
you
have
multi
powers
yeah,
you
can
specify
basically
a
minimum
memory
and
you
to
combine
resources.
This
is
possible
here.
D
F
I
maybe
this
is
shane
cannon
from
nurse
berkeley
lab.
Maybe
just
I
wanted
to
comment
that
one
thing
that
I'm
not
sure
was
covered
and
maybe
it'd
be
worth
having
you
know
a
presentation
separately
on
that
is
a
lot
of
what
slarn's
designed
to
do
is
give
a
policy
kind
of
framework
where
the
resource
provider
can
put
priorities
on
how
jobs
are
scheduled.
So
you
might
favor
large
jobs
over
smart,
small
jobs,
for
example,
where
you
may
give
certain
types
of
jobs
priorities
over
others.
You
know
really
what
it's
designed
to
do
is.
F
Typically,
you
have
a
backlog
of
work
that
exceeds
the
amount
of
resource
you
have
at
any
point
in
time,
and
so
it's
trying
to
make
decisions
about
how
to
schedule
that
workload
and
where
it
can
get
challenging
is
you
might
have.
If
you
need
to
schedule
a
really
large
job,
then
you
have
to
start
putting
aside
resources
so
that
you
can
run
that
right.
F
D
Yeah
it
it
has
very
sophisticated,
schedulers
and,
and
the
reservation
systems
are
also
nice.
Yes,
so
you,
you
can
basically
specify
as
an
argument
that
you
want
to
to
use
a
reservation.
So
this
is
one
of
the
nice
features
yeah.
So
there
are
a
bunch
of
very,
very
good
ideas
in
insights,
learn
the
partitions
the
limits.
I
think
the
whole
accounting
is
very,
very
sophisticated,
already
very
mature.
Already.
B
One
question
I
have
here
is
in
a
typical
lucky:
you
you
did,
I
think,
get
examples
from
a
an
actual
deployment
like
how
many
partitions
do
you
have
like
again
in
my
mind,
partition
it
resembles
a
queue
usually
like
have
like.
I
don't
know
tens
or
thousands
of
these
partitions
or
a
lot
less
things.
D
Yeah
what
I
saw
in
the
past
on
some
of
the
big
systems-
hpc
systems,
they
try
to
group
the
resources.
Usually
you
don't
get
a
lot
of
partitions.
You
have
maybe
five
ten
partitions.
Basically
some
of
the
what
you
usually
have
on
these
big
clusters.
D
They
they
organize
them
in
islands,
because
islands
are,
if
you
stay
within
an
island
you
might
have.
Basically
you
might
be
on
a
single
rack
or
something
like
that.
So
in
in
some
cases
you
have
possibility
to
access
an
entity
of
the
data
center
through
a
partition.
So
basically
you
know
through
taking
this
partition.
D
I
will
be
running
on
an
island
which
is
very
well
connected,
let's
say
and
will
will
give
me
very
good
latency
in
terms
of
communication
right,
so
you
can
have
this
as
one
one
thing,
then
you
might
have
different
types
of
hardware,
so
you
might
want
to
have
partitions
dependent
on
memory,
so
you
will
have
let's
say,
standard
nodes
with
having
128
gigabyte
memory
available,
and
then
you
have
a
fat
island
or
a
fat
partition
which
does
one
terabyte
memory.
D
It's
another
example.
So
there
are
not
so
many
usually
they're
within
a
dense
of
partitions
they're,
not
thousands
or
something
like
that.
So.
G
D
The
administrator
has
some
level
of
control.
There
is
most
probably
a
default
pool
where
the
jobs
will
land,
so
you
will
have
a
default
or
administrator.
Has
the
the
power
to
define
the
default
pool
where
all
the
jobs
will
go.
You
can
limit
the
access
also
to
the
partitions,
not
give
access
to
everybody
for
for
the
fat
nulls
or
something
that
this
is.
G
Possible,
okay,
but
the
job
still
might
can
get
scheduled
on
heterogeneous
kind
of
nodes.
Like
somebody
with
high
memory
and
somebody,
that's
all
that's
possible.
That's
not
explicitly
denied.
D
If
the
administrator
did
not
disallow
it,
it
can
be
scheduled,
but
yeah
you,
the
administrator,
might
turn
it
off.
So.
B
We're
almost
out
of
time
any
other
questions
from
the
community.
H
I
so
I
I
want
to
tackle
on
these
questions
in
the
slides,
so
the
the
slides
are
suggesting
that
slarm
is
the
resource
manager
and
kubernetes
is
the
interface,
but
I
I
was
thinking
whether
that
that
makes
sense,
or
it
would
make
more
sense
to
do
it.
Yeah.
D
This
is
I
I
wanted
to.
I
did
not
speak
about
it
too
much
but
see
in
my
company.
D
We
are
looking
at
benchmarking
and
and
how
to
approach
cloud
native
benchmarking
so
and
the
team
had
the
idea
to
look
to
classical
ci
tools,
white
jenkins,
ron,
jenkins,
job
and
stuff
like
that
and
yeah.
For
me,
I
don't
know
if
this
is
a
good
model,
because
benchmarking
usually
can
be
done
as
a
bachelor.
D
So
if,
if
the
benchmarks
are
mature-
and
you
know
that
they
are
stable
and
running,
you
can
shoot
them
in
a
queue
right,
go
to
sleep
and
get
back
come
back
and
get
the
results,
so
it's
very
well
suited
for
for
bad
job
processing.
D
So
I
was
thinking
basically,
how
can
we
do
benchmarking
in
my
company
for
kubernetes
by
maybe
reusing
some
ideas
from
slurm
and
the
idea
what
we
are
exploring?
Can
we
use
slurm
basically
to
spawn
kubernetes
clusters,
as
basically
the
benchmarks
can
vary?
D
We
want
to
benchmark,
let's
say
a
cluster
of
four
nodes,
eight
nodes,
so
you
can,
you
can
expose
all
the
nodes
make
them
available
through
sloan
first
and
then
slurm
allows
you
to
to
start
a
job
and
in
some
sort
of
prologue
script
you
can
provision
a
kubernetes
cluster
on
the
located
nodes.
This
was
the
idea
and
then
you
just
run
the
the
benchmark,
gather
the
results
and,
at
some
point
the
result
or
the
job
is
completed,
so
you
could
theoretically
combine
both
worlds.
D
You,
at
least
in
this
example,
for
benchmarking
and
make
it.
But
this
is
in
the
case
where
you
want
to
test,
maybe
the
whole
kubernetes
system
in.
In
other
cases,
you
might
have
just
one
kubernetes
cluster
and
you
just
want
to
submit
bad
jobs.
You
don't
need
all
this
overhead.
So
most
probably
it's
not
very
efficient
way
in
in
terms
of
provisioning
and
so
on,
but
it
should
be
possible.
B
Are
there
like
times
where
you
would
have
resources
being
set
aside,
while
them
is
trying
to
accumulate
the
amount
of
resources
required
for
the
all-or-nothing
drug?
And
it's
like
for
how
long
does
it
do
that,
like
I'm,
just
trying
to
understand
how
it
implements
all
or
nothing?
By
trying
to
minimize
maximize
you
know
or
minimize
the
case
where
you
have
as
resources
set
aside,
it
can
be
used
while
accumulating.
F
Required-
and
maybe
I
could
comment
on
that-
yes
it'll-
definitely
do
that.
So
that's
what
I
was
talking
about
earlier
is
it'll
make.
If
it's
trying
to
schedule
a
large
job,
it
can
create
sort
of
automatically
a
reservation
out
in
time
when
it
thinks
it's
going
to
be
able
to
have
those
resources
available
based
on
the
time
limits
that
have
been
specified
for
all
the
different
jobs,
and
so
it'll
say
you
know,
I
think
at
six
o'clock
I'll
have
all
the
resources
to
run
a
thousand
node
job,
and
it
will
it'll
make
that
reservation.
F
F
What
a
wall
time
is
they
don't
have
to,
but
typically
a
wall
time
is
either
specified
by
the
job
or
a
default
is
applied
based
on
what
partition
they're
going
to.
What
is
the
typical
that
you
see
like?
Is
it
it
can
be
hours
to
days,
but
typically,
you
know
many
many
days
starts
to
get
towards
the
limit
of
what
a
typical
hpc
center
might.
You
know
allow.
B
F
They
will
try
to
game
the
cues
to
try
to
get
their
jobs
to
as
fast
as
they
can,
but
it's
within
the
limits
of
the
how
the
policies
have
been
configured
so
there's
constraints
on
that.
You
know
like
a
smart
user,
will
sit
there
and
say
like
I
can
fit
my
job
in
that
big
backfill
slot.
Let
me
size
it
the
right
way,
so
it'll
jump
in
you
know
others,
just
they
just
want
it
to
run
at
some
point,
they'll
submit
it
and
they'll
wait
for
it
to
come
out.
The
other
side.
F
The
resources
tend
to
be
very
consistent,
so
you
kind
of
have
a
feel
for
that,
but
you're
right
it
can
vary.
It
can
vary
more
because
of
things
like
io
congestion
or
things
like
that,
but
they'll
typically
put
in
you
know
the
expected
time
and
then
they'll
put
in
some
safety
factor.
They
won't
get
charged.
They
only
get
charged
for
what
they
use.
Typically,
so
they're,
it's
more
about
just
trying
to
get
the
time
right
so
that
it'll
it'll
schedule
sooner
than
otherwise.
D
Yeah
yeah:
this
is
another
thing
which
I
did
not
cover
the
charging,
so
they
are
usually
in
the
hpc
data
centers.
They
connect
that,
with
with
some
sort
of
budget,
the
users
have
budgets
of
hours
compute
hours,
so
you
can
make
it
automatically
that
the
compute
hours
are
then
deducted
based
on
the
on
the
actual
used
compute
types.
A
Question
does
slurm
save
the
state
of
the
job,
let's
say
I'm
running
a
job
and
by
this
time
the
job
needs
to
be
completed.
If
the
job
is
not
completed,
then
save
the
state
so
that
I
can
resume
tomorrow
the
same
same
time
and
finish
the
job.
D
Yeah
this
goes
into
checkpointing,
usually
what
of
the
applications
are
doing
or
the
hpc
applications
are
trying
to
support
checkpointing.
It
depends
a
little
bit
also
on
your
code.
If
your
code
supports
checkpointing,
you
can
make
checkpoints
in
time
and
basically
start
from
from
a
certain
checkpointed
state
later.
So
it's
not
completely
automatic
that
you
make
a
copy
of
the
memory,
and
then
you
can
restart
it's,
not
the
virtual
machine
or
something
which
is
running
so
your
application
has
to
support
it.
D
B
Yeah,
thank
you
very
much
attorneys
and
shane
also
for
backing
us
with
some
of
these
questions.
We're
five
minutes
late.
I
guess
that's
really
great.
Please
like.
If
you
have
more
questions,
maybe
tag
shane
or
thank
us
on
the
working
group,
slack
channel
and
maybe,
as
as
some
mentioned
on
the
chat
we
can
invite
also
people
from
skid
md.
I
did
meet
with
them
before
they
schedule
defaults
and
mention
to
them
that
we
have
a
working
group
in
case,
so
we
will
try.
D
B
Right
like
how
do
we,
how
is
planning
to
support
like
auto
scaling,
for
example
in
cloud
environments?
I
guess
there
is
a
work.
There
is
a
ton
of
work.
There
is
happening
right
now,
so
that
would
be
also
interesting,
because
everything
that
we've
discussed
so
far
kind
of
works
well
on
an
on-prem
cluster.
B
All
right.
Thank
you
meet
in
a
couple
weeks,
thanks.