►
Description
This was an unconference session and therefore has no proper description.
A
All
right,
hi,
everyone
thank
you
for
coming,
I
am
Andrew
Chen,
and
that
is
Dominic
torno
and
just
to
give
you
some
context
like
this
kind
of
came
out
of
a
project
from
cig
docks
where
we're
using
systems
modeling
to
try
to
explain.
You
know
how
kubernetes
works
better.
So
this
is
sort
of
like
a
case
study
with
the
the
scheduler
and
then
let
me
give
it.
B
B
All
right
just
a
quick
question,
because
I'm
curious
so
for
myself,
I
had
the
hardest
time
when
I
got
started
with
kubernetes
to
wrap
my
head
around.
It
I
have
the
hardest
time
to
understand
what
it
actually
is
and
I
had
the
hardest
time
to
understand
how
it
works.
Does
anybody
else
share
this
feeling
or
all
right,
cool
yeah?
A
few
of
us
so
I
would
argue
that
kubernetes
does
have
a
problem
with
complexity.
B
However,
I
would
actually
argue
that
kubernetes
has
a
problem
with
its
perceived
complexity.
The
perceived
complexity
of
kubernetes
is
very
high.
However,
the
conceptual
complexity
of
kubernetes
is
actually
fairly
low.
It
has
a
few
basic
patterns
that
it
applies
over
and
over
and
the
understanding
of
kubernetes
relies
on
the
understanding
of
these
patterns.
So
I
believe
that
the
problem
is
a
problem
with
communication
and
not
a
problem
with
engineering.
B
So,
as
Andrew
already
said,
this
presentation
is
part
of
a
larger
collaborative
effort
between
the
CN
CF
Google
and
s
AP,
to
advance
the
standing
of
kubernetes
and
its
underlying
concepts
using
a
systems.
Modeling
approach,
so
at
the
end
of
this
presentation,
I
pleased
to
share
your
feedback
about
the
model
about
the
modeling
approach
that
we
use
in
this
presentation.
So
we
talked
about
modeling,
so,
let's
moral,
the
scheduler,
this
diagram
depicts
a
high-level
architecture
of
kubernetes
like
every
other
component
of
kubernetes.
The
scheduler
monitors
and
modifies
objects
in
the
kubernetes
object
store.
B
The
sequence
of
events
and
actions
around
the
scheduler
can
be
summarized
as
after
a
user
or
controller
creates
a
part.
The
scheduler
monitoring
the
object
store
for
unassigned
pods
will
assign
this
part
to
a
node.
Subsequently,
the
cubelet
monitoring
the
object
store
for
assigned
parts
will
execute
this
part.
B
B
B
Please
note
in
this
diagram:
there
is
a
node
name
in
the
pod
spec
that
we
will
come
to
later
that
assigns
or
actually
pre
assigns
a
node
to
a
part
before
it
is
actually
being
scheduled,
and
please
also
note
the
scheduler
name
in
the
pod
spec
to
assign
a
custom
scheduler
to
a
pod.
We
will
entirely
ignore
custom
schedulers
in
this
presentation.
B
So,
equipped
with
this
knowledge
of
kubernetes
objects,
we
can
now
define
that
a
part
P
is
assigned
or
bound
to
a
node
n.
We
are
a
binding
B
if
and
only
if
the
bindings
name
will
supports
name
the
bindings
namespace
equals
supports
namespace
and
the
bindings
target
name
equals
the
nodes
name.
The
existence
of
this
binding
signals
to
the
cubelet
on
this
node
to
execute
this
part.
B
So,
equipped
with
this
knowledge,
we
may
now
formally
describe
the
tasks
of
the
scheduler
s,
the
scheduler
for
a
part,
P
selects
a
node
n
and
creates
a
binding
B
so
that
the
bindings
name
equals
the
pods
name.
The
bindings
namespace
equals
the
pods
namespace
and
the
bindings
target
name
equals
the
nodes
name.
B
So
this
is
a
specification
of
the
control
loop
of
the
scheduler
in
TL
TL,
A+
temporal
logic
of
actions.
The
control
loop
of
the
scheduler
is
actually
fairly
straightforward.
Now,
let
me
add
this.
Alright,
the
scheduler
is
a
complex
piece
of
software.
It
has
many
many
lines
of
code.
It
is
not
easy
to
understand.
However,
the
specification
of
the
scheduler
is
actually
fairly
straightforward,
fairly
simple,
it
is
one
loop,
it
has
one
if
statement
and
it
selects
a
node
for
a
part.
B
So
the
complexity
of
the
scheduler
is
apparently
in
the
selection
process
of
the
nord
for
a
pod.
So,
let's
dive
into
that
selecting
a
node
is
a
two-step
process.
First,
the
scheduler
selects
a
subset
of
nodes
that
are
qualified
to
host
this
part
and
then
second,
the
scheduler
selects
a
note
from
this
subset
with
the
highest
rating.
For
this
part,.
B
B
B
So,
let's
start
with
a
simple
one:
let's
start
with
a
few
sanity
checks,
so
this
diagram
depicts
the
relevant
attributes
of
a
node
for
the
sanity
checks.
Most
importantly,
we
have
unscareable,
we
have
running
and
we
have
ready
so
the
filter
function.
Here
the
scheduler
may
assign
a
port
P
to
a
node
n.
B
Quick
highlight
the
Toleration
czar
defined
on
the
pot
and
the
taints
are
defined
on
the
node
and
just
like
before
we
can
define
it
with
actually
a
simple
expression
of
what
taints
in
toleration
stew
and
the
scheduler
I
assign
a
part
P
to
a
node
n.
If,
and
only
if,
for
each
taint,
that
is
an
element
of
the
nodes,
specs
taint.
There
must
be
a
toleration.
There
is
an
element
of
the
pot
spec
toleration
so
that
the
toleration
matches
the
taint.
Once
again.
B
If
this
is
not
the
case,
the
node
is
not
considered
qualified
for
hosting
this
pot.
Similar,
always
the
same
pattern.
It
comes
to
affinity
personally,
I
have
to
say,
I
had
a
harder
time.
Wrapping
my
head
around
Tenzin
toleration,
x'
and
an
easier
time
right
to
understand
affinity.
I
was
actually
surprised
to
see
that
the
data
structures
for
teens
were
actually
very
little
and
innocent,
and
the
data
structure
or
rounding
affinity
and
the
formulas
surrounding
them
is
actually
much
harder.
B
So
on
once
again,
we
have
a
formula
and
the
scheduler
may
assign
a
part
P
to
a
node
n.
If
and
only
if,
node
affinity,
pot,
affinity
and
part
anti
affinity
hold
true,
so
the
node
affinity
holds
true
if
and
only
if,
there
is
at
least
one
node
selector
term
in
the
pot
spec
affinity,
nor
affinity
so
that
the
node
selector
matches
the
node
same
with
the
pot
affinity.
The
pot
affinity
holds
true
if
and
only
if,
all
port
affinity
terms
in
the
pots,
spec
affinity,
pot
affinity
match
the
node
and
last
for
the
pot.
B
Anti
affinity
for
anti
affinity
holds
true.
If
no
port
affinity
term
in
the
pot,
spec
affinity,
pot,
anti
affinity
matches
the
node.
You
see
the
repeating
pattern
as
soon
as
a
node
passes
this
filter,
it
may
be
considered
for
scheduling
if
it
does
not
pass.
This
filter
is
taken
out
of
the
of
the
list
of
candidates,
so
the
filter
functions,
select,
nodes
right
and
then
the
rating
functions
assign
a
score
for
the
node
so
very
similar
to
the
filter
function.
B
The
rating
function
applies
individual
ratings
to
parts
and
notes,
write
sums
up
this
rating
and
then
for
the
set
of
notes
takes
the
highest
rated
once
so
rating
functions.
Anybody
want
to
see
more
formulas.
Anybody
want
to
see
more
grass
anything
like
that.
All
right
enough
graphs,
enough
math
for
more
details.
Please
do
visit
our
blog
post.
We
will
release
a
blog
post
after
the
cube
con
that
talks
about
the
scheduler
and
will
include
the
nitty-gritty
details,
but
for
now
we
can
do
a
simple
case
study
right.
B
So
we
have
a
cluster
and
on
this
cluster
we
have
nodes
with
the
GPU
and
nodes
without
GPU
I'm,
pretty
sure
half
of
you
are
already
sick
of
this
example,
but
we
gonna
stick
with
it
so
and
we
have
a
set
of
parts
that
do
not
require
the
GPU
and
we
have
a
set
of
parts
that
require
the
GPU.
Now.
What
is
our
objective
here?
Well,.
A
B
B
So
the
first
thing
we
gonna
do
is
we
add
a
taint
as
soon
as
we
add
the
taint
to
the
nodes,
without
that
the
nodes
with
the
GPU,
none
of
these
parts
is
eligible
to
run
on
any
of
the
GPU
nodes
and
is
only
eligible
to
run
on
the
nodes
without
the
GPUs,
since
they
do
not
have
any
toleration
specified
yet
now.
This
is
the
first
trip
up
as
soon
as
you
add
a
Toleration
to
the
pots
that
requires
GPU.
You
are
not
done.
B
The
parts
that
require
GPU
may
be
scheduled
on
the
notes
with
the
GPU,
but
nothing
in
the
formula.
If
we
remember
the
formula
tells
the
scheduler
that
they
have
to
be
scheduled
on
the
nodes
with
the
GPU,
you
need
a
different
method,
basically
a
different
filter
for
that.
So
the
second
thing
you
do
is
you
label
these
nodes
in
preparation
for
node
affinity?
Again,
this
didn't
change
any
of
the
possible
assignments
just
yet
you
have
to
add
affinity,
two
to
the
port's
that
require
GPIO.
B
So
as
soon
as
you
actually
do
add
this
affinity,
we
reached
our
mission
statement
write
the
parts
that
do
not
require
GPU
are
only
eligible
to
be
ever
scheduled
on
nodes
without
the
GPU
and
parts
that
require
the
GPU
are
only
eligible
to
be
scheduled
on
the
nodes
with
GPU
right
now
we
did
not
venture
into
the
raiding
functions.
So,
for
example,
we
did
not
venture
into
preferred
preferred
affinity
and
preferred
anti
affinity.
If
you
want
to
say
something
like
spread
my
workload
of
GPU
ports
evenly
and
not
just
on
one
machine.
B
C
B
Ok,
I
actually
do
not
have
the
ranking
functions
included
in
the
in
the
presentation,
and
let
me
try
to
to
basically
paint
a
picture
without
a
slide
so
for
the
ranking
functions.
A
common
example
where
common
user-facing
example
is
part,
affinity
and
pot,
anti
affinity,
so
you
can
require
and
that
a
part
does
not
run
on
the
same
node
with
another
part
right.
You
can
require
that,
and
that
is
a
filter
function.
So
if
the
if
the
node
is
already
inhabited
with
a
part
right,
the
scheduler
will
not
take
this
node
into
account.
B
However,
when
you
use
the
preferred
preferred
qualifier,
it
has
the
quality
of
a
reading
function.
So
if
the
scheduler
finds
a
node
that
is
uninhabited
by
the
port
in
question
right,
it
will
automatically
rank
the
node
higher
and
you
also
have
the
possibility
to
add
a
custom
weight
to
that.
However,
if
the
node
is
pre-populated
with
a
pod
in
question
right,
the
node
will
be
ranked
lower.
However,
it
will
not
stop
the
scheduler
from
not
scheduling
it
on
this
node
if
it
doesn't
find
any
other
node
that
actually
ranks
higher
but
I'm.
D
B
B
In
this
case,
I
could
add
labels
to
these
parts,
I'm
sorry
to
these
nodes
right
and
add
labels
to
these
nodes
and
then
specify
a
pod
affinity
for
both
of
the
sets
of
the
nodes.
So
you
do
have
this
possibility.
There
are
some
reasons
when
you
do
not
want
to
do
that.
One,
for
example,
in
this
case
would
be.
B
You
would
actually
have
to
label
each
and
every
node
right
and
you
were
to
have
to
label
each
and
every
part
right,
and
that
means
you
have
to
alter
the
pod
template
of
any
deployment
you
have
any
replicas
set.
You
have
any
cron
job,
you
have
any
job,
you
have
right
and,
yes,
you
could
actually
come
to
the
same
result.
I'm
not
sure
this
is
the
case
in
all
cases
and
under
all
circumstances,
but
in
this
case
you
could
come
to
the
same
result,
but
it
gets
unwieldy
very
quickly.
B
E
B
Yes,
this
is
absolutely
true.
So,
from
the
point
of
view
of
the
scheduler
right,
GPU
means
nothing
to
it.
So
it
is,
it
is
a
the
taint
is
an
opaque
value
right,
but
this
taint
matches
that
Toleration,
and
in
that
case
in
the
label,
that
would
probably
be
a
mention
of
GPU
right,
but
for
the
for
the
scheduler,
you
are
right
for
the
scheduler.
This
means
absolutely
nothing.
So
this
is
entirely
in
your
domain
and
in
your
responsibility.
Yes,.
B
This
is,
this
is
actually
a
tricky
question,
so
number
one
I
believe
yes,
number
two
I
am
not
entirely
sure
which
one
there
are
and
most
importantly,
I,
am
not
entirely
sure
about
the
rating
behavior.
What
part
of
the
rating
behavior
is
actually
part
of
the
contract
that
that
the
scheduler
gives
you
right
or
what
part
is
actually
basically
inside
the
scheduler
and
may
change
without
notice,
from
release
to
release
I
have
not
uncovered
that
yet,
but
I
do
believe.
Yes,
the
the
scheduler
does
have
a
tendency
to
spread
out.
G
B
B
The
note
now,
in
the
second,
the
scheduler
loops,
the
scheduler
loops
over
the
individual
individual
notes
and
establishes
the
ranking
right
for
the
for
the
node
and
the
pot
and
then
will
select
a
will
select
a
node
from
the
set
of
nodes
that
scored
the
highest
and,
if
I'm
not
entirely
mistaken,
the
contracted
behavior
is,
if
you
have
a
set
of
nodes
with
the
same
ranking
like,
for
example,
four
or
five
nodes
with
the
same
ranking
the
contracted
behavior
is
that
it
is
randomly
chosen.
I
do
not
think
you
can.
B
So
when
we
come
to
priorities,
we
actually
also
venture
in
a
in
a
topic
that
we
did
not
include
in
this
presentation,
and
that
is
the
scheduler
actually
has
the
chance
of
pre-empting
pots
right.
Sometimes
there
is,
there
is
somewhat
of
a
confusion,
not
entirely
sure
I
cleared
that
all
up,
but
there
is
some
kind
of
confusion
between
eviction
and
preemption
right,
so
eviction
is
defined
as
determination,
the
pod
by
the
cubelet.
It
would
do
so
under
pressure.
Take
all
of
this
with
a
grain
of
salt.
B
B
So
if
a
part
has
a
high
priority
higher
priority,
but
it
cannot
be
scheduled,
the
pod
will
I'm
sorry,
the
scheduler
will
examine
the
workload
across
the
cluster,
will
find
pot
or
may
find
pots
with
a
lower
priority,
and
if
the
scheduler
determines
that,
if
it
terminates
these
pots
that
then
the
pot
with
the
higher
priority
can
actually
be
scheduled,
it
will
preempt
the
pot
now
preemption.
In
this
case,
I
believe
means
that
the
scheduler
sets
the
deletion
time
stamp
of
the
pot.
B
A
D
B
If
you
don't,
if
you
don't
mind
me
asking
you
a
question,
and
so
what
do
you
think
about
this
style
of
modeling,
this
style
of
communicating
about
kubernetes
and
about
its
individual
components
so
or
functionality?
If
you
have
any
thoughts,
if
you
have
any
feedback
on
this
on
this
style
of
modeling,
I
am
super
happy
to
hear
it
doesn't
have
to
be
now.
You're
gonna
stick
around
for
a
few
minutes,
so,
if
you
want
to,
if
you
want
to
come
around
back,
that
would
be
cool.
Thank
you
very
much.
A
Yeah
related
to
that,
so
you
know.
Currently
the
documentation
is
a
bunch
of
descriptions
right
and
you,
you
kinda,
have
to
read
it
all
and
kind
of
piece
it
together
yourself.
So
the
idea
is
to
have
a
much
more.
You
know
rigorous
way
to
describe
the
behavior,
and
so
this
is
why
we're
using
this
modeling
approach,
which
will
include
you,
know
these
formulas
as
well
as
diagrams.
So
please
yeah
give
us
some
feedback
if
this
is
a
much
better
approach
in
terms
of
explaining
how
things
work.
Thank
you.
I.
B
It
is
if
you're
curious,
it
is
a.
It
is
a
formal
specification
language
that
is
called
plus
color,
that
translates
to
TL
A+.
That
is
a
language
designed
by
Leslie
Lamport
and
it's
a
formal
specification
language
and
it
come
with
a
model
checker.
So
you
can
actually
check
if
your,
if
your
statements
and
invariants
hold.
B
So
when
it
comes
to
that
it
is,
it
is
part
of
what
is
called
invariant
based
design
right.
So
you
have
you:
have
you
have
an
invariant
about
the
state
and
the
state
transitions
in
your
mind
when
you
design
an
algorithm,
and
then
you
specify
the
algorithm
and
you
check
if
your
invariants
actually
hold
true
so
for
this
one
I
had
previously
I
had
an
invariant.
B
That
said,
if
I
start
with
a
set
of
parts
from
the
very
beginning,
if
I
start
with
a
set
of
pots
after
the
cuban
heiress
scheduler
reaches
a
steady
state,
that
is,
there
is
nothing
else
to
do.
If
there
is
a
set
of
nodes
that
can
host
the
pots
right,
eventually,
every
part
will
be
assigned
to
a
node.
I
am
not
entirely
sure
if
the
way
I
modeled
kubernetes
scheduler
is
incorrect
or
if
this
is
actually
part
of
the
kubernetes
scheduler.
But
this
is
not
a
guarantee.
The
carbonated
a
scheduler
gives
you.
B
The
model
checker
showed
clearly
that
it
can
run
into
situations
right
that,
even
so,
and
with
a
holistic
view
of
the
system,
every
part
would
fit
on
a
node.
It
can
run
into
situations
very
basically
starves
itself
because
it
made
a
few
bad
decisions
in
the
beginning,
and
it
was
clear
after
that
that
I
had
to
relax
the
invariance
right,
because
I
I
am
fairly
certain
that
this
is
how
kubernetes
works,
but
I
need
to
have
peer-reviewed,
but
I
had
to
relax.
This
invariant
and
saw
kubernetes
does
not
give
me
any
guarantee.
A
G
A
B
B
Correct
so,
overall,
no
question
about
it.
The
these
these
processes
are
or
these
activities
are
asynchronous
right,
so
you
would
either
have
to
put
a
review
process
like
you
just
mentioned
into
place,
or
this
is
also
the
reason
why
I
usually
like
strongly
emphasis.
What
is
part
of
the
contract
right,
because
as
soon
as
it's
part
of
the
contract,
I
gonna
include
it
in
the
model
and
I
gonna
include
it
in
the
invariance.
B
If
it's
not
part
of
the
contracts,
I
will
formulate
it
so
that
the
model
checker
can
actually
choose
randomly
like,
for
example,
when
I
said
the
highest-ranking
pod,
it
will
choose
randomly.
Of
course,
there
is
no
random
in
kubernetes
at
least
I
didn't
find
one
I
think
it
always
takes
the
first
one
in
the
list
or
something
like
that,
but
for
the
for
the
for
the
sense
of
the
model,
this
is
now
random
right.
B
So,
as
long
as
the
contract
stay
in
place,
the
the
model
adhere
stood
and
if
the
contract
changes
you
have
to
change
the
model.
Yes,
if
you
are
actually
interested
in
this
detail,
I
I
give
a
little
bit
of
detail,
and
there
is.
There
is
something
that
is
called
model
refinements.
So
you,
your
model,
behavior
on
a
very,
very
high
level
right
and
then
step
by
step
at
define,
meant
refinements
and
also
proof
that
the
individual
step
on
a
lower
level
is
actually
a
behavior
that
is
allowed
on
a
higher
level.
B
So
the
higher
level
is
the
most
abstract
one
right
where
you
could
go
so
far
and
just
say
the
invariant,
a
pod
shall
find
a
node
right
and
you
go
further
and
further
down
and
add
more
information
in
that
case
also
the
higher
your
abstraction.
If
is
the
more
its
longevity
right,
the
further
you
go
down
and
come
closer
and
closer
to
the
implementation.
Then
it
is
really
tied
to
the
lifecycle
of
this
component.
C
So
I
for
one
have
found
your
presentation
very
helpful,
so
thank
you
and
I
think
it
should
be
extended
to
other
components
too.
You
know,
for
example,
like
like
the
cute
proxy
or
like
the
cubelet,
but
I
think
I
totally
agree
with
the
gentleman
keeping
these
in
sync
and
making
sure
that
you
know
the
formula
is
there
put
in
a
documentation?
H
Current
schedule,
architecture
scheduled
pause
one
by
one
right
so,
but
in
other,
like
scenario
like
big
data
or
motion
running
it's
better
to
schedule
all
the
pozdneyev
group.
So
what's
the
recommendation
for
this
use
case
is
the
path
is
I'm,
not
sure
in
the
next
scheduled
version.
We
will
cover
those
use
case
or
it's
better
to
extend
the
scheduler
power
selves,
I.
B
Am
actually,
unfortunately,
not
qualified
to
speak
to
that
so
I,
strictly
limit
myself
to
to
modeling
how
the
how
the
scheduler
works
right
now
there.
There
is
a
very
interesting
there's,
a
very
interesting
detail
to
your
question
that
maybe
I
don't
want
to
like
kill
everybody's
time
that
we
can
talk
about
after
this
presentation,
but
yeah.