►
From YouTube: GMT 2018-07-26 Containerization WG
A
B
B
B
C
B
B
B
C
Yeah,
thank
you
for
thank
you
for
allowing
me
to
talk
about
this
in
chess
or
not
in
the
short
notice,
I,
pretty
much
decided
to
talk
about
this
yesterday,
mostly
because
I
need
to
talk
where
the
content
people
working
on
the
containers
back.
So
this
seems
like
a
pretty
good
time.
Slot
I
didn't
really
get
too
much
time
to
prepare
this.
C
So
this
time
we
were
a
little
bit
lucky
in
the
log
analysis
and
I
found
a
very
good
correlation
with
a
problem.
Let
me
explain
the
problem
first.
The
issue
we
observed
here
is
that
it
seems
like
there
is
some
logical
possibility
that
Lydia
and
VDI
GPU
isolator
kind
of
leaked
a
GPU
car,
the
previous
allocated.
C
That
is
why
we
see
the
log
lines
like
collect
fail,
requested
one,
but
only
zero
available.
The
one
there
is
the
number
of
GPU
devices
isolator
thinks
is
thinks
it
has,
because
the
agent
thinks
isolator
has
and
they're
asking
you
to
allocate
for
the
the
allocate,
but
isolates
itself
cannot
allocate
anymore
because
there's
only
zero
available.
So
that's
and
this
this
prevents
Elkin
any
container
with
a
GPO
card
being
created
being
properly
and
provisioned.
We
had
even
seen
the
sometimes
they
requested
to,
but
only
one
available.
C
Even
that
kind
of
thing
has
happened,
but
anyway,
in
last
episode,
which
is
roughly
for
the
Wednesday
I
think
it's
Tuesday,
we
were
doing
we,
we
are,
we
analyzed
the
logs
now
cluster
and
we
found
a
very
good
correlation
in
our
cluster.
Only
this
is
not
my
log.
This
is
the
log
of
original
pug
with
water
and,
if
you
scroll
down
to
the
bottom
of
the
comment,
the
bottom
of
this
task
I
saw
this
lie.
C
Termination
of
executors
something
of
framework
failed,
failed
to
kill
all
processing
the
container
timeout
after
one
minute,
and
if
you
click
on
the
next
line,
the
two
building
today
are
the
previous
author,
I
believe.
Is
you
actually
acknowledged
at
this
this?
This
condition
might
be
problematic
because
we
are.
We
are
short-circuiting
all
the
other
isolator
cleanup
in
this
case,
so
it
seems
like
there
is
a
pretty
strong,
logical,
logical
explanation
here
in
that
the
container
Iser
was
not
able
to
the
container.
The
long
term
was
not
able
to
determinate
other
processes.
C
C
C
B
C
A
C
B
B
B
Makes
sense
so
so
because
if
you
fail,
if
you're
on
launch
or
destroy
so
the
way
we
destroy
a
container,
we
call
launcher
this
joy
and
then
we
core
isolator
cleaner,
and
then
we
cut
provision
and
destroy.
So
if
you
return
a
failure
which
is
soft
circuit
soft
circuit,
and
then
we
will
return
a
failure
for
the
destroy
and
you
can
no
longer
clean
up
the
resources
in
the
isolator,
so
I
think
right
now,
we
need
to
risk
the
problem
in
the
issue
with
the
way
we
resolve
this
issue.
B
C
B
C
B
C
C
C
C
B
So
if
I
remember
correctly,
so
we
have
an
order
of
Isolators
two
copper
pair
and
then
we
use
the
reverse
order
for
the
isolator
to
do
the
Carina,
which
means
I
forget
whether
or
not
we
put
the
host
ball
in
support
as
a
separate,
isolated
and
I
move
it
out
from
the
plane
next
isolated.
So
we
need
to
identify
the
host
volume
is
among
first
or
we
do
the
GPU
I,
say
it'll
clean
up
first,
so
so
it
might
be
possible
any
failure
of
other
isolated.
We
broke,
the
GP
I,
say
they're
from
cleaning
up.
C
These
moments
are
actually
NFS
months,
prepared,
created
out
of
bounds,
but
occasionally
the
code,
the
code,
the
puppet
code,
which
she
runs
these
six
months.
It's
got
a
failed
autobahn
without
without
method
agent,
explicit,
no
image,
and
then
these
events
mounts
are
not
visible
on
a
do
not
present
on
a
host
on
the
agent
host
in
bed,
but
our,
but
the
our
user,
who
creates,
creates
these
containers
using
those
attachments
who
didn't
know
that
they
were
just
creative.
C
B
C
A
Gilbert
a
yahoo
this
person
da
is
are
the
mouth
are
like
all
of
the
host
mount
Isolators
and
things
they
are.
They
ordered
after
the
GPU
isolator
I
forgot
the
exact
order,
but
I
think.
B
In
one
point,
six
I
think
in
one
point:
six
we
we
have,
but
we
have
a
deeper
order
of
the
isolator
and
all
the
custom
module
they
cannot
go
after
they're
gonna
be
all
they're
after
the
deep
all
those
building
isolator.
So
if
you
go
to
the
continuous
or
crazy
method
by
the
right
order
for
those
isolator
and
me,
Jen's
pitch
did
some
refactoring
in
one
six,
so
it
is
25
min
to
go.
We
need
to
go
back
to
the
right
version
of
Messrs,
but
they
are
in
some
specific
order
and
GPU
isolated.
B
A
B
We
need
to
make
the
same
post
as
I
said
as
a
shear.
It's
a
sheer
amount
and
we
also
need
to
create
separate
mount
in
space.
So
basically,
at
the
GPO
I
say
it
depends
on
those
two
from
the
Linux
file
system,
isolated,
but
I
forget
which
position
it
is
it
located?
If
you
have
some
dependency,
when
you
double
check
on
the
one
part
version,
okay,.
B
So
for
I
think
for
this
issue:
I,
don't
know
whether
you
guys
see
the
error
message
from
other
isolator
cleanup,
but
the
thing
I
would
like
to
make
sure
it
is
like
if
it
is
sucking
at
launcher,
destroy
and
we
you
guys
should
not
see
and
me
the
and
there
will
be
no
isolator
cannot
be
call.
So
we
need
to
figure
out
why
the
we
fail
to
kill
the
process.
B
So
if
it
is
talking
at
the
any
host
volume
arm
mount
and
then
we
should,
we
will
still
see
those
message
and
then
we
need
to
understand.
Is
it
related
to
NFS
mount
and
with
the
figure
so
I
I?
Don't
basically
for
this
issue,
we
need
to
spend
more
time
to
do
the
charging
and
then
create
more
agent
Locke.
Do
you
guys
have
any
like
consistent
way
to
reproduce
it?
No,
it
is.
C
Always
operation
is
happen,
what
a
widespread
of
happening!
This
are
roughly
every
month,
this
little
MyPlate
better,
it's
kind
of
like
it
seems
like
I,
don't
think
if
evenly
distribute,
it
seems
like
occasionally
this
happens
or
one
of
the
machines
and
then
quickly
like
within
a
couple
hours,
we'll
see
maybe
half
of
the
crap
within
couple
hours,
I'll
say
within
couple
hours,
we
see
about
50,
press
half
of
the
clusters
and
machines
in
just
read
them
latest
action
to
seeing
a
similar
problem.
C
B
A
B
C
If
those
containers
are
not
killable,
even
by
acid
KO
is
possible
that
the
destroy
cannot
finish
within
a
minute.
They
don't
we
and
then
if,
but
then
we
throw.
We
still
consider
task
is
I.
Think
in
task
is
bad
terminate
in
terminal
state
I
know
we
tell
the
tale
that
pretty
much.
The
issue
here
is
that
contain
the
riser
contain
the
riser.
Tells
agent
tells
them
her
response.
C
C
B
C
B
You
guys
I,
think
yeah
I
think
initially
I
I
expect
we
could
find
a
new
cost
from
if
even
it
is
from
the
other
component
me
should
find
a
new
cost
or
maybe
about
it,
on
a
suicide
and
instead
of
like
go
ahead
and
adding
some
rich
I
logic,
I
think
we
could
consider
that
for
sure,
but
we
should
never
understand.
What's
the
problem,
alright,.
C
Yeah
I
don't
get
on
that,
but
I.
What
I
want
to
discuss
here
is
a
general,
so
use
just
general
just
confirm
some
design
principles
here
say:
do
we
kind
of
expect
we
will
be
reliable,
they
terminate
any
process
in
a
container,
so
we
know
that
we,
so
we
know
that
cleanup
should
always
be
called.
That
seems
a
little
bit
dangerous
design
principle,
because
a
lot
of
things
is
it's
difficult
to
assume.
These
things
can
always
be
reliable.
I.
A
Mean
it's
time
out
the
like
one
time
out
thing.
It
feels
like
my
there's
no
room
to
configure
it
at
that's
point
one,
and
also,
if
you
cannot
do
so
for
any
particular
reason
in
serving
all
of
the
heterogeneous
and
like
workload
and
think
there
should
be
some
sort
of
error
being
emitted
being
carefully
charts.
So
people
can
take
those
just.
C
The
rigor
the
light
reading
above
you
try
that
amis
a
concur
yeah.
We
can
monitor
that
concur
for
sure
we
have
to
be
probably
going
to
do
some
temporary
remediation
see
if
we
see
that
concur,
we
just
hard
over
a
just
restart
metals
agent
from
system
D.
We
are
considering
that
approach
in
house,
but
I,
don't
don't
think
it's
a
scalable
solution,
yeah.
B
I
think
I
think
best,
there's
totally
up
to
the
operator.
That's
the
operate.
The
an
option
for
operated
and
incinerate,
basically
I
I,
never
in
in
I,
never
see
this
error
message
before
and
I
suspect
the
in
my
related
to
some
internal,
colonel
hablo
and
rich
and
I.
Basically,
I
do
not
see
but
we're
the
tie
out.
It
is
I
think
this
might
be
a
default
I
out
from
assistant
Hall
yeah.
C
So
actually,
if
you
ask
me
here,
the
fact
that
agent
cannot
kill
this
process
within
a
minute
is
not
that
a
big
deal
to
us
if
the
agent
kind
of
retries
with
some
kind
of
interval
and
eventually
it
managed
to
kill
it.
Yes,
at
least
that's
okay
in
our
stack.
What
we
are
really
what's
really
bothering
us
right
now
is
we
give
up
trying
that
we
trying
this
after
the
first
failed
attempt
leaked
so.
B
I
understand
your
motivation
on
like
we
solve
this
issue,
and
but
you
feel
feeling
like
adding
retry
it
is
that
just
a
workaround
we
still
don't
understand
why
I
feel
we
need
to
I
think
we
could
consider
that
option,
but
we
need
to
like
Vigo.
What's
the
root
cost,
okay,
well,
yeah
I,
think
I.
Think
right
now
we
do
have
been
lucky.
C
In
the
sense
that
we
are,
we
can
access
people's
code,
there
will
be
operators
wouldn't
even
have
access
to
people's
the
code.
The
people
running
in
the
container
it's
gonna
be
even
harder
for
them
to
figure
out
what
was
wrong.
Unless
we
are,
we
have
high
confidence,
there's
only
one
way
to
leading
to
such
issues,
which
I
kind
of
doubt
it
with
the
different
combinations
of
Linux
kernel
versions,
systems.
B
A
C
Actually,
this
is
the
first
time
I
was
I
started
to
focus
on
the
incorrect
termination
of
previous
container.
We
were
not
looking
at
this.
Previously,
we
have
been.
Our
hypothesis
was
about
internal
back
of
the
isolator
GP
elyza
little
cells,
so
we
were
focusing
our.
We
were,
focusing
our
attention
limited
attention
there
and
we
couldn't
really
make
any
fun
any
log
that
packing
those
assumptions.
So
that's
why
and
we
have
to
recover
the
cluster
quick,
because
our
customer
wants
to
use
the
Dubuque
ours.
This
is
the
first
time
we
starting
to
focusing
on.
A
C
B
Yeah,
because
personally
I
personally
I'm
tied
up
still
want
to
go
ahead
and
and
without
knowing
the
root
cause
and
then
go
ahead
and
at
the
Richer
logic,
I
think
it
is
like
some
random
choice.
We
expect,
and
it
might
happen
that
order
which
I
fail.
If
we
don't
understand
what
was
the
car
what's
the
route
cost,
so
so,
would
you
guys
my
like
keep
watching
and
once
you
get
an
alert,
try
to
collect
the
correct
the
running
process
ID
and
to
see
if
the
kind
of
thing.
C
C
Sorry,
I.
Let
me
progress
off.
We
were
never
focusing
on
the
failed
destroyers
and
our
users
specifically
rarely
care
about
that.
We
never
nobody
was
monitoring,
that's
that.
So,
if
we
do
that,
we
would
probably
need
to
deploy
some
infrastructure
say,
run
some
utility
scripts
or
basically,
one
that
continued
destroy.
Arahato
I,
see.
B
C
B
B
B
C
The
amounts
are
mostly
so
previously.
We
were
loading
both
executables
and
the
data
for
say
for
a
lot
of
workload
similar
to
much
learning
training.
It
may
not
be
detecting
liberal
training,
training
but
kind
of
pretty
similar
code
like
offline
offline
data
processing
pipelines,
the
the
code
actually
lower
the
both
executable,
as
well
as
the
data
from
Nevada
run,
execute
them
recently.
C
C
A
B
C
The
path
week,
at
least
we
can,
we
can
say
it's
like
it's
likely.
This
thing
is
related
to
either
container
Iser
or
the
internals
of
other
Isolators,
which
is
kind
of
internal
to
the
computer,
either
yeah
yeah,
previously
I
was
working
with
AB
and
Mahler,
but
he's
not
that
familiar
already
in
turnover,
containerized,
eh,
ok,.