►
From YouTube: CAPI e2e Deep Dive - 2023-05-11
Description
Cluster API e2e test deep dive session discussing CAPI issue 8641
A
Okay,
it's
the
11th
of
May
2023.
This
is
a
cluster
API
n10
test
deep,
live
session,
we're
going
by
the
kubernetes
code
of
conduct,
which
comes
down
to
really
be
nice
to
each
other
and
yeah.
Today,
we're
just
going
to
go
through
a
couple
of
tests
that
are
inflating
on
the
cluster
API
CI,
and
so
you
can
get
to
the
bottom
or
white.
A
So
first
I
was
just
looking
at
some
of
the
issues
we
have
here.
So
the
first
one
get
a
rate
limit.
A
This
is
where
I
think
we've
spoken
about
this
one
before,
but
we're
hitting
the
rate
limited
GitHub,
because
cluster
cartel
calls
GitHub,
when
it's
trying
to
find
out
what
versions
that
you
use
and
download
manifests
to
use
in
our
tests.
So
I
have
an
issue
open
over
at
test.
Infra,
it's
not
linked
there.
Maybe
I
need
to
check
yeah.
Let
me
check
to
link
it
back,
but
basically
we're
trying
to
get
a
GitHub
token
created
that
will
allow
us
to
hopefully
not
hit
that
rating
or
cluster
Global.
A
If
we
can
get
a
GitHub
token
I
think
our
other
option
is
cert
manager.
That
mostly
calls
us
to
hit
that
limit
so
yeah.
We
need
to
maybe
render
that
yaml,
but
that's
not
ideal
for
a
couple
reasons
this
one.
A
A
This
one
we
did
a
deep
dive
on
last
week.
Christian
has
a
I
think,
probably
a
fix
yep.
A
Fun
and
he's
probably
waiting
for
me
to
review
it.
So
if
anybody
else
wants
to
review
this,
please
do
and
yeah.
Hopefully
we
can
get
this
merge
soon
and
that'll
help
us
with
flakiness,
and
then
we
come
to
what
I
want
to
talk
about
today.
So
I
just
create
this
issue
earlier.
A
We
see
this
issue
fairly
regularly.
It's
probably
the
most
common
single
error
in
our
triage
for
flaking
right
now,
with
some
intermittent
spikes
of
the
rate
limits
which
I
guess
are
correlates
with
each
other
because
of
form
customer,
then
a
bunch
of
them
probably
do
in
the
same
time
period.
So
there's
no
control
plane
machines
come
into
existence.
A
This
means
we're
failing
at
creating
a
cluster
at
all.
No
control,
Play
Machines
means
absolutely
no
cluster
speed,
no
API
server
and
so
on.
So
let's
look
at
it
in
I
think
this
is
the
triage
Network.
A
S,
oh
here's,
two
of
them
from
triage.
Let's
look
at
the
overall
triage
first,
so
when
two
failures
to
today,
look
like
no
control
plane
machines
come
into
existence.
A
These
are
the
jobs
that's
been
failing
on.
It's
failed
in
six
jobs
in
the
last
I'm,
not
sure,
oh
since
the
fourth
or
the
since
the
27th
of
April
I,
think
these
jobs
are
all
on
1.4
but
they're
between
the
intern
tests
and
the
main
case
and
10
test.
And
then,
if
we
scroll
down
here,
we
have
them.
So
this
is
sorry.
Another
very
similar
error,
I'm,
not
sure
exactly
how
this
triage
works,
but
make
some
of
these
errors,
which
look
very
similar,
sometimes
be
split
up
but
yeah.
A
This
is
obviously
the
same
type
of
error
and
it's
also
Happening
Here
on
1.3
and
on
Main.
So
this
is
dispersed
across
all
of
our
end-to-end
jobs.
If
you
look
at
the
actual
tests,
it's
failing
on.
It's
also
failing
all
over
the
place,
cost
a
couple
upgrades
and
no
drain
timeout
upgrading
a
workflow
cluster
testing
on
self-hosted
clusters
like
this
is
just
happening
in
general
across
all
of
our
pests
and
what
that
means
is
we
can.
A
We
need
to
look
at
the
commonalities
between
the
tests
so
because
this
is
that
probably
now
some
of
these
tests,
such
as
the
upgrade
test,
create
multiple
clusters
during
their
run,
but
most
of
the
tests
only
create
one
cluster
and
by
the
way,
the
way
I
have
my
screen
share,
set
up.
I
cannot
see
the
chat
and
I
can't
see
if
people
are
putting
their
hands
up.
So
please
just
interrupt
me
if
you
have
any
questions
or
anything
to
say
or
ideas
or
anything
to
yeah
contribute
it
at
all
so
yeah.
A
So
we
need
to
look
at
the
commonalities,
so
we
do
use
a
single
function
which
is
like
apply
cluster
template
and
weight
across
nearly
all
of
our
tests,
probably
all
of
our
tests
to
create
the
initial
cluster
we're
going
to
test
so
that
it
might,
it
could
be
at
the
code
level
in
that
that
function
or
it
could
be
something
to
do
with
the
test
infrastructure
itself
and
prow
and
GCE,
or
it
could
be
part
of
update.
So
we
use
cap
D,
the
docker
infrastructure
provider
underlies
all
these
tests.
A
So
this
issue
of
the
control
plane
machine
now
coming
up
could
be
an
issuing
capital
or
it
could
be
some
combination
of
the
above,
so
I
just
picked
out
four
of
these
or
five
of
them.
These
correspond
to
the
first
five
here
and
so
I
think
the
first
thing
I
wanted
to
show
was
just
how
to
actually
download
the
artifacts.
A
A
This
part
of
itself
from
kubernetes
Jenkins
up
I,
think
I
already
have
it
here.
So
basically,
this
uses
gsutil,
which
is
Google
Cloud
Storage.
This
is
util.
You
can
look
it
up.
I'll
also
paste
this
into
Slack
and
we're
just
copying
from
the
Google
Cloud
bucket,
which
is
of
this
form
into
a
temporary
directory.
Here,
I've
already
done
this
and
I'm
going
to
skip
it
now
or
what
we're
saying,
Mr,
Schultz
working,
it's
not
working
this.
A
I,
don't
think
so
for
the
kubernetes
stuff.
This
is
all
available
on
the
web.
I
don't
think
I
have
I've
logged
into
Google
Cloud,
but
for
if
there
was
any
authentication
on
the
bucket
you're
using
the
same
place,
three
or
any
other
I
guess
Oracle
storage
yeah,
you
can
authenticate
with
your
Contour,
but
I
think
the
kubernetes
stuff
is
freely
available.
So
there's
I'm.
B
Sorry
that
these
buckets
are
well
readable,
read-only
readable,
so
you
don't
need
authentication,
for
that
would
be
super
helpful
to
get
that
command
pasted
to
the
slab
yeah.
A
A
We
should
also
probably
put
that
command
into
the
book
somewhere,
because
yeah
I
had
a
hard
time
putting
it
together
as
a
snake,
because
I
haven't
done
that
in
a
while
and
I
didn't
have.
It
noted
anywhere
obvious.
A
Yeah,
whatever
was
going
on
there,
where
it
was
chunking
the
file
or
something
so
I'm
just
going
to
cancel
this
because
it
does,
it
can
take
a
while,
depending
on
how
fast
the
Google
book
is
dropping
you.
A
So
I
put
under
the
there's
a
tank
directory
in
the
it's
happy
repo,
which
is
a
very
useful
place
if
there's
yamos
and
stuff
you're
working
with,
because
this
is
you
know
they
get
ignore
and
everything.
So
this
is
a
place
to
just
dump,
stop
you're
using
while
you're
in
the
repo,
but
you
don't
necessarily
want
to
get
to
know
about
it.
A
So
I've
just
talked
to
know
facts
one.
This
folder
is
one
eight
five,
seven
six.
A
Yes,
that
corresponds
to
this
wrong
and
we
want
to
find
no
control,
Play
Machines,
coming
to
the
distance.
So
the
first
thing
here
is
that
the
build
log
is
what
comes
up
here.
So
this
is
a
bit
log
self
research,
no
control
permissions
come
into
existence.
Let
me
make
this
bigger,
for
a
second
is
the
text?
Okay,
for
everybody
does
that
we
want
me
to
make
it
bigger.
A
So
the
build
log
is
here:
no
control
Dimensions
come
into
existence.
That's
what
we're
looking
for
what
we
want
to
find
in
order
to
drill
down
on
this
is
what
tests
that
happened
in
and
what
the
name
of
the
cluster
is.
So
here's
our
namespace,
here's
our
cluster
name.
A
A
A
Explain
this
so
we
have
this
artifacts
folder
clusters
here
contain
so
just
open.
Another
one
clusters
here
contain
items
from
the
actual
clusters,
particularly
logs
from
the
machine.
So
from
the
machine
pool
test
we
have
machines.
We
only
have
one
machine
that
we
logged
from
because
container
D
log
information
on
like
Integrity
config,
the
journal,
CTL
log
I'm,
not
sure
what
current.log
corresponds
to
Design
Reno.
A
No,
the
achievement
log,
which
can
be
super
useful
and
again
cubelet
version
stuff
and
a
lot
of
normal
characters
that
don't
now,
but
that's
what
we
usually
have
here,
but
we
couldn't
get
anything
because
we
got
no
control
play
machines
where
our
happy
objects
are
in
the
bootstrap
cluster.
A
So
this
zero
zero
Q
hug.
A
A
That's
a
close
template.
We
applied.
E
A
Sorry
this
is
the
namespace,
so
in
our
bootstrap
cluster
on
the
resources,
which
is
where
our
copy
resources
are
under
this
namespace
and
there
we
have
our
cluster
zero
zero
Cube
hook.
This
is
the
cluster
that
we
take
sometime
towards
the
end
of
the
test.
Alexa.
A
Okay,
if
anybody
has
a
question,
it's
okay,
we'll
continue.
This
cluster
is
picked
up
towards
the
end
of
the
test.
It's
not
always
the
exact
final
object,
and
sometimes
that
happens.
We
saw
this
last
week
where
sometimes
this
is
script.
Actually
after
the
test
fails,
so
you
end
up
with
actually
a
slightly
misrepresentative
State
here,
but
first
thing
we
check
here
is
maybe
the
status
conditions,
scaling
or
control
point
to
one
replicas.
A
A
Yeah,
there's
not
much
else
to
look
at
there
to
be
honest,
presumably
we
did
get
machine
deployments.
We
did
get
machine
sets.
We
did
get
a
control
plane.
Let's
have
a
look
at
the
role
play
and
see.
What's
going
on
there
same
message
right,
it's
just
scaling
up
into
one
replica
ready
false,
no
other
action.
There
may
be
in
the
machine
itself.
B
C
B
A
A
So
at
this
point
we
could
start
going
into
the
logs
here,
which
is
kind
of
what
we're
doing
last
week
and,
let's
say,
search
for
you
hook.
A
Boss,
we
do
have
this
super
cool
tool
until
so
I
have
run
I'm
running
towards
here
in
the
background
I'm
always
doing,
but
sometime
last
year,
when
we
worked
on
logging,
so
I
might
link
that
issue
in
the
thread
later
on.
A
We
still
have
open
issues
around
logging,
we're
still
trying
to
improve
them
sort
of
land
rehearse
cycles,
and
we
have
a
bunch
of
tasks
up
there
to
kind
of
improve
our
logging.
But
one
thing
that
Stefan
I
think
did
last
year
was:
allow
you
to
take
a
link
like
this.
So
this
is
the
private
job
that
we
failed
on.
Internet576
I
went
to
tilt,
so
let
me
just
show
that
for
a
second
sorry,.
A
Right
yeah,
that's
I'll,
make
sure
to
link
that
well,
we
can
deploy
observability
until
we
also
have
like
Prometheus
and
stuff,
but
I
just
need
refund
and
locally
Loki,
and
then
once
we
do,
then
it's
helped.
We
can
put
a
link
to
any
GCS
file
result
there.
You
can
also
link
I
think
directly
to
their
artifacts
page
and
press
import
logs.
E
A
C
F
A
So
I'm
running
on
a
local
cluster
running
tilt
up.
This
is
my
what
I
run
like
every
morning.
Delete
clusters,
hack,
is
kind
and
soft
cup
T,
which
is
a
helper
script
in
the
copy
repo
cell
top.
In
my
till
settings,
file
like
this
grafana
and
Loki
under
the
observability
and
I
think
the
addendum
here
is.
You
cannot
run
promptail
if
you
want
to
import
logs
but
yeah.
There
is
a
section
of
the
book
that
we
can
link
in
the
in
the
thread,
but
yeah
that's
all
built
into
the
copy
repo.
B
A
A
A
We
get
controller.
Well,
let's
get
a
map.
A
A
It's
slow,
the
name
of
our
cluster
was.
A
A
Wait
it's
located
and
I
think
they
keep
kind
of
improving
this
and
changing
it.
But
these
are
all
labels.
The
value
pairs
are
in
our
logs
from
klog,
so
there's
app
controller,
no
name
stuff
like
that
and
I'm
just
looking
just
for
the
Raw
name,
the
poster
a
lot
more
sorry
logs
are
decorated.
So
this
is
a
single
log
line
in
Json
format,
a
controller
controller
group,
controller
kind.
A
So
this,
for
example,
has
the
cluster,
so
most
our
logs
should
be
decorated
with
the
cluster
they
relate
to.
So
normally
you
look
at
whatever
controller.
You
can
also
look
across
controllers
and
you
can
get
cluster.ame
and
put
in
your
cluster
name.
You
can
also
look
at
the
namespace.
It's
pretty
useful
for
indent
this,
because
everything
is
run
for
the
most
part.
There's
only
one
cluster
breakdown
test,
so
the
namespace
is
roughly
equivalent.
The
different.
A
The
only
difference
type
is
the
upgrade
test
which
create
multiple
clusters
because
they
create
yeah,
a
second
bootstrap
cluster,
so
just
to
show
this.
Some
of
these
are
erroring
out,
I'm,
not
sure
why
so
these
are
Json
logs.
So
this
red
here
means
that
they're
not
incorrectly
formatted
as
Json,
let's
just
click
into
one
of
the
ones.
So
you
can
do
this
in
the
query
box
or
I
can
do
it
down
here
through
the
UI
to
the
UI,
which
is
I,
don't
care
about
most
that
stuff.
What
I
want
is
the
message.
A
Perfect
messaging
error
so
now
for
the
lines
that
go
down
here.
I
only
have
the
error
so
fail
to
preload
images
into
the
docker
machine.
A
Okay,
there's
our
first
click.
The
container
is
not
running.
D
A
I
appreciate
this
might
be
pretty
small.
Are
people
able
to
follow
along
here
or
should
I
go
back
to
maybe
the
ID
window,
where
things
might
be
a
bit
bigger,
but
just
to
a
I,
don't
think
we're
going
to
get
further
than
this,
except
for
go
back
to
the
IDU
window?
We
don't
have
Docker
machines
here,
but
possibly
they
get
cleaned
up
because.
A
I
think
we've
got
a
machine,
but
yeah
I,
guess
kcp.
If
this
fails,
this
is
on
1.4,
so
the
kcp
remediation
thing
should
be
there
right.
What
am
I
not
being
able
but
previous
to
the
kcp
remediation
thing,
the
the
new
feature.
If
the
first
control
plane
failed,
then
PCP
just
doesn't
try
it
just
says:
well,
your
cluster
bootstrap
fails
and
it
doesn't
try
to
get
control
plane
up
again.
D
B
A
B
A
E
And
we
can
look
for
okay.
A
Okay,
that's
actually
useful,
so
let
me
go
back
to
this
one
thanks,
I
I!
Never,
look
at
that
question.
I
didn't
realize.
Stock
loggers
are
hello.
E
A
D
A
A
Okay,
so
from
the
server
reconciles
is
I'm
watching
on
Docker
machines,
which
should
be
created
by
the
machine
or
Sorry
by
kcp.
In
this
case,
kcp
should
look
at
the
template.
That's
in
the
basic
P
object
machine
template
and
create
the
docker
machine
out
of
that
Docker
Machine
controller.
To
pick
it
up,
it'll
add
owners
which
is
fine.
This
is
just
the
owner,
or
this
is
just
the
owners
for
the
log
actually,
and
then
it
has
owner
references.
Probably.
A
The
unfortunate
name
Docker
dot
machine,
which
is
I,
think
this
is
just
an
abstraction
that
we
use
inside
of
the
docker
infrastructure
providers.
The
docker
Machine
controller
just
pass
information
around
me,
so
it
has
a
cluster
name,
the
machine
name,
the
IP
family,
the
container
which
again
I
think
is
just
a
helper
yeah.
It's
just
another
helper,
the
node
creator,
which
is
an
interface
to
actually
create
the
node
from
the
machine.
A
A
A
We
just
do
that
here
and
if
that's
not
equal
to
nil,
we
return
early.
So
after
everything
is
set
up,
we
normally
return
here
in
our
case
in
this
error,
we're
not
properly
set
up.
A
We
got
down
to
if
not
external
machine
don't
exists.
This
is
probably
the
only
time
this
happens
so
once
we
create
our
first
control
plane
machine,
we
clear
our
first
control
plane,
Docker
machine.
We
check.
If
it
exists,
it
doesn't
exist.
We
call
our
external
machine,
which
again
is
this
Docker
dot
machine,
which
is
our
abstraction
we
do
create
and
we
get
failure.
A
B
But
this
is
picked
by
kind.
This
is
not
picked
inside
Docker
right.
B
We
just
said
it
if
it's
zero,
we
set
it
here
to
other.
You
just
gets
determined
here
by
not
listening.
B
A
B
Just
to
recap,
so
there
are
two
ways
to
go
down
now:
one
is
to
check
why
this
overlap
happens.
The
other
way
would
be
try
to
find
a
workaround
if
we
can't
get
a
better
fix
right.
A
A
Okay,
but
we
definitely
have
a
place
to
drill
down
on
that's
actually
much
more
than
seems
likely
from
the
No
Control
Play
Machines
message,
and
it's
likely
an
issue
somewhere
between
cap
D
and
the
OS
are
not
actually
in
cluster
API
code
or
in
our
testing
code,
which
is
nice.
A
Because,
like
it's
still
as
flake
as
it
is,
but
at
least
we
don't
believe
that
people
who
are
actually
running
this
in
production
are
suffering
from
the
issue.
E
I
have
a
question
so
in
the
first
issue
that
you
opened,
can
you
open
the
pro
I
want
to
ask
something.
E
D
E
E
Are
just
like
it
recovers
itself,
or
is
it
like
an
actual
error.
A
So
it's,
it
is
an
actuariable.
Let
me
show
you
one.
Second,
yes,
just
on
test
grade
I'm
gonna,
look
at
some
of
this
stuff.
A
A
We've
enabled
that
Dinko,
which
is
our
test
Runner,
has
failed
fast
on
which
means
that
if
it
detects
one
failure,
it
stops
all
the
other
jobs.
Okay,
we've
disabled
this
on,
for
example,
happy
to
email.
So
we
get
that
coverage
because
before,
if
one
fails,
we
just
get
lost,
we
got
much
less
coverage
of
what
was
actually
flaky
and
actually
I'm
planning
to
make
it
so
that
all
of
our
jobs,
I
think
disabled,
fail
fast,
but
I
just
haven't
finished
that
period.
A
Yet
thanks
for
reminding
me,
but
what's
Happening
Here,
is
that
we're
like
randomly
deleting
all
of
our
kubernetes
resources,
while
pests
are
still
running
so
you're?
Getting
a
bunch
of
these
issues
failed
to
list
pods
certificate
signed
by
online
Authority,
but
they're,
basically,
because
the
API
server
might
be
down
like
these
are
just
yeah.
E
D
E
B
So
in
our
test,
we
also
have
some
informers
open
which
try
to
or
things
which
try
to,
for
example,
stream
locks,
and
this
will
also
cause
error
messages
if
yeah
the
resources
are
not
there
anymore,
or
something
like
that.
In
this
this
case,
it
seems
like
the
API
server
got
deleted
or
something
like
that.
A
I
need
to
dive
into
the
API
of
the
net
lesson
and
listener
clothes
to
find
out.
If
there's
a
way
to
get
a
signal
that
something
has
actually
been
released.
If
we
can
wait
for
that
or
if
we
can
just
wrap
it
in
a
retry
and
see
if
that's
okay,
it's
not
an
ideal
solution
to
just
accept
that
we
can't
ever
get
exclusive
access
to
a
port,
but
maybe
there's
some
check.
We
can
run
State.
Port
is
free
or
something
this
one.