►
From YouTube: Kubernetes SIG Network Bi-Weekly Meeting for 20220203
Description
Kubernetes SIG Network Bi-Weekly Meeting for 20220203
A
B
B
There
are
a
whopping
four
for
us
to
look
at
today.
There
were
13
earlier
there
were
there's
some
triage
there.
Yes,
I
I
noticed
many
people
doing
triage,
so
I
just
filtered
those
out
anybody
who
went
through
and
pinged
theirs
and
asked
for
whatever
original
poster
feedback.
I
filtered
those
out
of
this
list
and
I
think
I
closed
one
that
was
just
long
idle.
B
So
here's
the
four
that
I
think
are
left
worth
talking
about
the
first
one
is
pretty
clearly
a
bug
in
the
describer
for
ingress.
I
left
it
open
here.
I
actually
added
the
help
wanted
label
and
I'm
going
to
take
triage
off.
I
left
it
open
here
just
to
throw
out.
This
seems
like
a
should,
be
a
relatively
easy
bug.
If
people
are
looking
for
things
that
they
want
to
that
they
can
contribute.
B
This
is
a
maybe
a
good
starter
bug
to
just
go.
Look
at
the
describer
and
figure
out
if
we
can
just
maybe
delete
those
lines
and
and
update
the
test
if
it
was
even
tested,
hopefully
a
small
one,
easy
way
for
somebody
to
score
a
point.
A
B
Looked
around
and
I
can't
find
anywhere
that
we
actually
enforce
this.
It
was
in
it's
in
some
of
the
implementations
and
that's
fine.
If
the
describer
is
true
to
the
resource
that's
reading
from
the
api,
then
the
describer
will
show
the
correct
thing
if
the
implementation
made
that
choice,
but
we
don't
need
to
pretend
the
implementation
made
the
choice
when
it
didn't
okay,
fair
enough.
B
And
this
is
seeing
prashant's
name
on
this
brings
back
some
memories.
That
was
a
long
time
back.
20
2015
feels
like
forever
ago.
So
I
will
leave
this
one
open.
I
accepted
the
triage
on
it.
It's
a
should
be
a
good
one
for
someone
who
wants
to
jump
in
if
you've
got
someone
new
in
your
org
or
in
your
orbit.
That
wants
a
good
bug
to
start
on.
Here's
a
here's,
a
fun
one.
B
Second
was
a
flaky
test.
This
one
is
clearly
a
gcp
thing.
It
looks
like
we
leaked
a
load.
Balancer
bowie
is
assigned,
I'm
not
even
sure
why
it
didn't
close.
It
didn't
just
close
the
window,
but
always
assigned.
I
think
it
was.
I
left
it
open
just
to
ping
boy
to
remind
him
that
he's
assigned
to
this
one.
B
Okay
and
the
last
one
then
was
better
availability
of
deployments
with
replicas
equals
one.
I
suggested
that
it's
actually
a
dupe
of
the
external
and
internal
traffic
policy
issues,
so
I
I
will
take
the
action
to
mark
it
as
a
dupe
close
it
out.
That's
it.
A
All
right
on
to
the
agenda,
then,
let's
see
we
first
have
updates
from
kaping
apac
working
group
who
wants
to
take
that
one.
D
Hey
hi
everyone,
I'm
rajesh
kapoor.
This
is
actually
my
first
time
joining
the
sid
network
meeting.
Welcome
welcome!
Thank
you.
I
work
at
vmware
and
I've
contributed
to
sick
testing
before
in
off
late,
I've
been
helping
with
the
kipping
working
group,
so
I've
a
couple
of
updates
on
where
we
are
with
cupping.
So
I'll
probably
share
my
screen
and
walk
you
all
through
it.
D
Are
you
all
able
to
see
some
yellow,
slides
yep
looks
good
cool
okay,
so
what
we
did
with
kaping
was
like
a
bunch
of
us
who
happened
to
be
in
apec
time
zone
got
together
and
from
this
working
crew,
apart
from
the
cupping
main
group
to
work
during
ap
cars
and
the
goal
was
to
so.
This
happened
somewhere
around
december
last
year
and
and
the
goal
was
to
get
ci
working
for
cupping.
D
You
know
focus
on
a
couple
of
back
ends,
like
you
know:
iptables
ipvs,
user
space,
and
there
are
a
bunch
of
us
in
this
group
who
are
new
to
stuff.
So
what
we
do
is
we
meet
every
wednesday
at
around
7pm
ist,
that
happens
to
be
like
5
30
a.m,
pst,
and
we
hack
on
stuff,
like
most
recently,
we've
been
like
trying
to
fix
test
cases,
test
failures
that
are
on
ip
tables,
ipvs
back-ends
caught
in
the
ci
and
stuff
like
that,
so
over
a
period
of
two
months
or
so.
D
This
is
where
we
are
currently
wherein
we
have
got
github
actions
working
for
cupping.
So,
thanks
to
folks,
like
neha
friedrich,
shanusha
douglas
they've
been
super
helpful
to
get
these,
you
know
the
hooks
around
ci
to
test
back
ends
like
ip
tables,
and
things
like
that.
This
is
also
this
basically
runs
e
to
e
conformance
in
signet
tests.
As
of
now,
apart
from
that,
this
is
also
like
if
someone
wants
to
try
out
cupping
without
running
e2e
tests,
and
things
like
that
probably
should
use
this
script.
E
D
D
Okay,
apart
from
like
focusing
on
the
back
end
side
of
things,
we
got
windows
user
space
in
it's
not
yet
integrated
with
the
ci
yet,
but
I
think
we're
getting
there
from
the
ipvs
and
ip
tables
point
of
view.
D
We
caught
a
bunch
of
e2e
failures
and
then
neha
and
anusha
they've
been
working
with
vivek
and
hanuman
around
fixing
these
issues
like
offlate,
we
have
some
things
that
aren't
session
affinity
and
things
like
that
and
like
these
vacants
still
don't
have
unit
tests.
So
if
anyone
wants
to,
you
know,
start
contributing
to
cupping
and
see
how
these
beckons
are
written
in
cupping
like
I
think
this
is
a
like.
This
is
a
great
opportunity
to
start
get
started
with
cupping.
D
Apart
from
that
sorry,
these
are
the
old,
slides
yeah.
Apart
from
that
on
the
user
space
port
front,
useful
user
space
board
front,
we
have
the
there's
a
pr
in
flight.
We
have
the
changes
in,
but
then
there
are
a
couple
of
e2e
tests
failing
and
he's
still
trying
to
fix
that,
I'm
basically
helping
jay
to
get
user
space
in
shape.
Most
recently,
pallavi
joined
us
to
help
us
out
and
yeah
also
mikhail
introduced.
This
diff
store
to
us
in
nft
back
end.
D
So
if
we
look
at
the
back
ends,
nft
is
sort
of
using
sort
of
a
full
state
model
over
here,
as
opposed
to
how
I
p
tables
and
all
in
the
current
implementation
of
q
proxy,
how
they
use
service
change,
tracker
data
structs
and
things
like
that.
Nft
is
like
using
something
different,
wherein
they're
getting
the
entire
state
of
the
cluster,
with
all
the
service
endpoints
and
things
like
that
and
using
this
diff
stove,
which
basically
happens
to
be
a
b
tree
under
the
hood.
D
It's
trying
to
get
a
diff
of
what
all
has
changed
and
then
write
the
nf
table.
Rules
jay
feel
free
to
pitch
in
because
I
mean
I'll
get
the
details
right
here.
So
this
is
something
new
that
has
been
introduced
and
yeah
I
mean
it's.
It's
super
fun
code
to
go
through.
So,
if
anyone's
also
out
there,
you
know
looking
to
contribute
to
documentation
and
things
around
diff
store,
like
that's,
also
a
great
opportunity.
D
So
yeah
I
mean
this
is
all
I
had.
If
you
want
to
add
anything
to
this
jade
feel
free
to
go
ahead.
F
H
A
All
right,
let's
see
next
up,
fqdn
support
for
network
policy.
G
G
The
things
that
I
wanted
to
get
into
right
now
was
basically
a
get
a
up
down
on
whether
we
want
to
add
fqdn
support
entry
as
part
of
one
of
our
built-in
apis
and
potentially
get
some
clarity
on.
You
know
which
api
that
might
be
we'll
we'll
have
some
proposals
and
we
can
talk
through
those.
G
I
I
mean
some
discussion
of
the
details
of
the
api.
Spec
is
obviously
inevitable,
but
I'm
hoping
that
we
won't
get
too
rat
hold
on
that,
and
we
can,
you
know
maybe
iterate
on
that
in
the
cap
and
maybe
just
get
some
high
level.
Yes,
no
concerns
that
we
can
dig
into
deeper
so
a
couple
of
user
stories
right
off
the
bat.
These
are
things
that
I
think
I've
seen
and
I
shared
with
the
netpoll
working
group
and
people
have
seen
the
same
things.
G
The
first
story
is
we're
targeting
application
developers
in
both
of
these
examples.
The
first
one
is
a
pretty
standard
complaint
of
people
who
want
to
write
network
policies
that
target
external
services.
They
want
to
say,
allow
egress
to
some
service,
whatever
it
may
be,
but
managing
ip
blocks
can
be
tedious
and
it's
not
always
obvious
which
ip
block
you've
configured
just
by
looking
at
your
random
ipv4
access
address,
and
so
the
request
here
is.
I
want
to
be
able
to
write
down
an
fqdn
instead
and
have
my
cni
do
the
resolution.
G
So
you
know
I
can
just
point
you
to
wikipedia.org
and
you'll
you'll
handle
the
rest
for
me.
I
don't
need
to
worry
about
resolving
that.
Another
story
that
we've
run
into
is
a
bit
of
a
more
advanced
use
case.
This.
This
is
for
application
developers
who
want
to
be
able
to
use
a
wild
card
syntax
when
specifying
fqdn.
G
So
either
they
have
like
a
cloud
provider.
They
want
to
say:
hey,
you
know,
star
dot,
my
cloud
provider,
just
let
me
talk
to
all
of
my
cloud
provider
apis
or
they
have
some
sort
of
on-prem
hosted
services
and
they
want
to
batch
a
lot
of
those
just
a
convenience
api
that
lets
you
playlist
a
bunch
of
things
instead
of
having
to
enumerate
them
one
by
one,
and
these
are
the
overall
stories
that
we're
trying
to
target
here,
and
so
this
brings
us
to
the
proposal
for
bringing
fqdn
in
tree
and
right
now.
G
The
ideal
proposal
would
be
to
extend
network
policy
as
it
currently
stands.
There's
a
couple
of
benefits
to
doing
this.
The
first
is
just
from
a
readability.
Understandability
perspective
users
already
know
what
network
policy
is
and
they're
already
configuring
it.
G
So
it's
a
convenient
place
to
come
inspect
what
traffic
is
allowed
to
leave
your
clusters
or
allowed
to
leave
your
pods
without
having
to
dig
too
deep,
and
the
second
is
that
the
allow
list
with
implicit
deny
actually
works
really
well
for
fqdn,
because
you
don't
have
guarantees
on
completeness
of
dns
records
and
all
that
stuff.
It's
good
to
be
able
to
just
say:
here's
what
we
know
is
a
good
ip
you're
allowed
to
egress
there.
G
We
we
we
can
get
away
without
making
guarantees
about
completeness
of
resolving
dns,
and
this
is
a
sample
of
what
extending
the
api
would
look
like,
would
be
adding
a
top
level
field,
egress
fqdns
as
an
example,
and
that
lets
users
specify
you
know
alongside
the
ingress
and
the
egress
blocks
today,
specify
additional
fqdn
matches
by
either
name
or
pattern,
and
then
you
have
the
the
port
specification
as
well
and
go
do
like
a
cross
product
type
thing
it's
a
little
bit
dense.
G
G
So
it's
very
similar
to
the
standard
network
policy
you're
just
swapping
out
what
would
have
been
an
ip
block
for
this
fqdn
and
another
version
of
this.
You
know
you
can
still
select
pods
much
as
you
would
in
network
policy.
Today-
and
you
can
specify
you
know
a
name
itself
or
you
can
specify
a
pattern
and
have
sort
of
a
prefix
match
expansion.
G
An
important
point
to
sort
of
talk
through
here
is
about
how
we
feel
close,
and
so
this
is
something
that
we
talked
about
for
a
bit
and
what
what
I'm
proposing
right
now
is
to
reuse
the
policy
types
field
that
we'd
originally
added
to
help
extend
with
egress,
I
believe
or
ingress
rather,
but
the
idea
is
that
we
would
default
use
the
defaulting
in
the
api
server
to
add
this
egress
tag
to
any
policies
that
use
the
egress
fqdns
field,
and
so,
if
any
old
cni,
if
you
haven't
updated
your
cni,
but
you
try
to
use
this.
G
Your
sfqdns
field
you'll,
basically
get
a
deny
all
egress
policy
and
so
we'll
be
able
to
use
the
existing
mechanisms
within
the
api
to
just
fail,
closed
and
lock
down
your
workload.
So
it's
not
great
from
a
you
know:
user,
like
smoothness
perspective,
you'll,
have
to
you'll
realize
that
your
traffic
isn't
going
out
and
you'll
have
to
dig
into
why.
But
at
least
it's
not
a
security
bug
or
you
know
it's
not
a
hole
where
we're
now
egressing
a
bunch
of
traffic
that
we
didn't
intend
to.
G
So
that's
the
idea
around
how
we'd
like
to
extend
the
api,
there's
more
details
in
a
doc
that
we
have
written.
I
think
I've
linked
it
in
the
meeting
notes.
I
guess
the
alternatives
or
I
guess
we
have
just
a
quick
look
at
where
this
should
be
added
into
the
api
spec.
G
I
did
some
digging
on
how
the
api
is
defined
and
there's
a
few
other
places
like
the
network
policy,
pure
the
network
policy,
egress
rule
that
are
contenders,
for
you
know
adding
fqd
in
there
either
alongside
ip
block
and
pod
selector
as
a
peer
or
even
alongside
a
list
of
twos,
but
in
both
of
those
cases
you
either
end
up
with
undefined
behavior
or
failing
open
as
part
of
the
api
spec,
neither
of
which
sounds
particularly
exciting
or
fun.
So
I'm
proposing
a
top-level
field
edition
for
integrating
this
into
network
policy.
I
I
G
Yeah
great
question,
so
a
couple
of
thoughts
that
we
had
is
one.
The
idea
that
we
were
kicking
around
was
adding
a
core
dns
plug-in,
which
would
act
as
maybe
like
a
side
channel
that
once
it
resolves
ip's
matching
your
star,
it
can
send
the
resolved
ip
out
to
your
network
policy
provider
so
that
you're
also
aware
of
the
resolutions
as
they're
happening.
G
Another
option,
which
is
more
heavyweight
and
is
to
just
put
in
like
an
envoy
proxy
or
something
like
that.
That's
intercepting
your
dns
requests
and
then
try
to
do
a
res.
You
know
update
your
allow
lists
based
on
that.
I
Well,
how
do
you
know
that
it's
addressed
by
dns?
Yes,
because
you
put
the
policies
on,
can
still
use
ips
directly
right,
yeah.
K
G
Well,
so,
since
it's
only
allows
the
the
the
thing
that
we
can
get
away
with
is
we
won't
allow
other
traffic
to
egress
here?
I
think
the
problem
that
you'll
run
into
is
that
if
you
don't
do
a
dns
resolution
from
your
workload,
you
won't
update
the
allow
listed
ips,
and
so
your
traffic
would
like,
if
you
just
tried
to
hit
the
ip
without
doing
any
dns
lookup
you
would.
You
would
not
have
populated
the
the
network
policy
enforcer.
I
G
I'm
open
to
other
opinions.
My
my
thought
is
that
if
you
want
to
use
pure,
like
only
ips,
you
have
ip
blocks
and
you
you
can
do
that.
I
I
figured
that
if
you're
using
fqdn
for
you
know
ease
of
use,
I
feel
like
that
extends
equally
to
when
you're
writing
your
application,
like
just
as
you
want
it
might
be
two
different
persons
right.
G
I
K
I
Before
sort
of
the
first
one
that
where
you
have
something
that
looks
at
the
policies
and
produce
the
set
of
addresses
based
on
sort
of
what
the
policy,
what
the
patterns
and
the
policies
are,
I
mean
I
like
that
part-
I
mean
I
should
see
that
doable
and
make
that
list
available
that
won't
work.
We
do
that
in
openshift,
it's
terrible
yeah,
it's
like
it's.
You
have
a
race
conditions,
always
right
and
then
time
to
live,
and
all
this
exactly
oh
yeah.
L
B
B
M
But
I
think
that
that's
a
feature
not
because,
like
I
said
we,
we
have
this
in
openshift,
where
we
implement
it
by.
You
know,
looking
up
the
thing
and
checking
the
ttl
and
doing
a
new
dns
lookup
anytime,
the
ttl
expires
and
trying
to
deal
with
all
that
and
it
never
works
like
every
customer
who's
ever
used.
It
has
filed
bugs
against
it.
Like
you,
don't
want
to
do
that,
you
have
to
intercept
dns,
there's
no
other
way.
N
N
But
the
the
thing
my
question
is:
is
you
shut
down
all
dns
in
the
cluster
and
you
force
the
resolution
of
the
bots
because
you
can
force
the
resolve.com
in
the
paths
and
you
force
all
the
posts
to
resolve
in
your
core
dns,
whatever
dns,
and
this
coordinates
has
a
policy
and
then
you
put
the
rules
there.
It's
it's.
N
A
B
N
How
did
you
do
that
is,
for
example,
right
now
all
these
dns
filtering
that
you
can
use
in
your
home
right.
I
don't
remember
the
umbrella
and
this
one.
What
they
do
is
just
they
have
a
list
of
dns
and
they
send
you
to
the
you
know
to
a
block
ip,
because
otherwise
you
you
run
into
what
that
question
said
it's
impossible
to
to
to
get
root
for
the
the
ips
resolved
by
admin.
I
mean
yes,.
M
The
the
way
that
I
had
thought
about
this
before
is
that
the
network
policy
implementation
tells
core
dns.
I
need
to
know
if
anybody
asks
for
the
ip
address
of
github.com
and
then
whenever
anybody
does
core
dns,
the
core
dns
plugin
will
tell
the
network
policy
implementation.
Okay,
somebody
asked-
and
this
is
the
result
that
I'm
about
to
give
them
and
then
that
can
update
the
network
policies
and
then
the
pod
gets
the
dns
results
and
it
goes
through.
C
C
C
M
A
I
mean
the
other
problem
is
that
it
doesn't
prevent
the
application
inside
the
container
from
trying
to
use
its
own
ip
address
somewhere
out
there.
You
would
have
to
also
make
sure
that
you
block
any
like
port,
53
or.
M
O
Yeah,
I
think
I
think
everybody
talks
with
the
background
of
how
do
you
think
the
network
policy
should
be
implemented
also
a
lot
of
the
discussion,
hence
that
the
implementation
is
ip
tables.
So,
at
the
time
of
packet
arrival
to
connect
to
github.com
some
ib
table
should
be
there
to
do
the
allow.
O
But
the
idea
is
not.
Every
network
policy
is
implemented
using
ib
tables.
Some
people
have
firewalls
some.
There
are
many
many
ways
to
do
it,
a
lot
of
the
problems
and
they
are
valid
problems
and
the
race
conditions
described
are
because
everybody
thinks
that
this
is
ib.
Table's
implementation
form.
O
The
way
this
works
on
traditional
firewalls,
think
of
f5
kind
of
firewalls,
let's
pack
it
interceptors
so
at
the
packet
arrival
do
you
know
the
packet?
Even
if
it's
tls,
you
know
where
the
buckets
go
and
you
can
do
reverse
dns.
I
P
I
mean
this
all.
This,
though,
is
kind
of
in
reference
to
implementations.
I
I
know
people
have
doubts
that
it
can
be
implemented,
but
I
think
I
don't
know,
I'm
speaking
for
a
whole
now,
but
isn't
this
more
just
like?
Is
this
an
api
that
the
community
thinks
we
could
utilize
and
then
we
can
tackle
the
can
this
be
implemented?
O
Q
Logo
to
to
add
my
two
cents,
yeah,
I
think
in
andrea.
We
actually
implemented
something
some
policy
similar
to
this,
and
the
way
we
do.
That
is
that
if
a
policy
selects
some
pod,
we
intercept
any
traffic.
That's
going
to
that
pod
from
source
pro
53
and
then
look
at
you
know
their.
Q
What
what
the
dns
they're
trying
to
resolve-
and
we
basically
wait
until
something
some
rule-
is
realized
on
the
day
of
the
past
before
we
actually
pass
the
packet
on
back
to
the
back
to
the
original
requesting
part,
so
that
we
know
that
no
pod
will
get
a
response,
we'll
even
get
a
dns
response.
If
that's
the
fdm
we're
actually
filtering.
N
Q
I
I
I
work
with
telcos
right
and
the
telcos
have
been
living
on
that
they
can
intercept
dns
requests
and
do
I
mean
basically
sell
services
based
on
that.
They
know
where
the
users
are
going
and
that
is
going
away
so
the
whole
the
whole
thing
with
the
deep
packet
inspection
and
when
we
look
at
5g,
where
almost
everything
is
encrypted
becomes
useless.
I
mean,
then
you
have
encrypted
bit
streams
that
you
can
look
at
sort
of
how
they
behave
to
try
to
identify
what
it
is.
I
H
And
it
seems
this
seems
good
what
you're
describing.
I
Seems
good,
I
have
no
problem
with
it.
I
mean
we
refused
to
do
something
for
china,
where
they
wanted
to
push
basically
root
record
in
everywhere,
so
we
could
intercept
all
the
disappear
walls,
but
so
so
the
problem
when
you,
when
you
start
doing
that,
is
that
you're
going
to
get
dependent
on
snooping,
and
I
I
don't
think
that
is
a
good
idea.
I
It's
for
the
latest,
seven
stuff.
We
have
urls
right.
This
makes
more
sense,
but
for
layer,
3
layer,
4.
A
O
But
I've
seen
actually
a
lot
of
clusters
actually
majority
of
the
cluster
I've
seen
they
do
split,
split
fries
on
where
anything.cluster.local
goes
to
coordinates.
Anything
that's
not
clustered.local
goes
to
whatever
environment
dns
outside
the
cluster.
A
N
N
M
O
I
think
the
api
is
good.
I
think
I
honestly
think
the
api
needs
to
to
progress
and
we're
gonna
have
to
see
the
implementation
as
a
separate
thing
just
to
to
to
tell
you
to
set
your
expected
to
set
the
tone
without
intercepting
the
packet.
This
thing
is
entirely
good.
I
can
run
a
pod
first
that
has
dns
mask
and
then
download
the
records.
I
want
and
then
run
my
pot
and
resolve
using
that
dns
mask.
B
B
M
G
I
B
B
O
O
O
I'm
kind
of
inclined
to
and
separate
the
api,
because
it's
easier,
it's
better,
it's
optional
and
a
lot
of
people
can
say
not
gonna.
Do
that.
C
C
G
Discoverability
and
readability
in
that
that
the
semantics
are
very
much
the
same
as
standard
network
policy.
It's
a
list
of
allowed
egresses
in
this
case
and
users
know
where
to
go
for
that
today,
you
you
write
your
network
network
policies
that
target
your
workloads,
and
you
say
what
can
leave
if
you
add
a
new
resource,
you
know
it
just
adds
a
layer
of
friction
of.
If
I
need
to
look
at,
why
is
my
traffic
leaving
or
yeah?
G
G
E
It's,
I
think,
that's
also
debuggability.
It's
really
hard
to
figure
out
why
network
policies
aren't
working
with
an
l7
thing,
it's
easy
because
you
can't
access
your
thing
but
like,
and
so
it's
like
already
confusing
and
then
andrea
and
psyllium
and
everybody
else
and
calico
are
already
implementing
an
fqdn.
I
think
right,
I
know
andrea
does
and
I
think
celia
does
and
casey
can
tell
me
if
I'm
wrong
but
like.
I
O
B
But
describe
code
assumes
that
I
can
read
it
right,
like
I
doubt
very
much
that
a
change
to
cube
cuddle
describe
to
go
off
and
read
all
the
other
resources
at
the
same
time
would
be
accepted,
and
that
still
assumes
that
I
could
even
read
the
admin
network
policies
which
I
probably
can't
so
like.
I
think
I
forget
it
was
on
this
call
or
on
one
of
the
other
discussions.
But
we
talked
about
something
like
a
like
a
trace
route
resource
where
you
you
create
it
and
have
the
implementation,
tell
you
yes
or
no.
G
B
I
don't
think
that's
the
only
concern
at
least
not
not
on
my
half,
there
is
the
like.
Would
we
predicate
this
on
the
ability
to
self-describe
capabilities
that
like
if
we
did
that,
then
you
might
not
be
looking
at
integrating
this
until
28,
because
it
would
take
a
while
to
get
that
work
done
right
and
maybe
we
could
pipeline
them
together,
but
but
still
it'll
be
some
time.
The
second
part
of
it,
though,
is
just
like
api
growth
is
a
problem.
All
around
network
policy
is
not
the
best
factored
api.
B
Is
this
something
that
we
want
to
put
into
that
api?
Or
is
it
better
to
say
you
know
it's
a
it's
a
different
concept
put
it
in
a
different
space,
even
though
it
has
a
lot
of
overlap.
B
Oh
bridget,
raise
is
a
great
point.
We've
been
talking
about
this
for
25
minutes
or
something
do
you
have
a
proposal
doc
that
you
want
to
bring
to
the
mailing
list?
Did
you
already
do
that.
G
I
I
have
a
proposal
doc
I
can
send.
I
can
send
that
out
to
the
mailing
list.
I
haven't
sent
that
out
yet.
B
I
A
A
A
Thanks
all
right,
rob
and
point
slice
controller
fails
to
sync:
hey
yeah.
I
have
a
somewhat.
R
Interesting
bug
that
it
involves
a
series
of
potentially
race
conditions-
I'm
not
quite
sure
yet,
but
I
wanted
to
raise
it
in
case
because
I
think
there's
potential
to
change
behavior
here,
and
I
want
to
make
sure
that
if
we
changed
any
of
the
behavior
here,
it
wouldn't
break
other
things.
R
So
right
now,
what
the
endpoint
slice
controller
does
is
when
it's
running
through
at
end
points
and
trying
to
convert
them
more
pots
and
trying
to
convert
them
to
endpoints.
It
looks
for
a
node
and
it
looks
for
that
node
to
try
and
determine
what
the
zone
that
pods
in
that's
great.
If
it
doesn't
find
that
node,
it
bails
out
entirely
and
it
triggers
another
sync:
it
returns
an
error
we
get
into
the
exponential
back
off,
keeps
on
sinking
or
eventually
tails
off.
If
we
run
out
of
retries,
that's
great.
R
What
I've
observed
here
is
in
some
cases,
especially
if
you
have
some
level
of
churn.
You
can
run
into
a
scenario
where
the
node
just
doesn't
exist
and
therefore
the
service
is
never
updated.
The
endpoints
license
for
that
service
just
becomes
stale
as
long
as
we're
in
that
state,
which
does
not
feel
great.
R
R
There
are
a
few
real
use
cases
that
happens
in
the
most
likely,
one
that
I
can
think
of
is
in
garbage
collection.
So
when
no
life
cycle
controller
says
hey,
I
don't
see
this
node
anymore,
I'm
going
to
get
rid
of
it.
If
there's
still
pods
attached
to
that
node,
the
node
is
gone
before
the
pod.
Garbage
collection
even
starts,
so
it
is
a
real
thing.
R
The
alternative
that
I
can
think
of
is
either
drop
endpoints,
where
we
can't
find
a
node,
because
something's
weird
is
going
on
here,
but
then
we
risk
dropping
all
endpoints
if
our
node
informer
is
crazy
out
of
date
whatever,
and
the
other
option
is
just
if
we
don't
have
a
node
continue
on
and
just
don't
populate
zone.
R
C
R
Right
right
so
that
that's
it
so
if
we
don't
have
a
zone
and
some
someone
actually
depends
on
it
or
maybe
more
more
likely
if
we
don't
have
a
node,
maybe
we
are
in
that
state
where
the
pod
is
about
to
get
cleaned
up
anyways.
R
I
don't
know,
there's
probably
not
enough
time
to
go
in
depth
on
this
issue.
I
just
wanted
to
raise
it.
If
you
have
ideas,
thoughts,
yeah
andrew
you
had
a
comment.
C
R
C
There's
there's
like
dynamic
fixes,
which
is
like
okay.
Maybe
we
can
make
it
last
shorter,
there's
also
just
the
case
of
like
making
it
more
robust
to
like
there
could
be
a
bug
or
something
that
just
causes
one
node
to
become
invalid
and
just
kind
of
get
stuck
there
and
like.
Should
the
system
be
able
to
get
over
that
or
at
least
kind
of
limp
along.
R
So
there
there's
two
potential
reasons:
we
don't
have
a
node
one.
We
legitimately
don't
have
a
node
and
two
endpoint
slice
controller
has
lost
connectivity.
It's
node
watcher
failed.
Some
other
thing.
Some
other
connectivity
piece
failed
and
we
do
have
a
node.
The
endpoint
slice
controller
just
doesn't
know
about
it.
S
B
R
So
in
in
this
specific
case,
what
was
happening
is
the
cloud
the
the
vm
itself
was
gone
from:
the
cloud
provider,
no
life
cycle,
controller,
cleans
up
the
node
and
then
pod
garbage
collector
sees
oh
there's
no
node
attached
to
this
pod.
Let
me
delete
the
pot,
but
each
of
those
steps
takes
time
and
there's
a
period
where
the
pot
is
just
orphaned.
R
R
O
So
do
you
do?
Do
we
have
a
history
of
why
the
node
garbage
collector
does
not
just
garbage
collect
the
pulse
as
it
deletes
the
node,
because
at
this
point
of
time
it
knows
that
these
spots
are.
It
seems
to
me
like
silly
like.
Oh
I'm,
just
gonna
clean
this
side
of
the
room
where
I
live.
I
don't
give
a
damn
about
the
rest
of
the
room,
which
is
everybody
else
lives
in.
It
sounds.
R
O
N
O
O
C
A
O
N
B
N
O
O
Yeah
rob
before
we
drop
off
there's
a
comment
you
made
and
it's
gonna
like
let
the
light
bulb
in
my
head
in
my
head.
Failure
in
the
watchers
are
should
be
discussed
outside
the
context
of
this
problem.
O
Yeah
we
can
decode
it
defensively
against
failure
of
the
watchers.
So
let's
say
you
lost
connection
to
the
watcher
and
you're,
not
getting
the
update
and
you
decide
to
bail
out
bill
out
and
restart
everything.
That's
that's
fine,
but
you
can
never
really
make
any
assumption
about
the
watchers
working
or
not.
O
B
Keyboard
lives
and
all
that
stuff,
but
yeah
the
problem.
There
is
it's
a
question
every
watcher
has
to
answer
independently.
What
happens
if
I
stop
getting
updates
wrongly
right?
This
is
a
a
pattern
that
our
internal
systems
at
google
handle
in
various
ways
of
of
like.
I
don't
believe
that
50
of
my
cluster
disappeared
at
the
same
time.
So
I'm
going
to
ignore
this
update,
there's
something
else
wrong
right.
C
Yeah-
and
I
think
the
just
you
know
even
ignoring
the
watcher
issue,
there's
this
issue
of
consistency
like
let's
say
the
pod
just
sticks
around
with,
like
a
bad
note,
name
like
you,
should
probably
like
toss
it
at
some
point
or
just
ignore
it.
It's
like
it's.
B
Yeah,
I
agree.
That's
a
great
that's
a
great
point
boy,
because
node
name
is
actually
user
settable
right.
It's
a
request
for
scheduling.
So,
if
I
want
to
cause
havoc
on,
you
would
be
a
good
one
for
the
clustered
podcast.
I
create
a
pod
that
has
a
wrong
node
name
and
watch
all
your
endpoints
disappear.
B
R
So
yeah
it's
fair
yeah.
I
have
a
pretty
detailed
description
of
the
bug
itself
in
this
shoe.
So
please,
if
you
have
time
chime
in
there
with
thoughts
perspective,
different
approaches,
my
biggest
concern
is,
I
don't
want
to
break
something
else
in
the
process
of
trying
to
fix
this,
which
it
seems
likely
could
be
the
result.
So
anyway,
I
know
we're
at
times,
so
I
don't
want
to
take
too
much
more
but
yeah.
Thank
you.