►
Description
Don’t miss out! Join us at our next event: KubeCon + CloudNativeCon Europe 2022 in Valencia, Spain from May 17-20. Learn more at https://kubecon.io The conference features presentations from developers and end users of Kubernetes, Prometheus, Envoy, and all of the other CNCF-hosted projects.
eBPF & Cillium at Sky - Sebastian Duff & Anthony Comtois, Sky
A
Hey
everyone
welcome,
and
thanks
for
joining
our
talk
about
ebpf
and
celia
matt
sky
for
some
context.
First
of
all,
who
is
sky,
so
sky
started
out
as
a
satellite
broadcast
company
with
headquarters
in
the
uk
in
london
and
has
expanded
into
a
much
larger
multinational
company
with
presences
in
many
other
countries.
A
Likewise,
sky
has
expanded
into
many
other
areas
beyond
satellite
broadcast,
and
one
of
those
areas
is
ott
over
the
top,
which
is
online.
Video
streaming
and
ott
is
a
part
of
sky
which
we
are
part
of
there's
going
to
be
two
of
us
who
are
involved
in
the
presentation
today,
I'll
quickly,
introduce
ourselves
and
then
get
on
with
the
content,
because
20
minutes
goes
by
quite
quickly.
So,
first
of
all,
I'm
sebastian
duff
or
seb.
A
I've
been
with
sky
for
just
over
six
years,
originally
as
a
software
engineer,
before
moving
into
delivery
and
I'm
now
responsible
for
the
eco
engineering
department,
which
will
cover
a
bit
more
in
the
introduction
and
with
me
doing,
the
presentation
is
anthony
contoir,
who
is
a
principal
engineer
in
core
engineering
at
sky.
Anthony
has
a
strong
background
in
both
sre
and
software
engineering
and
joined
sky
in
2016..
A
I'd
also
like
to
mention
ccg,
who
are
a
consultancy
who
have
played
an
important
role
in
the
journey
to
build
a
mature
platform
as
a
service
offering
and
in
our
work
with
selium
and
edpf.
Joseph
samuel
from
ccg
was
originally
going
to
be
part
of
the
presentation
with
us,
but
due
to
timing,
he
was
only
able
to
be
involved
with
the
preparation
and
not
actually
part
of
the
presentation.
A
So
this
talk
might
be
a
bit
different
than
the
others,
rather
than
going
into
the
deep
technicals
of
how
we're
leveraging
ebp
app
we'll
be
focusing
on
the
delivery
aspect
and
how
we
leverage
technology
to
gain
a
higher
level
of
confidence.
Mitigating
risk
to
the
platform
and
to
the
business
in
the
first
section
of
the
presentation
I'll
give
a
brief
overview
of
what
we
as
core
engineering,
do
and
then
I'll
hand
over
to
anthony
to
first
talk
about
why
we
chose
psyllium
and
edpf
and
then
about
the
pipelining
as
a
form
of
risk
mitigation.
A
Show
time
in
core
engineering
we
build
a
multi-tenanted
kubernetes
based
platform
as
a
service
offering
which
hosts
about
90
of
the
application
workload.
The
platform
is
built
as
a
white
label
product
so
that
it
can
be
built
once
and
deployed
many
times
to
support
the
different
organizations
and
propositions
as
the
underlying
platform
for
the
high
profile
high
profile
propositions.
We
have
very
large
and
complex
requirements,
including
being
highly
available,
hybrid
cloud,
multi-region
and
active,
active
and
all
of
these
at
high
scale
and
low
latency,
so
to
be
able
to
operate
efficiently
at
scale.
A
We
have
a
number
of
important
engineering
principles
which
we
follow
for
everything
we
do,
and
I
won't
go
into
all
of
them
this
presentation,
but
I
will
mention
two
of
the
golden
rules
which
we
follow,
as
they
have
a
very
large
impact
on
the
way
we
do
things
in
the
way
how
things
have
been
designed.
So
the
two
golden
rules
are
tenant,
a
cannot
negatively
impact
tenant
b
and
no
tenant
can
negatively
affect
the
platform.
A
And
although
we
are
a
platforming
team,
we
measure
our
success
by
the
success
of
the
tenants
who
are
the
teams
using
the
platform.
So
our
view
is
that
one
might
have
the
best
most
perfect
platform
in
the
world.
But
if
people
are
struggling
to
adopt
it,
then
it
really
isn't
a
successful
platform,
and
that
really
comes
through
in
how
we
actually
implement
a
lot
of
the
capabilities
which
we
have
and
on
this
slide
we
have
some
interesting
stats
which
give
a
brief
view
of
the
scale
which
we're
working
at
so.
A
The
multi-tenanted
platform
currently
supports
just
over
13
departments,
with
over
90
teams,
which
is
about
1
000
engineers.
Using
the
platform,
these
teams
are
using
a
wide
variety
of
different
technologies.
So
our
goal
is
to
provide
a
consistent
face
for
everyone
and
we
largely
achieve
this
through
kubernetes,
but
we
also
build
custom,
tooling
and
libraries
for
teams
to
leverage
as
well
and
on
this
slide
we
have
a
bit
of
a
snapshot
of
some
of
the
interesting
technical
stats.
A
We
have
over
300
unique
applications
deployed
to
the
platform,
with
more
than
60
000
replicas
running
across
all
environments
and
to
support
the
required
scale.
We
have
performance,
tested
our
central
services,
such
as
ingress
to
1
million
tps,
and
that's
enough
from
me,
with
an
introduction
to
the
platform
I'll
hand
over
to
anthony
to
talk
about
why
we
show
cilium
and
ebpf.
B
Hi
everyone,
so
I'm
going
to
talk
about
why
we've
been
choosing
cydia
mad
sky
and
how
we've
been
like
mitigating
the
risk,
with
the
help
of
ccg
core
engineering
consulting
group.
B
So,
first
of
all,
that's
kind
of
how
we've
seen
like
there
is
a
lot
of
application
running
on
top
of
the
platform
and
on
top
of
kubernetes,
with
multi-talented
architecture
so
and
by
default.
On
kubernetes.
You've
got
like
a
flat
network
where
every
single
port
can
talk
between
each
other.
So
we
want
to
restrict,
with
the
help
of
kubernetes
network
policies
and
serial
network
purposes,
restrict
network
communication
within
the
cluster.
So
from
port
to
pod.
We
want
to
also
allow
and
block
access
to
external
endpoint,
for
example,
a
databases
to
our
tenants.
B
So
a
specific
tenant
are
gonna,
be
able
to
talk
to
a
specific
databases
and
so
on.
We
also
want
to
block
malicious
ip
defined
by
the
by
the
security
team
and
that's
going
to
be
defined
at
the
cluster
level,
and
we
want
to
make
sure
like
the
tenant
cannot
override
it
and
we
want
to
move
towards
the
least
privilege
access
models
where
every
single
tenant
are
gonna,
define
the
full
network
flow
due
to
the
high
scale
and
requirement
from
sky.
B
So
clem
is
going
to
essentially
inject
some
ebpf
program
inside
the
kernel
to
interact
with
with
the
network
stack
and
it's
gonna
is
kubernetes
aware
so
has
a
full,
topology
and
and
and
ips,
which
is
gonna,
be
able
to
inject
inside
the
bpf
map
and
share
the
data
between
the
ebp
program
and
and
the
kubernetes
context.
That's
gonna
allow
us
to
have
a
more
efficient
load,
balancing
and
network
policies
propagation,
and
we
heavily
rely
on
the
deny
functionality
of
serium
to
block
at
a
cluster
level.
B
So
in
order
to
embrace
this
kind
of
new
new
technologies,
we
want
to
make
sure
there
is.
We
want
to
mitigate
the
risk
before
going
to
production,
so
we're
going
to
show
how
we've
been
like
mitigating
the
risk
by
automating
the
test,
so
you've
got
a
git
repository,
you're
going
to
push
and
and
then
you're
going
to
have
a
bid
which
is
going
to
be
a
deploy.
B
So
you
you
commit
and
the
build
is
going
to
be
started
on
our
ci
agent
and
we're
going
to
run
all
the
localized
tests
so,
for
example,
linting
diggers
vulnerability
scanning
and
when
everything
has
been
passed
we
can
obviously
the
image
the
docker
image
has
been
built
and
published
to
the
test
repository.
B
So,
as
part
of
those
tests,
we've
got
two
main
tests:
functional
testing,
which
we
are
heavily
relying
on.
The
serum
connectivity
test
suit,
which
is
provided
by
cdm,
and
it's
essentially
a
bunch
of
pod
which
are
deeper
in
the
cluster
and
doing
some
dna's
http
check
connectivity
check
and
also
making
sure
the
cdm
network
policies
and
kubernetes
network
policies
are.
B
The
behavior
is
expected
on
and
working
on,
a
running
cluster
and
we've
been
adding
some
additional
functional
testing,
including
like,
for
example,
making
sure
namespace
network
policies
cannot
override
the
cluster
wide
deny
policies
we
want
to
make
sure,
like
the
identities,
are
limited
to
some
specific
label
so
to
limit
the
number
of
identities
inside
the
cluster.
So
for
us
we
only
limit
on
the
namespace
levels
that
mean
every
single
namespace
we're
going
to
have
a
one-to-one
mapping
with
with
the
serial
identities,
and
we
want
to
make
sure.
B
B
So
that's
one
of
the
main
tests
making
sure
we've
got
functional
requirement
and
and
testing,
and
then
we've
got
the
non-functional
testing,
which
is
we
are
trying
to
have
like
a
30
minute
of
fast
feedback
loop
and
what
we
want
to
exercise
is
the
full
network
and
making
sure
everything
can
work
because
psyllium
is
using
or
interacting
with
with
the
network
stack
so
essentially
exercising
the
full
network
path
is
having
a
some
load,
injector
sending
some
load
through
a
back
end,
and
then
we've
got
like
multiple
communication
happening
so
from
put
to
pod
from
using
the
service
ip,
but
also
cross-cluster,
communication
using
the
internal
and
external
invoices,
and
obviously
we
are
relying
on
kubernetes
hostname
to
to
target
this
this
service.
B
So
we
are
trying
to
target
like
the
worst
case
scenario,
and
we
have
like
showing
some
number
like
the
number
of
identities
we
want
to
have
at
the
maximum
on
on
a
cluster
and
then
and
then
trying
to
reproduce
it
on
our
test
environment.
B
A
B
Obviously
it's
very
hard
to
have
every
single
use
case
and
test.
So
that's
why
we
automate
all
the
tests
making
sure
we
can.
We
can
scale
and
every
time
we've
got
a
issue
reported.
Then
we
we
would
make
sure
we
had
requestion
testing
to
make
sure
the
issue
is
not
going
to
happen
again.
So
as
part
of
the
non-functional
testing,
you've
got
four
different
tests.
B
The
first
one
is
we're,
simulating
the
identity
chart,
so
it's
essentially
creating
and
deleting
some
pod,
which
can
have
a
produce
some
some
identity
chart
and
see
the
identity
are
injected
inside
the
bpf
policy
map.
So
we
simulate
or
create
5
000
identities
and
we've
noticed
a
small
aged
case
so
doing
all
those
four
tests.
Obviously
you've
got
the
curiosity
thing
happening
and
you've
got
some
cd-image
and
siem
operator,
and
and
also
it
is
serious
cd
member
and
the
backend.
B
We
started
and
we've
noticed
a
very
small
edge
case
scenario
with
cdm
agent,
restart
where,
when
you
restart
it,
you've
got
a
small
increase
in
term
of
drop
as
part
of
the
matrix,
but
it's
not
affecting
the
client.
Due
to
the
tcp.
We
try
and
we've
been
like
working
closely
with
cilium
and
eiser
valentin
to
to
merge
it
upstream
to
major,
fixed
upstream
and
then
you've
got
the
second
test,
which
is
exactly
the
same,
but
without
cmo
agent
restart
and
we
tolerate
zero
drop
and
the
first
test
is
we.
B
The
second
test
might
go
away
and
we're
just
gonna
have
the
first
one
but
totally
zero
drop
when,
when
we're
gonna
release
a
new
new
version,
the
third
test
is
simulating,
like
the
cm
network
policies,
recreation
which
is
so
he's
gonna
exercise
like
flushing
the
bbf
map
and
all
the
information
inside
when
you
delete
the
cm
network
policies
but
and
when
you
create
it,
it's
just
gonna.
B
B
Chance
to
isolate,
which
which
scenario
or
is,
is
impacting
to
give
you
a
bit
more
insight
so,
for
example,
we're
heavily
relying
on
metrics
as
when
the
load
is
running
and
we
we
have
been
creating
some
manner
and
we
gave
the
test
and-
and
we
failed
the
test
if
there
is
an
any
alert
generated.
B
So,
as
you
can
see
on
the
left,
you've
got
the
pod
creation,
so
we
are
trying
to
create
pod
to
generate
identities
related
to
and
that
peak
you've
got
like
10
10,
10
pod
created
per
second,
which-
and
you
can
see
the
pod
count
which
is
like
roughly
around
like
1500
and
which
is
matching
the
identity
delta,
even
delta,
which
is
a
number
of
identities,
and
you
we,
we
delete
some
of
them,
so
you've
got
the
identity
change,
so
you
can
see
it,
it
can
go
up
and
down.
B
You've
got
the
bpf
map
operation,
which
is
showing
all
the
all.
The
operations
are
happening
on
the
bpf,
the
serium
drop
and
and
and
also
you
can
see
the
four
tests
with
the
load
ejector
with
two
2000
ktps,
and
we
monitor
the
load
injection
latency,
making
sure
it's
not.
There
is
no
increase
and
the
cpu
and
memory
gave
us
the
ability
to
define
properly
on
the
worst
case
scenario
on
the
demand
set.
B
So
when,
when
both
test
has
been,
has
passed,
we're
going
to
promote
this,
this
artifact,
so
the
docker
docker
image
to
the
external
test
repository
at
every
at
every
single
day,
at
6
00
pm,
we
are
running
what
we
call
extended,
nft
or
non-functional
testing,
which
is
essentially
the
non-functional
test
which
has
been
running
for
30
for
30
minutes.
But
for
this
time
it's
going
to
be
a
longer
period
of
time
and
for
maximum
16
hours.
B
When
everything
has
been
passed,
it's
going
to
be
promoted
to
different
organizations.
So
it's
going
to
be
one
for
nbcu,
one
for
a
sky
and
they're
going
to
be
deployed
on
different
clusters.
That's
why
we've
got
different
organization
and
we
deploy
to
what
we
call
predev,
which
is
one
of
the
specific
environment.
B
Obviously
we're
going
to
gather
all
the
metrics
and
and
define
some
alerts
so,
for
example,
latency
increase,
http
error
packet
drop
and
that's
how
we
can
get
promotion
from
one
environment
to
the
other.
One
through
the
alert
at
8
30
we've
got
like
the
promotion
mechanism,
which
is
going
to
be
from
promoted
to
one
pre-dev
to
dev
and
so
on.
B
If
there
is
no
alert
defined
and
at
10
am,
for
example,
you've
got
we've
because
we've
got
multiple
region,
we
can
stagger
the
the
deployment
across
multiple
clusters,
so
at
10
am
you've,
got
the
first
region
and
then
the
second
one
at
12
and
for
the
same
environment.
And
then
we,
if
there's
anything,
happening
on
on
the
first
one,
then
we
can
stop.
Obviously
the
deployment
on
the
second
one.