►
From YouTube: Secure, Efficient API Plane Traversal for Compute Resources on Exascale Super Comput... Tim Pletcher
Description
Don’t miss out! Join us at our next event: KubeCon + CloudNativeCon Europe 2022 in Valencia, Spain from May 17-20. Learn more at https://kubecon.io The conference features presentations from developers and end users of Kubernetes, Prometheus, Envoy, and all of the other CNCF-hosted projects.
Secure, Efficient API Plane Traversal for Compute Resources on Exascale Super Computers Using SPIRE - Tim Pletcher, HPE
Cray Shasta exascale supercomputers use SPIRE for securing machine-to-machine control plane communication. In this presentation, we’ll discuss the requirements for securing critical communication on these supercomputers and why SPIRE was a good choice for this application.
A
I'm
tim
fletcher,
I
I
was
at
cray
for
a
little
over
two
years
during
that
time,
hpe
acquired
us.
While
I
was
at
cray.
I
was
the
software
security
architect
for
the
shasta
systems.
There
were
a
few
of
us
that
are
systems
architects
there,
and
so
my
remit
was
the
access
control
framework
that
currently
sits
over
the
the
shasta
systems.
A
I've
since
moved
over
to
to
the
security
engineering
team
at
hpe,
and
I
work
with
dan
and
a
bunch
of
other
folks
in
sunil,
and
I'm
still
involved
with
the
hpc
side
of
the
house
in
a
few
dimensions.
I'd
like
to
say
that,
while
I'm
here,
there
are
a
whole
bunch
of
other
folks
that
behind
the
scenes
that
worked
on
this
particular
dimension
of
the
implementation,
zach
chrisler
kevin
burns
a
whole
bunch
of
folks.
So
so
not
a
one-man
show
by
any
stretch
of
the
means
over
there.
A
A
These
machines
are
capable
of
of
over
an
extra
flop
of
computational
activities,
and
I
always
like
to
write
out
the
10
to
the
18th,
because
it
makes
me
think
about
how
fast
and
how
much
the
number,
how
fast
and
how
many
numbers
these
things
are
crunching
at
any
one
particular
time.
A
Historically,
these
machines
will
scale
to
tens
of
thousands
of
compute
nodes
and
different
installations
will
have
different
node
counts
associated
with
them
right
now.
The
slingshot
network
that
that
runs,
shasta
systems,
network
wise,
is
400
gig,
going
to
800
gig
with
the
next
generation
switches
and
cards
that
are
coming
down
the
road
in
a
few
years.
It's
a
dragonfly
network
configuration.
So
it
means
that
all
the
nodes
are
never
more
than
a
fixed
number
of
hops
away
from
each
other
and
allows
for
a
highly
consistent
and
manageable
high-speed
network.
A
There's
truckloads
of
storage
behind
it
and
and
so
yeah.
They
just
they're
just
crazy
right,
they're
data
center
sized
computers,
they
consume
power,
measured
in
megawatts
and
it's
kind
of
funny
when
you
think
about
it,
because
there
are
scenarios
where
the
machines
will
have
their
own
substation
or
you
have
to
be
careful
about
when
you
run
them,
because
they'll
brown
out
the
neighborhood
like
the
city,
so
they're
really
neat
machines.
A
So
so
one
thing
I
would
like
to
say
about
the
the
craze
systems-
and
I
guess
I'll
get
to
that.
Our
management
system,
our
system
management
model
basically
is
going
to
look
fairly
familiar
to
a
lot
of
you
or
all
of
you.
A
There's
a
there's,
a
bunch
of
functions
around
the
compute
nodes
themselves.
You
need
to
manage
the
images
you
need
to
manage
boots.
You
need
orchestration
of
boot,
configuration
management
for
the
nodes
themselves
and
whatnot.
Then
there's
a
network
management
side,
there's.
Obviously
two
networks:
one
is
the
high
speed
network
and
one
is
the
management
network
which
is
a
fairly
standard
configuration
and
so
you've
got
all
the
interaction
that
goes
on
there.
A
Hardware
management
is
is,
as
you
would
see,
in
a
lot
of
a
large
data
center
operations.
You
got
to
deal
with
power
management.
You've
got
to
deal
with
all
the
bmc
endpoints
firmware
the
whole
nine
yards
and
then,
of
course,
there's
a
security,
so
personas
non-person
entities
all
have
to
be
accommodated
in
the
authorization,
authentication
and
key
management
context.
A
So
how
does
that
work?
The
the
csm
software
the
cray
systems
manage
software
is,
I
would
say,
a
fairly
generic
and
vanilla,
cncfe
implementation,
and
that
was
by
design.
So
we
run
on
kubernetes
we
use
sdo,
we
have
vault,
we
have
cert
manager.
I
think
any
of
this.
You
know
you
would
take
a
look
at
and
be
like,
oh
yeah,
of
course,
that
makes
perfect
sense
right
if
you're
going
to
run
an
api
plane
in
front
of
a
big
machine
like
this
there's
a
whole
bunch
of
different.
A
If
you
look
around
the
side,
there
there's
a
whole
bunch
of
different
networks
that
that
are
present
in
this
system.
They
deal
with
the
hardware,
their
vlan
basically,
and
they
deal
with
the
hardware
management
side
to
get
to
the
bmc's.
A
A
So,
along
the
way
we
were,
we
have
a
gen
one.
Basically,
when
you
think
about
the
context
of
how
you
need
to
transit
the
api
plane
from
a
compute
node,
the
the
side
where
you're
gonna
have
administrative
traffic
in
and
out
is
fairly
straightforward,
the
the
harder
part
for
us
because-
and
I
say,
fairly
straightforward,
because
our
access
control
framework
basically
is
standalone,
and
so
we
ship
key
cloak.
A
With
this
thing,
you
could
basically
stand
the
system
up
and
run
it
completely
in
an
air
gap
environment
by
itself,
without
talking
anything
else,
and
that
was
one
of
the
original
requirements,
obviously
because
of
where
these
machines
run.
So
so
our
story
around
basic
api
interaction
as
an
individual
is
pretty
straightforward
and
good.
A
We
use
opa
to
deal
with
the
aussie
topic
and
key
cloak
issues
of
standard
oidc
tokens
and
away
you
go
where
that
breaks
down
is
that
there
are
applications
that
run
on
the
compute
nodes.
Obviously,
that
deal
with
platform
operations,
and
so
our
gen
1
implementation
of
this
was
to
basically
use
what
key
cloud
calls
a
service
account
which
is
effectively
just
issuing
a
long-running
oidc
token
and
handing
it
out
right,
and
it
was
a
horrible
implementation.
A
I'm
still
maybe
a
little
bit
embarrassed
by
it
at
the
end
of
the
day,
but
we
had
to
get
started
somewhere
and
we
knew
that
we
were
going
to
have
to
build
something
specifically
to
accommodate
this
and,
as
it
turned
out,
saitail
and
cray
were
acquired
around
roughly
the
same
time
and
I
started
interacting
with
sunil
and
emiliano,
and
it
became
pretty
clear
pretty
quickly
that
while
we
were
pretty,
we
were
very
familiar
with
spiffy
because
of
our
use
of
istio.
A
We
hadn't
really
considered
spire,
it
wasn't
on
our
radar
screens
and
after
a
little
bit
of
discussion
internally,
the
aha
moment
came
and
the
the
dots
are
connected
and-
and
we
didn't
have
to
build
anything-
we
really
just
needed
to
pick
up
the
ball
with
fire
and
get
going,
and
so
we
did
and-
and
it
was
a
great
collaboration,
I
would
say
that
it
was
probably
one
of
the
the
easier
implementation
cycles
I've
been
through
in
that
whole
process.
A
It
took
less
than
90
days
and
we
were
up
and
running
in
the
in
the
system
and
so
that
included,
that
included
the
whole
nine
yards
of
applications
talking
across
the
the
api
plan.
A
So
this
is
where
we
sit
today
with
the
shasta
csm.
As
you
can
see,
there
are
obviously
the
key
cloak
is
there
for
the
the
individual
users
and
then
the
mpes
are
covered
by
spire.
A
Today,
we'd
use
joint
tokens
to
to
test
the
compute
nodes,
and
I'm
going
to
talk
a
little
bit
more
about
that
as
we
go
forward
and
and
then
the
x
as
vids
get
issued
and
and
away
we
go
the
applications.
A
A
So
so
this
is
our
world
and
I'd
say
again,
it's
a
fairly
straightforward
and
standard
implementation
in
most
respects,
with
the
addition
of
spire
making
our
live
lives
a
lot
easier.
A
The
state
of
the
of
the
control
plane
or
the
access
control
framework
is
good
for
a
1.0
implementation,
but
you
can
always
improve,
and-
and
we
will
seek
to
do
that,
and
so
the
big
place
that
we
have
a
challenge
is
really
that
that
node
attestation
in
the
compute
plane,
when
the
original,
when
the
original
specifications
were
cut
for
these
shasta
systems,
tpms
were
not
specified
on
the
compute
blades,
and
so
we
find
ourselves
in
an
awkward
position.
A
They
will
be
going
forward,
but
we
have
machines
that
are
fielded
that
do
not
have
them,
and
so
that's
that's.
You
know
objective
number
one
is
to
get
a
better
story
than
we
have
today
with
the
joint
token,
and
then
we
really
like
the
work
that
is
going
on
with
with
our
effort
to
get
spire
and
istio
talking
together
in
the
community.
A
We
already
have
a
central
pki
issuer
system
that
runs
in
behind
the
platform
and
and
we're
in
the
process
of
basically
putting
an
operator
in
place
to
roll
the
certs
from
the
issuer
for
istio.
It
would
be
nicer
if
we
could
just
bolt
in
aspire
to
that
task
and
call
it
done
that'd
be.
That
would
be
awesome.
A
The
other
thing
that
we're
looking
forward
to
is
api
driven
workload,
registration
so
as
platform
components
come
on
the
ability
to
register
them
automatically,
as
opposed
to
manually
with
a
config
file,
which
is
what
we
do
today
we'll
be
replaced,
and
that
will
be
a
good
upgrade
for
the
engineering
teams
to
get
up
and
running
the
other
topic
that
we're
going
to
be
looking
forward
to
is
federation.
A
So
there
are
certain
components
that
we
have
in
the
platform
that
are
not
baked
into
the
csm
software
itself,
so
one
of
those
would
be
the
the
fabric
controller.
The
the
slingshot
software
is
a
standalone
system.
It
can
run
without
csm,
but
the
ideal
scenario
is
that
their
their
access
control
framework
can
be
federated
into
ours,
and
we
do
that
through
spire.
That
has
been
proposed.
I
don't
know
where
they're
at
on
that
team
side,
but
then
that
takes
advantage
of
the
federation
capabilities,
inspire
which
are
excellent
and
pretty
straightforward
to
implement.
A
As
I
said
today,
so
node
attestation,
I
have
the
astute
observer-
might
have
noticed
the
use
of
joint
tokens.
Joint
tokens
are
not
ideal.
They
they
require
you
to
to
create
an
issuing
mechanism
that
can
be
a
little
bit
challenging
to
have
reach
the
security
level
that
you'd
really
like
it
to,
and
we
knew
going
in
that
this
was
going
to
be
something
that
wasn't
going
to
be
our
ultimate
end.
State.
A
A
So
let's
look
at
this.
Coal
was
kind
enough
to
steal
my
thunder
on
the
tpm
attestation,
so
I'm
going
to
skip
through
this
one
real
quick,
because
I
think
you
will
look
at
this
diagram
and
see
something
that's
very
much
similar
to
what
what
he
had
presented.
It's
basically
the
flow
to
do
the
the
the
tpm
based
attestation
from
the
compute
node.
A
A
Time
so
recall
that
these
are
disclosed
machines
for
the
most
part,
and
so
so
we
have
a
process
where
we
can
inject
that
boot
and
the
init
rd
phase,
and
we
do
that
with
with
other
payload
components
today,
and
so
this
will
just
end
up
adding
the
x
name
certificate
into
the
mix,
and
then
we
start
down
the
process
of
of
the
cert
verification,
generate
the
nonce
and
you
go
back
and
forth,
and
you
end
up
with
the
with
the
spiffy
id
issued
using
the
intermediate
that
the
spire
server
holds,
which
that
would
which
it
has
acquired
from
our
pki
as
a
service
plant.
A
A
It's
still
not
where
we'd
like
to
be
with
the
tpm
side,
but
while
we
get
to
this
process
of
of
the
best
that
that
we
can,
then
this
is
something
that
we've
we
started
to
look
at
and
I
think
we're
watching
the
step,
ca,
acme
server
project.
I
think
that's
something
that
we
hope
gets
there
pretty
pretty
quickly
and
then
we'll
take
a
gander
at
that.
A
So
so
that's
that's!
That's
fire
in
in
a
super
computer.
In
a
nutshell,
we
have
time
for
questions.
I
want
to
leave
a
time
for
questions.
If
anybody
has
any,
we
appreciate
the
chance
to
share
the
experience
that
we've
had.
I
would
say
again
that
you
know
if
you
haven't,
if
you
haven't
started
working
with
spire.
It
really
truly
is
a
swiss
army
knife
and
it
becomes
more
and
more
that
as
the
plug-in
universe
gets
bigger.
A
B
A
A
It
does
kind
of
change
the
the
the
fleet
management
problem
is
real
in
in
the
large
footprints
with
any
of
this
type
of
thing.
So
so
we
try
to.
We
want
to
find
the
way
in
the
short
term
that
that
is
most
manageable
for
the
system,
administrators
and
because
they
have
a
big
job
and
these
machines
run
a
lot.
So
we
want
to
keep
that
that
overhead
low.
A
We
are,
we
are
working
on
some
things
internally
around
that
at
hpe.
It's
a
it's
to
me.
It's
that's
the
that's
the
killer,
app
right
there.
A
Historically,
you
know
tpm
operations
and
and
when
you
mix
them
in
with
with
with
anything
related
to
fleet
management,
maintenance
events,
you
know
any
of
that
just
is
really
painful,
which
is
why
a
lot
of
people
don't
do
it
so
that
to
me
in
the
context
of
of
security
engineering,
especially
when
we
start
to
really
focus
on
hardware
up
attestation
is
going
to
have
to
be
dealt
with,
and
so,
when
you
look
at
some
of
the
other
things
that
are
going
on
with
platform
certs
and
whatnot
in
spdm,
you
know
the
the
need
is
gonna
come
for
hardware
to
not
only
be
for
every
component
in
a
box
to
be
validated,
and
so
so
I
suspect
that
this
is
going
to
come
to
the
forefront
more
and
more
and
when
we
do
have
people
looking
at
that
problem
internally,.
A
I
will
say
that
it's
improved,
we,
we
seem
to
there
there's
one
scenario
that
that
dings
us
on
the
boot
cycle,
so
we've
we've
we've
run
up
to,
I
think,
probably
close
to-
I
want
to
say
8
000
nodes
at
this
point,
give
or
take
a
little
bit,
and
so
one
of
the
applications
that
runs
in
the
context
of
our
world
is
a
heartbeat
mechanism.
So
so
what
what
happens
is
like
when
you
run
through
this
boot
cycle?
A
All
these
things
start
heartbeating
almost
immediately
and
it's
kind
of
a
thundering
herd
problem,
and
we
see
we
supercomputers
have
thundering
herd
problems
in
a
few
different
areas.
So
we
end
up
actually
with
the
heartbeat
thing.
Turning
that
off
during
the
boot
cycle-
and
I
think
we're
we're
meeting
most
of
our
targets
with
the
with
the
boot
time
is
today.
It
is
a
contractual
requirement
in
the
super
real
world
to
boot
in
a
certain
speed.
So
it's
a
it
fixes.
It's
prominent
in
the
discussion.
B
She
would
like
to
liaise
with
you
following
the
talk.
Oh.
A
A
Today,
so
so,
a
good
way
to
think
about
these
machines
is
that
they're,
big
iis
implementations,
and
so
it's
very
it's
a
from
an
analogy
perspective.
It's
just
like
hey
you
boot
up,
here's,
your
vpc
and
then
workload
manager.
Software
comes
in
and
dispatches
applications
into
the
compute
plane
at
runtime.
A
That's
not
to
say
that
we
haven't
had
requests
from
the
customers
on
this
for
paz
type
services.
I
think
I
think,
as
you
see
more,
if
you,
as
you
see
more
modern
approaches
to
to
running
jobs
up
in
the
compute
plane
itself,
you'll
start
to
see
spire
make
its
way
in
there,
but
we
don't
provide
that
as
a
service
and
the
core
platform.
D
A
B
A
And
I
think
there
would
be
I'd
be
there
I'd,
be
surprised
if
there
was
an
interest
from
some
of
the
labs
community
around
this
as
well
there's
some
of
the
labs.
Actually,
you
know
expose
these
machines
to
a
lot
of
researchers
and
they
come
in
and
do
their
thing
and
so
secrets
management
has
been
one.
That's
been
a
topic
of
discussion,
you
know,
and
we
kind
of
said.
A
Well
all
right,
you
know
that's
these
are
all
viable
and,
as
we
get
a
little
bit
more
mature
in
the
platform,
then
we
can
start
to
look
at
that.
The
past
topic
there's
also
you
know
the
topic
of
some
type
of
get
service,
so
they
could
do
git
ops
up
in
the
compute
plane
with
you
know,
whatever
they're
gonna
run.
So
I
think
that's
only
a
matter
of
time
before
it
starts
to
show
up,
but
I
think
you're
starting
to
see
different
approaches
to
workload
management
come
in
too
right.