►
From YouTube: Lockheed Martin Customer Case Study: AI from Training to Edge with MicroShift OpenShift Commons 2022
Description
Customer Case Study: AI from Training to Edge with MicroShift at Lockheed Martin
Red Hat OpenShift Commons 2022 @ Kubecon/NA
Detroit, Michigan
October 25, 2022
Speakers:
Ian Miller & Matt Wittstock (Lockheed Martin)
https://commons.openshift.org/gatherings/kubecon-22-oct-25/
A
B
Good
afternoon
Detroit,
and
thanks
for
having
us
my
name-
is
Matt
whitstock.
This
is
Ian
Miller.
We
are
from
a
team
and
Lockheed
Martin
called
AI
factory
and
we're
going
to
talk
just
a
little
bit
about
AI
training
to
the
edge.
If
you
jump
forward
here,
so
the
AI
Factory,
as
you
can
see
up
here,
is
not
actually
an
8-bit
Factory.
We
are
sort
of
a
team.
B
That's
focused
on
the
end-to-end
machine
learning
life
cycle,
so
we
are
doing
everything
from
your
data
ingest
your
data,
catalog
version
everything
in
the
data
side
to
have
clean
reproducible
work
there
moving
into
the
training
side,
deploying
models
which
we're
going
to
show
you
a
little
bit
about
here
today
into
actually
then
maintaining
those
models
to
standing
on
monitoring,
retraining
and
then
a
little
bit
of
what
you
can
see
here.
All
of
the
hardware
that
underlies
all
of
the
work
we
do
for
AI
Factory.
B
C
Yeah,
so
a
lot
of
the
the
use
cases
are
out
at
factories
around
the
concept
of
ml
Ops,
a
reasonably
new
field,
but
really
it's
just
taking
devops
principles
and
bringing
in
machine
learning,
because
the
life
cycle
for
machine
learning
is
a
little
different
than
devops.
You
have
you
have
a
lot
more
complexity
in
your
pipelines.
Your
pipelines
might
not
actually
put
out
the
same
thing
every
single
time,
they're,
not
necessarily
completely
reproducible,
there's
inherent
stochasticity
in
the
middle
of
it.
C
So
there's
a
new
discipline
forming
around
that
called
ml
Ops
and
really
like
the
need
from
that
from
from
Lockheed
Martin
or
any
other.
You
know
company,
that's
employing
MLS
the
need
to
be
able
to
trust
what
you're
you're
putting
in
the
field
with
machine
learning
which
becomes
really
challenging
to
trust.
Once
you
don't.
You
can't
see
inside
of
why
the
decisions
are
being
made
all
the
time
and
so
a
lot
of
the
machine
learning
the
ml
Ops
is
around
really
keep
keeping
track
of
every
little
step
of
that
pipeline.
C
C
You
know
how
can
we
trust
this
thing
that
that
is
new,
and
we
don't
really
know
why
it's
making
decisions?
The
way
that
it
is
there
is
a
field
of
explainable
AI,
which
is,
is
the
study
of
like
why
a
model
makes
decisions
specifically,
but
what
we
found
is
is
in
addition
to
that.
C
You
really
just
need
this
robust
process
to
get
your
models
to
production,
and
then
you
can
actually
trust
based
on
all
these
other
inputs
as
well
and,
of
course,
all
the
stuff
I'm
describing
it
really
needs
a
lot
of
elasticity
and
scaling
and
gpus
are
super
expensive.
So
you
might
do
that
in
the
cloud,
but
you
also
might
be
doing
that
on-prem
and
from
Lockheed
perspective.
C
B
You
go
nice,
so
talking
a
little
bit
about
the
platform
and
why
we
are
here
so
we
believe
very
much
in
open
source.
First,
we
use
a
lot
of
Open
Source
tools.
One
of
the
big
things
that
we
are
all
about
is
making
sure
we
don't
go
anywhere
near
a
monolith.
We're
really
a
lot
of
small
composable
modules
pieces
of
software
that
we
put
together
into
what
really
makes
up
AI
Factory.
That
also
allows
us
to
really
when
we
work
with
different
programs
and
different
teams
that
we
have
across
Lockheed
Martin.
B
Let
them
pick
the
components
and
needs
that
they
have
for
their
business
versus
taking
some
very
large
piece
of
software
and
bringing
it
all
in
and
getting
that
through
a
number
of
different
processes
that
happen
to
exist.
So
we
really
focus
on
allowing
a
it's
all
open
source
we
can
bring
in
what
we
need.
We
can
see
how
it
was
built.
We
then,
like
I,
said
modulize
it
and,
of
course
everything
runs
on
top
of
kubernetes.
B
So,
of
course,
the
platform
is
a
little
bit
of
a
loaded
term,
but
I'm
going
to
talk
about
on
this.
Both
the
platform,
as
in
the
the
software
side,
but
also
the
platform,
and
where
do
we
actually
run
this
it's
a
little
bit
of
both
as
Ian
said,
we
have
to
be
massively
scalable,
which
means
both
scale
to
zero,
sometimes
because
the
cost
is
crazy
if
we're
doing
anything
outside
of
a
cloud
world
to
scaling
to
whatever
our
need
is
for
a
very
large
data,
set
very
large
training
jobs
that
we
might
have
out
there.
B
We
have
to
always
be
changing
so,
especially
in
the
AI
field.
New
tools
come
in
all
the
time.
There's
always
new
things
coming
in,
so
we're
always
taking
a
look
at
what
is
out
there.
What
is
available?
What
can
we
bring
into
our
stack
that
meets
a
business
need
for
us
and
then
is
there
something,
maybe
a
new
tool
that
meets
a
need
we
had,
and
we
already
have
something
out
there
and
there's
a
better
version
of
it.
Now
same
thing
goes
for
all
of
the
underlying
Hardware
infrastructure,
the
kubernetes
layer.
B
As
far
as
where
we
deploy
we're
all
over
the
place,
we've
got
environments
that
we
run
out
on
our
public
clouds,
as
well
as
our
government
clouds.
Of
course,
we
have
very
large
Central
infrastructure.
In
fact,
the
one
that
we
tend
to
use
the
most
is
one
that
we've
built.
B
We
have
an
Nvidia
super
pod
that
we
do
the
vast
majority
of
our
work
on
160
a100
gpus,
that's
just
one
of
the
mini
GPU,
HPC
environments
that
we
have
and
then,
of
course,
we've
got
a
couple
of
different
things
for
Edge
work,
so
one
of
them
that
we
use
quite
often
an
hpe
edge
line,
we'll
talk
about
it
more,
it's
actually
what
we're
running
a
little
bit
of
our
demo.
We
didn't
bring
it
with
us.
It's
a
very
loud.
B
You
wouldn't
be
able
to
hear
us
on
stage
but
fantastic
piece
of
hardware,
and
then
we
do
a
lot
of
work
on
embedded
now.
So
this
is
where
we've
really
brought
in
microchip.
Moving
from
you
know,
a
large
scale,
openshift,
that's
running
on
the
rest
of
our
environments,
to
a
very
light
edge,
kubernetes
that
we're
running
on
devices
like
these
right
here,
so
we'll
jump
into
that
a
little
bit
more
as
we
go
into
a
demo.
So
you
can
jump
forward.
C
C
And
how
that
kind
of
process
goes,
so
we
have
a
lot
of
our
training
just
because
the
resource
is
necessary.
You
do
a
lot
of
your
training
on
your.
Your
large
clusters,
like
the
Nvidia
superpot,
is
a
great
example
there,
where
we
have
all
these
gpus
to
bring
to
bear,
but
then
you're
in
in
most
cases,
your
model,
at
least
in
the
in
the
embedded
or
Edge
world
that
we
live
in
your
model,
is
not
going
to
be
able
to
be
run
on
those
Cloud
resources.
You
need
it
deployed
elsewhere.
C
Maybe
it's
to
the
edge
line
is
like
a
four.
Is
it
for
you,
four
node,
you
know
a
server
box
that
you
can
take
with
you
to
places
or
even
to
embedded
spaces
like
the
the
Nvidia
devices
here,
and
so
really.
We
need
to
be
able
to
move
those
models
that
we
train
to
these
different
environments
and
they
might
be
optionally
connected
or
not,
and
so
really
the
way
that
we've
been
able
to
make
that
the
smoothest
is
to
kind
of
standardize
our
platform
layer.
A
C
Those
different
forms
factors
and
so
we're
running
openshift
or
kubernetes
in
the
in
the
data
centers,
and
also
on
these
Edge
boxes
that
we
can
take
to.
You
know
different
places
with
us
and
then
also
now
we're
running
that
on
our
embedded
devices
as
well,
and
so
it
really
because
you
have
that
standard
stack
and
that
standard
underlying
platform.
It
gives
you
a
lot
of
flexibility
to
move
between
those
environments
and
we're.
That's
that's
the
success,
we're
seeing
so
really
we
take.
C
Let's
talk
a
little
bit
about
the
actual,
like
underlying
Tech
stack
that
we're
using
for
those
model
serving
environments.
So
really
we
we
lean
heavily
on
a
lot
of
the
stuff
that
we'll
see
here
at
kubecon,
I.
Think
in
the
last
like
week
or
two
there's
been
announcements
for
several
of
these
projects,
even
getting
donated
to
the
cncf,
which
is
really
cool.
But
so
really.
C
Our
stack
right
now
is
like
some
AI
Factory
inner
Source
stuff,
some
red
hat
openshift
on
our
large
devices,
and
we
have
Micro
shift
now
running
on
our
Edge
devices.
A
lot
of
the
training
and
a
lot
of
the
AI
stuff
on
the
in
the
cloud
is
done
with
kubeflow
and
then
k-serve,
which
is
a
project
that
came
out
of
that.
We
used
for
model
serving
both
in
the
cloud
and.
C
We'll
be
running
k-serv
on
some
arm
devices
which
I
don't
know
if
it's
the
first
time
it's
been
done,
but
certainly
we
had
to
build
a
lot
of
it.
Forearms.
C
B
Yeah,
so
let's
actually
talk
a
little
bit
about
an
edge
use
case.
In
fact,
I
don't
have
any
links
or
anything
in
front
of
me,
but
there
was
quite
a
bit
of
press
that
came
out
recently.
Just
this
morning,
we
did
fly
micro
shift
on
top
of
one
of
our
Lockheed
Martin
stalkers,
so
that
but
microshipped
out
in
the
air
very
exciting
running
a
bunch
of
containerized
AI
workloads.
B
Talking
about
a
little
bit
of
the
use
case
for
sort
of
deploying,
though
more
of
this
AI
like
we're
going
to
show
off
the
actual
AI
infants.
The
way
we
are
some
of
our
different
use,
cases
often
involve
a
lot
of
large
data
collects.
So
in
this
case
you
can
see,
we've
got
a
little
stalker
flying
showing
a
picture
of
taking
a
lot
of
video
footage.
We
collect
a
lot
of
this
footage,
bring
it
to
somewhere
centrally,
and
then
we
train
a
bunch
of
models
to
better.
B
Do
you
know
understanding
of
what
is
actually
being
seen
in
that
video,
and
then
we
deploy
that
out
into
the
field
on
these
stalkers
and
now
we're
actually
getting
live
immigrants.
In
this
case
of
what
I'm
kind
of
showing
here,
so
this
is
sort
of
a
computer
vision
use
case
now.
What
we
really
focus
on
there
as
well
is
not
only
then
just
deploying
that
out
very
important,
but
now,
as
you're
out
there
doing
inference
you're,
actually
both
learning.
How
is
this
running?
B
Is
it
actually
performing
very
well
and
you're
capturing
a
lot
of
new
data?
So
every
time
we
do
some
sort
of
flight,
we're
capturing
a
lot
of
new
data,
and
so
then
we
have
all
of
this
to
bring
back
again
back
into
our
Central
environment.
With
all
of
this
new
data
we've
collected.
All
of
that
new
data
then
has
to
be
labeled
stored,
cataloged
versioned,
slightly
out
of
order
there,
but
all
of
these
different
things
to
the
week
and
then
take
and
train
the
model
better
again,
you
know
use
all
this
new
data.
B
We
might
have
some
new
things
that
we've
saw
on
the
field
that
we're
able
to
train
the
model
under
better
understand,
deploy
it
back
out
again
and
continue
that
Loop
over
and
over
and
over
again.
So
one
of
the
big
things
that's
where
we
really
focus
on
all
the
pipelines,
all
the
training,
everything
that
goes
involved
into
making
this
as
seamless
as
possible.
So
we
can
just
collect
data
train.
Send
it
back
out
again
complete
that
loop
I,
don't
know
if
there's
anything.
A
C
This
kind
of
space
and
it
to
be
fair,
it's
only
recently
really
been
possible
with
some
of
the
more
compute
you
can
deploy
to
those
embedded
devices
to
run
those.
But
really,
if
you
think
about
it,
you
you
need
to
be
able
or
like
a
lot
of
our
use.
Cases
need
to
be
able
to
change
their
mission
mid-flight
or.
A
C
That
goes
on
the
software
side,
but
it's
even
greater
on
the
AI
side,
because
you
might
need
to
deploy
a
new
model
or
several
models
at
a
moment's
notice
and
be
able
to
swap
between
them,
depending
on
what
your
mission
is
or
like.
What
that
you
know,
end
device
is
doing,
and
so
in
Dynamic
environments
like
that,
the
AI
needs
to
be
just
as
Dynamic,
and
what
we've
already
found
from
from
running
these
in
these
use
case.
C
That
Matt
is
talking
about,
is,
is
we're
able
to
like
have
seamless
transition
of
models
with
no
downtime.
But
switching.
You
know
how
the
AI
is
interpreting
the
data
that's
coming
in,
or
maybe
we
can
deploy
a
new
capability
from
the
AI
perspective
all
seamlessly
without
without
taking
down
the
asset
to
go
like
re-flash
it
and
then
send
it
back
up,
and
so
you
know
it's
certainly
possible
to
do
without
the
container
orchestration.
B
So
this
is
leading
up
to
the
potentially
fun
part
here.
So
let's
we'll
talk
a
little
bit
about
the
the
demo
that
we're
going
to
be
doing
I
had
to
decide
if
I'm
going
to
talk
very
slowly
to
preload
the
demo.
B
So
we've
got
a
couple
pieces
of
Hardware
that
we're
doing
some
of
this
prototyping
on
and
trying
to
show
off
here.
So
back
in
Denver,
we've
got
one
of
our
Edge
lines
that
is
running
some
image
generation
and
I
guess
a
little
bit
of
NLP
all
that.
C
From
open
AI,
it
takes
speech
to
text,
and
then
we
have
another
model
called
stable
diffusion.
We've
pulled
that
from
from
hugging
face
another
open
source
model
and
I
think
stability,
AIS,
who
train
that
one
and
and
that
one
takes
text
to
image.
B
So
we'll
see
how
that
goes
here
in
just
a
second,
when
we
switch
to
that
tab
because
we
might
might
be
pre-loading
a
lot
of
data,
but
before
we
jump
there
as
I
mentioned
earlier
these
up
here
we
are
running
two
just
Nvidia
agx
of
the
new
Oren
devices,
their
Dev
kits.
We've
got
these
running
up
here
with
microshift
on
them.
We
were
initially
going
to
run
a
lot
of
this
demo
on
there,
but
switch
to
just
kind
of
showing
how
to
do
a
lighter
inference
on
these
devices
as
we
go
along.
A
C
C
It's
hard
to
see
the
text
is
like
what
it's
interpreting
from
what
we
say
and
then
the
picture
will
be
whatever
it
is
that
we're
talking
about.
So
we've
been
talking
about
tech
stuff,
so
that
makes
sense.
B
C
A
little
to
be
fair,
these
models
are
meant
to
run
in
huge
data
centers,
so
it
takes
a
little
bit
of
process
and
time
to
to
pick
them
up.
Cats
in
space.
B
B
C
Give
it
a
few
more
and
then
we
can.
We
can
turn
our
attention
to
the
edge
boxes.
A
C
Again,
we're
not
taking
credit
for
the
models,
just
the
ability
to
run
those
on
some.
Some
Edge
boxes
is
what
we
would
think,
but
yeah.
So
that's
one
of
them
and
then,
let's,
let's
take
a
look
at
I,
gotta
drop
off
the.
A
A
C
B
C
So
really,
this
is
like
what's
running
on
these
devices
right
now,
so
we
have
Micro
shift
running
on
there.
There's
you
can
see
a
lot
of
the
control
plane
notes
if
you
can't
see
I'll
narrated
for
you
there's,
maybe
like
six
or
seven
six
containers,
I,
think
running
the
the
core
micro
shift
piece
of
it,
and
then
we
have
a
full
assert
manager
running
on
there
and
in
it's
DOD.
C
The
reason
for
that
is,
that's
really
like
a
dependency
for
the
type
of
case
serve
deployment
that
we
did
so
K
server
is
pretty
cool.
If
you
haven't
checked
it
out,
it
basically
allows
you
to
it's
like
it
has
some
custom
resources
for
deploying
models,
and
so
there's
a
k-serve
controller
on
here
and
it
will
actually
interact
with
istio
for
you,
it'll
interact
with
server
manager
and
it'll
kind
of
build
you
all
the
stuff
you
need.
So
all
you
have
to
do
is
say
here's
my
model
at
the
base
case.
C
A
C
Actually
go
spin
that
up
for
you
so
pretty
cool,
so
we
have
all
that
control
plane
running
here
and
then
our
data
plane,
one
here,
which
is
our
our
c410
predictor
I-
did
make
the
name
space.
If
somebody
can
see
that,
but
it's.
B
C
C
A
C
And
then
an
example
of
inference
really
like
a
model
at
the
end
of
the
day.
Is
software
right?
So
really
what
you're
doing
is
you're
deploying
it
with
a
rest.
Endpoint
case
serve
also
supports
stuff
like
like
akafka
ingest
or
for
grpc,
but
the
base
case
is
just
deploy
a
rest
endpoint.
C
C
C
Again,
the
result
is
just
Json
because
it's
an
endpoint,
but
you
can
see
we're
running
the
c410
on
on
the
arm
boxes,
and
so
really
what
that
entails
is
like
I
think,
as
we
start
to
see
more
arm
both
on.
C
A
C
At
the
Westin
so
yeah,
that's
that's
just
an
example
of
the
same
technology
that
we
can
run
in
our
data
center
running
on
microshift
right
in
front
of
you,
so
foreign
yeah,
one
other
thing,
I'll
say
since
I
think
we
we
both
talk
fast.
So
we
got
a
little
bit
of
time
to
Vamp
here,
but
one
thing
I
think
is
ripe
in
this
space
is
even
though
we
can
run
these
control
planes
on
the
edge
devices.
A
C
Processes
that
you
don't
have
to
have
on
the
device,
a
lot
of
these
Frameworks
are
built
to
run
in
Cloud
environments,
and
one
place
that
I
think
is
ripe
for
Innovation
is,
is
being
able
to
have
the
control
plane
running
even
on
the
edge
line
type
device,
maybe
not
even
in
the
data
center,
but
be
able
to
manage
a
bunch
of
these.
You
know
embedded
deployed
models
that
are
on
these,
like
optionally,
disconnected
environments,
so
I'm
going
to
put
that
out
there
into
the
world.
A
C
C
On
our
most
of
the
collaboration
for
the
data,
scientists
and
machine
learning,
Engineers
happens
on
our
training
platforms,
and
so
we
deploy
multi-tenant
Cube
flow
on
there
and
yeah
I.
Think
it's
just
the
ability
to
get
an
environment
that
has
all
of
their
dependencies
and
easy
access
to
gpus
has
really
accelerated
their
ability
to
train
a
bunch
of
models,
and
then
we
also
deploy
you
know.
Speaking
of
other,
like
things
that
we
have
deployed
in
that
training
environment,
we
have
a
number
of
like
distributed
training
tools.
C
So,
for
instance,
kubeflow
comes
with
one
called
khatib.
The
Rey
Community
is
awesome.
It
has
built
some
really
cool,
distributed,
training
stuff
and
then
also
determined
AI,
there's
another
open
source
tool
that
we
use
there
I
mean
that
allows
them
again
like
easy
access
to
that,
but
they
can
collaborate
on
on
building
these
larger
training,
jobs
and,
and
so
far,
they've
been
sharing
our
GPU
clusters
they're,
like
scarce
resources,
but
so
far
we've
been
able
to
have
them
share
those
clusters.
It's
a
pretty
good
effect.
Yet.
B
C
But
it's
really
just
the
ability.
What
a
data
scientist
or
a
machine
learning
engineer
needs
is
like
quick,
easy
access
to
to
large-scale
GPU
and
distributed
training
environments,
and
it's
not
necessarily
easy
to
build,
and
so,
from
our
perspective,
we've
built
a
platform
team
at
Lockheed
Martin
to
deliver
that
to
our
array
of
of
Engineers.
There
I
know
that's
not
the
reality
for
every
company
to
have
the
resources
to
do,
but
it's
it's
worked
very
well
for
us.