►
Description
Diego 2019 Project Update - Sunjay Bhatia & Amin Jamali, Pivotal
In this talk, the Diego team will survey how the Diego components interact inside of CF to run application instances and tasks and then dive into how those interactions have evolved over the past year to improve system stability, security, and scale. This talk will also review how recent work in Diego supports powerful platform capabilities such as first-class support for app developer specified sidecars, improved reliability of the CF routing tiers, and other features that the core Cloud Foundry teams are working on or considering for development today.
For more info: https://www.cloudfoundry.org/
B
A
Alright
so
speed.
First,
we
expect
to
move
quickly
and
deliver
features
to
our
users
and
reduce
the
feedback
loops.
That's
what
we
think
of
it,
as
as
part
of
the
Diego
team.
We
want
to
execute
on
our
backlog
quickly,
so
we're
continuously
providing
value
to
our
users,
and
so
then
they
can
in
turn
move
quickly.
A
So
some
examples
are
OCI
mode
collaboration
with
the
garden
team.
You
know
in
order
to
increase
the
speed
at
which
you
can
scale
your
applications
and
create
containers.
Instead
of
having
your
application
droplet
the
tar
file
that's
copied
into
a
container.
It's
not
actually
there's
a
feature
where
you
can
actually
make
that
a
container
image
layer
we've.
Also
tried
to
work
with
the
garden
team
to
calculate
more
accurate,
CPU
metrics
with
a
garden
windows
team
on
route
integrity
for
windows,
so
that
TLS
from
the
go
router
to
your
applications
is
communication
from
the
go
riders.
A
Your
applications
is
over
TLS
bring
that
feature
as
a
beta
experimental
feature
for
users
on
Windows
and
also
more
parity
with
Windows
and
CF
SSH
within
a
sixty
and
other
external
network
plugins,
you
have
the
same
experience
across
Linux
and
Windows.
We've
also
worked
with
the
Bill
packs
and
capi
teams
on
bringing
citecar
support
to
the
platform
and
with
the
capi
team
and
the
observability
teams
on
tagging
logs
and
metrics
for
your
applications
with
more
information.
So
you
can
as
a
downstream
consumer.
A
B
Next
up
we
have
stability
and
scalability,
as
the
team
goes,
our
team,
we
expect
diego
components
to
keep
running
and
to
keep
your
apps
running.
We've
come
to
expect
of
Diego
to
perform
well
in
larger
installations,
running
thousands
of
apps,
and
we
should
continue
to
expect
that
from
Diego
and
improve
and
refine
and
improve
on
this
experience
to
be
a
bit
more
specific
issues
that
happen
at
scale
can
be
attributed
to
the
some
of
the
environmental
faults
which
the
system
should
continue
to
be
more
resilient
to.
B
We
worked
on
various
features
over
the
past
year
that
helped
solidify
and
improve
on
this
on
the
Diego
Diego
components.
As
far
as
the
stability
and
operational
scale
goes
to
name,
a
few
things
will
start
with
Nats
Nats
failures
are
common
in
in
installations
where
messages
are
somehow
missed.
So
specifically,
this.
This
was
an
issue
where
deleted
apps
remain
routable
in
scenario,
so
we've
lowered
the
chance
of
running
into
this
issue
by
having
a
cache
of
that
removals.
B
This
has
been
an
issue
in
some
of
the
larger
installations
where
locket
has
held
on
to
some
of
the
sequel
connections
and
we've
improved
on
that.
As
far
as
the
connection
lifetime
goes
and
actually
without
actually
impacting
the
performance
there's
new
metrics
there
in
order
to
figure
out
what
they
what
them,
what
in
order
to
help
monitor
that
usage
over
time.
A
All
right
now,
let's
take
a
look
at
deeper
look
into
one
of
the
problems
that
we
addressed
in
the
past
year
before
we
start
just
for
anyone.
Who's
not
familiar
we're
gonna
be
talking
about
locket,
which
is
Diego's
distributed,
locking
component,
which
is
backed
by
a
sequel
database.
The
BBS,
an
auctioneer
use
it
to
maintain,
which
of
the
many
active
many
nodes
is
the
active
one
and
also
the
Diego
cells,
periodically
check
in
with
lock
it
so
that
the
code,
so
the
locking
can
keep
track
of
which
cells
are
available
for
the
control
plane.
A
The
BBS
ends
up
using
this
information
from
locket
to
make
some
decisions
during
convergence,
which
is
the
process
of
converging
the
desired
state
from
a
clock,
controller
and
the
actual
state
of
the
world.
That's
running
in
your
foundation
and
when
the
BBS
makes
decisions
about
changing
the
state
of
an
NR
P
or
if
an
LR,
P
crashes
or
or
whatnot,
it
sends
events
to
the
rat
emitters
that
run
on
each
CEL.
They
communicate
with
the
go
router
about
existing
routes
and
they
get
this
information
from
the
VBS
via
events.
A
So
here
we
have
kind
of
a
diagram
about
what
expected
kind
of
steady
state
of
a
system
would
look
like.
We
have
a
BBS
that
has
three
LR
piece
to
a
running
on
cell
one,
one
on
cell
two
and
locket
has
a
record
of
each
of
these
cells
they're,
both
checking
in
with
locket
periodically,
however,
in
real
environments
running
at
scale.
There's
environmental
faults
that
happen,
the
network,
latency
cetera.
A
What
happens
when
a
cell
is
unable
to
check
in
with
lock
it
for
some
amount
of
time,
lock
it,
but
its
cells
are
periodically
checking
and
with
loggia
to
make
sure
that
that
state
of
the
system
is
updated.
So
what
happens
to
the
work
running
on
a
cell
that
is
unable
to
check
in
with
the
control
plane?
A
So
in
the
past,
before
the
changes
that
we
made
recently
when
the
cell
one
was
not
unable
to
talk
to
lock
it,
the
BBS,
it
notices
this
during
convergence,
which
runs
periodically
and
the
lrp
is
running
on
that
cell
or
replaced
the
BBS
tries
to
replace
all
the
lrpc
on
this
kind
of
missing
cell
and
what
actually
ends
up
happening.
Is
you
get
downtime
in
your
application
because
those
are
new
replacement
instances,
shadow,
those
old
instances?
So
why
does
this
shadowing
happen?
A
Fundamentally,
we
always
represented
LR
PS
as
groups
in
the
API.
In
all
our
internal
logic,
we
had
endpoints
and
all
of
the
endpoints
to
fetch
LR,
PS
and
all
the
event
endpoints
grouped
LR
piece
together.
A
group
is
basically
the
potentially
evacuating
instance
and
the
kind
of
ordinary
regular
instance
that
correspond
to
an
individual
application
instance.
A
So
what
do
we
do
to
fix
it
instead
of
grouping
our
piece
we're
just
dealing
with
it?
We're
just
now
dealing
with
them
individually,
instead
of
adding
another
feel
to
the
group,
it
made
more
sense.
We
thought
to
move
forward
with
a
flattened
structure,
so
we
can
make
be
more
flexible
in
the
future.
If
we
need
to
make
more
changes,
this
was
kind
of
a
big
change.
A
Ordinary
means
that
the
lrp
is
running.
Normally,
it's
kind
of
a
combination
of
the
cell
state
and
the
lrp
state
together,
evacuating,
is
on
there's
an
L
R
P,
that's
on
a
cell,
that's
evacuating
and
suspect
is
not
on
a
cell.
That's
we
don't
know,
what's
actually
happening
with
it
hasn't
checked
in
with
Lockett,
so
it
might
be
gone.
It
might
not
be
so
with
this
change.
What
happens
now?
A
L
RP
s
on
the
missing
cell
aren't
immediately
replaced
they're
marked
as
suspect
and
replacements
are
created,
but
they're
not
immediately
used
they're
kind
of
waiting
until
we
know
that
we
should
use
them.
Since
we
don't
technically
know
the
state
of
the
cell,
we're
kind
of
optimistic
and
expecting
work
to
continue
running
to
ensure
that
the
maximal
route
ability
to
your
office,
it
happens
so
enduring
like
small
environmental
flakes,
like
one
missed
check-in
with
Lockheed.
A
If
convergence
notices
that
we
don't
want
to
drop
routing
to
your
applications,
we
want
to
maintain
this
maintain
altitude
as
much
as
possible.
So
if
a
cell
goes
missing
the
replacements
once
they
start
running,
then
they
take
over.
They
take
over
the
routes
to
the
app
that
application
and
the
original
replacements
are
deleted.
So
you
should
have
a
seamless
transition
if
a
cell
kind
of
blips
out
of
presence,
if
a
cell
is
unable
to
talk
with
locket,
then
you
should
have
a
seamless
transition
and
not
have
any
downtime
for
your
applications
running
on
that
cell.
A
B
A
Here
we're
just
seeing
the
removal
of
the
lrpc
from
the
database
they're
no
longer
gonna
be
routable,
however,
if
the
cell
actually
ends
up
coming
back
so
the
next
time
it's
able
to
check
in
with
the
locket
and
the
control
plan
is
able
to
find
it
again.
We
just
delete
the
replacements.
We
know
that
the
original
instances
are
already
running
they're
already
in
a
good
State.
We
don't
need
to
penalize
the
cell
for
missing
a
heartbeat.
We
just
keep
those
running
and
use
them
and
get
rid
of
the
old
instances.
B
B
We
want
to
focus
on
the
right
things
within
our
team
so
that
we
can
realize
the
goals
of
the
community
and
the
foundation,
and
that
means
prioritizing
and
D
prioritizing
work
that
ensure
future
savings
and
effort
for
the
core
CF
teams
that
this
means
that
refining
and
fixing
the
current
feature
said
that
our
that
are
in
the
Diego
is
a
high
priority
for
us,
and
introducing
new
features
is
not
as
high
priority.
If
a
feature
set
can
be
realized
in
as
part
of
the
evolution
of
the
platform,
it
might
be
best
to
start
it
there.
B
Another
area
that
we
are
continuing
to
focus
on
is
the
continued
focus
on
the
Diego
operator,
experience,
reducing
the
time
to
debug
components
and
app
failures.
That's
an
area
that
we
have
heard
from
users
and
customers
that
they
would
like
to
have
more
visibility
into
and
have
an
easier
time
debugging.
These
failures.
When
and
if
that
happens
and.
A
You
can
actually
now
disable
those
endpoints
if
you
are
not
using
TCP
routing
and
your
app
traffic
and
CF
ssh
traffic
will
go
over
TLS
ports
that
are
proxied
by
the
Envoy
proxy.
That
comes
with
each
of
your
application
instances
and
everything
will
be
over
TLS
in
that
guard.
It's
not
currently
compatible
to
TCP
routing,
but
if
you
just
have
HTTP
up
so
everything
should
be
fine
and,
of
course,
also
we're
trying
to
work
on
updating
dependencies
in
a
more
automated
fashion.
B
That's
all
we
have
for
you,
shoutouts
that
we
want
to
give
out
to
our
p.m.
Josh
here
under
development
developers
on
the
team,
Orion
ik,
the
CF
components
team
that
we
worked
closely
with
bill,
packs,
Cappy,
CF,
networking,
garden
garden,
windows
and
Percy.
More
importantly,
the
CF
community,
every
one
of
you
there's
a
lot
of
these
issues
have
been
discovered
via
github
issues
and
slack,
so
please
keep
them
coming.
B
This
is
how
we
can
make
the
system
more
robust
and
more
stable
and,
if
you're
experiencing
any
issues
at
scale
as
far
as
the
scheduling
as
far
as
your
app
crashes,
please,
let
us
know
if
you'd
like
to
have
a
better
experience
debugging
these
failures,
please
let
us
know
with
that.
We
have
a
few
minutes
for
any
questions.
If
you
have.