►
Description
No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).
A
A
I'm
dave
zolitowski
a
principal
engineer
at
spotify
and
today
with
me,
I
have
rajan
and
dave
from
fidelity
investment
as
our
guests.
In
these
live
streams,
we
bring
end
user
members
to
showcase
how
their
organizations
navigate
the
cloud-native
ecosystem
to
build
and
distribute
their
services
and
products
join
us
every
fourth
thursday
at
9
a.m.
Pacific
time
this
is
an
official
live
stream
of
the
cncf
and
outside
you're,
subject
to
the
cncf
code
of
conduct,
please
don't
add
anything
to
the
chat
or
questions
that
would
be
in
violation
of
the
code
of
conduct.
A
A
B
Yeah
sure
I'll
go
first,
my
name
is
raj
rajan
puripati
people
call
me
rajan,
I'm
part
of
an
organization
called
ecc
within
fidelity
within
that
I'm
part
of
the
cloud
platform
team.
I
primarily
work
on
the
kubernetes
based
projects
and
the
goal
of
our
team
is
basically
to
set
up
the
next
generation
application
platform
for
our
users.
So
to
put
it
in
right
words,
if,
let's
say
a
double
developer
with
infidelity
wants
to,
you
know,
do
something
to
achieve
a
particular
business
objective.
B
We
wanted
to
make
it
as
simple
and
as
easy
as
possible
for
for
the
developer,
so
that
that's
me
very
happy
to
represent
the
cloud
platform
team.
C
Hi,
my
name
is
dave
batello
I've,
I'm
in
charge
of
the
private
platform
squad
within
fidelity,
where
we're
primarily
focused
on
building
out
kubernetes
platforms,
to
run
our
container
workloads
on
premise.
C
We
have
probably
around
40
clusters
production
clusters
on
premise
of
kubernetes
running
at
one
time,
supporting
a
variety
of
workloads
with
close
to
about
10
000
containers.
A
Cool
that
sounds
like
a
lot
before
diving
into.
A
I'm
sure
it
feels
like
a
lot
sometimes
before
diving
deeper
into
the
karate's
part
curious
to.
Can
you
tell
us
a
little
more
just
about
the
infrastructure
setup
at
fidelity
and
what
prompted
you
to
start
adopting
cloud-native
tools
and
kubernetes.
B
Yeah,
I
can,
I
can
start
with
that.
So
basically
we
we
have
a
mix
of
on-prem
and
the
cloud
we
are
into
multiple
cloud
providers
as
well,
so,
basically
just
just
to
give
an
idea
instead
of
just
setting
up
clusters
and
then
opening
it
up
for
the
users,
our
goal
basically
was
to
come
up
with
the
platform,
which
is
more
fidelity
specific
in
the
sense
that
we
want
all
the
the
best
features
available
from
the
cncf
technologies.
B
We
want
that
to
be
available
for
the
users,
but
at
the
same
time
we
have
some
hard
constraints
from
from
an
enterprise
standpoint.
So,
for
example,
today,
if
I'm
a
developer
with
infidelity,
it's
not
very
easy
for
you
to
just
go
and
spin
up
your
own
kubernetes
cluster
and
then
deploy
an
application
and
take
it
to
protection.
You
have
to
take
a
lot
of
security
aspects.
You
know
that
come
into
picture
in
terms
of
your
your
image
that
is
using.
B
That
is
that's
that's
very
important
in
terms
of
the
amis
there's
a
long
list
so
and-
and
this
particular
list,
for
example,
it
it
sort
of
keeps
changing
as
well.
So,
for
example,
there
could
be
a
security
event,
not
just
with
infinity,
but
in
anywhere
outside
that
could
actually
trigger.
Like
a
you
know,
a
different
policy
or
change
to
a
policy,
so
these
constraints
are
something
which,
as
a
developer,
it's
not
so
easy
to
keep
up
with.
B
So
we
so
the
goal
that
we
set
out
was
to
actually
build
an
application
platform
where
we
take
all
these
constraints
into
account
and
we
sort
of
come
up
with
this
platform
where,
as
a
developer,
you
get
to
experience
all
the
features,
the
best
best
features
from
the
cncf
technologies.
At
the
same
time,
you're
guaranteed
that
you
are
running
in
a
in
a
secure
fidelity,
environment
security
is
one
aspect
of
it.
B
There
are
other
aspects
like
compliance
and
a
lot
of
other
things
which
will
get
into
the
detail,
but
that
is
the
goal
asset
developer.
You
sh,
you
should
it
which
we
want
to
make
it
really
really
easy
so
that
you
know
you're
able
to
you
know
quickly.
Deliver.
You
know
the
business
objectives,
so
we
have
today
mentioned
40.
That
is
more
40
clusters.
So
that's
on
the
on-prem
side.
B
So
totally
put
together,
we
have
like
eks
clusters,
aks
clusters
on
aws
and
azure,
so
total
crosses
like
300
plus
now,
so
we
have
a
dash.
We
have
a
dashboard
and
sometimes
we
look
at
it
and
it
jumps
like
from
like
one
number
to
another,
so
I'm
pretty
sure
we've
crossed
like
310
or
something
like
that.
So
so
that's
that's!
That's
a
high
level!
You
know
setup
dave.
If
you
want
to
add.
C
Yeah,
I
mean
infrastructure
from
an
on-prem
perspective.
You
know
primarily
where
we
built
out
our
platform
around
vcenter
around
vsphere
infrastructure.
C
You
know
where,
for
our
on-prem
services
there's
also
a
proprietary
api
platform
which
we
built
on
prem,
that
sits
in
front
of
our
vsphere
infrastructure,
and
we
we
leverage
that
for
a
lot
of
the
provisioning
of
our
virtual
server
instances
that
we,
you
know,
we
lay
the
foundation
upon
for
for
building
kubernetes.
C
So
a
large
portion
of
the
responsibility
for
building
out
the
platform
entails
also
understanding.
You
know
how
the
infrastructure
works
behind
the
scenes
and
tightly
coupling
our
integration
deployments,
our
kubernetes
build-outs
on
that
infrastructure.
B
Yeah-
and
I
just
wanted
to
add
on
the
csf
technology
standpoint-
use
yeah
kubernetes
of
course,
but
we
do
use
helm
so
helm
is
our
standard
for
packaging
most
most
of
the
time
when
I
actually
mention
something
like
helm,
so
we
always
have
this
approach
where
we
don't
want
to
restrict
users
to
the
fullest
extent.
So
we
we
always
have
this
famous
thing
where
it's
battery
is
included,
but
it's
swappable
right.
So
we
do
you.
B
As
a
user,
you
have
an
option
to
switch
to
something
else
as
well,
but,
like
helm,
is
one
of
the
like
most
widely
used
packaging.
You
know
mechanism.
B
This
platform
will
talk
about
more
in
detail,
but
it's
it's
kind
of
like
a
base
platform
on
top
of
which
you
can
actually
run
a
lot
of
things
so
in.
In
that
perspective,
we
have
like
envoy,
which
is
running
on
top
of
eks
clusters,
as
well
as
a
part
of
like
an
apa
gateway
and
stuff
like
that.
So
it's
like
a
combination
of
all
these
technologies.
A
B
B
We
try
to
look
at
where
the
community
is
going
forward
and
we
want
to
stick
to
the
community
right.
So
almost
always
we
sort
of
look
at
this
landscape
and
make
sure
that
you
know
we
pick
something
from
the
landscape.
So
in
terms
of
the,
for
example,
we
use
like
some,
if
not
full,
some
some
components
from
the
flux.
You
know
cd,
for
example,
helm
operator,
which
is
part
of
flux,
hem
controller.
Now,
and
now
it's,
I
think
it's
github's
toolkit
right.
B
So
we
use
that
extensively.
We
we
actually
have
built
open
source
project
called
cron
on
top
of
it.
B
So
basically,
we
took
these
technologies
from
I've,
seen
cs
since,
if
and
built
something
on
top
of
it
just
to
extend
it
a
little
bit
to
our
use
case
and,
and
we
have
sort
of
open
source
that
as
well
so
github's
toolkit
is
one
example
and
cron
is
the
open
source
project
that
we
have
built
on
top
of
it
yeah
we
use
container
d
right
and-
and
these
are
the
major
ones
and
we're
we're
always
you
know
constantly
exploring
you
know
projects
I
think
craftsman
on
the
telemetry
side
as
well.
B
We
are,
we
are
looking
at,
you
know
fluent
bit
and
you
know
stuff
yeah.
We
use.
C
Fluid
bit
for
a
lot
of
our
log
collection,
we
also
are
pretty
heavily
invested
in
opa,
so
from
a
governance
and
compliance
perspective,
so
we're
using
opa
to
build
policies
and
constraints
around
you
know
how
we
govern
the
platform
itself,
so
there's
all
different
types
of
policies
that
we've
implemented
to
enforce
specific
people,
the
metadata
that
are
associated
with
name
spaces,
we're
looking
to
make
that
migration
very
shortly
over
from
the
native
psp
policies
within
kubernetes
to
to
opa
to
do
that
policy
enforcement.
A
Now
that
makes
a
lot
of
sense
and
I
think
both
of
you
alluded
to
or
touched
on
security
for
the
environment,
and
I
don't
think
you
necessarily
said
it,
but
I
assume
there's
a
lot
of
regulation
as
well,
so
I'm
curious
how
being
in
such
a
regulated
and
required
to
be
highly
secure
space
impacts.
The
way
you
look
at
all
of
this
infrastructure
and
open
source
tooling.
B
Yeah,
so
we
we
have
very
strict,
you
know,
regulations
being
part
of
the
financial
industry.
That's
that's
actually
important
as
well.
So
the
way
we
look
at
it
is
that's
very
important
to
have
right,
so
we
have
built
the
platform
in
such
a
way
that,
from
a
user
standpoint,
for
example
today,
if
you,
if
you
talk
to
any
of
the
fertility
developers,
they
don't
look
at.
B
If
let's
say
they
are,
let's
say
in
aws:
they
don't
look
at
it
as
if
it's
it's
an
eks
cluster
or
a
kubernetes
cluster,
but
they
look
at
it
in
terms
of
like
a
fidelity
platform
internally,
we
call
it
as
like
fit
fit
eks
fidei
case
and
stuff
like
that,
so
they
always
refer
it
to
as
like.
Hey
fedex
is
version
one.
So
we
have
our
own
fidelity
platform
versioning,
so
they
usually
say
eks1.02.0
and
stuff,
like
that.
B
So
coming
to
the
security
point,
what
we've
done
is
we
have
sort
of
packaged
all
these
things
as
a
part
of
the
platform
right.
So,
whenever
we
sort
of
make
a
release,
we
what
we
do
is
like,
for
example,
there
are
all
these
add-ons
I'll
give
you
I'll
give
you
an
example:
the
opa
which
dave
was
mentioning
about
that's
something
which
is
part
of
the
fertility
platform.
B
So
from
a
user
for
the
fidelity
user
standpoint,
he
doesn't
look
at
it
as
like
one
standalone
add-on,
which
is
running
in
a
cluster,
but
he
hershey
is
basically
exposed
to
the
features
that
come
comes
out
comes
out
of
the
add-ons.
B
So
the
way
we
try
to
portray
is
basically
we
have
this
platform
and
you
have
all
these
features
that
are
available
for
you
forget:
don't
focus
more
on
the
add-ons
aspect
of
it,
because
behind
the
scenes
between
3d
case
1.0
2.0,
we
might
actually
switch
the
add-on
to
something
else,
or
we
can
combine
two
add-ons.
We
can
do
a
lot
of
things
as
a
part
of
the
platform,
but
from
a
user
standpoint
they
just
look
at
it
as
a
feature.
B
So
from
this
perspective,
if
you
look
at
it,
we
sort
of
built
in
the
security
features
within
the
platform,
so
that,
let's
say,
if
we
are
going
from
one
kubernetes
version
to
another,
we
sort
of
have
this
rigorous
process
where
we
we,
we
check
every
single
add-on.
That
is
part
of
the
platform.
It
goes
through
like
a
as
a
process
to
make
sure
between
the
versions
between
the
version
of
kubernetes,
as
well
as
between
the
version
of
add-ons.
B
Is
there
anything
that
has
changed
that
impacts
our
you
know
security
guidelines
right?
It
could
be
as
simple
as
a
particular
add-on
version,
not
com
image.
Let's
say
base
image
that
is
part
of
this
new
add-on
version.
Maybe
that
is
not
compliant
with
you
know
some
of
the
current
security
policies.
This
is
one
one
good
example
right,
so
those
are
things
that
we
will
actually
validate
as
a
part
of
our
no
rigorous
validation
process
before
we
release
our
platform
version.
So
what
happens
is
whenever
the
users
get
a
notification
that
hey?
B
We
have
this
video
as
2.0
or
video
case
2.0.
Most
of
these
things
are
actually
already
handled
for
them
and
we
do
make
it
part
of
our
internal
release
notes
so
that
they
are
aware
of
like
what
all
things.
B
Sometimes
you
have
to
take
a
slightly
different
approach
where,
for
example,
let's
say
we
want
to,
we
want
a
particular
change:
it's
not
in
compliance
with
one
of
our
security
policies.
At
the
same
time,
let's
say
we
are
unable
to
get
that
change
from
the
open
source
project
immediately
right.
So
those
are
the
cases
where
we
will
sort
of
come
up
with
certain
work
runs.
It
could
be
like
for
a
very
smart
period,
a
small
period
of
time.
B
We
could
actually
do
something
where
we
will
release
the
feature
as
a
part
of
the
platform
version,
but
at
the
same
time
we
will
sort
of
do
some
work
around
for
for
for
a
period
of
like
two
months
or
something
like
that
until
the
actual
change
comes
from
the
open
source
project.
So
these
are
the
things
that
we
do
as
a
platform
team,
but
from
a
user
standpoint,
they're
they're
unaware
of
these
things
like
from
their
standpoint,
you
have
a
feature
that
is
like
very
stable
and
you
know
it
is
working.
A
Yeah
and
so
what
does
that?
Make
versioning
and
upgrading
look
like
from
a
user
standpoint
like
you
release,
vidks
n,
plus
one.
B
A
B
So
basically,
what
happens
is
so
maybe
a
little
bit
on
the
fertility
structure.
We're
not
like
one
central
team
which
manages
all
the
clusters.
So
philip
is
a
large
organization
and
we
have
a
lot
of
sub
organizations
right.
Each
each
business
unit
itself
is
a
company
by
itself.
It's
a
lot
of
developers.
They
have
their
own
dedicated
devops
teams,
sre
teams-
and
it's
like
that,
so
the
way
it
works
is
a
business
unit
will
have
a
devops
team
rate
and
operations
team.
B
So
when
we
release
it,
it's
actually
them
taking
the
platform
version
and
then
upgrading
the
cluster.
It's
not
as
if
like
we
will,
we
we
sort
of
give
the
tools
in
place
we
sort
of
have,
like
you
know,
a
ui
where
they
can
actually
go
and
do
it,
but
we
don't
do
it
for
them.
They
have
their
own
timelines
and
it's
up
to
them.
So
basically
it
goes
like
this,
so
we
release
a
phd
case
version
so
we'll
have
a
call.
B
We
have
a
release,
call
and
then
the
user
sort
of
get
to
see
what
all
new
features
are
coming
in.
What
are
all
the
breaking
changes
and
then
they
get
to
decide?
Okay,
when
they
can
actually
do
it.
We
do
have
a
oh.
B
You
have
a
time
frame
where
we
we
support,
like
n,
minus
three
and
minus
four
version.
It's
not
as
if
a
particular
business
unit
cannot
stick
with
the
particular
version
for
a
long
time.
That's
there,
but
at
the
same
time
they
sort
of
pick
the
version.
They
actually
upgrade
the
clusters.
And
if,
if
there
are
any
issues,
then
that
becomes
like
a
platform
issue,
it's
like
an
issue
happening
with
platform
2.0
and
then
we
sort
of
you
know
jump
in
and
we
have
all
these
monitoring.
B
You
know
dashboards
and
everything
set
in
place,
so
we
people
know
a
friend
if,
if,
if
somebody's
doing
a
cluster
upgrade
platform
upgrade,
and
if
something
is
going
wrong,
we
would
automatically
you
know,
jump
so
that's
how
it
it
works.
So
from
from
from
a
devops
team
member
who's
actually
picking
up
the
version
they
might,
we
might
have
packaged
20
different
add-ons
within
the
platform
and
each
could
be
in
its
own
version
right.
So
they're
not
worried
much
about
it
from
their
standpoint.
B
They
look
at
the
entire
entirety
of
it
as
like
one
single
version.
So
even
if
one
of
the
add-ons
is
not
working,
let's
say
a
particular
feature.
Is
you
know
not
working,
so
they
just
from
their
perspective.
It's
basically
the
platform
version
vs
unstable,
so
we
just
released
like
a
patch
version
for
it
right,
so
that
that's
a
wet
house
that
that's
got
that's
how
it
goes
actually.
C
Yeah,
so
on
on
the
private
cloud
side
on
prem,
so
what
rajon
was
talking
about
was
a
lot
of
what
how
we
manage
things
out
in
the
public
cloud
and
a
large
portion
of
that
is
self-service
right.
So
the
cut
the
release
they
provide
the
release
out
to
the
customer
base
and
then
the
customers
are
consuming
it
and
then
they
pretty
much
have
a
you
know
the
ability
to
roll
it
out
on
their
own
schedule.
C
On-Prem
we're
a
little
bit
more
prescriptive
over
it.
We
take
a
little
bit
more
control
over
it.
It's
probably
more
of
a
managed
offering
more
than
anything,
and
so
what
we've
been
trying
to
do
over
the
last
you
know
year
is
try
to
obviously
keep
up
with
the
kubernetes
versions.
That's
that's
a
challenge
right.
So
this
past
year
we
had
a
target
to
try
to
do
four
upgrades
in
one
year,
and
so
we
did
four
upgrades
this
this
year.
C
What,
by
the
end
of
this
year,
we'll
have
done
four
kubernetes
upgrades
with
a
target
of
one
per
quarter,
so
hopefully
by
the
end
of
this
year,
and
maybe
my
team's
listening
will
get
120
out
the
door
by
the
end
of
this
year,
and
so
that's
a
substantial
amount
of
workload.
That's
just
to
keep
up
with
the
versions
that
doesn't
include.
C
You
know
all
the
work
that
we
do
around
like
add-ons
things
that
you
know,
rajan
was
mentioning
around
maintaining
conversions,
for
you
know
all
the
different
charts
that
we
roll
out
to
support
the
environment
and
provide
other
capabilities.
B
Yeah,
I
I
see
a
question
from
maywork.
How
do
you
keep
up
with
the
upgrading
dave
if
you're,
okay?
I
just
wanted
to
touch
upon
that
because
that's
something
which
I
think
some
of
our
learnings
could
be
used
for
the
users.
B
So
there
is
this
problem
right,
especially
when
you
are
multi-cloud,
so
imagine
we
are
trying
to
have
an
platform
which
which
actually
provides
certain
features
so
from
a
user
standpoint,
they're
looking
at
this
unified
platform,
and
that
is
supposed
to
run,
irrespective
of
where
you
are,
whether
you
are
an
on-prem,
whether
you're
natively
as
an
azure.
It's
a
very
difficult
thing
to
do,
especially
when,
if,
for
example,
in
on
the
on-prem
side,
let's
say
we
are
using
rancher
there.
B
The
versions
that
rancher
would
support
will
be
slightly
different
from
I'm,
referring
to
the
n
minus
four
n
minus
three
problem
where,
for
example,
one
one
vendor
might
say,
my
current
is
120,
and
I
I
follow
the
n
minus
three
model.
So
at
any
point
in
time,
one
seventeen
is
the
latest
at
the
same
time,
another
vendor.
If
you're
on
the
cloud
aka
aks,
for
example,
there
could
be
a
situation
where
they
are
doing
119
and
n
minus
four,
so
their
least
supported
version
is
160
right.
So
how
do
you?
B
How
do
you
do
this?
It's
it's
a
very
tricky
problem,
so
that
is
where
there's
no
clean
solution
to
it.
Let
me
put
it
that
way.
So
that's
where
we
constantly
we
the
platform
leads
we
sort
of
meet,
and
sometimes
we
sort
of
ask
dave,
for
example,
to
sort
of
slow
it
down
a
little
bit
where
we
sort
of
catch
up
and
stuff
like
that.
But
but
one
thing
that
we
always
put
in
the
front
is
the
stability
of
the
platform.
B
Even
if,
let's
say
120,
I'm
just
taking
number
130
has
like
some
really
important
feature,
and
then,
let's
say
one
of
the
teams
or
some
of
the
teams
are
waiting
for
it
right.
If
we
think
that
you
know,
we
won't
be
able
to
provide
this
uniform
experience.
If
we,
you
know,
if,
let's
say
one
of
the
platforms,
let's
say
if
you
say
that
azure
wants
to
move
forward
right
and
then
you
know
do
that
where
you
know
aws
lags
behind.
B
If
that
is
going
to
be
a
situation,
we
really
evaluate
we,
we
try
not
to
do
that
right.
We
try
to
wait
so
stability
becomes
more
and
more
important
than
releasing
new
features,
so
sometimes
we'll
actually
tell
that
application
teams
that
hey
this
is
a
feature
that
you
want.
But
can
you
can
you
live
without
it
for
like
another
few
months?
B
Is
it
like
absolutely
important,
because
that
would
directly
mean
that
we
can
actually
the
point
here
is
the
stability
always
comes
first,
so
that
is
one
thing
and
the
thing
is
it
took
some
time
for
our
internal
users
to
it's.
I
think
it's
my
mind.
It's
it's
a
mindset
so,
for
example,
like
upgrading
these,
these
are
big
clusters
and
a
lot
of
critical
applications
are
running
in
it
right
so
like
a
year
back
when
we
went
back
to
them
and
said
that
hey
community
most
very
fast,
like
kubernetes.
B
If
you
look
at
look
at
it
as
a
project,
the
developers
are
like
amazing.
They
come
up
with
all
these
features
very
quickly,
so
the
versions
move
very
fast,
so
we
sort
of
release
versions
as
well
right,
so
it
was
very
a
year
back
our
internal
customers.
It
was
very
difficult
for
them
to
digest
the
fact
that
every
two
months
or
something
like
that
they
have
to
do
a
major
upgrade.
B
So
now,
if
you
look
at
it
looking
at
how
stable
the
whole
thing
was,
so
we
have
taken
it
to
a
point
where
it's
more
of
a
personal
perception
thing,
but
we
have
taken
it
to
a
point
where
it's
okay
to
do
the
upgrade
it'll
be
stable.
So
building
that
sort
of
a
thing
is
very
important.
So
if
you
can,
if
you
can
actually
put
all
your
efforts
towards
making
the
upgrades
like
really
really
stable,
then
the
users
build
it.
B
This
is
this
conference
that
builds
in
the
user
right,
for
example,
if
you
look
at
our
version,
you
know
upgrade
validation
process,
every
single
add-on
that
we
use.
We
have
a
mix
right.
We
use
some
of
the
community
add-ons
a
lot
of
community
add-ons
and
we
do
have
like
some
of
our
custom
build.
We
have
like
a
lot
of
operators.
We
have
a
lot
of
operators
that
we
write
every
single
add-on.
We
have
a
rigorous
walkthrough.
We
we
try
to.
B
We
have
like
a
separate
set
of
smoke
test
integration
tests
that
is
very
well
maintained
that
will
almost
always
catch.
If
there
is
an
issue
that
you
know,
that's
mapped
to
a
particular
version
or
something.
So
there
is
a
rigorous
amount
of
work
that
goes
into
validating
each
of
the
add-ons
that
are
part
of
the
part
of
the
platform.
B
So
we
put
in
a
lot
of
efforts
towards
the
stability
aspect
of
it,
so
that
will
in
turn,
you
know,
increase
the
confidence
for
the
users
and
then
now
it's
it's
a
new
normal
right.
No,
no!
It's
a
new
normal.
It's
not
as
if,
like
you
know
like
something
back
upgrades,
would
be
like
few
times
a
year
for
major
platforms,
but
that's
not
the
case
right
now.
So
building
that
conference
in
your
user
is
is
very
important.
I
I
just
wanted
to
add
that.
A
Yeah
now
that
makes
sense,
I
think,
of
a
question.
That's
kind
of
building
on
that
exact
kind
of
stability
and
confidence
in
the
user.
Part
they're,
asking
about
how
you
make
sure
that
upgrades
or
updates
to
any
of
these
components
are
safe
to
apply
and
on.
B
That
that's
a
good
point,
so
one
of
the
things
that
we
have
we
do
is
we
sort
of
have
a
structure
where
I
know
it
differs
from
company
to
company.
But
we
have
to
follow
an
approach
where
we
have
certain
engineering
clusters.
We
call
it
as
test
clusters,
platform
engineering
clusters,
so,
for
example,
I'll
give
an
example.
Let's
say
between
a
development
and
the
testing
and
the
production
environment
itself
like
there
are.
B
Usually
there
are
differences
in
terms
of
policies
and
stuff
like
so
we
make
sure
that
our
testing
clusters,
the
platform
engineering
clusters,
are
on
all
these
spaces.
So
when
we
start
out
first
of
all,
before
even
going
to
the
platform,
nearly
everything
starts
from
your
local
right.
B
We
have
a
very
strong
set
of
test
cases
that
are
very
well
maintained
right,
so
it's
it's
it's
it's
based
on,
of
course,
combination
of
cucumber
and
all
different
sort
of
things,
so
we
have
a
very
strong
set
of
integration
test
smoke
test
that
is
very
well
maintained.
I
keep
stressing
on
the
very
well
very
well
maintained
because
it's
easy
to
come
up
with
the
first
set,
but
sometimes
like
over
a
period
of
time.
You
can
easily
like
you
know,
not
maintain
it
very
well.
B
Then
it
loses
its
purpose,
so
we
use
we
sort
of
rely
on
that
which
will
actually
catch
a
lot
of
issues,
and
even
after
that
there
is
a
rigorous
testing
on
an
environment
basis
we
sort
of
tested
in
like
platform,
engineering,
dev
platform,
engineering
production.
B
These
are
efforts,
but
for
for
for
our
scale,
these
are
like
you
know
massive
for
for
supporting
300
clusters.
We
cannot
afford
to
make
mistakes.
We
do.
A
B
Mistakes
here
and
there,
but
we
do
everything
possible,
like
we
sort
of
things
to
do
it,
so
that
is.
That
is
one
thing
right.
We
have
like
strong
integration
test,
the
test
suite
that
is
like
well-maintained,
we
test
through
every
you
know,
environment
type,
and
after
that
also
when
we
release
it
again.
This
is
something
where
we
don't
upgrade
all
the
300
clusters,
as
I
said
before,
it's
more
of
the
users,
picking
it
and
picking
picking
up
the
release.
B
So
we
also
try
to
see
if
we
can
actually
work
with
some
of
the
business
units
who
are
usually
they're,
okay,
to
pick
up
something
first
right,
so
there
we
worked
them
very
closely
to
see
if
there
are
any
issues
in
the
in
the
development
clusters
when
they
upgrade,
they
usually
start
with
development
clusters.
So
we
sort
of
monitor
that
very
closely.
B
We
have
very
strong
logging
in
telemetry,
which
sort
of
helps
us
that
if
somebody
is
picking
up
a
release
and
putting
it
in
their
dev
clusters,
let's
say
that
is
the
first
of
300
clusters
that
is
getting
up
upgraded,
like
all
our
eyes
are
on
this
right.
So
we
we
watch
it
very
carefully
and
if
we
see
an
issue
then
we
sort
of
quickly
you
know
revert
to
it.
B
Sometimes
we
even
you
know
it's
rare,
but
we
can
even
like
pull
out
the
whole
release
and
say
that
you
know
what
like
it.
You
know
we
will
come
up
with
the
patch
fix
and
stuff
like
that.
So
no
straightforward
answer,
but
one
one
good
thing.
If
one
point
one
take
away,
if
you
want,
I
would
say,
maintaining
a
strong
set
of
you
know:
integration
test,
suite.
C
Yeah,
I
could
add
on
to
that
a
little
bit
I
mean
from
from
the
on-prem
perspective.
Like
rajan
said,
you
know,
we
we
definitely
have
spent
a
lot
of
time,
building
out
these
test,
suites
unit,
testing,
functional
testing
and
ensuring
that
we're
not
just
doing
this
testing.
C
You
know
at
the
end
of
a
release
cycle,
but
we're
doing
these
types
of
tests
all
the
time,
and
so
some
of
the
strategy
behind
it
really
is
around
building
that
end-to-end
testing,
something
that
we
can
run
on
a
daily
basis,
something
that
is,
you
know,
bringing
issues
to
our
attention
on
a
daily
basis
versus
you
know,
finding
out
right
before
the
end
of
the
release.
C
I
think
the
second
piece
of
that
for
us
is
really
rolling
out
these
releases
in
a
little
bit
of
of
a
you
know,
canary
fashion,
if
you
want
to
call
it
that,
where
we'll
do
you
know
in
in
our
area,
we
have
multiple
zones
in
multiple
regions,
we'll
do
one
at
a
time.
C
From
a
non-production
perspective,
we
give
our
business
partners
adequate
time
to
cycle
through
that
environment,
ensure
that
they've,
you
know,
maybe
deployed
workloads
multiple
times
become
comfortable
with
it
and
then
subsequent
scheduling
of
of
the
upgrades
to
our
production
clusters
happening.
You
know
during
tech
windows.
You
know
during
times
when
there's
the
least
chance
for
impact
to
our
production
running
workloads.
C
So
that's
a
lot
of
the
method
behind
the
madness.
That's
for
sure.
B
Yeah,
I
I
there's
another
question
on
the
chaos
engineering
stuff,
so
I
want
to
take
that
that's
a
very
good
question,
so
we
we
have
we've
been
using.
We've
been
doing
the
chaos
engineering
stuff
since
2021
early
2021,
but
the
point
I
want
to
stress
is
even
two
years
back
I've
I
remember
very
clearly
even
in
2019
right,
we
we
made
sure
that,
for
example,
let's
say
we
want
to
add
a
feature
to
the
platform
and
the
feature
comes
from
a
particular
community
maintained
add-on
right.
B
You
know
a
testing
associated
with
it,
so
even
when
we
bring
that-
and
we
make
sure
that
before
you
can
plug
that
into
the
platform
you
have
to,
you
know,
add
add
your
you
know
test
case
to
it,
so
we
run
helm
test
against
all
the
add-ons,
so
there
is
no
add-on
that
can
actually
go
into
the
platform
without
a
test
case
associated
with
it.
You
also
take
it
a
step
further,
so
we
have
this
open
source
project
called
cron,
where
we
came
up
with
the
idea
of
something
called
layers.
B
So
what
happens
is
basically
you
have
a
collection
of
add-ons
right.
So
look
at
this
case,
so
we
have
clusters
running
in
azure,
aws
and
then
on-prem
right.
There
are
certain
add-ons
that
runs
everywhere,
but
there
are
certain
add-ons
that
runs
only
in
the
cloud
which
is
located
as
azure,
and
there
are
certain
add-ons
which
runs
only
in
you
know
on-prem
for
example.
So
we
came
up
with
the
idea
of
something
called
layers.
B
So
the
reason
I
bring
up
this
layer
concept
is
even
in
20
I
mean
even
even
like
two
years
back,
we
were
very
clear
that
each
add-on
should
have
a
test
test
associated
with
it,
and
this
layer,
which
is
a
collection
of
add-ons
which
is
closely
related,
will
have
like
an
integration
test
associated
with
it,
which
is
basically
another
helm
chart.
So
imagine
a
layer
which
has
like
five
add-ons
each
is
a
helm
shot.
B
So
each
sensor
has
a
test
and
there'll
be
the
last
add-on
in
the
layer
which
is
a
helm
chart
which
basically
does
the
integration
of
all
those
add-ons.
These
were
significant
efforts,
but
it
paid
off
paid
us
off
really
well
in
the
long
term,
so
help
test
is
extremely
important,
even
if
you're
picking
up
a
community
project
which
doesn't
have
it,
you
know,
please
you
know,
add
it
add
that
to
your
list
at
the
same
time,
you
know
come
up
with
like
an
integration
test,
help
chart.
B
You
know
that
can
actually
validate
like
how
certain
add-ons
how
they
work
together.
I'll
give
an
example,
for
example,
as
a
part
of
our
onboarding
process
right,
so
we
created
an
extension
to
namespace
called
namespace
groups,
so
the
users
typically
they're
not
exposed
to
namespace.
They
always
start
with
something
called
namespace
group.
So
as
soon
as
you
create
a
namespace
group,
there
are
certain
things
that
happen.
So
your
your
reading
groups
are
automatically
you
know
created.
B
There
are
certain
things
that
happen
and
it's
it's
basically
a
work
done
by
a
few
add-ons
together.
So
there
is
like
a
an
integration
test,
helm
chart
which
basically
checks
this
particular
thing
right
so
yeah.
These
are
some
of
the
things
that
you
know
we
have
been
doing
like
even
from
the
beginning.
At
the
same
time
recently
we
have
early
2021
starting
early
to
21.
We
have
started
focusing
a
lot
on
the
gas
engineering
stuff,
so
we
that
that's
part
of
our
suite
as
well
now.
C
Yeah
and
the
chaos
engineering
aspect
of
it
right,
so
you
know
we.
I
think
that
we're
dabbling
in
that
right
now
I
have
you
know
I
have
definitely
looked
at.
You
know
integrating
chaos
mesh
into
some
of
our
pipelines.
That
would
not
only
handle
you
know
building
you
know,
building
these
these
clusters
running
through
unit
tests,
but
also
knocking
things
over
you
know
and
then
ensuring
that
the
platform
continues
to
function
as
as
we
expected
to
so
we're
still.
A
Yeah
that
makes
a
lot
of
sense
and
then
for
the
specific
tests.
I
think
there's
kind
of
the
question
about
case
engineering,
but
also
post
mortems
and
things.
Do
you
do
things?
Do
you
have
ways
of
ensuring
that
times
when
it
does
go
down
or
you
do
run
into
issues
that
doesn't
happen
again
like
how
you
bring
that
back
into
your
testing
frameworks?.
B
Yes,
yes,
that
that
that
actually
happens.
So,
let's
let
me
think
about
it.
So,
basically,
typically
what
happens
is
like
when
we,
when
we
we
sort
of
prioritize
our
it
could
be
an
obvious
thing,
but
I
think
that's
something
I
just
want
to
stress
upon,
because
it
really
works.
Well,
we
we
prioritize
stability
over
features.
That's
that's
that's
that
might
sound
obvious,
but
it's
something
which
is
very,
very
important.
B
If
we
find
there
are
certain
things
that
we
have
not
done
wrong,
it
actually
feeds
back
and
then
we
sort
of
focus
on
that
first
before
the
next,
the
new
features.
So
the
reason
I
say
it's
obvious
thing
is
it
takes
effort
when
you
bring
that
back
when
you
discuss
in
your
you
know,
sprint
meetings
and
stuff
like
that.
This
is
given,
like
you
know,
very
high
importance,
so
I
think
it's
it's
part
of
our.
B
Maybe
I
don't
know
it's
the
same
culture
now
that
we
have
to
focus
more
on
the
stability.
I
think,
if
you
have
like
a
small
team
with
a
few
clusters,
then
it's
a
different
thing,
but
especially
when
you
are
holding
all
these
like
300
plus
clusters
for,
like
you
know,
thousands
of
developers
and
a
big
organization
like
fidelity
stability
comes
first,
so
we
sort
of
immediately
take
that
and
then
put
it
back
to
our.
You
know
sprint
to
make
sure
that
the
changes
are
done
to
the
the
test.
C
Yeah
I
mean
added
on
to
that
version.
I
think
that,
like
it's
pretty
much
ingrained
into
our
fidelity
dna,
that
root
cause
analysis
is
the
de
facto
method
for
us
coming
to
conclusions
of
what
needs
to
be
fixed
right.
So
I
mean
my
team
is
very
well
versed
in
the
fact
that
you
know
when
we
find
problems
that
we
need
to
come
to
that
root,
cause
to
understand
how
we
can
resolve
that.
So
it
doesn't
happen
again
so
that
our
application
partners
don't
run
into
these
types
of
problems
down
the
road
so
yeah.
C
We
spend
a
lot
of
time
tracking
trying
to
ensure
that
we
are
opening
up
stories
and
and
and
understanding
when
we
haven't
figured
things
out.
You
know
that
we
get
back
to
those
and
we
drill
into
those
things.
You
know.
As
a
matter
of
fact,
we
were
talking
about
some
of
those
things
this
morning
before
we
lucky
enough
to
join
your
your
your
broadcast
here
so
yeah
and.
B
Then
maybe
it's
just
an
important
question
right,
so
we
we
sort
of
look
at
things
in
a
slightly
different
way.
So,
for
example,
let's
say
if
something
is
happening
on
the
customer's
side
right,
we
do
have
three
different
environments,
which
is
supposed
to
mimic
the
customer
environment
right
in
terms
of
like
security
profiles
and
everything,
but
so
if,
if
there's
something
that
they
are
catching,
we
are
not
catching.
We
sort
of
try
to
look
at
why
this
difference
came
right,
which
means
like
some.
B
There
is
a
mismatch
in
terms
of
like
how
the
environment
is
configured
versus.
You
know
us,
so
we
even
try
to
look
at
the
process
that
the
fundamental
process
that
was
actually
broken
so
that
this
can
actually
occur.
So
we
sort
of
go
and
fix
that,
so
that
not
only
this
problem
will
not
happen
again,
but
like
many
such
diff
types
of
problems
will
not.
B
You
know,
you
know
occur
so
like
we
sort
of
go
down
to
that
level
where
even
if
it's
like
a
fertility,
specific
process
which
is
like
you
know,
very
basic,
we
we
try
to.
B
You
know
push
to
make
changes
or
automate
that
in
such
a
way
that
you
know
so
we
basically
try
to
go
analyze
the
base,
not
just
like
on
a
high
level
why
this
happened
not
from
this
one
particular
problems
perspective,
but
to
an
extent
where
how
do
we
prevent
not
just
this
problem
but
similar
types
of
problem
from
occurring,
for
example?
One
example:
is
it
happened,
maybe
a
couple
of
years
back,
but
the
way
our
im
roles
were
managed
in
aws.
B
So
we
even
like
took
a
big
step
and
came
up
with
our
own
framework
based
on
stack,
sets
and
stuff
like
that,
so
we
changed
the
whole
process.
It
was.
It
was
a
lot
of
effort
to
actually
do
that
relatively.
But
now,
when
I
look
back,
that
is
one
of
the
important
thing
things
that
we
did
so
we
changed
the
process.
We
had
to
get
a
lot
of
approvals
because
that
was
already
a
hardened
process,
but
we
made
we
sort
of
got
the
approvals
and
approvals
and
we
changed
it.
B
And
now
after
that,
like
we've,
not
seen
not
just
that
issue
but
like
any
that
that
space
is
like
very
stable,
so
you
have
to
go
to
that
level
if
it
makes
sense.
A
Yeah
now
that
does
make
sense.
I
guess
one
more
potentially
quick
thing
on
testing
before
we
move
along.
You
talked
a
lot
about
testing
and
it
sounded
to
me,
like
you,
were
talking
a
lot
about
testing
the
fidelity
platform
like
pieces
that
you've
built.
I'm
curious
if
your
automation
and
your
tests
also
catch
potential
issues
in
the
tooling
like
if
there's
a
change
in
kubernetes
that
breaks
something
for
you.
Does
that
get
caught
here
or
is
there
a
different
process
for
catching
things
like
that?
No.
B
It's
it's
baked
in
actually
so,
for
example,
even
if
we
bring
so
the
the
integration
test
piece
that
I
talked
about,
so
even
if
you
bring
it,
it
includes
test
cases
for
the
community
and
it's
not
just
our
staff.
So
sometimes
we
actually
go
and
raise
issues.
Upfront,
and
you
know
it
benefits
the
you.
You
know
the
community
as
well,
so
it
it
actually,
you
know,
involves
all
the
kubernetes
stuff,
as
well
as
the
community,
add-ons
cool.
That
makes
a
lot
of
sense.
A
So
then,
I
just
wanted
to
take
a
bit
of
a
step
back
and
hear
a
little
more
about
the
overall
architecture.
I
mean
you've
mentioned
multiple
clouds
and
you've
mentioned
300
clusters,
but
that's
about
as
far
as
we
know,
so
I'm
just
curious
to
kind
of
dig
a
little
deeper.
How
big
are
these
clusters?
A
B
Absolutely
absolutely
so
I'll
start,
and
then
I
think
you
can
also
jump
in
so
most
of
your
clusters
are,
in
the
medium
size
range
medium
size.
In
the
sense,
I
don't
think
it
has
more
than
75
notes
or
something
like
that.
So.
B
Not
not
for
sure
at
the
same
time,
it's
not
small
aspects.
Most
of
them
fall
in
that
range.
Everything
is
multi-tenant,
so
that
is
one
of
the
key
things
that
we
decided
back
when
we
started
in
2018
until
we
decided.
So
there
are
a
lot
of
processes
built
along
to
support
built
around
to
support
the
multi-tenancy.
B
So
one
of
the
examples
was
stuff
that
I
was
mentioning
earlier
around
extension
of
name
spaces,
so,
for
example,
when
a
team
on
boards,
instead
of
just
mapping
it
to
a
particular
namespace,
we
sort
of
created
an
entity
around
it,
so
every
team,
when
they
actually
onboard
they,
they
get
like
something
called
an
ns
group,
and
so
that
is
the
kubernetes
aspect.
But
there
is
like
fertility
specific
aspect
in
terms
of
like
how
does
it
get
integrated
with
the
so,
for
example,
how
does
the
ad
groups
get
created
right?
B
B
So
in
terms
of
in
terms
of
the
number
of
teams
it
it
it
sort
of
varies
but
like
easily.
You
could
you
could
find
like
50
to
you,
know
75
teams
working
in
a
cluster,
and
I'm
talking
about
team
of
you
know,
let's
say
eight
or
something
other
and
they
cannot
step
over
each
other.
There
are.
There
are
frameworks
built
around
resource
codes,
limit
ranges
and
stuff
like
that,
for
example,
the
ns
group
concept
it
sort
of
includes
a
section
for
the
resource
code.
B
So
as
a
as
as
a
cluster
admin,
I
can
actually
say
this
team
gets
this
ns
group
with
certain
resource
quotas
and
stuff
like
that.
So
which
means
the
team
themselves
can
actually
go
and
add
and
delete
name
spaces
within
that
ns
group,
but
it
is
actually
bound
to
a
particular.
You
know
a
set
of
constraints,
so
it
is.
It
is
multi-tenant,
as
most
of
them
follow
typical
approach
of
having
a
cluster
admin,
and
usually
there
is
less
dependency
on
the
cluster
admin.
B
What
I
mean
by
that
is
you
go
to
for
the
initial
stuff
where
he
sort
of
onboards
you
and
sets
the
constraint
after
that,
we
try
our
little
best
not
to
create
a
process
where
you
have
to
go
to
the
cluster
admin.
Again
and
again
so
so
mostly
they
are
self,
you
know,
independent
we
do
have
pipeline
setup,
but
pipeline
is
something
which
you
know.
Sometimes
it
can
become
opinionated,
so
they
can
actually
bring
their
own
pipeline,
so
most
of
them
use
their
own
pipelines,
deployment,
pipelines
and
stuff
like
that.
C
Yeah
from
the
on-prem
side
of
things
again,
you
know
we've
adopted
or
or
prescribed
to
the
notion
of
smaller
clusters.
You
know
rajan
mentioned
medium-sized
clusters,
you
know
initially,
you
know
the
thought
behind.
It
was
maybe
we'd
go
with
logic
clusters,
but
ultimately
it
comes
down
to
blast
radius.
C
We've
learned
that
you
know
the
the
automation
that's
involved
with.
You
know:
maintaining
maintenance
around
these
rehydration.
It
gets
lengthy
right.
It
takes
a
lot
of
time
to
rehydrate
a
thousand
nodes,
so
by
breaking
it
down
into
smaller
pieces,
we're
really
able
to
create
decision
points
where
we
can
decide
if
we
need
to
move
on
with
other
clusters.
If
say,
we
ran
into
a
problem
or
if
we
can,
you
know
just
continue
or
stop
dead
in
our
tracks
and
revert
or
pop
things.
So
to
speak.
C
So
you
know
the
smaller
clusters
multi-tenant,
mostly
business
aligned.
So
a
lot
of
our
clusters
are
specifically
business
unit
aligned
so
that
it
creates
that
separation
between
our
business
partners
and
then
within
those
clusters.
They're
they're
definitely
multi-tenanted,
where
all
the
different
development
groups
are
working
within
that
cluster
separated
through
many
you
know,
name
spaces
and
the
non-production.
C
We
see
a
lot
of
those
name
spaces
delineated
by
their
various
development
cycles.
Right,
so
you
know
developments.
C
And
for
the
most
part,
those
non-production
clusters
tend
to
be,
on
average
at
least
two
times
larger
than
our
production
clusters,
just
based
on
the
number
of
workloads
in
the
various
environments
for
them
to
cycle
through
application
development
before
they
get
to
production
from
an
architectural
perspective
on
prem
I
mentioned
earlier
that
we're
predominantly
on
vsphere,
and
on
top
of
that,
we
front
front
all
of
our
clusters
with
avi
load,
balancing
services
and
those
that
hobby
load
balancing
service
sits.
C
You
know
pretty
much
as
an
l4
proxy
down
to
our
either
the
interest
in
nginx
ingress
controllers,
which
handle
that
path-based
routing
and
then
also
we
allow
them
to
also
use
like
node
port
ranges
right
so
that
they
can
do
direct
pod
traffic.
So
yeah.
A
Psycho,
the
big
thing
here
is
just
around
monitoring
and
observability
of
your
clusters
like
how
what
tools
do
you
use?
How
do
you
kind
of
collect,
metrics
logs
traces,
all
of
the
normal
observability
things,
and
how
do
you
monitor
the
health
of
of
the
workers.
B
Yeah
we'll
talk
about
it.
I
think
there's
one
question
on
the
ownership
of
the
cluster.
So
so,
basically,
as
I
said,
the
business
unit
are
our
internal
clients
right.
So
basically,
we
sort
of
provide
this
platform
and
certain
it's
basically
framework
with
the
collection
of
tools
to
manage
it.
The
ownership
of
the
cluster
is
actually
with
the
business
units,
so
the
platform
this
has
defined
its
own
set
of
rules.
B
So
there's
something
called
the
global
admin
platform,
admin
and
stuff
like
that,
so
business
unit
they
will
have
like
a
you
know:
devops
sre
team,
so
they'll
actually
be
the
the
the
cluster
admins.
So
we
do
have
you
know
like
overall
access,
but
they
are
the
actual
cluster
admin.
B
So
in
terms
of
upgrades,
if
there
are,
for
example,
let's
say
if
you're
having
issues
with
the
resource
quota
and
stuff
like
that,
so
that's
how
that's,
where
the
actual
the
users
actually
go
to,
so
that
is
the
ownership
of
the
cluster
and
and
basically
the
ownership
is
sort
of
divided
in
this
way
when,
whenever
I
say
platform,
end
updates
a
collection
of
namespaces,
it's
it's
more
than
that
from
a
from
a
kubernetes
standpoint,
it's
a
collection
of
namespaces.
B
So
if
you
open
up
a
fidelity,
you
know
cluster,
you
have
all
these
set
of
management,
name,
spaces
and
system
name
spaces.
Anything
with
hyphen
system
is
a
system
name
space
similar
to
group
system
where
all
these
critical
add-ons
runs,
and
then
we
have
a
collection
of
management,
name
spaces
where
all
these
right
from
clusterrunner
skilled,
ingress
control.
All
these
things
runs
those
set
of
things
put
together
as
a
platform.
Any
issue
happens
there.
B
We
are
responsible
for
it,
so
the
cluster
admin
like
doesn't
even
have
to
look
into
it
like
it
straight
away,
comes
to
it
comes
to
us
because
it's
a
platform
issue,
your
platform
is
unstable,
anything
other
than
that.
Let's
say
if
there's
an
issue
with
the
particular
resource
code
and
limit
range
within
the
username
space,
and
that's
where
maybe
the
class
admin
will
will
come
into
picture
other
than
that
like
we
have
the
we
have
another
role
called
namespace
admin.
B
So
if,
if
I'm
the
owner
of
the
namespace
group
right,
I
have
a
collection
of
namespaces
I'll,
have
admin
access
to
that
namespace,
which
means
like
I
can.
I
can
do
whatever
I
want.
You
know
within
that,
for
example,
let's
say
I'm
I'm
trying
to
install
let's
say
some
sort
of
a
crd
based
operator
right.
So
there
is
like
a
a
particular
automated
process
where
you
can
actually
go
and
submit
where
a
particular
add-on
within
the.
B
Will
sort
of
create
the
customer
resource
definition
for
you
from
then
on
your
your
you're
on
your
own?
So
that's
how
we
have
sort
of
done
it.
So,
coming
back
to
the
monitoring
strategy,
we
basically
have
it's
actually
a
mix,
but
we
have
a
combination
of
data
dog.
You
know
splunk
and,
like
you
know
stuff
like
that,
so
we
have
a
collection
of
very
good
collection
of
pre-built.
You
know
dashboards,
so
at
any
point
in
time,
when
you
have
300
clusters
right,
so
look
at
it.
B
This
way,
each
cluster,
each
cluster
has
this
collection
of
name
space,
which
I
said
is
a
platform.
So
within
these
three
under
clusters,
if
the
platform
is
unstable
on
any
of
these
clusters,
we
will
get
to
know
so
so
that's
how
we
actually
set
it
up.
So
from
our
standpoint,
we
just
look
at
a
particular
platform
version.
When
we
release
platform
version
1.0,
we
just
we,
we
know
what
all
clusters
are
upgraded.
Water
trusses
are
not
upgraded,
but
at
any
point
we
would
get
to
a
platform.
B
1.0
has
issues
in
any
any
of
the
clusters,
so
we
use
datadog
in
combination
with
spunk,
and
we
have
all
these
people
dash
dashboards,
so
yeah
and
we
usually
metrics
heavily
metrics
logs
traces
is
some
of
the.
I
think
some
of
the
community
projects
have
it
some
of
them.
B
Don't
even
like
our
internal
tools
that
we've
developed,
we,
I
think,
we're
still
in
the
process
of
making
the
best
use
of
tracer,
so
I
think
we're
getting
there
so
in
terms
of
monitoring
again,
I
think
that's
separated
any
platform
related
come
to
comes
to
us,
but
anything
which
is
like
application
related.
It
goes
to
the
namespace
segment
and
then,
if
it
is
anything
on
on
the
on
on
the
other
side,
it
goes
to
the
cluster
admin.
So
that's
so
we
have
sort
of
separated.
B
So
if,
let's
say
an
application
team
has
an
issue
with
their
deployment,
then
it
doesn't
come
to
us
at
this
point.
I
just
wanted
to
touch
upon
a
little
bit
on
this
on
something
which
you
are
actually
working
just
just
to
you
know.
Maybe
it's
used
for
users.
Maybe
they
can
actually
think
along
these
lines.
Let's
take
this
problem
where
you
have
these
deployment
pipelines
right.
So
at
this
point,
if
a
deployment
is
having
an
issue,
we
have
an
sra
team.
B
It
comes
to
us
sometimes,
but
most
of
the
time,
usually,
what
happens
is
if
I'm
a
developer
like
mid-level
developer.
With
four
five
years
of
experience,
I
usually
go
to
the
team
lead
first
and
then
team
david
will
actually
go
to
the
business
unit.
Devops
teams
right.
So
what
we
are
trying
to
do
now
is
sort
of
as
a
part
of
the
platform
right.
We
are
trying
to
actually
come
up
with
another.
You
know
sort
of
a
system
where
imagine
this.
This
is
the
this
is
what
we
are
trying.
B
So
you
have
a
jenkins
pipeline.
For
example,
imagine
we
give
you
a
jenkins
plugin
where
anytime,
your
jenkins
build,
fails.
It
prints
out
a
link
and
you
click
click
on
the
link.
It
tells
you
what
the
problem
is
right,
so
that
is
something
what
we
are
actually
working
towards.
Hopefully
we'll
actually
open
source
it
we're
trying
to.
You
know
build
some
machine
learning
models
and
then
you
know
do
some
some
analysis
or
on
top
of
it,
to
actually
come
up
with
these
things.
B
The
reason
I
mentioned
that
is
now
we
are
actually
trying
to
take
it
a
step
further
in
such
a
way
that
each
developer
who's,
actually
the
user
of
the
platform
we
trying
to
focus
on
the
pain
points
they
have
and
then
try
to
you
know
solve
all
that.
So
I
just
want
to
touch
upon
that
a
little
bit,
but
I
think
going
back
to
the
monitoring
again.
Maybe
dave.
Do
you
want
to
add
something
on
it.
C
Sure,
yeah,
you
know
one
of
the
questions
that
was
on
the
board
was:
how
do
you
monitor
worker
health?
So
I
mean
we
do
have
some.
You
know
basic.
You
know
things
for
workers
that
just
ensure
that
the
virtual
machine
exists
and
it's
responsive.
C
That's
really
just
basic
monitoring,
but
the
monitoring
itself
really
comes
from
datadog
right,
so
the
datadog
monitors
that
we've
set
up
are
specifically
looking
at
the
components
within
the
clusters
right.
So,
for
instance,
you
know
if
you
kublets
down
well,
then
your
your
node
is
not
going
to
work
right.
So
from
that
perspective,
the
health
of
that
node
is
inoperable
right.
So
it's
it's
not
it's
not
functioning,
so
that's
kind
of
how
we
prescribe
to
it.
C
So
a
lot
of
the
monitors
we're
writing
that
we've
written
are
really
around
like
those
service
component
health
at
health
status,
from
a
logging
perspective
like
he
he
mentioned
real
briefly
was
you
know
we
do
use
splunk.
We
have
kind
of
a
mixed
bag
of
logging.
We
use
datadog
in
some
areas.
We
use
splunk
we
and
we
also
have
a
team
internally,
and
it
has
built
out
some
really
interesting
architecture
around
like
an
aggregation
tier
around
fluentd.
C
And
then
those
catholic
topics
are
read
by
elk
right
and
that's
how
we're
able
to
kind
of
use
kafka
as
a
you
know,
almost
like
a
traffic
manager
right.
Where
do
I
send
logs
for
these
specific
clusters,
because
there
are
different
requirements
that
come
from
business
lines
around
where
they
want
their
logs
to
land
so
yeah?
I
hope
that
answers
some
of
those.
B
B
I
think
the
agents,
the
data
dogs,
plunk
agents
that
we
have
has
these
by
default,
but
I
think,
on
top
of
that,
we
have
deployed
as
a
part
of
the
platform
one
of
the
add-ons,
if
I'm
not
wrong
as
the
node
problem
detector
add-on
from,
I
think
it's
part
of
the
kubernetes
project
itself,
no
problem
detector,
so
I
think
we
are
sort
of
using
that
that
actually
helps
as
well,
but
I
just
wanted
to
mention
one
problem
which
we,
which
we
have
right
so
today,
look
at
it
this
way.
B
So
let's
say
there
is
an
application
deployment
that
failed
right
and
it,
and
if
you
look
at
the
logs
it
will
say
helm,
release
time
note
and
if
you,
if
you
run
get
parts
it
will
say
pod
spending,
then
you
will
figure
out
why
the
parts
are
pending
and
you'll
see
your
nodes
are
unstable.
Then
you
will
figure
out
like
why
the
nodes
are
unstable.
It
will
be
something
to
do
with
cluster
scalar
or
like
something
happening
on
the
network
side
that
is
affecting
the
aws
autoscaling
groups.
B
There
is
like
a
chain
of
things,
so
one
of
the
pain
points
that
the
developers
have
today
is
when
something
like
this
happens,
even
with
the
current
solutions.
If
even
if
you
set
alerts,
if
you
open
the
mailbox,
you
have
like
a
flood
of
alerts.
So
it's
not
as
if,
like
someone
tells
you
that
hey
all
these
problems
are
happening,
forget
about
it,
just
fix
this
one
right.
This
is
what
you
need
to
focus
on,
then
everything
else
will
happen
automatically.
B
So
this
is
a
problem
that
the
project
that
I
was
mentioning
earlier,
based
on
a
ai
and
machine
learning
and
stuff
like
that.
That
is
something
which
you're
trying
to
solve
when
a
deployment
fails
when
they
click
on
the
link.
You
want
to
tell
that
hey.
There
is
this
network
issue
happening
and
your
order
scaling
group
is
having
an
issue
yeah
someone.
B
These
are
things
that,
even
even
though
we
have
a
pretty
sophisticated
monitoring
setup
today,
these
are
problems
that
we
still
have
and
sometimes
like.
Let's
say
there
is
like
a
network
outage
going
on
that
is
affecting
a
lot
of
things
right
from
a
developer
who's,
just
looking
at
jenkins,
five
plane,
two
for
him
to
get
that
information.
It
takes
like
hours.
Sometimes
he
he
raises
a
ticket.
Someone
has
gone
for
lunch,
they
come
back,
they
look
at
it.
B
They
raise
something
in
the
team's
chat
and
then
somewhere
they
get
to
know
that
networking
is
working
on
it,
this
correlation,
but
but
if
you
look
at
it
from
the
way,
we
are
platforming
right,
but
the
way
we
look
at
it
is
they
are
users
of
our
platform,
and
this
is
a
platform
you
know
experience.
So
we
want
to
enhance
that.
B
So
we
are
sort
of
investing
along
with
the
existing
monitoring
and
stuff
like
we
are
sort
of
investing
in
our
efforts
around
the
area
where
how
do
we
use
the
latest
ml
techniques?
To
sort
of
make
this
better
for
them,
and
the
question
was:
how
do
you,
how
does
one
differentiate
between
logging
and
tracing
so
as.
A
B
Most
of
the
add-ons,
I'm
not
sure
if
they
they
do
a
lot
of
tracing
at
this
point,
most
of
the
things
most
of
our
monitoring
starts
with
metrics
and
then
from
metrics
we
sort
of
tried
to
correlate
to
logging.
We
have
seen
that
whenever
you
have
tracing.
That
is
the
best
thing
you
have
right.
B
You
start
with
metrics,
that's
where
you
get
the
alerts,
then
you
go
to
the
trace
and
then
you
get
the
locks
but
yeah
at
this
point
everything
starts
with
metrics
and
then
it
sort
of
you
know
goes
to
the
logging,
but
some
of
the
latest
stuff
that
we
are
trying
to
do
based
on
the
machine
learning.
It's
actually
reverse.
B
So
you
sort
of
start
with
the
looks,
and
it's
it's
you
know
interesting,
it's
for
the
future,
but
that's
one
thing
we've
been
doing
and
that
question
was
around
what
is
the
component,
the
message
that
actually
reaches
out
to
the
google
dpi?
B
I
didn't
get
that
so,
basically
in
terms
of
the
communication
between
the
control
plane
and
the
kubelet.
That
sort
of
differs
from
you
know,
for
example,
ek
is
slightly
different
from
you
know.
You
know
a
rancher,
how
do
you
do
inter
cluster
networking?
That's
a
very
good
question.
B
That
is
one
of
that
is
one
of
the.
We
still
have
that
as
a
pain
point.
Let
me
tell
you
so
it's
it's,
not
a
problem
that
we've
solved.
We
are
working
on
it.
There
are
solutions
that
we
are
still
looking
at,
but
I
can
say
that's
a
problem
that
we've
not
sold.
Basically,
it's.
C
B
B
Yeah,
so
the
the
the
ny
stuff
that
I
mentioned,
oh
no
stuff
that
I
mentioned
was
so
there
are.
There
are
things
that
is
getting
built
on
top
of
our
platform,
so
so
we
have
3d
case
right.
So
we
have
all
these.
For
example,
there's
like
an
ml
platform
that
we're
trying
to
build
on
top
of
vdk.
B
So
similarly,
there's
like
an
ap
gateway
that
gets
built
on
top
of
our
our
feed
case,
so
the
fertility
platform
that
we've
built
over
a
period
of
now
like
when
I
look
back
over
the
period
of
two
years,
whatever
we
have
done,
it's
it's
now,
it's
like
a
solid
foundation
where,
like
people
can
actually
build
on
top
of
it,
so
internal
clients,
one
of
the
businesses
that
can
actually
build
the
layers
concept
that
I
talked
about
earlier.
It's
basically
collection
of
violence.
B
They
can
actually
now
contribute
and
say
that
hey,
I
have
this
machine
learning.
You
know
set
of
features
available,
I'm
packaging
it
as
a
as
a
layer
applying
on
top
of
your
platform
that
becomes,
like
your
you
know:
ml
plugin.
At
the
same
time,
it's
an
ml
platform
which,
with
all
the
fertility
constraints,
you
know
you
know,
set
to
it.
B
So
the
unwise
stuff
that
I
mentioned
was
around
those
lines
where
there's
like
an
ap
gateway
that
is
actually
getting
built
on
top
of
our
platform,
and
that
is
where
you
know
it
is
actually
used.
B
B
To
crusher
part
to
board,
we
do
have
teams
using
you
know,
calico
within
our
clusters.
We
stick
to
at
least
on
the
cloud
side.
I
think
maybe
natural
answer
it's
different,
but
on
cloud
said
we
stripped
the
native
cna
drivers.
We
don't
use
overlay,
at
least
at
this
point,
so
we
have
teams,
which
is
something
which
we
don't
enforce,
but
we
have
teams
which
basically
uses
they
can
actually
install
it.
B
It's
not
part
of
the
platform
yet,
but
they
can
actually
install
calico
on
top
of
our
platform
and
then
do
the
network
policy
and
stuff
like
that.
Yeah
we're
using
we're
using.
B
Yeah
and
I
assess
a
cluster
to
cluster
is
still
something
which
goes
out
and
then
comes
through
the
ingress,
so
we
don't
have
any
global
service
mission
or
anything
like
that.
At
this
point,.
A
All
right
that
makes
sense.
I
guess,
with
that
we
can,
we
can
wrap
up.
I
just
want
to
thank
everyone
for
joining
us
today
for
this,
for
this
episode
of
the
cloud
native
end
user
lounge,
and
it
was
great
to
have
both
of
you
rajan
and
dave
on
to
talk
about
fidelity,
and
we
had
some
great
interaction
and
great
questions
from
the
audience
and
we
bring
this
end-user
allowance
to
you
on
the
fourth
thursday
of
every
month
at
about
9
00
a.m.
A
Pacific
time
so
hope
to
see
you
the
next
one
and
don't
forget
to
join
us
for
kubecon
cloud
nativecon,
north
america
october
12th
through
15th,
to
hear
the
latest
in
the
cloud
native
community
and
also,
if
you'd
like
to
showcase
your
usage
of
cloud
data
tools
as
an
end
user
join
the
end
user
community.
There
are
a
lot
more
details
on
cncf.io,
end
user
and
again
thanks
everyone
for
joining
us
today
and
hope
to
see
you
next
time.