►
From YouTube: Reducing your Kubernetes Cloud Spend
Description
No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).
A
There
we
go
all
right.
Welcome
to
today's
cncf
webinar,
reducing
your
kubernetes
cloud,
spend
I'm
libby
schultz
I'll,
be
moderating.
Today's
webinar
we'd
like
to
welcome
our
presenters
today,
webb
brown,
ceo
at
cubecost
and
nico
kovacevic
cto
at
cubecost.
I
hope
I
didn't
just
butcher
that
again
a
few
housekeeping
items
before
we
get
started
during
the
webinar.
You
are
not
able
to
talk
as
an
attendee
there's
a
q,
a
box
at
the
bottom
of
your
screen.
B
Excellent
well,
thank
you
so
much
for
the
introduction-
and
I
just
want
to
say-
welcome
everyone
thanks
for
joining
us
today
and
we're
gonna
talk
about
one
of
our
all-time.
You
know
favorite
subjects
that's
near
and
dear
to
our
hearts,
which
is
helping
you
effectively
manage
or
reduce,
spend
when
running
workloads
on
kubernetes.
B
What
we're
going
to
do
today
is
first
we're
going
to
present
can
an
overall
general
framework
for
how
to
think
about
different
optimizations
or
opportunities
to
reduce,
spend
and
then
we're
going
to
go
into
some
very
practical
examples
or
war
stories
that
we've,
you
know,
picked
up
over
the
past
several
years,
while
working
in
this
area,
and
so
let
me
first
start
with
just
a
little
bit
of
background
on
us,
so
my
name
is
webb.
I'm
joined
by
my
esteemed
colleague
nico.
B
We
are
both
part
of
the
founding
team
at
coop
cost.
We
build
cost
monitoring
and
cost
optimization
solutions
for
teams
running
applications.
On
the
kubernetes
platform,
we
have
more
than
a
thousand
teams
using
our
product
today
across
all
major
cloud
providers
as
well
as
on-prem,
and
we're
going
to
talk
about
again.
Some
of
the
lessons
we've
learned
by
working
directly
with
hundreds
of
them
during
that
experience,
a
lot
of
what
we're
going
to
talk
about
is
aimed
at
cloud
environments,
but
a
fair
amount
of
it
does
actually
apply
to
you
know
on-prem
environments
as
well.
B
So
why
are
we
here?
First
and
foremost,
we
very
much
believe
that
kubernetes
as
a
platform
presents
an
amazing
opportunity
to
deliver
applications
more
cost
effectively.
B
We
strongly
believe
this
and
feel,
like
we've,
seen
this
in
many
migrations
and
many
production
environments,
but
we
also
believe
that
there
are
certain
things
that
can
nudge
teams
towards
a
risk
of
kind
of
overspending
or
increasing
spin
if
they
don't
focus
on
these
areas,
and
so
there's
kind
of
three
core
reasons
that
why
we
like
believe
this
to
be
true
first,
is
that
when
we
see
teams
move
or
fully
embrace
kubernetes
oftentimes
their
decision-making
process,
how
they're,
actually
deploying
or
updating
applications
is
more
decentralized.
B
This
is
again
a
great
thing,
but
it
oftentimes
leads
to
more
just
moving
parts
to
to
monitor
and
more
dynamic
systems,
and
this
is
again
due
to
you
know
faster
release
cycles,
but
also
things
like
you
know,
auto
scaling,
which
are
programmatically
modifying
releases
as
well,
and
then
part
three.
You
know
now
developers
are
empowered
with
the
ability
to
spin
up.
You
know
all
kind
of
resources
whenever
they
need
them.
You
know
this
can
be.
You
know
hundreds
of
gpus
in
any
region
in
the
world
from
any
major
cloud
provider.
B
Today
and
again,
this
is
an
amazing
thing,
but
it
also
means
that
mistakes
or
kind
of
oversights
can
be
more
expensive
when
they're
not
called
so
you
know,
these
are
just
three
things
that
we
want
to.
You
know
keep
in
mind
as
we
think
about
kind
of
immediate
optimizations,
but
also
just
ongoing
governance
of
running
you
know,
workloads
and
in
kubernetes.
B
So
now
that
we
kind
of
have
the
overall
kind
of
problem
framed,
we
want
to
present
a
high
level
like
function
or
or
framework
to
think
about
making
optimizations
or
or
like
providing
a
solution
to
that
challenge,
and
so
anytime,
we're
gonna
make
an
optimization
we're
going
to
think
about
touching
at
least
one
of
these
variables,
specifically
we're
going
to
be
impacting
the
amount
of
time
a
resource
is
provisioned,
the
quantity
of
that
resource
or
the
price
of
that
particular
resource.
B
And
so,
if
we
look
at
each
one
of
these,
if
we
look
at
the
amount
of
time
something
is
provisioned.
This
can
be
the
amount
of
time
your
cloud
provider
or
provider
is
actually
billing.
You
for
that
particular
resource,
and
so
you
think
about
doing
this
effectively.
Something
like
cluster,
auto
scaling
would
allow
you
to
you
know,
shorten
that
time
or
adjust
that
time
to
just
the
period
where
you
actually
need
those
resources.
B
Part
two
of
this
is
the
quantity
of
the
resources
you're
provisioning.
So
again
here,
if
you
think
about
kind
of
optimizing
this
at
a
high
level,
this
can
be
like
a
right
size
equation.
We'll
talk
more
about
which
is
getting
just
the
right
say,
amount
of
ram
or
amount
of
storage
for
your
particular
needs
based
on
your
applications,
and
then
part
three
is
is
really
where
the
kind
of
finance
component
comes
into
play
specifically
around.
You
know,
looking
at
the
cost
of
each
cpu
or
the
cost
of
each
gb
of
ram
et
cetera.
B
That
really
thinks
I
like
about
the
bigger
picture
here,
and
it's
a
thorough
quote,
which
is
the
price
of
anything.
Is
the
amount
of
life
you
exchange,
for
it
definitely
meant
to
be
a
little
tongue-in-cheek
but
kind
of
two
big
points
here.
One
is
that
you
know
we're
not
just
talking
about
the
price
of
cloud
resources.
B
Your
time
is
also
really
expensive.
As
a
you
know,
an
infrastructure
engineer,
so
we
really
want
to
think
about
getting
the
like
biggest
impact
for,
like
you
know
your
time
dedicated
to
it
and
then,
secondly,
is
this.
These
changes
can
also
oftentimes
be
like
hard
to
estimate
in
terms
of
the
amount
of
time
that
it
will
take
for
you
to
optimize
it
that
day
and
then
also
like
manage
it
going
forward.
So
we
want
to
try
to
give
you
a
sense
for
difficulty
estimates
knowing
that
there
will
always
be
contextual.
B
B
So
with
that,
I
want
to
turn
over
nico
he's
going
to
touch
on
an
import,
a
super
important
part
of
a
kind
of
a
precursor
to
optimizing.
Your
infrastructure
specifically
around
measuring
allocation
efficiency,
et
cetera,
nico.
You
want
to
take
it
away.
C
Sounds
good
yeah
thanks
bob
I'll
share
my
screen,
really
quick
and
then
we'll
step
through
basically
a
concrete
version
of
the
framework
that
web
just
introduced
and
familiarize
everyone
a
little
bit
with
some
of
the
metrics
that
our
open
source
coupe
cost
project,
scrapes
and
computes,
and
provides
teams
with
the
ability
to
do
some
of
this
cost
optimization.
C
So
here
we're
looking
at
a
dashboard
of
aggregated
metrics
from
our
kubecosta
open
source
project,
and
you
can
see
that
things
are
aggregated
by
namespace
here
over
the
last
day.
C
So
I'd
like
to
basically
just
unpack
the
little
heuristic
equation
that
webb
just
presented
time:
quantity
of
resources,
price
per
resource
and
also
touch
on
this
efficiency
metric.
So
efficiency
is
something
that
we
think
is
is
one
of
the
most
important
ways
of
thinking
about
this
problem.
C
So
looking
at
this
dashboard
an
example
of
how
you
might
step
through
this
is
something
like
noticing
that
we've
got
2.2
percent
efficiency
here,
as
you
might
guess,
we
would
consider
that
to
be
pretty
low.
So
we
can
sort
of
step
through
why
that
is
and
get
to
the
root
of
what
metrics
underlie
this.
How
to
think
about
them,
and
then
that
will
lead
us
into
later
in
the
talk
talking
about
how
to
resolve
some
of
those
issues
or
up
that
efficiency,
number
reduced,
spend,
etc.
C
So
we're
in
the
default
namespace
looking
at
memory
and
cpu
cost.
Basically,
if
we
just
drill
down
into
this
we'll
see
within
this
namespace
over
the
last
day,
we've
got
a
number
of
containers
running
in
different
pods
and
this
a
b
and
c
these
three
columns
and
then
the
total
cost
correspond
directly
to
the
equation
that
web
covered,
which
is
time
running
so
here
we've
got
24
hours
running
for
all
of
these
over
the
last
day,
which
means
they've
all
been
running
the
whole
time.
C
C
So
you
know,
predictably,
this
corresponds
to
the
node
that
these
are
running
on,
but
it's
broken
down
to
specifically
how
much
you're
paying
for
cpu
for
that
workload
and
then
that
yields
your
total
cost.
So
basically,
these
are.
These
are
your
levers
if
you
want
to
be
paying
less
you've,
either
got
to
reduce
your
hours
running,
which
may
or
may
not
be
possible.
C
You've
got
to
reduce
your
the
amount
of
resources,
the
quantity
of
the
resource
which
may
or
may
not
be
possible,
but
commonly
commonly.
This
is
this
is
a
big
one
and
then
price
per
cpu
hour,
which
is
also
a
big
one,
but
again
we'll
step
through
that
later.
For
now,
we
can
actually
drill
down
one
step
further.
C
So
if
we
just
look
at
basically
a
grafana
dashboard
of
the
raw
metrics
for
one
of
these
we're
looking
at
basically
a
test
deployment
pod
that
isn't
really
doing
much.
So
what
we're
seeing
here
is
basically
on
the
left.
You
see
cpu,
cpu,
usage
and
request
and
on
the
right
you
see
memory
usage
in
request.
C
So
this
brings
us
to
back
to
the
idea
of
efficiency
and
when
we
talk
about
allocation
at
kubecost,
we're
talking
about
the
max
of
cpu
usage
and
request,
because
what
we're
trying
to
do
is
with
the
metrics
that
we're
emitting
we're
trying
to
share
with
our
users
what
you're
actually
being
billed
for.
C
So
when
we
look
at
these,
we
can
see
basically
from
the
cpu
perspective,
we've
got
a
request,
but
we
have
essentially
zero
usage
and
from
a
memory
perspective,
we've
got
you
know,
30
percent
usage,
but
cpu
is
actually
the
primary
cost
of
this
pod.
If
we
go
back
to
the
top
level,
we
will
see,
14
cents
has
been
sent
spent
on
cpu
and
only
one
cent
on
memory.
C
So,
even
though
our
memory
has
you
know,
30
usage,
it's
such
a
miniscule
amount
of
what
is
being
spent
and
cpu
has
zero,
so
we're
hovering
really
close
to
zero.
We're
we're
basically
a
two
percent
efficiency,
so
this
is
basically
the
way
that
we
think
about
our
core
metrics
and
then
this
provides
us
the
three
levers
to
say:
okay,
what
should
we?
What
actions
can
we
take
now?
Who
can
be
alerted?
C
How
do
we,
how
do
we
basically
start
here
and
end
up
in
a
position
where
we're
spending
less
so
I
think
the
plan
right
now
is
to
pause
briefly
and
take
any
questions
from
this
before
we
move
into
specifics
on
how
to
fix
some
of
these
issues.
So.
B
B
This
is
this
is
such
a
critical
part
to
to
lead
into
the
next
part
of
the
discussion,
because
it's
you
know
the
the
nico
the
example
nico
gave,
I
think
is
great,
which
is
you
know
looking
that
at
you
know
the
default
name
space
you
know,
14
out
of
15
of
every
dollar
spent
is
spent
on
cpu,
so
it
actually
spins
makes
very
little
sense
to
look
at
like
optimizing
your
memory
or
you
know,
storage
or
network
from
that
namespace
again,
just
because
you're
able
to
see
like
the
biggest
cost
drivers
and
whenever
you're
you
know
allocating
your
your,
like.
B
You
know
very
costly
time.
We
want
to
steer
towards
those
areas
where
you
can
have
that
that
biggest
impact
right,
so
you've
got
a
handful
of
of
questions
here
nico.
So,
first
one
is:
how
did
you
fix
the
cost
per
cpu
yeah
great
yeah?
You
want
to
take
that
one.
C
Sure
yeah,
thanks
for
all
the
questions,
definitely
some
good
ones.
Basically,
if
we
go
back
to
this
screen,
the
question
is
basically
about
this
column
and
the
simple
answer
to
the
first
question,
and
the
second
question
I
guess,
is
that
we
integrate
with
cloud
provider
billing
apis
and
we're
aware
of
which
nodes
your
workloads
are
running
on,
and
thus
we
know
how
much
that
node
costs
and
what
the
workload
which
node
the
workload
was
running
on.
C
So
we
can
come
up
with
the
price
that
you
were
paying
at
the
time
for
that
workload
per
resource,
and
then
I
believe
we
also
had
a
question
pop-up
about
on-prem
environments,
and
in
that
case
we
just
allow
users
to
input
how
much
their
hardware
is
costing
them.
So
you
can
you
can
input
custom
pricing
and
sort
of
override
this
or
provide
provide,
whatever
makes
sense
for
your
situation
for
on-prem.
B
Yeah
and
in
there
nico
maybe
worth
hiding
like
we
actually
have
two
different
pipelines
to
how
we
support
that.
We
have
a
really
simple
pipeline
where
you
can
just
say
you
know:
cost
per
like
core
cost
per
gb
of
memory,
et
cetera.
We
also
have
like
a
more
advanced
pipeline
for
teams
that
have
a
lot
of
heterogeneous
assets
and
want
to
actually
you
know,
go
through
and
have
an
individual
like
asset
id
for
each
vm
disk
et
cetera,
and
you
can
actually
again
kind
of
tag
those
with
each.
B
You've
got
a
couple
more
here,
I
think
you've
hit
on
the
first
four
there.
One
next
question
is
you
have
a
product
or
service
business
model
or
both?
So
like
you
know
everything
we're
showing
is
you
know
showing
metrics
from
our
open
source
project?
We
do
have
a
business
and
enterprise
product
with
a
lot
of
like
extra
functionality
available.
B
B
So
next
question
for
you,
nico
is
if
the
workload
is
distributed
across
multiple
nodes.
Will
you
take
out?
Take
the
average.
C
Right,
so
this
is
a
great
question.
I
think
we
get
questions
like
this.
A
lot
and
part
of
part
of
our
answer
here
is
that
I
think
we're
looking
at
the
problem
from
a
slightly
different
perspective
than
this.
So
in
a
sense
we
are
yes,
but
really
what
we're
doing
is
taking
each
instance
of
each
running
container
separately,
so
we
are
instead
of
trying
to
average
things
and
break
them
down,
we
think
of
it
as
aggregating.
C
So
if
we
look
at
this
example
again,
we've
got
price
per
cpu
hour
and
you'll
notice
that
they
actually
aren't
the
same,
even
though
we're
talking
about
cpu
usage
in
the
same
name
space.
Basically,
what
is
probably
happening
here
is
that
these
are
running
on
different
nodes
and
those
nodes
might
have
different
prices
associated
with
them.
C
So
when
we
look
at
it
from
the
top
level,
we
are
saying
basically
like
we've
aggregated
every
every
running
instance
of
every
container
and
every
pod
in
each
of
these
namespaces
that
have
their
own
individual
situations
and
individual
pricing
perhaps
to
arrive
at
this
price.
So
we
don't
given
that,
like
an
individual
instance
of
a
container
can't
be
running
across
multiple
nodes,
there
isn't
any
average
necessary
if,
if
we're
looking
at
the
problem
from
that
perspective,
so
I
hope
that
I
hope
that
helps.
B
Yeah,
no
that's
great,
and-
and
one
thing
I
would
just
add,
is
like
that's
the
exact
model
we
we
implement,
which
is
truly
building.
You
know,
from
the
container
level
up
teams
do
have
the
ability,
if
they
wanted
to
like
override
that
pricing,
they
can
always
use,
provide
like
custom
pricing.
Sometimes
we
see
situations
where
teams
want
to
create
like
an
internal
economy
which
may
not
like
perfectly
reflect
what
their
cloud
provider
is
billing.
So
we
do
have
that
capability.
It's
actually
really
similar
pipeline
to
what
nico
just
mentioned
for
on-prem
environments.
B
So
you've
got
a
couple
more
questions
here.
All
all
great
questions.
Thank
you.
Everyone
next
question
is:
is
there
a
range
of
savings
based
on
your
experience
with
various
customers
which
services
have
larger
potential
per
savings
on.
A
B
As
an
example
so
yeah,
I
may
just
say
that
we're
gonna
actually
get
into
this
some
more
with
like
five
really
practical
examples.
After
this
we,
it
is
not
uncommon
for
teams
that,
like
haven't
focused
in
this
area,
to
be
able
to
reduce,
spend
by
70
or
80
percent.
That's
that
I
would
say,
that's
like
really
common
for
teams
that
are
able
to
like
devote
you
know
real
kind
of
like
engineering
resources
to
doing
this
optimization.
B
We
can
talk
more
about
specific
examples,
though
so
question
is,
could
you
compare
coupe
cost
to
fin
up
efforts
complementary
or
somewhat
overlapping?
So
we
are
part
of
the
phenops
organization
we're
actually
like
a
founding.
You
know
vendor
with
the
recent
launch,
we're
huge
fans.
We
think
it's
doing
great
things
and
fully
support.
You
know
all
the
like
openness
that
it's
bringing
from
a
training
and
certification
perspective,
as
well
as
just
general
education
perspective,
so
definitely
complementary
and
we're
you
know
involved
on
things
going
on
there.
B
So
in
case
of
aws,
are
you
considering
only
aws
fargate
pricing?
I'm
not
sure.
If
I
fully
so,
we
would
basically
just
be
reflecting
the
cost
of
the
node
where
these
workloads
are
being
run.
We
also
have
and
we
can
share
more
resources
on
it.
But
if
you
look
at
this
notion,
there's
one
column
here
called
external
cost.
In
the
view
that
miko
is
showing,
if
you
had
say
let's
say
you're
not
running
like
you
know,
eks
on
fargate,
you're
running
just
other
workloads
in
fargate
or
you're
running.
B
You
know,
like
rds
instances
you
have,
you
know,
like
s3
storage,
buckets
et
cetera.
We
would
allow
you
to
like
allocate
those
cost
back
to
the
actual
kubernetes
tenant.
So
you
can
get
just
kind
of
a
centralized.
You
know
unified
view
again,
whether
that
be
fargate
or
anything
else
outside
of
kubernetes.
B
Okay,
so
that's
lots
of
great
questions
from
q.
A
I
see.
I
also
have
some
here
in
chat
I'll.
Maybe
we
can
split
these
up
niko.
I
can
take
the
first
one
really
quickly
because
there's
another
one
that
we're
going
to
touch
on
in
a
second
sachin
s.
B
I
can
imagine
how
this
would
help
downsize
in
the
cluster
etc,
but
real
workloads
are
burstly.
So
isn't
this
head
room
for
quality
of
service
super
super
relevant?
This
is
where
not
only
like
quality
of
service
comes
into
play,
but
also
the
like
nature
of
your
workloads
and
specifically
around
like
usage
patterns.
B
This
can
be
true
in
production
environments,
but
is
typically
less
true,
so
you
know
again,
this
should
absolutely
factor
into
your
decision
making
process
when
going
through
right
sizing,
and
this
can
impact
if
you're
doing
dynamic,
right
sizing
using
like
a
cluster,
auto
scaler
or,
if
you're,
doing
more
static,
which
we'll
we'll
talk
a
little
bit
more
about
later
in
the
presentation
next
question
is:
does
kubecos
run
as
a
separate
component
in
the
same
cluster,
or
can
you
run
it
outside
the
cluster
nico?
You
want
to
take
that
one.
C
Sure
yeah
today
it
it
runs
in
your
cluster,
so
we've
found
from
the
teams
that
we
work
with
that.
It's
actually
been
really
valuable
to
be
able
to
run
it
entirely
from
within
a
cluster
in
within
your
own
cluster,
because
basically
there's
a
whole
a
whole
slew
of
issues
with
like
egressing
data
data,
privacy,
etc.
That
teams
can
sort
of
run
this
product
and
they
don't
have
to
they
get
all
the
metrics
and
they
don't
have
to
worry
about.
C
Privacy
concerns
yeah
there's
we
do
have
some
some
people
who
have
wanted
to
run
this
as
like
a
sas
solution,
but
for
now
yeah
it
runs
in
your
cluster
right.
Alongside
things.
B
Yeah,
and,
and
so
that
that
presents
a
number
of
like
really
interesting
behaviors,
as
well
as
like
nico,
showed
you
with
grafana.
These
metrics
are
written
directly
to
like
a
local
prometheus
instance,
so
you
can
do
a
bunch
of
cool
things
like
create
custom
dashboards
in
the
cluster.
You
can,
you
know,
set
up
alert
manager
for
custom
alerts
from
these
prometheus
metrics,
all
like
miko
said,
while
owning
and
controlling
your
data
and
not
having
to
egress
any
of
this
right,
but
you
can.
B
Alternatively,
we
have
a
number
of
teams,
do
this
that
take
these
metrics
and
send
them
to
some
external
like
bi
tool
or
some
like
hosted
solution
like
say
a
data
dog
where
they
like
monitor.
You
know
other
infrastructure
metrics
right,
so
a
couple
others
and
then
we'll
go
through
these
really
quickly
and
we'll
have
time
for
questions
at
the
end.
So
so
does
the
does
the
price
take
into
account
savings
plans?
Absolutely
we'll
talk
more
about
this.
B
B
So
what
should
be
the
best
solution?
You
recommend
I'm
not
sure,
custom
if
you're.
If
you
have
any
more
information
you
can
share
there,
I'm
not
sure
fully
understand
that
one
but
happy
to
come
back
to
it
so
running
outside
to
use
it
for
multiple
clusters.
B
B
So
tons
of
great
questions,
I
haven't
gone
back
to
the
q
a
but
I'll
circle
back
to
that
towards
the
end
of
our
presentation,
but
I
will
jump
back
and
let
nico
share
some
of
these
very
practical
examples
of
implementing
like
optimizations.
Now
that
we
have
this
kind
of
newfound
visibility
or,
like
you
know,
framework
for
cost
allocation
right.
C
Cool
yeah
thanks
webb,
so
basically
we're
just
gonna
step
through
five
of
these,
but
there
are.
There
are
many
more
that
you
know.
C
If
we
had
all
day
we
could
keep
going,
but
basically
these
are
some
of
the
top
five
anti-patterns
for
overspending
that
we
see
teams
routinely
either
not
know
how
to
solve
or
not
be
even
aware
that
there's
a
problem
until
they
start
basically
analyzing
some
of
these
metrics
and
then
it
becomes
painfully
obvious
where
the
problems
are
and
well
also,
for
each
of
these
give
our
take
on,
like
from
a
general
perspective,
how
we
think
it's
best
to
solve
them.
So.
C
So
the
first
one
is
orphaned
resources.
We
would
categorize
this
as
fixing
pulling
lever
one
you
could
think
of
in
this
equation,
which
is
time
running,
and
this
is
actually
a
pretty
easy
one.
Once
you
see
the
problem,
which
is
just
that
there
are
often
resources
cloud
resources
in
your
infrastructure
that
aren't
doing
anything,
they
don't
have
an
owner,
they're,
just
sort
of
sitting
there
and
you're
paying
for
them.
So
basically
this
could
be.
You
could
think
of
this
as
ips
persistent
volumes
is
probably
the
biggest
one.
C
We've
certainly
had
teams
install
our
product
and
quickly
figure
out
that
they
had.
You
know
tens
of
thousands
of
dollars
of
disks
that
were
just
sitting
idle
without
owners.
Load
balancers
are
a
big
one.
I
think
I
think
load
balancers
and
ips.
It's
really
easy
to
think.
Like
oh
yeah
I'll
just
like
expose
this,
and
then
you
know,
maybe
the
project
gets
handed
off
to
a
different
team
or
during
tear
down
you.
C
You
know
eliminate
the
deployment,
but
you
forget
to
eliminate
the
load
balancers
and
over
the
course
of
months
or
years
that
piles
up
and
you
can.
You
can
basically
just
find
a
treasure
trove
of
of
things
that
you
can
just
eliminate
and
and
stop
spending
money
on.
So
we
consider
this
to
be
pretty
easy
in
the
grand
scheme
of
things.
C
The
impact
certainly
can
be
high,
it's
probably
not
quite
as
high
as
some
of
the
ones
that
we're
going
to
get
into,
but
we
consider
it
easy
because
by
definition,
these
things
are
just
not
being
used.
So
normally
it's
a
it's
a
straightforward
solution.
C
C
If
this
thing
is
sitting
in
a
name
space
and
it's
not
being
used
and
it's
exceeding
a
certain
amount,
someone
gets
alerted
and
then
they
can
come
come
loop
around
so
yeah,
that's
it's
a
pretty
straightforward
solution,
but
the
solution
really
is
just
delete
the
resource
and
stop
paying
for
it
and
how
you
implement
that
to
some
extent
is
up
to
you,
but
it
generally
revolves
around
having
this
mechanism
of
identifying
an
owner
and
then
communicating
to
that
owner.
This
is
you're
spending
money
on
this.
C
All
right
next
one
is
abandoned
workloads,
so,
as
you
see
on
the
slide,
workloads
that
do
not
provide
real
business
value,
we'll
talk
about
heuristics
for
this.
How
we
think
about
this-
and
you
know
some
of
the
teams
that
use
our
product,
how
they
think
about
it
and
what
we
think
should
be
done
about
it.
But
again,
this
is
sort
of
like
a
category.
One
thing
which
is
time
running
you've
got
a
workload
running
on
your
infrastructure.
That's
chewing
up
resources,
but
so
it's
maybe
it
maybe
is
doing
something.
C
It's
you
know
we're
not
saying
that
usage
is
zero,
but
usage
could
be
through
the
roof,
and
if
it's
not
providing
a
real
solution,
then
it
for
all
intents
and
purposes,
might
as
well
not
be
doing
anything
so
sort
of
like
one
step
more
complex
than
than
orphaned.
C
Resources,
so
what
you're
seeing
here
is
a
dashboard
that
was
built
on
top
of
our
some
of
our
open
source
metrics,
and
this
gets
back
also
to
the
throw
quote
from
earlier,
which
is
like
your
your
time
is
also
something
we're
trying
to
help.
You
optimize
just
a
quick
glance
here
at
this
dashboard
you'll
notice
that
we
have
basically
one
workload
that
is
causing
you
know
whatever.
That
would
be
90
90
to
93
of
this
of
this
overspend.
C
This
107
overspend
is
just
in
one
workload,
so
something
that
we're
really
trying
to
help
teams
with
and
that
I
think
the
teams
using
our
our
products
and
our
metrics
have
been
successful
with,
is
just
finding
that
that
low
hanging
fruit,
that's
a
big
win,
and
for
not
a
lot
of
your
time.
You
know
like
this,
this
last
one
on
the
list
25
cents
a
month.
C
Maybe
maybe
you'd
be
fine,
if
you,
if
you
let
that
one
run
but
100
bucks
a
month,
you'll
want
to
take
care
of
so
and
and
to
talk
briefly
about
the
heuristics
here
we
can
also
field
questions
on
this
later
if
people
are
interested,
but
the
way
that
we
recommend
teams
measure
whether
or
not
something
is
abandoned
is
with
network
traffic.
C
So
if
a
pod
is
chugging
away
chewing
up
resources,
maybe
it
even
has
a
request,
that's
higher
than
the
resources
that
it's
it's
actually
using,
maybe
not,
but
if
it's
not
egressing
any
data
anywhere
else,
we
use
that
as
a
heuristic
for
saying
like
is
this
thing
really
being
used?
You
know
it
might
be
computing
things,
but
if
it's
not
sending
that
result
anywhere,
then
it's
at
least
a
flag
of
like
you
might
want
to
revisit
this
and
again
as
we
move
into
solutions
for
abandoned
workloads.
C
This
would
be
basically
like
a
medium
difficulty,
because
it's
tougher
to
know
it's
not
as
easy
as
this
thing
doesn't
have
an
owner
at
all.
We're
talking
now
more
like
this
looks
a
little
fishy,
but
if,
if
we
contact
the
owner
of
this,
they
should
be
able
to
justify
it
and
often
they
don't
even
realize
that
it's
still
running
so,
as
you
see
listed
here,
common
examples,
deprecated
deployments.
C
Maybe
this
is
something
where
like
responsibility
for
it,
you
know
shifted
from
from
one
team
to
another
team
and
the
new
team
didn't
wasn't
even
aware
that
it
was
still
running
and
the
original
team
didn't
tear
it
down.
Dev
environments
is
a
huge
one,
so
this
is
one
where
you
know
we
have.
We
have
some
other
open
source
projects
related
to
like
cluster
turndown.
That
could
address
this,
but
essentially
like
your
dev
environment
on
nights
and
weekends
is
sitting
there
with
a
request.
C
If
you
don't
have
it
turned
down
and
you're
spending
that,
and
it's
really
not
doing
anything
so
again.
The
general
theme
here
lack
of
awareness,
organizational
changes,
things
like
this,
where
things
fall
through
the
cracks,
but
we
can
see
like
huge
impacts
from
from
abandoned
resources
and
then
again
the
solution
is
basically
to
it's
very
similar
to
orphan
resources,
set
up
some
sort
of
alerting
rule
dashboard.
C
Where
there's
a
point
of
communication
who
is
an
owner
for
you
know,
let's
say
like
a
common
one:
is
people
will
have
owners
by
namespace,
so
you
assign
ownership
by
a
namespace,
and
then
we
can
go
in
and
say:
okay
like
here
are
all
the
abandoned
resources
in
this
namespace.
C
C
B
Yeah
I'll
I'll
take
it
from
here,
so
those
are
two
of
the
five
you
know.
Number
three
is
is
kind
of
like
a
a
catch-all,
we've
seen
a
lot
of
like
war
stories
or
like
unfortunate
circumstances.
Here
we
say
that,
like
these
are
workloads
that
are
behaving
in
unexpected
ways.
You
know
common
examples
with
these
would
be
like
an
application
bug.
B
So
we've
seen
you
know
actually
a
pretty
recent
story
of
you
know
essentially
like
an
infinite
loop
that,
like
auto
scales,
resources
that
cost
you
know
tens
of
thousands
of
dollars.
We
also
have
had
a
user
that
had
like
a
bitcoin
miner
installed
in
their
kubernetes
cluster
and
that
plus
auto
scaling
led
to
just
a
huge
burst
in
like
resource
consumption.
B
So
these
are.
These
are
kind
of
those
like
long
tail
of
unexpected
events
that
when
they
happen,
can
be
even
like
in
a
relatively
short
amount
of
time,
fairly
costly
and
so
there's
the
problem
of
kind
of
addressing
that
particular
event
and
then
there's
the
also
the
problem
of
kind
of
like
having
monitoring
or
governance
in
place
to
where
you
kind
of
minimize.
Those
events
happened
when
these
are
kind
of
you
know
present
often
times
they're
meaningful.
B
I
think
there's
a
little
bit
of
like
selection
bias
here
on
our
part,
but
like
when
teams
present
them
to
us.
They're
oftentimes,
like
you,
know,
a
real
part
of
their
their
spin.
This
definitely
crosses
into
the
like
medium,
if
not
like.
You
know
medium
hard
category
just
because
there
is
a
really
long
tail
of
things
to
monitor.
For
you
know,
part
of
the
solutions
that
we've
seen
are
like
really
just
monitoring
for
kind
of
unexpected
changes
in
spend
or
like
spend
anomalies.
B
A
common
pattern
would
be
like
looking
at,
say
the
moving.
You
know
seven
day
average
for
like
the
cost
of
a
namespace
or
cost
of
a
cluster
etc,
and
then
having
a
mechanism
to
you,
know,
notify
team
members
and
and
like
being
able
to
take
action
quickly.
One
of
the
beautiful
things
about
kubernetes,
which
is
a
real
change
here,
is
that
you
in
kubernetes
metrics.
B
You
can
truly
have
like
real-time
cost
monitoring,
alerting
and
not
have
to
wait
until
you
get
a
bill
from
your
cloud
provider
which
may
be
like
you
know
days,
or
you
know
many
hours
later
so
kubernetes
metrics,
whether
it's
prometheus
metrics,
directly
integrated
with
like
alert
manager
or
another
solution,
can
get
you.
This
visibility
in
you
know
real
time
or
near
real
time.
B
So
I'll
jump
aside
there,
so
number
two
is
starting
to
get
into
the
the
third
input
in
that
equation,
which
is
you
know,
managing
the
price
of
resources-
and
this
is
you
know,
touching
on
usage
type
and
specifically
when
we
talk
about
usage
type,
we
talk
about
selecting
across
you
know
on-demand
versus,
like
spot
or
preemptable
versus
making
reservations,
whether
it's
committed
use,
discounts,
reserved
instances
or
savings
plans,
it's
really
about
kind
of
going
above
and
beyond
just
using
like
basic.
You
know
on-demand
instance
types
this.
B
This
can
be,
you
know
hard
just
because
it
really
involves
an
effort
of
kind
of
predicting
the
future
or
you
know
forecasting
the
future
oftentimes
you
know,
finance
like
will
get
involved
if
you're
as
part
of
a
bigger
organization.
So,
just
kind
of
you
know
managing
that
across
teams
can
be
difficulty
difficult
or
again
kind
of
accurately
predicting
the
future,
but
it
often
times
can
yield
really
big
benefits
for
teams
that
do
have
some
predictability
in
terms
of
like
their
their
spend
going
forward,
and
this
is
kind
of
like
a
high
level
visual.
B
We
want
to
present
because
we
think
it's
a
really
powerful
framework-
and
this
is
you
know,
looking
at
using
on-demand
versus
reserved-
and
this
is
with
a
cluster,
auto
scaler,
helping
you
kind
of
dynamically
adjust
for
different
kind
of
workload,
demands
or
usage
patterns
in
your
product,
and
what
you
generally
see
here
is
that,
as
you
have
more
and
more
predictability
into
that
kind
of
base
level
load
that
you
know
will
always
be
there.
Whether
this
is
from
a
compute
and
memory
standpoint
or
gpu
standpoint
or
something
else,
you
know:
data,
storage,
etc.
B
You
can
start
layering
on
more
of
these
reservations
and
again
have
major
savings,
and
then
that,
coupled
with
like
auto
scaling
on
demand,
nodes
can
be
a
super
powerful
framework
and
again
you
know,
yield
70,
plus
percent
savings
and
then
a
very
similar
framework
when
looking
at
spot
or
preemptable
usage
would
again
to
be
stacking
on
those
reservations
as
you
as
you
have
like
more
and
more
predictability
and
baseline
load
and
then
letting
spot
availability
scale.
You
know
naturally,
given
those
kind
of
marketplaces.
B
This
is
a
really
big
one
and,
and
you
know,
kind
of
considered
hard
just
because
it
requires
architecting,
your
kubernetes
workloads
in
a
way
that
they
are
resilient
to,
like
you,
know,
node
termination
or
like
regular
node
failure.
So
that's
definitely
one
thing
and
that
could
be
touching
on.
You
know
like
managing
replicas,
and
you
know
pod
disruption,
budgets
et
cetera,
but
for
again
for
teams
where
that
is
a
potential
fit.
These
can
yield
huge
benefits
and
then
our
last
example
is.
B
You
know
one
that
we
had
a
question
about,
which
is
you
know?
It
is
really
easy
to
come
to
kubernetes
and
say:
there's
just
so
much
complexity,
I'm
just
gonna
start
by
over
provisioning
resources,
and
so
that
way
will
minimize
the
risk
of
you
know:
downtime
performance,
you
know,
bottlenecks
etc,
and
this
is
makes
you
know,
total
sense
and-
and
we
actually
recommend
for
teams
that
are
brand
new
to
to
kubernetes,
to
just
follow
that
pattern
right.
B
It's
only
when
you
really
reach
production
at
scale
to
where
these
dollars
can
become
meaningful
enough
to
where
it
makes
sense
to
kind
of
make
these
investments
and
really
go
through
kind
of
this
right-sizing
exercise-
and
this
is
not
an
uncommon
scenario
where,
when
we
first
start
working
with
teams,
they
regularly
will
have
up
to
you
know
80
or
sometimes
even
90
percent,
idle
capacity
or
slack
capacity,
and
just
like
we
mentioned
this
is
oftentimes
just
making
sure
that
you
know
they
have
tons
of
ample
headroom
for
for
burst,
but
teams
that
go
through
the
exercise
to
kind
of
measure
that
an
actual
like
measure,
you
know
peak
usage
or
say
p99
usage.
B
They
can
oftentimes
reduce
this
in
a
in
a
major
way
without
looking
at
some
sort
of
auto
scaling,
so
just
doing
that
statically.
So
that's
part
one
and
then
part
two
is
really
taking
into
you
know
that
quality
of
service
kind
of
sla
from
this
cluster
can
have
a
big
impact
as
well
so
oftentimes.
We
see
teams
take
kind
of
a
uniform
strategy
with
over
provisioning,
whether
it's
a
dev
cluster,
a
staging
cluster,
a
prod
environment
or
like
a
critical
or
environment
when
in
reality
you
can
actually
apply
that
context
and
oftentimes.
B
You
know
b
stay
much
more
comfortable
with
you
know,
running
at
lower
compute
and
risking
kind
of
cpu
throttling
in
a
dev
environment,
because
the
impact
of
that
may
be
relatively
low.
You
know,
given
your
particular,
you
know
circumstances,
so
you
know
when
taking
that
to
account
we've
seen.
B
Teams,
have
you
know
major
major
wins
just
by
going
through
and
doing
this
exercise
statically
and
taking
that
profile
into
account,
and
maybe
you
know
programmatically
once
a
day
or
even
once
a
week,
making
that
adjustment
and
doing
like
a
bin
packing
exercise
where
they
see
where
their
workloads
will
be
kind
of
best
fit,
given
the
like
instance,
types
that
are
available
from
their
cloud
provider
and
then
the
other
part
of
this
is
if
your
environment
is
a
good
fit
for
auto
scaling.
B
B
We
do
consider
it
like
you
know
medium,
if
not
hard,
just
because
it
can
actually
create
risk
if,
if
you're,
not
careful
for
when
you
do
have
like
a
bursty
workload
that
you
would
not
have,
you
know
extra
headroom
to
support
that.
So
definitely
you
know
anytime
you're
going
through
this.
We
recommend
not
thinking
about
kind
of
the
median
case,
although
the
median
case
is
helpful
in
this
scenario
to
get
a
high
level
understanding,
but
once
you
start
moving
into
optimizations
really
thinking
about
you
know
closer
to.
A
B
Usage
and
what
the
impact
of
a
right
sizing
exercise
would
be
on
peak
utilization,
so
we
have
just
seen
time
and
time
again
that
you
know
avoiding
these
patterns
can
reduce
spin
by
you
know,
80
plus,
when
done
right.
We
think
you
can
do
it
with
you
know,
without
creating
kind
of
any
performance
or
reliability
concerns
oftentimes.
It's
just
a
useful
like
cleanup
exercise
as
well,
because
you
get
rid
of
you
know.
B
But
again
we
want
you
to
think
about
when
you're
pursuing
these
to
try
to
focus
on.
You
know
the
biggest
like
bang
for
your
buck,
given
how
valuable
your
time
is
and
starting
with
that
allocation
piece.
So
you
know
where
the
biggest
opportunities
first,
for
you
know,
spin
reduction
are
so
we
love
talking
about
the
stuff
reach
out
to
us
at
any
time
team
at
coopcoffs.com.
A
B
Awesome
so
question
here
is:
does
vmware
or
oracle
have
a
similar
cost
optimization
tool,
I'm
actually
less
familiar
with
the
like
offerings
of
of
either?
But
I
do
know
that
vmware
has
the
cloud
health
product
which
does
provide
cost
optimization
solutions.
B
Workload
around
it
sounds
like
about
a
thousand
customers
comes
at
one
time,
so
then,
how
many
pods
and
how
much
cost
is
required
to
support
that
nico.
You
wanna
you
wanna,
take
that
one
or
you
want
me
to
you
sure.
B
Yeah,
let
me
let
us
know
if
there's
more
context
you
can
provide.
I
do
think
that
nego's
exercise
they
went
through
for
measuring
efficiency
and
going
through
like
pod
right
sizing
can
be
super
valuable,
and
I
would
say
like
that,
combined
with
something
like
hpa
can
be
great.
If
that
like
stateful
set
is
or
staple
application
is,
is
architected
in
a
way
to
support
that.
C
Yeah,
I
would
say
just
in
general,
like
comparing
your
usage
and
request,
is
a
good
exercise
and
doing
that
you
will
see
either
that
your
usage
is
for
a
long
period
of
time,
well
below
your
request,
or
perhaps
that
it's
actually
too
high.
We
see
that
sometimes,
in
which
case
you're
risking
you
know,
eviction
things
like
that,
so
just
using
those
two
metrics
from
from
the
grafana
dashboard
earlier
in
the
presentation,
and
then
there
are
other
ways
of
like
running
statistical
analyses
for,
like
you
know,
p99
p85,
depending
on
how
you
know.
C
Is
this
like
something
that
you
don't
mind
if
it
gets
killed?
Is
this
high
availability?
You
never
want
it
to
be
killed
to
give
you
know
heuristics
for
for
what
sort
of
like
overhead
you
want
to
maintain,
but
to
some
extent
it's
kind
of
up
to
you.
B
Yeah
yeah,
no,
that's
great!
So
next
question
kind
of
you
know.
It
looks
like
two
part
here
so
in
the
case
of
applications
that
are
running
at
say,
60
to
80
percent
idle.
Is
it
better
to
use
a
a
serverless
solution?
B
You
know
like
a
kubernetes
native
kubleth
and
do
we
recommend
that
another
example
provided
here
is
open
fission,
I'm
less
familiar
with
with
open
fission
there,
but
I
think,
like
this
kind
of
get
to
that,
like
broader
picture,
whereas
cost
is
kind
of
like
one
part
of
this
decision,
it's
also
kind
of
similar
to
like
you
know,
should
you
migrate
cloud
providers
for
cost?
B
We
think
it's
a
very
like
relevant
input
to
this
equation,
but
oftentimes
like
performance,
availability
and
like
functionality
are
like
very
core
parts
of
this
as
well,
but
I
will
say
that,
like
we
do
see
serverless
as
a
like
useful
tool
for
managing
costs
for
sure
what
we've
seen
is
that
you
know
we
regularly
work
with
teams
that
have,
like
you
know,
medium
or
either
very
high,
like
complexity
applications
most
of
the
time
like
it's
hard
to
move
all
of
their
workloads
to
server
list.
C
All
right
looks
like
we've
got
a
question
about
the
presentation
link.
I
would
check
out
the
chat.
I
believe
there's
an
answer
there
about
that
yep.
What
is
a
generally
allowable
percentage
of
total
capacity
that
has
been
observed?
Empirically
I
assume
bin
packing
is
not
optimal
all
the
time.
C
I
think
this
varies
slightly
from
team
to
team,
but
so
in
in
some
parts
of
our
application
we
will
use
a
notion
of
like
profile,
which
is
to
say
it
depends
on
the
priority
of
what
it
is
that
you
are
running,
we
would
say,
probably
as
high
as
like
somewhere
between
75
and
90,
if
you
don't
mind
getting
evicted
so
if
you're
on
a.
C
If
this
is
like
a
dev
thing-
and
it's
like-
oh,
you
know
like
30
seconds
of
downtime
here
there
is
okay,
but
we
really
want
to
like
squish
down
the
cost.
I
would
say
definitely
a
more
generous
overhead
for
a
high
availability.
I
my
gut
says
like
60,
like
40
percent
overhead,
something
like
that.
I
webb
might
have
a
different
answer
here,
but
it
sort
of
depends
on
your
situation.
B
Yeah,
I
think
it's
surprisingly
common,
where
we
see
teams
land
at
like
you
know,
35
to
you,
know,
40
percent
overhead
and
I
think
it's
a
function
of
just
what
nico
mentioned,
which
is
again
like
you
know,
quality
of
service,
but
there's,
I
think,
two
other
things
that
come
into
play
here.
One
is
like
variability
of
like
you
know:
resource
requirements,
the
example
that
nico
showed
you
things
were
super
stable.
B
So
if
you
have
just
a
bunch
of
long-running
batch
jobs,
you
may
be
able
to
get
to
like
90
plus
percent
utilization,
because
you
know
resource
utilization
is,
is
really
stable
and
then
the
second
part
of
that
is,
if
you
also
have
like
high,
you
know,
predictability
in
terms
of
resource
utilization.
You
know
looking
forward,
so
I
definitely
seen
scenarios
where
teams
are
in
the
90s,
but
I
do
think
it
is
is,
is
like
you
know
not
the
the
norm
at
this
point
and
then
so.
We've
just
got
two
more
minutes
here.
B
We've
got
two
questions
here
at
the
end,
so
one
is
recommended
books
or
resources.
There
is
a
book
called
cloud
phenoms
that
I
think
is
really
really
good.
It
is
authored
by
the
one
of
the
main
authors
is
one
of
the
creators
of
the
finnops
foundation.
B
J.R
stormont,
definitely
one
that
I
recommend
it
kind
of
paints,
a
holistic
picture
of
just
you
know,
managing
spend
in
cloud
environments
and
then
the
last
question
here
is:
I
think
it's
a
good
approach
to
invest
in
auto
scaling
with
ml
techniques,
especially
in
vpa
cases,
absolutely
think
it
can
yield
benefits,
but
we
definitely
recommend
starting
with
simpler
solutions
just
for
like
introspection
purposes
and
just
like
understanding
why
things
are
behaving
the
way
they
are.
B
We
especially
recommend
that
if
it's
in
like
a
production,
you
know
critical
environment,
but
then
you
know
really
when
you're.
Looking
at
like
fine-tuning
this,
you
know
down
to
the
last
dollar
we
have
seen.
Scenarios
where
you
know
ml
can
be,
can
be
very
useful
in
doing
that,
but
oftentimes
we
when
we
first
start
working
with
teams,
there's
like
bigger
wins
with
just
kind
of
you
know,
just
investing
the
first
time
to
like
do.
You
know
right
sizing.
B
Do
kind
of
you
know
all
of
the
like,
you
know,
exercises
that
we
went
through
here
today.
C
Yeah,
I
I
think
it's
fair
to
say
that
we
could
even
be
stronger
on
that
point
and
say
something
like
we
would
not
recommend
it
if
you
haven't
gone
through
yeah
the
path
of
like
understanding
what's
going
on,
or
else
you
risk
sort
of
creating
a
second
layer
of
misunderstanding
of
this
of
a
similar
nature
as
what
we're
trying
to
solve
trying
to
help
team
solve
in
the
first
place,.
B
Yeah
definitely
and
again
that
can
you
know
impact
not
just
cost
but
reliability
or
uptime
as
well.
As
you
know,
just
general
performance
with
you
know:
cpu
throttling,
for
example,
all
right
excellent,
well
we're
out
of
time,
but
want
to
thank
everybody
again
for
all
the
awesome
questions
and
for
joining
us
today.
I
really
appreciate
libby
and
the
team
met
at
cncf
for
for
making
it
happen.
A
Of
course,
thank
you
both
so
much
for
a
great
presentation
and
we
look
forward
to
seeing
everyone
again
soon
check
back
on
the
website
later
today,
and
we
will
have
all
of
this
good
stuff
loaded
and
ready
for
your
enjoyment
talk
to
y'all
later.