►
From YouTube: AMA with SREs (Public Livestream)
Description
No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).
A
The
fact
first
question
is
carry
over
from
week
before
soon
six
weeks
before
mech,
who
I
do
not
see
on
the
call
so
I
will
verbalize
this
we're
getting
more
questions
from
customers
and
prospects
to
use.
K-8
that's
communities
to
deploy,
get
lab
for
context,
questions
from
a
prospect
in
his
Google
Doc
follow-up
questions.
How
far
are
we
from
testing
helm
track
deployment
at
scale?
On.Com
I
see
some
s
eries
on
the
deliverability
team
here
I
will
hand
that
over
to
maybe
jar
for
you,
you
here.
B
Oh
sure,
so
we
are
using
give
a
home
chart
for
Caleb
calm,
but
we're
only
using
it
for
a
couple
services
right
now
where
we
started
using
it
for
the
registry
service
and
that
worked
out
pretty
well,
because
registry
is
a
standalone,
stateless
service,
it
didn't
have
any
other
dependencies
and
then
the
next
service
we
use
it
for
was
mailroom,
which
is
also
a
pretty
standalone
service
that
does
connect
to
Redis.
But
currently
we
have
a
single
production,
kubernetes
cluster.
B
We
have
also
a
stage
in
cluster
and
we
have
those
two
services
running
in
a
minute
and
we
are
planning
to
move
some
of
the
site.
Click
use
next
and
that's
gonna
happen.
We're
already
in
the
planning
phase
of
this,
and
that's
probably
gonna
happen
early
next
year.
Were
there
any
other
details,
you'd
like
to
know
about
the
kubernetes
migration.
A
Follow
know
we
do
have
a
slate
of
services
or
components
of
our
stack
that
we
are
considering.
So
it's
going
to
be
piecemeal,
we'll
keep
moving
things
over
and
eventually
we'll
have
a
considerable
portion
of
the
application
infrastructure
on
kubernetes.
It
will
never
never
say
never,
but
it's
likely
to
never
be
everything,
especially
taking
an
account.
Some
of
our
workloads
with
get
file
nodes.
A
A
Perhaps
not
the
same,
Michael
I
see
a
couple.
Michaels
don't
know
who
this
was
all
right.
A
new
support
agent
started
Monday,
so
it's
six
weeks
ago,
welcome
over
the
first
six
weeks,
went
well
yet
to
explore.
Much
of
the
currently
developed
committee's
features
as
container
attached
storage
been
explored
or
is
it
being
used
currently
I,
don't
know
the
answer
to
that
question.
D
If
I
was
a
recently
on
aw
screen
pet
got
to
talk
to
a
lot
of
our
customers
in
the
field,
a
lot
of
them
were
very,
very
happy.
Some
of
them
were
not
as
happy
about
a
recent
outage
and
I
just
wanted
to
kind
of
track
down
the
issue
and
here's
some
of
the
state
there.
It
looks
like
Dave
posted
the
incident
issue,
which
is
85
28
in
the
infrastructure
tracker,
but
I
was
working.
You
get
just
on.
This
call
give
give
a
little
bit
of
a
summary
of
that.
A
retro.
A
Sure
we
had
a
synchronous,
retro
I,
believe
the
meeting
was
recorded.
I
can
try
and
dig
up
the
link
to
that
afterwards
as
well.
The
main
takeaway
from
this
was
we
have
without
summarizing
the
incident
itself.
I
think
that's
done
quite
well
in
the
issue,
so
I'd
encourage
folks
to
go
there
and
take
a
look.
The
major
walk
away
from
this
was
a
slate
of
13
corrective
actions,
a
number
of
which
are
currently
in
progress,
three
of
which
have
been
closed.
A
A
D
I
can
share
I
had
a
lengthy
conversation.
This
was
an
individual
developer
they're,
their
own
projects
were
on
get
lab
comm
and
they
were
basically
you
know.
They
came
by
the
booth
to
ask
for
advice
on
what
can
I
do.
Is
there?
Is
there
something
I
can
do
to
avoid
this
uptime
or
when,
when
I'm
down
and
I
can't
access?
You
know,
I
can't
get
work
done.
Basically,
what
can
I
do
about
that
and
they
were
asking
about
potentially
self
hosting,
so
my
advice
was
I
highly
recommend
it
against
self
hosting
I,
basically
said
I'm.
D
Sorry,
you
know,
I
didn't
know
all
the
details,
but
I
know
we
have
been
staffing
greatly.
I
know
our
team.
All
of
you
have
been
tremendously
improving
things
incremental
II
as
we
go,
and
so
my
basic
advice
was
I'm.
Sorry
for
this
outage,
things
are
getting
better
and
I
have
personally
seen
that
over
time,
so
my
personal
trust
isn't
getting
lab
comm
and
I'd
highly
recommend,
not
self
hosting,
but
then
that
was
kind
of
like
the
ask
on
their
part
and
honestly
I'm,
not
sure.
D
If
there's
anything
like
you
know,
maybe
not
storing
large
files
or
some
other
nuanced
thing
that
relates
to
some
types
of
outages
that
maybe
they
would
they
some
action.
A
user
could
perform
to
be
more
resilient
against
outages,
I,
don't
know
what
that
would
be,
but
maybe
that's
a
fair
question
for
the
team.
I
think.
A
A
This
issue
wasn't
caused
by
any
customer
behavior.
This
was
a
error
that
we
imposed
on
ourselves:
unintended
consequences
of
a
configuration
change,
so
users
themselves.
We
we're
looking
to
protect
users
from
inadvertently
impacting
themselves
and
every
other
user
uncom,
because
we
don't
have
a
lot
of
isolation
in
place,
so
we're
continuing
to
work
with
developments
make
certain
that
application
limits
are
focused
on
a
priority
oftentimes,
its
API
rate
limiting
or
parallelism
in
builds
that
runs
off
in
a
combination
of
not
having
one
and
having
too
much
of
the
other
wreaks
havoc.
A
You
know
fifty
thousand
concurrent
CI
jobs
running
all
trying
to
do
exports
project
repositories.
Things
like
that,
so
I
suppose.
To
answer
that
question
directly,
we
say
be
hyper
vigilant
about
what
actions
you're
taking
against
the
API.
We
currently
give
users
a
lot
of
power
there
without
a
lot
of
we'll
say.
Well,
we
give
them
unchecked
power,
perhaps
in
ways
that
we
shouldn't.
So
that's
one
area,
any
other
esterase
have
ideas,
lots
of
things,
I'm
sure
we've
talked
about
sure.
E
We
don't
want
to
worry
about
it
either
and
that's
why
we
had
so
many
corrective
actions
come
out
of
this
particular
event,
so
that
we
can
all
you
know
sleep
easy.
Knowing
that,
like
we're
not
going
to
make
that
same
mistake,
we're
not
going
to
you
know,
have
a
similar
outage
with
a
similar
profile.
Then
we'll
probably
be
others
in
the
future.
E
Just
due
to
the
fact
that
get
lab
comm
is
not
a
static
piece
of
software,
but
even
when
that,
even
when
that
experience
is
bad,
getting
that
feedback
from
people
who
use
get
lab
comm,
knowing
what
was
painful,
even
if
it's
just
like
this
isn't
slow
or
this
I'm,
not
getting
the
response.
I
feel
I
should,
like
that's
all
good
stuff,
so
that
you
know
the
product
gets
better,
and
that
includes
the
product
of
the
service.
F
Hi
this
is
Matt
smiley,
I'm
another
one
of
the
SMEs
I
would
say
that
I
would
add
to
what
Anthony
and
Cameron
have
just
said.
That's
as
a
developer,
you
can
obviously
just
just
state
the
obvious.
You
can
obviously
still
work
with
your
local
clone
of
any
ket
repository
if,
if
the
inability
to
get
productive
worked
on
was
based
on
lack
of
access
to
CI
pipelines,
for
example,
that's
we
don't
really
have
a
good
answer
to
that,
one
that
when,
when
you
can't
access
the
service,
you
can't
push
your
commits,
sir.
F
F
The
the.com
service
that
we
operates
has
a
great
deal
of
internal
redundancy
that
is
really
cumbersome
to
to
operate
in
a
self-hosted
environment.
There
are
absolutely
reasons
to
run
large
gate
lab
installations
in
a
self-hosted
manner,
but
that's
almost
certainly
not
what
this
developer
is
is
talking
about
like,
and
we
can
talk
more
about
that
offline.
F
If
that's
of
any
interest
to
you,
but
but
like,
for
example,
where
we're
actively
working
on
understanding
how
this
particular
failure
cascaded
to
a
broader
scope
than
we
expected,
that's
one
of
the
corrective
actions
where,
in
other
in
other
aspects
of
resiliency,
where
we're
currently
actively
working
on
addressing
several
single
points
of
failure
in
in
the
gait
lab
architecture.
That
is,
that
our
it's
going
to
be
feasible
to
address
in
calm
scale
deployment,
but
really
isn't
practical
in
self-managed.
D
Yeah
this
is
all
tremendously
helpful
and
that
that's
a
great
summary
of
my
advice
to
this
particular
person,
based
on
my
conversations
with
our
large
customers
that
self
host
essentially
yeah.
This
is
this
is
a
massive
undertaking.
You
know
kind
of
as
a
small
individual
developer
you,
you
really
do
not
want
to
take
on
all
of
this
effort
and
all
of
this
risk,
the
the
value
of
having
this
team
run.
It
is,
is
great
and
so
I
think
that
they
ever
everyone
is
always.
D
You
know
frustrated
when
they
can't
do
something
that
they
want
to
do
and
I
think
that's
very
reasonable
response,
and
so
I
I
was
super
happy
because
you
know
I
said
to
them.
You
know
well
we're
very
transparent.
We
post
the
retro
online
and
the
issue
will
be
online
and
they
said
oh
yeah,
I've
already
I've
already
been
in
that
and
I've
seen.
All
of
that
and
that's
a
reason.
D
I
love
get
laughs,
so
that
was
great
feedback
to
the
so
the
company
as
a
whole
into
the
this
team,
in
particular
that
we
we
make
those
issues
public,
and
so
that
was
nice
for
me
as
like
a
booth
worker
to
be
able
to
just
say
like
yes,
that's
there,
they
knew
about
it.
But
yeah.
That's
like
that
was
my
advice
as
well.
D
F
Plus
other
I,
totally
agree
and
plus,
as
other
folks
have
mentioned,
get
lab
is
a
really
fast
moving
target.
Staying
on
top
of
upgrades
is
you
know
it's
it's
a
meaningful
investment
of
time
if
you're
doing
self-managed
so
yeah,
if
it's,
if
it's
a
single
person
or
a
small
team
I
mean
if,
if
they're
interested
in
running
itself,
that's
perfectly
fine,
there's
nothing
wrong
with
that,
but
you
do
get
a
lot
of
benefit
from
running
on
a
from
from
running
on
a
managed
service
cool.
Thank
you
for
bringing
this
up.
Thank.
A
G
Happy
to
I
was
just
wondering
sort
of
what,
if
anything,
can
department.
Other
departments
do
to
make
your
jobs
easier
and
if
you
wanted
to
interpret
it
in
the
negative
kind
of
way
like
what
frustrates
you,
when
you
interact
with
other
departments,.
A
I
will
pass
and
let
other
people
I
think
I'm
saying
that,
because
I
think
I
have
better
points
of
communication
to
voice
those
frustrations
as
well
as
continue
to
improve
on
a
lot
of
the
communication
channels
that
Dave
and
I
have
developed
for
the
team.
So
I'd
like
to
hear
from
other
s,
eries
in
the
group.
F
This
is
Matt
again
I
I
love
it
people
are
super
supportive.
It's
awesome,
anytime,
I
have
questions.
You
know
it's
really
a
my
main
problem
is
finding
out
where
to
ask
that
question
to
get
you
know
to
get
in
front
of
the
right
people-
and
you
know
that's
the
more
I
might
get
loud,
the
easier
it
is
to
find
there's
places
but
yeah
people
they're
super
supportive
and
very
helpful
I
love.
It.
H
To
make
some
of
my
statements
clarified,
joining
the
giggly
team
would
help
us
greatly,
because
right
now,
one
of
our
biggest
operational
pains
is
the
fact
that
giddily
is
singly
home
don4t,
five
different
servers,
which
means
we
don't
have
a
single
point
of
failure.
We
have
45
single
points
of
failure
and
we
need
to
get
giggly
AJ
built
out
as
quickly
as
possible
and
continue
to
iterate
on
Gilly
availability
so
that
we
can
continue
serving
without
and
do
maintenance
at
the
same
time.
H
Right
now,
we
have
basically
no
way
to
do
maintenance
on
the
get
storage
servers
without
taking
down
users
for
extended
periods
of
time
and
that's
difficult
to
to
stomach
as
an
SRE
as
far
as
making
meals,
better
yeah.
Basically,
anything
we
can
do
to
improve
the
performance.
The
rails
stack,
caching,
improvements,
memory,
efficiency,
improvements,
the
better
squeezing
performance
out
of
the
existing
hardware.
We
have
making
it
more
efficient.
B
B
I
put
it
there
for
a
rate
limiting
and
application
limits
in
general.
I
know
that
there
was
a
recent
case
where
someone
was
spamming
us
with
spam
issues
at
a
rate
of
I,
think
it
was
600
a
minute
or
so,
and
you
know
these,
these
sort
of
things
can
cause
us
a
lot
of
pain
and
so
I
would
say
just
helping
us
to
get.
B
You
know,
application
limits,
implemented
and
also
just
having
a
little
bit
more
awareness
of
those
types
of
prominent
problems,
as
well
as
like
the
infra
dev
issues.
I
also
was
going
to
write
that-
and
this
is
more
for
you-
know
back-end
development,
but
I'm,
just
being
aware
of
all
of
the
different
errors
that
we
see
on
github.com.
B
Please
like,
let
us
know
in
the
production
channel
or
in
the
SME
lounge
I
also
keep
tabs
on
other
channels
as
well
just
to
kind
of
monitor
what's
going
on
and
what
people
are
complaining
about
if
there
are
any
complaints,
but
when
you
definitely
don't
don't
feel
shy
and
definitely
let
us
know
you.
C
Know
I
won't
on
a
dad.
This
is
like
I
wanted
to
add
that
if
you
feel
like
you,
it's
okay
to
tell
us
something,
you
think
you
might
already
know
that
we
might
already
know,
because
in
the
midst
of
an
incident
like
more
data,
points
can
really
be
very
helpful.
So
you
know,
like,
like
Jarek,
said,
they'll,
be.
I
This
is
Dave,
I'll
actually
speak
up
too
I
think
the
other
thing
that
would
help
us,
though,
in
terms
trying
to
steer
a
little
bit
of
Lyles
comment.
Is
that
just
dropping
something
in
slack
and
incident
management
channel
is
not
going
to
start
the
production
team?
Looking
in
there
the
infrastructure
team
looking
into
an
incidents,
we
actually
have
ways
to
work
with
pager
didi
and
we're
working
on
improving
those
two,
but
Pedra
do
use
our
escalation
path.
So
slack
is
our
asynchronous
communication
channel.
I
It
will
not
generate
an
immediate
response
and
using
threads
in
slack
is
important,
particularly
in
incident
management.
So
a
little
bit
of
slack
at
AK.
It
actually
does
help
us
if
there
is
a
threat
going
on
cuz
things
crawl
by
very
quickly
there,
so
I
would
I
would
ask
that
there
is
a
I,
don't
think
we're
doing
anything
wrong
there,
I,
just
like
being
mindful
of
how
that
communication
happens,
it's
useful
and
then
back
up
at
I
actually
noted.
I
We
have
a
merge
request
in
flight
for
the
if
it's
slow
and
things
are
going
on,
we're
looking
at
getting
a
little
bit
more
content
on
that
merge,
request
of
advice
in
the
handbook
for
if
you've
got
an
it's
slow
coming
from
either
you
that
you're
noticing
something
or
a
customer.
You
know
talking
about
using
the
performance
bar
grabbing
horror
files
from
a
particular
browser.
Information
from
chrome
from
the
developer
tools
will
also
help
us
looking
into
performance
problems.
A
Yes,
more
information
is
better
and
will
do
a
good
job
of
making
that
frictionless
for
you.
When
you
get
into
slack
right
now.
I
know,
there's
been
some
confusion.
Do
I
page
with
the
/pd
command?
Some
of
those
are
gone
paid.
Your
duty
upgraded
their
api
to
v2,
and
we
tried
to
integrate
some
things.
It
hasn't
worked
so
well,
there's
a
start
incident
command
that
does
work
it's
in
the
handbook
somewhere
with
some
failures
here
and
there,
but
it
still
opens
issues.
A
An
issue
will
actually
get
the
attention
of
an
SRE
the
issue
with
an
incident
label
on
it
that
and,
as
they've
said,
the
pager
escalation
is
the
right
way
to
go.
There's
an
app
thing.
I
linked
in
the
doc
it's
100
am
are,
is
working
to
set
up
a
set
of
integration
points
with
all
of
our
escalation
tools:
slack
zoom,
pager
duty
lab.
So
that
will
have
a
single
point
for
kickoff.
So
take
a
look
at
epic
100.
A
A
Lately,
they've
got
something
called
our
workflow
builder,
which
is
just
a
JSON
definition
of
a
UI
GUI
interface,
so
you
can
do
a
slash
command
and
then
it
pops
up
like
we're
all
familiar
with
pto,
ninja
and
Google
Calendar,
and
things
like
that,
where
you
can
use
buttons
and
dropdowns
and
menus
and
all
the
like,
what
will
hopefully
be
pretty
populating
the
intimate
issues
and
hitting
the
correct
escalation
policies,
etc,
etc.
So
we
are
working
on
that
and
it's
day
flaring
out
more
information
well
fields
for
more
information
for
you
to
include
in
there
like.