►
From YouTube: 2021 04 20 Jenkins Infra Meeting
Description
No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).
A
Head
started:
olivia
your
meeting,
hi
everybody.
Welcome
to
this
new
jenkins
infrastructure
meeting.
We
have
few
incident
to
reports,
so
we
definitely
have
to
be
too
tall
today.
So
first,
let
me
share
the
notes
for
today,
as
last
week,
we
are
going
to
use
mcmd.
A
So
here
are
the
news:
if
you
cannot
edit,
the
news
feel
free
to
request
access
but
yeah.
So
let's,
let's
start.
The
first
thing
that
I
want
to
talk
is
the
incident
that
happened
yesterday
with
schedule
jenkins
that
I
own
so
as
a
reminder
get
to
jenkins.
That
is
a
mirror
engine
that
redirects
every
request
to
download
an
artifact
from
the
jenkins
to
a
mirror
located.
A
I
mean
to
the
closest
mirror
from
your
location,
which
means
that
from
I
mean
from
europe,
you
originated
from
to
europe,
my
house
and
so
on.
What
happened
around
six
around
four
p.m:
around
sorry,
around
five
five
pm
utc
for
some
reason:
the
network
storage,
so
the
azure
file
storage
mounted
into
that
mirror
bit
service
stopped
responding.
A
We
got
error,
saying
quota
accident,
and
so
because
we
could
not
communicate
with
the
network
file.
Storage,
basically
mirror
bits
had
no
idea
about
which
file
can
be
distributed
can
be
distributed
to
which
mirror.
A
So
that's
basically
what
happened
so
the
fallback
so
the
current
the
fallback.
The
way
it
was
configured
is,
if
that
to
get
the
changes
that
I
was
not
was
not
working.
It
fall
back
to
a
service
that
was
using
the
same
network
file
storage.
So
basically,
we
just
resent
way
too
much.
Requests
to
the
to
the
file
storage
was
which
was
problematic
at
the
time,
so
it
took
us
around
two
hours
to
three
hours
to
understand
the
issue.
A
So
we
had
a
rough
idea
of
what
to
do
where
to
look
at
several
people
were
involved
in
that
outage,
so
the
the
first
step
was
to
redirect
the
traffic
to
a
node
machine
that
we
named
package
touching
the
io
which
has
every
file,
so
that
machine
has
the
same
content
that
what
is
located
on
the
azure
file
storage.
So
the
idea
was
just
to
regrade
the
traffic
to
a
different
machine,
so
we
could
put
back
that.
A
We
could
put
that
service
back
on
track
until
we
understand
what
happened
so
the
service
was
done
so
yesterday
evening.
Europe
time
everything
was
fine.
Yes,.
B
A
A
The
temporary
the
temporary
solution
was
fine,
but
that
machine
is
not
able
to
under
the
loads
that
happened
during
peak
hours,
so
it
was
definitely
just
a
temporary
solution
until
until
we
understand
what
what
what's
that,
what
was
happening
so
on
my
side,
what
I
did
yesterday
was
to
open
a
portion
to
open
a
connection
using
powershell
with
the
azure
account
to
list
the
every
open,
open
file
connection
with
that
azure
file
storage,
and
there
is
a
heart
limit
to
2000
and
what
I
identify
is
there
was
one
sec
in
one
session.
A
We
almost
filled
that
limit,
so
we
opened
like
1998
connection
around
I'm,
not
sure
damien,
because
we
we
investigated
wisdom
in
this
afternoon-
around
4
p.m.
I
think
utc.
A
Picks,
and
so
once
once
the
the
azure
firestore,
it
was
not
working
correctly.
Then
you
we
saw
a
lot
of
side
issues
like
cpu
usage,
went
crazy
inside
the
nodes
memory
increased,
obviously
because
the
service
was
not
able
to
to
answer
requests.
The
number
of
requests
increased
as
well.
So
I
mean
we
can
clearly
identify
huge
peaks
that
happened
during
that
time.
A
C
C
Did
we
cause
the
azure
file
system
issue
by
making
too
much
request,
or
was
it
an
azure
issue
that
caused
the
request
to
pile
and
to
go
on
race
condition
on
whatever
application
and
keeping
the
file
under
open?
And
this
we
cannot
conclude
on
one
or
the
other,
and
we
miss
data
to
be
able
to
conclude.
A
No,
so
so,
if,
while
we
are
still
on
that
specific
issue,
I
think
what
was
really
nice
to
notice,
based
on
our
learning
back
in
november.
A
First,
we
put
in
place
a
status
page
in
november,
so
we
could
communicate
about
this
issue
and
people
were
able
to
quickly
open
the
incident
on
the
jkc
status
hit
repository,
so
we
were
able
to
communicate
about
the
incidents
when
it
starts
when
closed.
That
was
the
first
thing.
A
First
thing:
sorry,
some
people
ask
why
the
monitoring
and
the
status
page
did
not
show
that
get
the
jenkins
that
I
o
was
down
the
root
cause
of
that
is
because
the
way
the
container
is
working,
the
service
starts
and
read
in
the
directory.
A
So
read
the
network
of
file
storage
content
into
a
directory,
so
the
service
was
working
but
could
not
read
the
data
in
a
specific
directory,
so
r
il
check,
because
we
only
monitor
the
route
so
get
the
changes
that
are
you
or
each
check
told
us
that
the
service
was
up
and
running,
but
when
we
tried
to
access
a
specific
file,
obviously
that
was
not
working.
So
that
was
the
thing.
So
we
have
to
improve
on
monitoring,
to
monitor
a
specific
file,
let's
say
slash
time,
slash
into
or
whatever
file
we
monitor.
A
Ideally,
that
file
needs
to
be
small,
so
we
don't
put
pressure
on
the
network,
but
we
have
to
improve
the
monitoring,
something
that
we
saw
in
november
and
we
saw
the
same
pattern.
This
time
was
some
requests
were
passing
and
others
not
so
we
were
able
to
download
some
specific
files,
but
other
were
returning
five
or
three
hours
in
in
november
we
had
the
same
issue
where
we
could
access
every
files,
except
that
those
under
the
directory
planes.
A
I
didn't
understand
why
at
that
time
the
problem
resolved
by
itself
and
we
saw
the
same
issue
yesterday.
Some
requests
were
passing,
others
not
get.
The
jenkins
dollar
was
up
and
running,
but
so
yeah,
that's,
that's
the
that's
one
of
the
things.
A
So
so
yeah
and
we're
not
sure
wha.
What
we'll
do
is
we'll
just
improve
the
monitoring
to
monitor
a
specific
file,
but
this
will
just
reduce
to
I
mean
this
would
just
help
us
to
to
detect
the
issue,
but
in
this
case
we
we
didn't
get
the
metering
notification.
A
The
second
thing
that
we
monitor
that
we
checked
was
since
now
a
while.
We
monitored
that
we
can
download
the
latest
jenkins
version
from
package
of
jenkins
leo.
So
we
have
a
monitoring
check
for
that
and
how
it
works.
We
query
repo
the
jk.org
to
see.
What's
the
latest
version
for
the
weekly
and
deltas,
we
relieved
the
retrieve
that
version,
and
then
we
query
get
the
jenkins
that
I
use.
A
If
that
check
is
failing
for
30
minutes
and
then
it
triggers
an
alert
and
then
we
are
notified
by
petroleum
team.
What
happened
here
is
because
the
service
was
not
failing
for
two
or
two
hours.
Some
requests
were
passing
and
others
not.
So
when
we
look
at
the
metadata
dashboard,
we
clearly
see
that
half
of
the
request
was
were
working
under
others.
Not
so
that's
why
we
didn't
get
notified
by
the
monitoring
so
well
we
we,
we
had
a
look
to
multiple
things
here,
but
we
we
could
not.
Really,
I
mean
yeah.
A
A
Obviously,
in
my
case
that
was
easy
because
I,
the
the
command
that
I
run
back
in
november,
was
still
in
my
powershell
history.
So
I
just
re-execute
the
same
comments,
but
I
should
put
the
documentation
in
a
runbook
so
the
next
time,
if
it
happen
again,
someone
else
can
do
the
same.
A
A
So
if
you
have
yeah,
I
just
put
here
the
the
kind
of
access
that
are
done
on
that
network
storage.
Just
for
to
give
you
an
id.
While
you
can
mount
the
same
network,
the
same
as
your
file
storage
into
multiple
containers,
you
can
write
from
multiple
containers.
You
can
read
from
multiple
multiple
containers.
You
still
have
a
limit
on
the
number
of
open
files
you
can
have
at
the
same
time,
and
just
to
give
you
an
id,
we
have
my
turn
check
that
tests.
A
If
you
can
access
some
specific
location
on
the
container
you
have,
the
apache
will
have
an
apache.
That
means
that
that
can
return
content.
We
have
mirror
bits
which
scan
on
a
regular
basis
from
every
containers.
We
have
datadog
monitoring.
We
have
so
that's
it's
really
difficult
to
really
have
to
have
an
a
clear
understanding
of
where
those
access
those
file
access
came
from,
but
yeah
we
are
still
investigating
the
next
topic
that
I
want
to
mention
the
next
outage.
I
mean
that
was
not
really
a
notation.
A
So
any
question
before
we
move
on
no
another
issue
that
happened
last
week,
so
we
wanted
to
improve
the
way
we
deliver
jenkins
at
our
website
by
directly
totally
rely
on
hem
charts,
and
we
face
an
interesting
challenge
here.
A
We
put
branch
protection
on
the
git
repository
that
contains
jenkins.
That
are
your
website,
so
we
put
branch
protection,
so
we
always
use
pull
requests
to
introduce
change
and
because
the
new
workflow
implies
committing
to
the
branch
and
we
could
not
identify
a
way
to
say
we
want
to
keep
the
brand
protection
but
only
allow
a
specific
bot
to
modify
that
specific
file.
A
C
So
since
I
have
a
proposal
here
because
there
are,
there
were
at
least
six
different
ways
of
implementing
that
workflow,
based
on
changing
some
bits
connecting
automating.
This
means
that
we
don't
have
a
consensus
right
now,
so
that
could
be
a
nice
ground
to
start
writing
an
erp
which
is
the
same
thing
as
a
grp.
C
A
I
would
be
really
happy
to
work
on
that
document
because,
four
years
ago
I
work
on
the
same
for
the
current
implementation
and
the
current
way
to
deploy
it
into
their
websites,
and
I
mean
every
other
website
like
javadoc
and
api
plugins
and
so
on
and
in
four
years
a
lot
of
things
evolve
and
so
yeah.
I
would
be
really
glad
to
to
re-evaluate
my
assumption
that
were
done
four
years
ago
to
purpose
something
different.
C
It
was
mentioned
that
maybe
iop
could
be
merged
into
the
gep.
I
here
I
don't
really
care.
The
goal
is
that
we
get
started
on
writing
the
proposal
there,
and
if
we
see
that
we
can
do
that
exercise
as
a
community
team
for
one
or
two
important
topic,
then
we
can
raise
the
discussion
of
should
we
go
to
gp,
but
the
goal
is
to
learn
to
work
before
running
here,
so
that
that's
the
goal
of
the
proposal
of
staying
on
the
iop
that
hasn't
been
updated
since
a
lot
of
time.
A
Today,
that
sounds
a
good
idea
and
I
would
even
go
one
step
further,
since
we
are
working
again
on
documentation,
git
repository
to
put
a
lot
of
documentation,
I'm
just
wondering
if
we
could
just
put
the
document
there
as
well,
so
we'll
just
regroup
that
document
the
meeting
notes
outage
maintenance
and
documentation
in
one.
B
A
B
A
The
okay,
so
so
right
now
we
have
so
we
have
a
git
repository
name,
documentation
that
was
created
a
long
time
ago
and
the
idea
was
to
have
a
public
documentation
where
we
document
everything
related
to
infrastructure.
A
We
agreed
that
we
would
push
in
a
different
git
repository,
which
is
the
documentation,
and
so
since
we
we,
we
collect
the
the
note
for
the
meetings
and
the
upgrade
plan
and
every
other
things,
including
run
books.
I
was
just
suggesting
to
move
the
ip
documents
in
the
same
deck
in
the
same
git
repository,
so
we
just
have
a
bigger
repository
with
more
content.
B
A
Documentation
sounds
good.
Those
were
the
two
most
important
topic
that
I
wanted
to
talk
today.
Damien,
you
were
mentioning
that
you
wanted
to
bring
the
website
depicting
the
issues
we
have
is
really
that's
the
I
in
french.
So
I
guess
that's
the
right
time.
C
C
It's
throttled,
that's
with
the
t,
so
if
we
are
throttled,
so
that
means
that
some
of
our
requests
are
queued
before
being
sent
to
avoid
having
peak
of
flow
big
workload
on
the
api
control
plane,
and
this
request
comes
from
different
sources,
mostly
all
our
m-file
process,
which
takes
care
of
doing
the
git
ups
operation
to
the
kubernetes
cluster,
but
also
from
all
the
jenkins
instances
that
are
spawning
pods,
because
when
a
pipeline
is
running
inside
a
pot
template
it,
it
run
websocket
command
from
a
kubernetes
client
inside
jenkins
to
run
the
cube.
C
D
C
Yeah
there
is
some
there
is
some
monitoring
currently
integrated
in
azure,
so
we
could
maybe
start
with.
With
this
point,
I
don't
know
how
to
extract
that
information
continuously
from
jenkins,
though
there's
there
are
way
to
have
a
groovy
script
to
run
that
will
print
instantaneous
usage
of
the
current
open
connection
from
jenkins.
C
C
A
Would
be
also
nice
is
to
identify
all
the
potential
limits
that
you
have
to
use
aks,
because
that's
a
common
issue
in
azure,
whatever
service
you
use,
you
have
limits
like
the
number
of
computers
and
the
number
of
cpu
you
can
deploy
in
one
region
and
stuff.
Like
that,
the
number
of
files
you
can
open
on
the
azure
file
storage
at
the
same
time,
and
so
maybe
we
have
a
limit
that
I
mean
that
we
need
to
identify.
C
And
also
that
could
be
worth
it
to
check
with
jenkins
and
jenkins
six
communities
and
maybe
jenkins
user
or
any
variation,
because
I'm
sure
we
are
not
alone,
I
mean
we
don't
make
so
much
request.
So
is
it
because
our
azure
iks
is
meeting
some
issues?
Is
it
because
of
the
kubernetes
version?
Do
we
have
other
user
with
the
same
issues?
Another
kubernetes
kind
cluster,
because
I
mean
it's
not
uncommon
and
I
we
don't
do
anything
that
is
exotic.
We
run
pipelines
on
pods
and
the
pod.
Template
might
have
two
or
sometimes
three
containers.
C
D
But
I
don't
think
we're
running
that
on
that
particular
intro
that
stuff
my
I
I'm
I'm
guessing
that
it's
probably
unfair
related
the
fact
that
the
fact
that
we
run
a
full
deploy
or
a
full
kind
of
chat-
maybe
maybe
so
often.
A
C
Yeah
and
so
right
now,
the
first
step
will
be
what,
as
garrett
said,
checking
doing
the
existing
monitoring
on
azure
and
see
the
breakdown
between
different
clients
and
see
which
one
is
emitting
bunch
of
pics
and
then
from
there.
We
could
go
further.
C
C
C
We
already
had
the
knowledge
from
the
previous
one,
so
we
partly
failed
and
partly
and
partly
succeeded,
because
team
was
able
to
provide
a
fallback
solution,
so
I've
identified
at
least
two
procedures
that
must
be
written
at
all
costs
how
to
fold.
C
What
team
and
mark
did
yesterday
how
to
to
have
a
temporary
fallback
to
be
sure
that
user
can
still
download
file
for
one
full
day
slower
without
the
mirror
capability,
but
still
they
can
download
and
how
to
identify
and
fix
the
azure
file
storage,
identify
we
were
able
to
follow
microsoft
online
procedure,
but
we
were
missing
some
point
about
a
powershell
script
that
olivia
mentioned
earlier.
This
being
should
be
a
run
book
I
mean
by
one
book
I
mean
a
no-brainer,
only
the
main
dot.
We
can
still
do
ourselves
during
an
incident.
C
The
line
between
the
dots,
but
we
need
there
was
some
missing
element
that
could
have
helped
and
maybe
could
have
helped
the
fixing,
without
bothering
olivier
so
now
that
we
all
have
the
knowledge
and
understanding
of
what
happened,
because
since
it
happened
already
in
november,
that's
the
second
time.
That
means
we
will
have
another
issue
with
azure
file
storage.
B
Is
it
worth
one
more
action
item
to
investigate
or
suggest
or
discuss
ways
to
detect
that
style
of
failure?
The
the
failure
mode
was
was
rather
strange
in
terms
of
flapping
and-
and
you
know
it
was
on
again
off
again
some
some
here
some
there.
I'm
not
sure
that
failure
detection
is
is
ultimately
possible
for
that
kind
of
failure,
but
is
it?
Is
it
worth.
A
So
yeah,
to
be
honest,
identifying
this
issue
that
happened
yesterday
is
definitely
challenging
because,
as
you
says,
as
you
say
that
was
flapping.
When
we
look
at
the
monitoring,
sometimes
they
check,
where
passing
sometimes
that
and
even
worse,
we
have
400
gigabytes
of
files.
Some
files
were
accessible,
others,
not,
and
in
november
those
files
were
under
the
director
plugins.
This
time
it
was
not
necessarily
the
same.
C
What
do
you
think
olivier?
Maybe
putting
I
don't
know
if
it's
technically
possible,
but
the
amount
of
file
handle
open
on
the
azure
file
storage
were
a
pretty
good
indicator
that
something
went
wrong
because
it's
the
quota
we
we
reached
and
once
the
quota
was
reached,
then
everything
got
south.
So
do
you
think
it's
possible
to
have
a
routine
that
hourly
that
measure
at
least
the
amount
of
open
500
on
that
specific
volume?
A
So
first
thing:
first,
everything
is
possible
depending
on
how
much
time
we
want
to
invest
in
there
practically
I'm
not
sure.
If
it's
worthwhile,
because
python
the
checks
are
published
to
datadog,
they
are
written
in
python.
I
I
don't
have
any
documentation,
so
we
could
probably
use
the
python
sdk
for
that,
but
those
are
definitely
not
information.
A
I
don't
think
those
information
are
available,
as
is
in
data
and
so.
B
Would
would
you
be
okay,
olivia,
I
think
there's
some
interest
there.
If
I
ask
datadog,
do
they
already
have
a
built-in
monitor
somewhere?
That
would
check
azure
file
storage
used,
because
I
would
expect
this
to
be
a
common
thing.
They've
already
implemented,
and
all
we
would
need
to
do
is
find
out
how
they
what
they
did.
A
So
we
definitely
have
storage
information
storage
file
counts.
I
have
to
do
that
check.
I
mean
we
have
information
like
the
month,
egress,
ingress
and
stuff
like
that,
but
we
have
the
latency,
but
we
don't
have
the
information
that
yeah.
I
have
to
look
at
okay
thanks,
so
we
are
running
out
of
time,
so
I
propose
to
finish
the
meeting
here,
but
before
we
do,
I
just
want
to
to
highlight
the
fight
that,
because
of
the
new
workflow,
we
are
going
to
have
one
document
permitting.
A
So
I
put
the
idea
of
this
document
the
link
to
the
next
week
meeting.
So,
if
you
have
anything,
you
want
to
put
to
the
agenda,
feel
free
to
add
that
information
there
and
so
yeah.
We
used
the
next
token
next
week,
thanks
for
your
time
have
a
great
day
bye.