►
From YouTube: Consul discussion gitlab-org/gitlab!271575
Description
Skarbek and Jason Plum investigate a target event involved in consul misbehaving with some queries, and target some options for moving forward to continue the investigation or possible solutions to the issue.
Reference: https://gitlab.com/gitlab-org/gitlab/-/issues/271575
B
B
A
B
A
You
have
a
series
of
error
messages.
You
can
get
an
actual
fail
to
connection
right,
so
you
connect
to
no
route
to
host
or
connect
timeout
right.
Then
you
get
the
no
response,
which
is
where
there's
literally
nothing
came
back
across
the
pipe
and
then
there's
effectively
a
which
is
a
no
answer
was
given.
B
B
So
it
looks
like
in
your
latest
message,
no
route
to
host,
and
that
could
be
a
few
things
right
and
that
would
require
us
to
kind
of
dig
through
logs
to
find
a
correlation
of
a
scaling
event
or
your
pod
rotation
event.
The
other
one
that
it
seems
that
you
want
to
concentrate
on
is.
A
Right
which,
based
on
what
I
saw
noting
that
I
watched
and
read
the
notes
from
the
call
this
morning,
yay
double
speed.
What
I'm
seeing
is
that,
on
occasion,
council
has
a
message
that
the
client
disconnected.
B
A
B
A
So
in
the
message
I
have
from
10
a.m.
Eastern
this
morning
points
out
that
there's
the
two
types,
the
ones
that
we
really
care
about,
which
is
the
no
response
from
name
servers
list.
There
is
an
instance
where
we
were
getting
more
than
one
actual
error,
but
we're
swallowing
it.
We
fixed
that
since
so
now
we
get
a
new
response
from
dns
servers
when
we
try
to
look
up
the
dns
server
that
we're
supposed
to
be
asking,
so
that
problem
is
out
of
the
way
we
can
identify
that
now.
A
That
means
that
we're
connecting
and
then
there
was
no
response
in
time
and
it
cut
the
connection
because
it
was
like
somebody
should
have
responded
by
now.
B
A
There
are
very
rare,
no
route
to
hosts
those
up
here
based
on
their
sparsity
that
it's
either
pod
scaling
or
node
scaling
yeah
right,
almost
always
what
I'm
seeing
are
the
no
route
to
hosts
are
trying
to
reach
out
and
hit
the
actual
dns
server
like
the
tube
dns,
and
it
goes
poof.
It
goes
away.
I
need
to
verify
this,
but
it
appears
that
that's
what
I'm
seeing,
I'm
not
seeing
it
getting
a
no
route
to
host
to
connect
to
a
service
ip
I'm
seeing
it
get
a
pod
ip
yeah
and
then.
A
A
Something
a
little
funny
for
their
two
dns,
auto
stealer.
I
don't
think
that's
us,
because
we
would
have
a
consistent
ip
address,
and
that
is
indeed
where
all
of
the
messages
that
we're
going
to
talk
about
here
in
a
second
come
from
is
from
one
locked
service
address.
A
Okay,
so
almost
all
the
time,
it's
like
proof
something
went
away
and,
as
a
general
rule
of
thumb,
a
service
so
long
as
the
service
is
present
will
have
a
service
id
yep
no
route
to
host
you'll
get
a
route
to
that
host.
What
you
won't
get
is
a
connection.
The
connection
will
time
out
trying
to
connect
you'll
still
get
a
connect
to
error,
but
it
won't
be
no
route
to
host.
B
A
So
on
my
hunch
of
the
no
response,
meaning
that
there
was
actually
a
timeout
somewhere,
I
went
looking
in
the
logs
and
just
took
the
same
sector,
the
same
base,
selector
and
time
zone
and
looked
for
anything
with
timeout
and
then
realized.
Oh
look:
there's
a
very
nice
clean
message
from
the
warren
long
out
of
standard
out
that
not
responding
within
tcp
timeout,
I'm
going.
A
You
know
the
last
day
with
this
message
or
this
message
and
I
just
kept
cutting
it
down
to
the
smallest
window,
to
show
replication
of
the
events
and
what
we
end
up.
Seeing
is
every
time
that
we
get
the
message,
no
response
from
name
server
list.
You
can
see
the
underlying
net
dns
trying
to
connect
timeout
expires.
Try
it
again,
try
it
again
and
I'm
like
bingo
yeah.
Now
we
have
a
at
least
we
have
a
point
in
correlation
that
we
can
try
to
use.
A
A
Right,
so
that's
the
finicky
part
here,
because
the
way
that
we're
doing
our
gitlab
database
load,
balancing
resolver
only
pulls
the
first
answer
now
in
theory,
that
should
be
the
service
ip
address.
A
There
are
ways
with
cube
dns
to
make
a
request
and
get
all
of
the
ip
addresses
of
all
the
services
currently
backing
the
pod,
and
then
you
can
walk
down
it.
We've
seen
this
in
some
experience
in
dealing
with
nginx
yeah
right,
you
get
the
end
points
back
instead
of
a
particular
individual,
ip
or
the
service.
I'd
be
doing
the
proxy
result.
A
A
B
B
A
A
A
Now
one
of
the
things
that's
more
expected
from
their
documentation
and
their
chart
from
what
I
can
see
is
they
expect
that
you
shim
or
stub
the
console
domain
into
the
actual
cube
dns
or
the
core
dns
that's
implemented
in
your
cluster.
You
basically
convince
it
to
send
anything.
That's
you
know,
dot
console
to
the
console
service
ip
address
now,
there's
an
upside
in
that
regard
in
that,
even
if
the
ttl
is
set
to
one.
A
Second,
if
you
get
10
requests
in
one
second,
console's,
not
trying
to
answer
that
request
for
all
ten
right,
let's
just
say
as
an
example
of
should
we
scale
plus
two
nodes
and
both
of
those
nodes
will
happily
fit
20
web
service
pods.
They
won't
at
our
scale
the
nodes,
but
let's
give
it
an
example:
there's
a
sudden
scaling
event,
the
pods
will
come
up
and
they
will
probably
come
up
before
cut
the
local
daemon
set.
Console
has
actually
joined
two
console
cluster
right.
Neither
of
them
are
members
yet.
A
The
10
pods
come
up
and
you
know
their
dependencies
run
pretty
quick
and
then
they
all
try
to
fire
at
once
and
they're
going
to
be
within
a
second
or
two
generally,
then
you've
got
the
problem
of
boom.
Who
do
I
get?
Do
I
actually
end
up
spreading
the
service
requests
like
to
the
service?
Ip?
Are
those
requests
being
spread
across
the
other
eight
members
of
the
cluster
that
are
active
and
members
or
is
one
of
them
just
getting
hammered
all
of
a
sudden
right.
A
A
A
A
B
A
B
B
B
I'm,
if
you
don't
already
here's
a
link
to
this,
was
this
morning's
agenda.
I'm
just
going
to
reuse
it.
So
I
just
added
your
question
as
the
very
first
item
that
we
can
revisit.
B
B
B
B
B
Alright,
so
using
the
query
that
you
were
that
you
had
posted
in
the
issue,
clusters
c
and
d
haven't
suffered
anything
in
the
last
five
hours.
Cluster
b
has
a
sporadic
set
of
events
that
happened
at
1400
utc
again
at
14,
30
and
then
16
30.
So
like
10
minutes
ago,
roughly
I'm
tempted
to
make
this
slightly
easier
and
just
target
on
one
of
those
time
zones.
B
B
B
B
B
Up,
I
did
confirm
that
the
ip
address
that's
showing
in
the
log
search
that
you
posted,
that
is
the
correct
cluster
ip
address
that
is
utilized
by
the
console
dns
service.
A
A
A
B
B
B
Its
period
seconds
is
set
to
ten,
that's
every
so
it
runs
every
ten
seconds
and
its
timeout
is
one
second.
So
if
this
command
fails
to
run
within
a
period
of
one
second,
we'll
fill
the
readiness
probe.
We
only
failed
the
probe
once
and
assuming
our
timestamps
are
all
in
a
kosher
standpoint.
Here
it
looks
like
we
could
run,
we
will
have
failed,
we
could
have
a
pod
in
a
failing
state,
receive
a
request.
The
request
is
not
met,
and
then
the
next
rareness
probe
is
set
to
the
pod
and
fails.
B
B
B
B
Well,
the
timeout
for
the
curl
request
is
one
second,
so
like
I
don't
know
why
we're
filling
our
readiness
pro,
but
that's
a
very
short
window
of
opportunity.
B
A
A
Because
if
a
message
refutation
comes
in
and
then
blip
it
goes
off,
is
it
is
that
resulting
in
whatever
is
causing
the
one
second
curl
timeout
to
not
pass,
or
are
we
seeing
a
cpu
spike
while
it's
processing,
some
sort
of
event
or
other
network
traffic?
Is
there
like?
There's
got
to
be
something
correlative
that
we
can
bind
to
why
this
is
happening?
I
know
that
we
see
these.
B
Refuting
so
still
excluding-
I
don't
know
what's
in
info
logs,
but
excluding
info
logs
and
counting
the
word
refuting
5600
times
wait
is
that
the
right
line,
number
yeah,
5625
times
refuted
a
suspect
message
from
some
machine,
and
this
varies
like
around
that
time
was
specifically
a
vm.
But
I
also
see
this
for
gke
nodes
as
well,
that
are
participating
with.
B
A
I
just
popped
up
the
console
chart.
I'm
there's
a
laughable
comment
in
here
for
the
readiness
probe,
because
we
do
this
curl
exec
and
then
we
turn
around
and
like
hit
localhost
on
the
specific
port
or
whatever
else.
B
A
There's
a
note
that
says
when
our
http
status
endpoints
support
the
proper
status
codes,
we
should
switch
to
that.
This
is
temporary.
B
So
we're
running
a
really
old
version
of
the
helm
chart.
I
don't
know
what
version
you're
looking
at,
but
I
feel
like
that
comment
probably
exists
for
quite
a
while
if
it
means
anything
we're
on
version
020,
dot,
zero.
I
believe.
B
A
B
I
did
capture
that
somewhere,
not
in
this
issue,
but
I
did
capture
that
somewhere
already,
so
we
know
that
nothing
happened
to
the
node
itself
like
it's,
not
a
node
scaling
of
it.
B
We
know
the
pod
has
not
scaled,
so
maybe
the
next
thing
to
look
at
is
metrics
for
that
node
to
see
if
we
are
suffering,
maybe
a
high
usage,
just
cpu
or
memory
for
that
node
that
this
is
operating
on
and
maybe
also
for
this
pod,
like
maybe
this
pod
was
suffering
from
unfortunate
event,
so
where
in
our
logs
is
the
node
that
we're
running
on?
So
this
is
the
target
node
that
we
want
to
look.
B
B
A
B
Let
me
double
check
that,
but
I'm
pretty
sure
that's
where
we're.
A
A
A
A
In
that
particular
case,
we
should
be
able
to
replicate
that
exact
pattern
and
on
a
regular
basis
and
watch
for
pods
that
are
having
problems
responding
within
the
one.
Second.
B
B
Potentially
that
that's
primarily
what
I
wonder
about
the
disc
utilization,
but
also
note
that
right
around
1630,
our
metrics
kind
of
disappeared
for
a
second
on
this
node,
some
of
the
metrics
disappeared.
For
some
reason.
B
B
B
B
B
A
B
B
A
A
Hope,
or
not,
guarantees
on
the
qos
for
consoles
memory
and
cpu,
as
opposed
to
best
efforts.
B
A
B
B
B
A
A
B
A
Right
and
I
don't
know
that
we
can
adjust
it
to
a
smaller
interval,
but
unquestionably
an
average
per
core
load
of
120
percent.
A
B
This
node
is
handling
gitlab
shell.
B
On
this
particular
node
pool
no
like
we'll,
have
fluent
d
and
elastic
or
like
pub
sub,
but
like
as
far
as
what
we
want
to
run
on
this
node
pool,
it's
just
going
to
be
getting
lab.
Shell.
B
A
A
A
B
For
we
want
shell,
two
three
three
e
three,
four
f:
eight
dash
dvd
five
s
and
what's
the
key
combo
to
show
me
pods.
B
Yeah,
so,
as
you
see,
there's
only
get
lab
shell.
You
know
here's
our
pod
that
we're
investigating
so
only
gitlab
gel
a
bunch
of
stuff
that
kubernetes
or
gke
provides
us
followed
by
our
logging
infrastructure
and
our
node
export
that
we
manage
45
restarts
for
that
node
exporter.
That's
great.
A
A
Yeah,
I
definitely
definitely
want
to
do
some
investigation
to
watch
generating
these
massive
spikes
on
average
per
chord
load
like
it's
good
that
we're
using
it
right.
Don't
get
me
wrong,
but
when
you
can
just
kind
of
trace
the
line
average
and
you're
below
75,
but
you're
regularly
spiking
over
a
hundred
yep
like
there's
a
problem
here
and.
B
A
B
Okay,
so
look
at
that
chart
for
me
expand
your
view
a
little
bit.
A
Massive
spike
there's
a
100
percent.
You
should
just
over
100
you
spike
at
16,
35
30.,
and
then
you
said.
B
B
A
So
the
likelihood
is
that
we
may
be
seeing
it
not
actually
doing
a
full
swing
right.
We
could
be
going
from
30
to
200
and
back,
but
over
that
30.
Second
time
it's
you
know,
didn't
crest,
all
the
way
up
there.
B
B
B
B
A
A
B
B
B
A
A
B
B
Seems
weird,
you
might
find
this
interesting,
I'm
gonna
paste
a
link
for
this
is
our
workload
on
the
node.
B
A
B
A
B
B
B
So
here
you
know
no
response
name
service
list.
This
happened
at
1401,
so
similar
behavior.
B
B
B
Network
activity
was
on
the
drop,
in
fact,
but
you
know
still
reasonable.
Let
me
go
back
to
this.
B
B
B
B
A
A
B
A
B
A
No
okay.
Can
you.
A
Check
the
environment
to
see
any
hosts
that
are
exposed
through
nvn
or
emv,
specifically,
if
anything
happens,
to
show
up
as
a
as
an
entry
for
console.
A
A
Yeah,
you
could
just
control
exec
env.
B
A
A
Honestly,
my
expectation
is,
we
we
won't
need
more
than
a
couple
of
hundred
meg
like
100,
like
200
megabyte
of
memory,
seems
probably
not
needed.
That's
an
easy
one
to
find
out
just
like
grab
all
the
console
pods
and
see
what
their
lifetime
memory
is
boom
and
I
would
say
not.
We
probably
don't
need
to
guarantee
more
than
say,
300
400,
like
300
m
on
cpu.
A
A
A
And
then
you
start
having
long
entries
that
like
say,
I
tried
that
one.
Then
I
tried
that
one
and
then
I
tried
that
one
that
would
be
somewhat
feasible
if
we
could
ensure
that
not
all
traffic
is
being
directed
to
a
given
top
of
the
list
and
two.
If
our
gitlab
database
load,
balancing
resolver
actually
returned
more
than
one
entry
because
it
doesn't.
A
B
This
is
a
bad
idea,
but
instead
of
having
our
pods
reach
out
of
that
node
in
order
to
reach
console,
is
there
a
host
port
configuration
that
we
could
leverage
and
then
we
could
somehow
configure
all
of
our
pods
to
talk
to
local
hosts
and
some
port
that
we
know
that
console
should
be
running
on
since
we
run
console
in
all
of
our
nodes
anyways.
Currently,
that
would
eliminate
network
copy.
B
However,
if
we
set
the
priority
class
to
a
certain
value
and
we
set
the
resources
as
we
should,
because
currently
we
don't
theoretically
console
should
bounce
back
and
since
we
pull
for
console
with
every
60
seconds,
I
believe
it
is
we'll
put
pressure
for
that.
One
node
onto
the
primary
database
for
that
one
minute
in
which
all
of
those
pods
failed
the
request
for
so
hopefully,
if
something
like
that
happened
it
shouldn't
last,
but
so
long
assuming
console
was
able
to
spawn
back
up
correctly.
B
A
So
that's
the
trade-off
in
that
the
target
ports
are
known
and
we
know
the
ports
according
to
the
daemon
set.
They
expose
named
dns
dash,
something
and
basically
one
is
tcp-
wants
you
to
be
in
the
same
thing.
So
if
I
look
at
the
daemon.
A
B
A
A
A
B
B
B
A
B
B
B
A
So
our
console
is
out
of
date,
chart
wise,
but
that
means
an
application
update,
which
might
mean
a
compatibility
update
to
versus.
What's
in
omnibus,
are
we
using
the
omnibus
for
our
console
cluster.
B
A
The
change
to
make
the
service
become
a
node
port
instead
is
not
that
hard,
but
there's
some
dancing
to
do
on
that
one,
because
you
really
don't
want
to
try
and
expose
port
53
or
burnaby's
node
unless
you
have
to
and
there's
a
lot
of
touchbacks
on
that
one
I
still
kind
of
want
to
try.
A
Instead
of
having
figuring,
I
should
say
I
want
to
try
to
figure
out
how,
instead
of
having
us
ask
for
where
is
console
and
then
ask
console,
is:
does
it
make
sense
to
route
the
request
through
the
system
dns
through
cube
dns
through
core
dns
in
a
stub
domain,
because
if
anything
else
at
least
then
we
would
have
when
you
get
10
requests
all
of
a
sudden?
It's
not
hey!
Here's
ten
request
console.
I
hope
you
return
right
like
yeah.
A
B
Maybe
refine
what
I'm
writing!
I'm
writing
down
action
items.
A
It
so
I'd
have
to
go
digging
into
the
load.
Balancer.
The
gitlab
database
load
balancer
code
base
again
to
see
what
parameters
we're
passing
to
search
and
if
the
timeout
configurability
is
there.
I
know
that
the
gems
default
is
fine.
I
have
to
see
if
our
default
is
that
or
there's
any
timeout
configurability
that's
beyond.
What's
there
we
can
pass
it
a
different
timeout.
We
just
have
to
make
it
possible.
B
B
Right,
okay:
is
there
anything
else
that
we
want
to
touch
on,
otherwise,
I
think
we're
ready
to
end
the
call.