►
From YouTube: Node.js Deep Debugging - Gireesh Punathil, IBM India
Description
While there are established debugging techniques for well known problem types, we have come across rare and complex yet interesting production issues - stemming from pervasive build, deployment configurations and heterogeneous workload types that node.js is subjected to. In this session, we will illustrate case studies of few unique issues that we debugged, custom diagnostic tools that were used and lessons learned. The attendees will learn addressing deeper level production problems and self-diagnose anomalies.
A
Hi
welcome
and
thank
you
for
joining
this
talk
on
node.js
deeply
bugging
before
we
talk
about
what
this
session
is
all
about.
What
are
the
stated
purpose
and
the
expected
outcome?
Let
me
provide
a
brief
intro
myself
in
greece.
I
work
for
ibm,
I'm
an
architect
with
a
team
called
ibm
runtimes,
that's
responsible
for
developing
and
supporting
language
runtimes
such
as
java
and
node.js,
I'm
also
a
member
of
the
node.js
technical
steering
committee
and
the
diagnostic
working
group.
A
A
That
essentially
means
I
have
spent
a
good
amount
of
time,
debugging
node.js
and
terminals,
and
I
felt
it
prudent
to
share
some
of
these
experiences
some
of
the
path
I
traveled
with
people
who
are
using
node.js
and
might
come
across
same
or
similar
situations
as
the
issues
that
I
debugged
and
instead
of
reinventing
the
wheel,
it
would
be
pretty
easy
for
them
to
reuse
some
of
these
experiences
and
lead
the
issues
in
hand
towards
queue,
care
resolutions.
A
A
This
is
how
I
want
to
explain
it
I'll
start
with
the
execution
environment,
where
the
issue
is
reported,
how
the
issue
is
manifested
externally
and
then
the
whole
path
or
the
proceedings
towards
problem
determination
and
what
was
the
actual
issue
and
how
the
issue
got
resolved
and
then
the
tools,
the
methodologies
and
the
best
practices
that
I
followed
as
well
as
what
are
the
learnings
and
the
insights
that
are
reusable.
So
that's
the
whole
idea
all
right.
So
here
is
our
first
topic.
A
A
I
stress
on
the
word
supposed
to
be
because
that's
the
whole
essence
of
the
issue
at
hand,
and
we
will
see
why
so
the
issue
was
manifested
in
the
form
of
a
high
and
steady
rss
size
in
the
node.js
process
and
reported
by
many
users
with
a
similar
symptom,
as
they
all
are
seeing
very
low
heap
usage.
But
very
high
rss.
A
This
is
not
a
surprise,
because
we
know
that
node.js
is
not
just
using
memory
from
the
javascript
heap.
It
has
also
bearings
on
the
native
backend.
Many
objects
have
memory
that
is
used
in
the
native
heap
as
well
so
low
javascript
heap,
but
high
native
memory
and
high
rss
on
the
face
of
it
is
not
a
problem,
but
because
many
users
reported
the
same
issue
with
a
similar
symptom.
We
thought
it's
problem
to
investigate
and
started
looking
at
it.
A
One
of
the
common
symptoms
was
the
fact
that
they
all
had
a
history
of
huge
memory
allocation
in
the
form
of
various
objects
getting
created
back
to
back
and
then
later
all
the
objects
getting
garbage
collected,
but
the
rs
is
not
proportionately
coming
down,
so
that
was
a
common
symptom
reported
by
most
of
these
customers
or
end
users.
A
Now
coming
to
the
problem
determination,
as
we
know,
the
javascript
heap
is
not
a
true
reflection,
so
we
make
use
of
the
operating
system
tools
to
see.
What's
the
actual
memory,
consumption
of
the
process,
the
latex
tools,
such
as
stove,
pmap,
etc,
helps
here
and
then
we
look
at
the
consumption
and
it's
correct.
A
A
The
behavior
was
not
very
consistent,
even
with
the
same
program
that
was
run
in
different
systems,
say,
for
example,
different
system
with
differing
in
specification
or
differing
in
terms
of
the
load
conditions,
so
that
was
on
the
face
of
it
as
a
challenge,
but
it
gave
us
a
lead
in
terms
of
how
do
we
proceed?
So
one
of
the
first
step
was
to
disambiguate
or
get
an
answer
to
why
the
same
program
is
behaving
differently
in
systems
with
differing
spec
or
differing
amount
of
physical
memory
or
different
set
of
load
characteristics
in
the
system.
A
A
Then
we
saw
that
the
rss
once
hiked
up
is
never
coming
down,
whereas
if
the
system
is
tight
in
terms
of
the
physical
memory
with
respect
to
the
rolling
process,
and
if
there
are
a
lot
of
load
in
the
system,
then
we
see
that
rss
is
immediately
coming
down,
which
is
in
accordance
with
the
expectation
from
the
nodes
process
so
reading
through
various
documentations
various
manuals
and
specifications
around
virtual
memory
and
resident
set
memory,
and
with
the
help
of
a
small
c
program
which
truly
simulated
the
node.js
scenario,
we
could
actually
understand
the
root
cause
of
the
issue.
A
So
in
my
small
c
test
case,
what
I
was
doing
is
allocate
small
chunks
of
memory
in
rapid
succession,
but
in
large
amount
and
then
immediately
free
all
those
memory
say,
for
example,
2,
kb
chunks
of
say
1000
such
chunks
allocate
from
the
free
pool
and
then
free
immediately
then
sleep
for
some
time
to
see
if
the
rss
that
shot
up
would
come
down
or
not.
A
That's
really
a
surprise
to
me
for
at
least
myself
and
my
fellow
customers
or
the
end
users.
I
don't
know
if
that's
a
known
art,
but
it
was
really
surprising.
The
resident
set
size
is
not
truly
reflecting
the
resident
memory
of
the
process.
The
process
has
long
back
freed
up
the
memory,
but
still,
unfortunately,
it's
all
accounted
against
the
same
process.
A
A
Unfortunately,
the
challenge
with
respect
to
the
implementation
or
the
solution
of
this
problem
is
the
fact
that
we
don't
have
a
consistent
or
well-defined
api
that
would
capture
the
actual
account
or
memory
in
a
process.
The
actual
accounted
memory
in
the
process
is
something
called
working
set
size
in
linux.
A
Unfortunately,
we
don't
have
an
api
to
get
that
value
out
of
it
and
then.
Secondly,
this
is
not
something
which
we
can
implement
cross-platform,
where
node.js
is
supported
in
a
number
of
other
platforms
where
we
may
not
be
able
to
consistently
get
this
value.
So
for
these
reasons,
the
rfe
is
still
not
completed.
A
A
So
the
next
one
is
exit
race.
This
is
an
issue
that
was
reported
in
northeast
version
11.
So
what
is
exit
race?
Of
course
the
name
was
given
at
a
later
point
in
time
when
we
completely
diagnosed
the
issue
and
the
resolution
was
identified,
the
issue
was
manifested
as
random
crashes
by
all
sort
of
programs
with
no
specific
patterns
whatsoever
that
can
be
identified.
A
The
pd
approach
was
to
run
the
program
many
times
under
the
problematic
version
and
collect
as
many
systems
as
possible,
associate
the
failing
context
from
the
terms
and
see
if
we
can
do
some
classification
based
on
the
failing
context
into
a
discrete
set
of
patterns
rather
than
indefinite
set
of
random
crashes.
A
So
this
approach
helped
in
two
aspects:
number
one.
It
helps
to
understand.
There
are
two
or
three
or
four
discrete
set
of
patterns
and
not
for
too
many
number
two.
It
revealed
that
in
all
the
cases,
the
main
thread
was
in
its
exit
path,
so
that
essentially
means
in
all
the
programs
it
was
about
to
exit
or
the
main
thread
has
initiated
the
cleanup
activities.
So
this
was
really
a
strong
lead.
A
The
other
challenges
that
we
faced
were
like
when
we
did
a
git
bisect.
It
showed
some
comment
and
by
looking
at
the
changes
that
was
belonging
to
the
commit,
which
was
so
evident
that
that
commit
has
no
relation
with
the
issue
because
it
was
absolutely
no
way
connected.
A
So
the
order
is
absolutely
important,
because
there
are
interdependencies
that
could
be
defined
between
the
resources
and
the
core
part
that
access
these
resources.
The
issue
pops
up
when
there
are
multiple
threads,
it
could
be
the
worker
threads
or
it
could
be
the
internal
helper
threads
that
node
uses
to
perform
asynchronous
operations.
A
We
were
seeing
random
crashes,
so
the
resolution
was
to
follow
the
order
in
which
the
resource
was
created
to
be
followed
at
the
time
of
the,
as
well
as
to
make
sure
that
the
helper
threads,
which
can
potentially
access
these
resources
are,
you
know,
kept
in
a
very
consistent
state
at
the
time
when
we
are
doing
the
cleanup,
which
means
if
one
or
more
threads
can
access
one
of
the
resource.
We
make
sure
the
thread
is
brought
to
the
state
where
it
does
not
make
for
the
threat
to
access
that
resource
anymore.
A
That
was
the
resolution.
One
of
the
problem
determination
technique
that
we
learned
and
applied
in
this
scenario
is
called
a
timing
injection.
So
timing
injection
essentially
involves
adding
sleep
delays
between
activities
where
we
suspect
presence
of
multiple
threads,
which
can
cause
issues
so
basically
widening
the
timing
window
in
an
artificial
manner
and
leading
to
simulating
race.
A
A
Long
story
short
to
our
confusions
around
which
process
owns
the
console
from
where
the
process
are
spawned
between
the
parent
and
child.
All
the
native
stream
handles
or
the
child
process
are
closed
by
default,
and
then
three
pipes
are
created
by
the
parent
and
the
one
end
of
the
pipe
is
held
by
the
parent,
while
the
other
end
is
given
to
the
child
and
the
child
uses
it
as
if
it's
one
of
the
standard
streams.
A
So
whenever
the
child
writes
into
the
stream,
it
goes
to
the
parent
and
whenever
the
parent
writes
into
the
input
stream
of
the
child,
it
gets
flow
into
the
child
process.
So
that's
why
the
child
is
able
to
use
the
standard
streams
effectively
as
if
those
are
the
standard,
input,
output
and
error
streams,
while
parent
is
able
to
find
control
those
streams,
so
that
mechanism
works
pretty
good.
A
Now,
coming
back
to
the
context
problem
context,
what
was
happening
is
when
a
two
child
process
were
engaged
in
the
communication
by
re-piping
the
standard
streams,
for
example,
one
of
the
child
process
output
stream
was
piped
into
the
other
child
process,
input
stream,
so
that
whenever
the
child
process
is
writing,
the
data
should
flow
into
the
second
child
process.
So
that
was
the
understanding
that
was
the
design,
but
unfortunately,
the
data
was
getting
lost
in
an
arbitrarily
manner
in
an
arbitrary
manner.
A
A
A
One
of
the
hypotheses
that
we
made
around
this
was
the
child.
Endpoint
is
already
a
pipe.
We
know
about
that,
and
that
essentially
means
there
is
already
a
channel
that
is
open
between
the
one
of
the
child
process
and
the
parent
process.
Now,
then,
the
child
end
of
the
pipe
is
re-piped.
That
essentially
means
the
the
file
descriptor.
The
underlying
handle
of
the
pipe
is
copied
and
redirected
to
the
other
end
point
of
the
second
child.
What
happens
to
the
channel
that
is
still
open
between
the
first
child
process
and
its
parent
process?
A
So
arguably
that
is
still
open
and
the
data
could
flow
through
that
channel
as
well
so
now,
depending
on
which
two
pairs
of
the
process
are
currently
active,
which
two
process
are
having
the
cpu
cycles
to
spare
and
which
process
is
currently
writing
and
which
process
is
ready
to
read
in
terms
of
its
code
flow
and
control
flow.
A
A
We
were
able
to
see
exactly
what
they
suspected
that
data
was
partly
going
into
one
process
and
the
rest
of
the
data
without
any
laws,
was
flowing
into
the
second
process.
The
solution
was
pretty
straightforward.
When
the
pipes
are
re-piped,
we
need
to
make
sure
that
the
existing
pipe
should
be
deactivated.
A
That
we
to
build
a
theory
that
potentially
explains
the
problem
at
hand
and
then
write
test
cases
and
debug
around
the
hypothesis
to
prove
or
disprove
that
hypothesis
and
then,
if
it
is
not
to
be
the
case,
make
a
new
theory
and
write
a
case
and
validate
around
that
without
that,
if
you
are
just
starting
by
writing
test
cases
and
traveling
in
the
reverse
direction,
I
don't
think
it's
going
to
give
much
use
to
the
problem.
Determination,
exercise.
A
Okay,
shutting
down
node.js
in
flight
well
looks
pretty
straightforward,
looks
not
even
an
issue.
What
is
shutting
down
in
flight
terminate
the
javascript
runtime
and
the
platform
that
was
started
by
the
node.js
process?
Isn't
it
as
simple
as
calling
process.exit
or
killing
the
process
with
a
signal
such
as
control
c?
A
A
So
this
means
an
embedder
has
a
custom
launcher
or
an
entry
point
function,
and
the
entry
point
is
responsible
for
initializing
a
number
of
components
of
the
embedded
application,
and
one
of
the
component
happens
to
be
node.js.
Now
in
this
setup,
the
lifecycle
of
the
embedded
process
could
be
much
larger,
much
wider
than
the
lifecycle
of
nodejs.
To
give
a
simple
example,
in
response
to
certain
external
commands,
the
embedder
would
want
to
shut
down
the
running
instance
of
the
node,
while
continuing
to
work
with
the
rest
of
the
components
of
the
embedder.
A
A
So
the
problem
is
that
node.js
event
loop
is
designed
in
such
a
manner
that,
as
long
as
there
are
active
handles
in
the
even
loop,
the
loop
is
not
going
to
exit.
The
only
way
to
get
out
of
the
even
loop
is
to
send
an
exit
signal,
but
that's
going
to
terminate
not
only
the
even
loop
but
also
the
wider
embedded
process.
A
So
that's
where
we
need
a
mechanism
to
shut
down
the
node.js
instance,
specifically
and
in
a
custom
manner
where
the
even
loop
is
quest
and
it
silently
and
gracefully
comes
out
of
all
the
processing
so
that
the
embedder
can
continue.
A
So
the
way
the
stop
api,
which
was
proposed.
Work
is
like
this.
It
inserts
an
async
handler
into
the
even
loop
in
the
beginning
and
when
the
embedder
wants
to
stop
the
node.js
instance,
it
files
in
a
sync
event,
with
the
handle
that
we
talked
about
earlier
as
the
sync
handle
this
triggers
an
async
event
in
the
even
loop
and
subsequently
the
handler
gets
called,
which
is
invoked
in
the
main
thread,
of
course,
by
stopping
all
other
processing.
A
A
This
eventually
shuts
down
the
node.js
instance
and
returns
to
the
caller,
which
in
this
case
is
the
embedder
now.
My
key
take
away
from
this
particular
activity
is
that
when
new
architectural
posabilities
are
attempted,
we
learn
a
lot
and
in
this
case
specifically
around
how
the
even
loop
works.
How
the
overall
architecture
of
the
even
loop
is
designed
and
then
how
the
threads
coordinate
around
the
even
loop,
etcetera.
A
And
the
last
one
in
this
series
is
the
truncated
stdio
data
on
process
exit.
This
was
reported
by
many
users
with
varying
problem
descriptions.
Here
the
process
is
again
a
child
process
and
they
started
out.
Data
are
typically
large,
like
in
kilobytes,
so
that
means,
if
you're
having
a
child
process
that
writes
kilobytes
of
data
and
exit
immediately
after
the
right,
the
parent
does
not
intercept
the
data
in
its
completion,
so
some
amount
of
data
gets
truncated.
A
That's
the
issue
now,
as
we
know
from
one
of
the
previous
topics,
that
the
parent
and
the
child
communicate
through
a
pipe
and
the
pipes
are
subjected
to
buffering
and
chunking,
and
there
is
a
max
buffer
parameter
that
defines
how
much
data
could
be
allowed
through
the
channel
in
one
shot.
A
trivial
suspect
is
the
value
of
this
max
buffer.
We
increased
this
but
did
not
see
any
result.
Any
positive
result.
That's
coming
out.
A
We
tried
testing
in
different
platforms
that
basically
exhibits
a
different
pipe
behavior,
but
that
was
also
not
giving
any
clue
as
such.
A
Now
changing
the
data
volume
had
a
visible
effect
and,
more
importantly,
each
time
we
were
getting
different
amounts
of
data.
So
clearly
there
was
some
race
condition,
but
between
which
entities,
so
the
sequence
that
writes
the
data
to
the
parent
is
one
sequence
and
the
sequence
that
wants
to
exit
the
process
is
the
second
product,
second
sequence.
A
A
The
data
truncation
was
varying
in
a
variety
of
manner,
but
why
wouldn't
the
process
exit
wait
for
the
data
right
to
complete?
That's
where
the
crux
of
the
problem
is
so
the
console.log
kpi,
which
was
used
to
write
the
huge
chunk
of
data
is
asynchronous
by
nature
in
node.js.
This
means,
if
you
present
a
large
amount
of
data
to
it
only
a
chunk.
A
That
is
that
volume,
which
is
you
know,
bare
minimum
which
the
pipe
can
hold,
is
written
in
the
first
place,
and
then
the
rest
of
the
data
is
scheduled
for
further
writing
when
the
pipe
is
drained
completely
and
is
ready
to
write
again
what
if
an
exit
is
pending
on
the
process,
is
this
data
rewritten?
No,
the
data
is
the
data
right
activity
is
completely
abandoned
and
a
pending
exit
is
processed
first,
and
here
is
this
is
actually
the
issue
now.
A
What
is
the
solution?
The
solution
would
be
not
to
use
the
console.log
api,
which
is
asynchronous
in
nature,
so
one
of
the
non-workaround
is
to
use
process
dot.
Stdot
write
api
as
opposed
to
the
control
dot,
lip
load,
api
and
process,
dot,
std
dot.
A
S3
outdoor
api
is
a
blocking
api,
so
it
holds
the
process
on
a
blocking
state
until
the
whole
data
is
written,
so
that
is
essentially
the
known
program,
but
fundamentally
for
the
console.log
being
asynchronous
and
at
the
exit
point,
if
there
are
data,
that's
pending
will
get
truncated
does
not
have
a
comprehensive
solution
as
of
now,
and
that
means
we
do
have
a
number
of
issues
opened
on
that.
I
don't
think
we
we
will
have
a
design
level
proper
fix
for
that
in
the
near
future,
but
we
do
have
a
stable
workaround.
A
So
that's
pretty
much.
It
is.
I
hope
you
enjoyed
this
talk.
I
want
to
conclude
the
talk
by
saying
these
are
some
of
the
examples
of
issues
where
we
spend
considerable
amount
of
time,
debugging
the
knowledges
and
tunnels,
and
gain
vital
insights
around
how
things
work,
plus
usage
of
a
variety
of
tools
and
techniques
and
methodologies.
A
I
would
like
to
strongly
recommend
folks
who
are
interested
in
getting
further
details
on
this
or
how
to
use
some
of
the
tools
or
even
to
get
started
with
contributing
to
node.js
and
start
debugging.
You
are
welcome
to
write
to
me
or
any
of
my
fellow
project
members.
We
are
more
than
happy
to
assist
you.
There
are
enough
issues
in
the
backlog
at
the
moment
that
are
waiting
to
be
hand-picked,
so
you
wouldn't
be
disappointed
at
all.
I'm
sure
on
that
once
again,
thank
you
very
much
for
listening
to
this
talk.