►
Description
Real-Life Node.js Troubleshooting - Damian Schenkelman, Auth0
When building a large enough set of services using node.js, there will be a point when you find that your application is suffering from performance or memory issues. When this happens, you have to roll up your sleeves, get your tools and start digging. This talk explains how you can use tools such as ab, flame graphs, heap snapshots and Chrome's memory inspector to find the cause of these. We will go over two real life issues, a CPU bottleneck and a memory leak, we found while building our services at Auth0, and also explain how we fixed them.
A
And
I'm
here
to
talk
about
a
couple
of
things
that
we
found
over
these
four
years
that
are
kind
of
the
usual
suspect.
You
know
things
that
you
find
that
are
kind
of
making
your
application
on
your
services
crash
and
how
we
find
them,
how
we
get
them
fixed.
So
this
is
kind
of
a
repeatable
process.
There
are
two
important
things
that
we
want
to
talk
about.
One
of
them
is
memory
leaks
on
the
other
one,
our
cpu
bottlenecks
or
performance
related
issues.
So,
let's
start
with
the
first
one
memory
leaks.
A
We
first
have
to
define
what
it
is.
This
is
going
to
be
pretty
fast
first
part,
so
the
main
cost
for
remember
leak,
a
cyclic
of
mine
likes
to
say
it
is
unwanted
references.
We
are
keeping
something
alive
that
we
aren't
going
to
be
using
in
the
future.
We
don't
need
it.
So
if
you
can
represent
the
memory
model
like
this,
we
can
see
that
the
garbage
collector
has
kind
of
pointers
to
what
it
calls
roots
and
those
roots
have
arrows
or
references
to
our
object,
and
then
we
have
kind
of
a
dependency
graph.
A
So
when
we
do
something
like
this
and
we
say:
okay,
B
dot
d
equals
no.
What
we're
saying
is
get
rid
of
that
arrow.
That's
it
at
this
point.
When
D
is
no
longer
referenced,
then
the
garbage
collector
comes
in
and
says:
hey.
Let
me
take
that
out
of
there.
You
don't
need
it
anymore
and
you're
good.
So
that's
how
your
program
keeps
running
over
time.
You
end
up
with
something
like
this,
which
is
the
sawtooth
pattern,
so
you
start
allocating
memory.
You
allocate
a
bit
more
and
every
once
in
a
while.
A
A
You
start
having
things
that
you
don't
need,
so
you
have
like
a
large
bucket
and
then
you
are
at
another
one
and
at
this
point,
if
you
keep
creating
a
lot
of
these
either
your
browser,
or
in
this
case
your
node
application,
they
will
crush.
So
this
is
what
we
were
saying.
This
is
an
actual
chart
from
one
of
our
monitoring
services.
Basically,
what
you
can
see
is
that
memory
went
up
and
it
kept
going
up.
There
were
no
garbage
collections
and
the
process
just
died.
A
So
this
is
a
really
bad
situation
to
being,
if
you
think
about
it
like
this
is
not
the
best
place
to
be
so,
you
have
to
figure
out
okay
what's
going
on
and
you
have
to
go
on,
okay,
find
them
or
bust
them.
Oh
really,
right,
or
is
it?
How
do
we
find
it?
How
do
we
fix
it?
One
important
thing
that
we
learned
is
that,
if
you're
ever
in
this
situation,
the
first
thing
you
need
to
do
is
take
control.
A
You
don't
want
to
just
start
researching
and
leave
everything,
as
is
that's
because
every
time
the
application
crashes
responses
to
requests
are
not
being
generated.
So
some
some
requests
are
failing.
You
don't
want
that.
One
trick
that
you
can
do
is
you
can
increase
the
heap
size
of
your
process,
so
it's
in
general,
it's
like
1.2,
1.4
gigabytes.
If
I
remember
correctly,
if
you
make
that
higher
than
your
application
crashes
less
often,
the
other
thing
that
you
want
to
do
is
drain
connections.
A
If
your
memory
reaches
like
the
80%
limit
in
85,
you
should
probably
stop
accepting
your
requests
and
just
process
the
current
ones.
What
you're
doing
basically
is
you
are
manually
garbage
collecting
by
restoring
the
process
again,
not
ideal,
but
you
have
to
buy
time
until
you
can
find
the
real
reason.
A
Once
you
have
done
this,
the
first
thing
you
should
do
is
get
a
heap
snapshot.
What
is
a
heap
snapshot?
It's
kind
of
a
picture
of
of
everything
in
your
memory
in
your
heap,
you
can
use
the
bit
profile
module
for
this.
There
are
other
tools.
This
basically
allows
you
to
send
the
signal
to
the
process
and
probe
on
take
a
heap
snapshot
right.
So
let's
dig
a
bit
deeper
into
this
like
what
does
this
mean?
What
can
I
do
with
the
heap
snapshot?
So,
let's
see,
can
you
see
that
yeah?
A
What
we're
seeing
here
are
all
the
objects
in
our
application
in
our
service
and
the
type
of
term,
how
far
they
are
from
the
root,
so
the
distance,
how
many
of
them
there
are
the
shallow
size,
which
is
basically
how
much
they
are
occupying
in
memory
and
retain
size,
which
is
what's
the
size
of
everything
else
that
they
are
pointing
to
on
keeping
alive.
So
these
are
fairly
different.
A
My
first
recommendation,
if
you're
looking
for
memo
leak,
is
check
the
strength
if,
for
any
reason,
you
are
creating
a
lot
of
them,
strengths
are
good
because
they
are
very
contextual
based
on
the
context
of
the
strengths.
You
can
figure
out
where,
in
that,
in
your
program
that
string
is
being
created,
so
all
of
these
are
of
course
strings
from
node
modules
code.
That's
always
kept
alive
in
memory,
and
we
started
seeing
some
of
these.
A
So
this
is
actually
again
real
heap
snapshot
of
a
memory
leak
we
found-
and
this
is
how
we
send
logs
to
Kinesis
to
our
stream
service.
So
the
next
thing
you
do
is
you
come
here
and
you
pop
this
thing
up
and
you
see
the
retainers
who
is
keeping
that
object
in
memory
right?
So
we
have
a
body-
and
that's
been
pointed
to
by
an
HTTP
request.
This
is
an
anonymous
function,
so
name
your
functions
like
if
you
name
that
you
will
have
a
better
name
here
and
eventually
we
get
here
right.
A
So
we
see
a
forever
agent
has
a
cell
keeping
a
key
to
Kinesis
and
then
there's
a
TLS
of
it.
So
we
see
that
this
is
being
kept
alive,
eventually
right,
but
this
forever
agent.
So
what
does
this
mean?
It
means
we
have
something
like
this
as
a
mental
picture,
so
an
object
sockets,
that's
pointing
to
another
object
that
has
a
key
and
that
key
has
an
array
of
TLS
sockets.
So
that's
how
the
forever
agent
works.
A
It
keeps
a
socket
or
more
of
them
alive
for
each
of
your
origins
right,
but
if
you're
doing
keep
alive
and
you're
doing
logs,
you
shouldn't
be
creating
a
new
socket
on
every
connection.
That's
what
keep
alive
is
forward
to
avoid
that,
so
that
was
kind
of
the
first
thing
that
something
was
off
I'm
explaining
all
of
this
in
a
very
sequential
manner,
but
it
was
a
bit
more
chaotic
than
that.
You
can
see
the
PRS
for
this
stuff
and
say
well.
A
We
didn't
really
know
exactly
what
was
happening
right,
but
we
had
a
couple
of
different
approaches.
One
thing
that
we
found
is
that
the
AWS
SDK
actually
wasn't
getting
rid
of
a
couple
of
event
listeners,
so
they
were
always
live
and
keep
on
keeping
references
to
the
strings
and
the
other
one
we
found
is
that
the
forever
agent
was
actually
creating
a
new
socket
on
a
new
connection.
For
every
time
we
locked
so
the
more
our
application
was
used
and
the
more
logs
we
generated
the
more
memory
we
consumed.
This
was
like
the
fixed
part.
A
It
took
a
lot
of
time,
but
once
we
found,
as
we
said,
okay,
let's
go
back,
use
the
normal
agent
and
use
it
save
it
to
forever
and
that's
it.
But
this
hopefully
gives
you
an
idea
of
well
how
you
can
find
a
memory
leak
field,
what
it
is
and
fix
it
once
you
fix
it,
you'll
go
back
to
the
Sawtooth
right,
so
this
is
kind
of
the
normal
graph.
These
are
not
restarts
that
we
force.
This
is
real
memory
being
Darvish
collected
okay,
so
the
IRR
thing
I
want
to
talk
about
our
CPU
bottlenecks.
A
A
The
thing
is,
you
start
saying:
okay,
we
want
to
create
performance
tests
for
this
to
avoid
progressions
and
stuff
like
that,
and
if
you
are
doing
something
like
that,
you
would
expect
a
chart
like
this
one,
which
has
a
minimum
in
this
case
it's
400,
but
that's
because
of
latency.
So
you
go
to
server
it's
like
200,
milliseconds
200
back
and
then
you
start
getting
the
errors
right.
That's
probably
the
small
var
there
it's
okay!
We
have
some
bad
requests,
so
you
don't
have
to
do
a
lot
of
processing.
A
A
A
We
found
this
tool
called
flame
graphs
and
what
fine
graphs
do
is
that
they
provide
you
a
representation
of
how
much
time
each
function
is
taking
in
your
program,
the
wider,
the
representation
for
the
VAR
of
the
function,
the
more
time
it's
taking
so
in
this
case,
B
is
taking
80%
of
the
program.
C
and
D
are
taking
the
same
and
what
you
see
in
hate.
That's
the
call
stack,
so
it's
how
you
actually
got
there
right.
A
A
Let's
do
an
honor
demo.
Okay!
So,
let's
see
this
is
kind
of
the
beautiful
simplification
of
the
program
of
the
problem
that
we
have.
So
we
are
using
this
tour,
which
has
a
user
and
the
hash
for
that
user
right,
but
that's
the
password
hash
and
we
know
there's
not,
there's
not
a
problem
with
the
store.
So
we
can
just
keep
it
in
memory
and
then
we
had
an
authorized
endpoint
which
basically
fetches
something
from
the
store
and
thus
a
comparison
from
the
password
and
the
hash,
and
that's
it
right
so
well
worse,
the
problem.
A
This
is
actually
a
like
30
line
file.
We
have
like
tens
of
thousands
or
even
hundreds
of
thousands
of
lines
if
you
consider
dependencies,
so
it's
a
lot
simpler
than
it
looks,
but
let's,
let's
not
even
take
a
guess
right.
So
what
we
can
do
is
we
can?
Can
you
see
that
yeah?
So
we
can
come
here
and
say:
okay,
I'm,
going
to
run
this
program
as
like
in
benchmark
mode
using
a
tool
called
zero
X
and
what
0x
allows
you
to
do
is
to
get
flame
graphs
from
your
node
code.
A
This
will
actually
ask
for
my
password
because
it
requires
some
permissions,
YouTube
and
kernel
level
stuff
like
root
and
then
I
can
generate
load.
So
we
can
use
AV
or
any
other
tool
for
this
and
I'm
doing
a
hundred
requests
in
total,
with
a
concurrency
of
ten.
So
like
100
people,
logging
in
right
and
you
have
like-
you
really
have
to
be
patient
because
oh
we
have
a
problem.
This
is
a
CPU
issue,
so
it
finishes
and
like
the
times
that
you
can
see
here
kind
of
similar
to
what
we
saw
before.
A
A
A
Ok,
so
that
part
has
now
soon.
Ok,
so
let's
see
what's
this
here,
although
you
probably
can't
read
it,
but
it's
bcrypt,
compare
so
that's
like
34%
of
the
code,
the
same
thing
here,
bcrypt
compare
it's
actually
on
different,
like
bars,
because
the
coal
sucks
different,
depending
on
when
the
event
finished,
parsing
the
body
and
stuff
like
that,
but
against
it's
always
being
going
to
the
same
function
on
the
same
thing
here,
big
group
compare
so
evidently
we
have
a
problem
and
it's
related
to
be
clip.
A
Compare
if
you
take
a
look
at
our
implementation,
we
are
doing
something
sync.
So
we're
a
know.
Then
we
say:
ok,
things
are
async,
they
are
better
and
we'll
just
change
that
to
run
asynchronously
right.
But
what
does
asynchronous
mean
in
this
case?
Well
like
this,
is
a
CPU
bound
operation,
so
it
being
a
CPU
bound
operation,
it
would
block
the
event
loop.
So
what
we're
doing-
and
this
is
related
to
a
talk
I
saw
earlier
today-
is
we
are
queuing
this
in
the
leave.
You
be
event
loop,
sorry,
so,
let's
run
this
again
and.
A
A
So
we
need
to
scale
and
scaling
is
not
about
just
handling
throughput.
It's
also
about
how
much
money
you
spent
handling
that
throughput
because,
like
if
you
buy
a
Cray
supercomputer.
Well,
it's
not
that
good,
so
you
could
think
about
creating
like
using
a
faster
hash
function.
The
problem
is
that
that's
not
a
safe.
The
reason
we
use
a
slow
hash
function
is
that
if
someone
gets
a
hand
of
the
hashes,
they
have
a
hard
time
tracking
them.
So
that's
not
an
option
for
us
we're
a
security
company.
We
don't
want
to
go
that
way.
A
You
don't
have
a
lot
of
atomicity.
So
then
you
go
kind
of
the
horizontal
scaling
way
which
can
be
combined
with
like
the
vertical
scaling.
But
if
you
create
multiple,
odd
services,
you
run
into
another
problem,
which
is
that
the
service
not
only
allows
you
to
log
in,
but
it
can
also
allow
you
to
change
your
email
and
changing
you.
Your
email
is
an
I/o
bound
operation
because
you
just
go
to
data
base
instead
of
like
login,
which
is
again
cpu-bound.
A
We
fix
this
by
creating
a
service
called
bus,
it's
open-source,
you
have
the
link
there,
and
it
is
that
you
have
the
same
interface
as
you
do
with
bcrypt,
but
it
actually
works
like
this.
So
you
set
up
the
bus
service
behind
the
load
balancer.
It
communicates
with
the
client
using
protocol,
buffers
or
Avro
depending
on
the
day
and
your
configuration,
and
it
does
two
things
it
either
compares
a
password
to
hash
or
it
hashes
a
password.
That's
it.
A
The
good
thing
is
that
it's
very
easy
to
figure
out
when
this
is
actually
going
to
be
a
bottleneck
and
autoscale,
because
you
can
very
effectively
measure
the
amount
of
requests
you
know
that
it
takes
between
70,
milliseconds
and
100
milliseconds
to
run
a
big
comparison,
so
you
can
say:
okay,
I
can
handle
like
ten
of
this
per
second.
That's
it!
That's
when
you
scale
important,
regardless
of
the
numbers,
is
to
always
do
the
cost.
A
Comparison
did
I
achieve
the
desired
throughput,
and
am
I
spending
the
lower
amount
of
money
that
I
can't
write,
so
those
are
the
two
key
things
and
always
fail
gracefully
when
you
introduce
a
new
dependency,
so
you
see
she's
keeping
her
hands
up,
even
though
she
didn't
stick
the
landing,
that's
good!
If
you
introduce
the
new
dependency,
you
should
be
able
to
run
things
in
a
different
way.
A
So
if
the
bass
cluster
starts
to
fail,
what
we
do
is
we
turn
back
to
running
the
big
creep
comparison
locally
as
a
potluck,
and
that's
that
gives
our
operations
team
time
to
figure
out
what's
going
on
and
get
the
cluster
back
up,
it's
not
ideal.
In
terms
of
cost,
it's
not
ideal
in
terms
of
performance,
but
it's
ideal
in
terms
of
that's
the
best
that
we
can
do
for
our
customers
right
now.
One
last
thing
cut
picture:
I
was
missing
one
of
these,
so
I'm
adding
it.