OpenJS Foundation OpenJS World 2021 - Development, 2 Jun 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Node.js Deep Debugging - Gireesh Punathil, IBM India

Description

While there are established debugging techniques for well known problem types, we have come across rare and complex yet interesting production issues - stemming from pervasive build, deployment configurations and heterogeneous workload types that node.js is subjected to. In this session, we will illustrate case studies of few unique issues that we debugged, custom diagnostic tools that were used and lessons learned. The attendees will learn addressing deeper level production problems and self-diagnose anomalies.

A

Hi welcome and thank you for joining this talk on node.js deeply bugging before we talk about what this session is all about. What are the stated purpose and the expected outcome? Let me provide a brief intro myself in greece. I work for ibm, I'm an architect with a team called ibm runtimes, that's responsible for developing and supporting language runtimes such as java and node.js, I'm also a member of the node.js technical steering committee and the diagnostic working group.

A

What are the motivation behind today's talk? I've been involved in debugging a number of end user reported issues. In fact, I am personally interested in troubleshooting complex issues. That's coming out of the production deployments, such as crash, hang performance, etc.

A

That essentially means I have spent a good amount of time, debugging node.js and terminals, and I felt it prudent to share some of these experiences some of the path I traveled with uh people who are using node.js and might come across same or similar situations as the issues that I debugged and instead of reinventing the wheel, it would be pretty easy for them to reuse some of these experiences and lead the issues in hand towards queue, care resolutions.

A

So that's the whole idea or the motivation behind this talk. So today I have five issues or five topics in hand for each of the issues.

A

This is how I want to explain it I'll start with the execution environment, where the issue is reported, how the issue is manifested externally and then the whole path or the proceedings towards problem determination and what was the actual issue and how the issue got resolved and then the tools, the methodologies and the best practices that I followed as well as what are the learnings and the insights that are reusable. So that's the whole idea all right. So here is our first topic.

A

The title says: hi and steady state rss. What is rss it stands for resident set size here. Size would mean uh the memory size. As we know there are operating system primitives to allocate and reallocate memory to and from the process. Now at any given point in time.

A

The allocated or deallocated memory is not necessarily a true reflection of the actual memory accounted for the process.

A

This is because we rely on the virtual memory system and at any point in time, in a sufficiently loaded, multi-task system, there could be other processes which are demanding memory and the memory which are not used by the process or memory that are not used by the process for a considerable amount of time is swapped out of the process and given to the other process that needs it and then, at a later point in time, when the current process requires it, the memory is brought back in.

A

So this means the actual accounted memory is going to be equal or less than the allocated memory for the process in most of the practical scenarios, so the rss, the resident set size is expected to be or supposed to be, the accounted memory for the process.

A

I stress on the word supposed to be because that's the whole essence of the issue at hand, and we will see why so the issue was manifested in the form of a high and steady rss size in the node.js process and reported by many users with a similar symptom, as they all are seeing very low heap usage. But very high rss.

A

This is not a surprise, because we know that node.js is not just using memory from the javascript heap. It has also bearings on the native backend. Many objects have memory that is used in the native heap as well so low javascript heap, but high native memory and high rss on the face of it is not a problem, but because many users uh reported the same issue with a similar symptom. We thought it's problem to investigate and started looking at it.

A

One of the common symptoms was the fact that they all had a history of huge memory allocation in the form of various objects getting created uh back to back and then later all the objects getting garbage collected, but the rs is not proportionately coming down, so that was a common symptom reported by most of these customers or end users.

A

Now coming to the problem determination, uh as we know, the javascript heap is not a true reflection, so we make use of the operating system tools to see. What's the actual memory, consumption of the process, uh the latex tools, such as stove, pmap, etc, helps here and then we look at the consumption and it's correct.

A

We see high rss, but then, unfortunately, we don't have a mechanism to see what is the breakup of the native memory consumption or what are the finest level of details about where this high rss is accounted for within the program or within the virtual machine?

A

So the only way to get to the details would be through uh running some simulation testing. So we wrote custom programs to simulate the behavior, but we were not getting the that's the point in time. We got into some other hint around.

A

The behavior was not very consistent, even with the same program that was run in different systems, say, for example, different system with differing in specification or differing in terms of the load conditions, so that was on the face of it as a challenge, but it gave us a lead in terms of how do we proceed? So one of the first step was to disambiguate or get an answer to why the same program is behaving differently in systems with differing spec or differing amount of physical memory or different set of load characteristics in the system.

A

So that actually gave us some opportunity to dig into the problem. We see two patterns, that is when the system is high end with ample amount of physical memory and the system is relatively idle.

A

Then we saw that the rss once uh hiked up is never coming down, whereas if the system is uh tight in terms of the physical memory with respect to the rolling process, and if there are a lot of load in the system, then we see that rss is immediately coming down, which is in accordance with the expectation from the nodes process so uh reading through various uh documentations various manuals and specifications around virtual memory and resident set memory, and with the help of a small c program which truly uh simulated the node.js scenario, we could actually understand the root cause of the issue.

A

So in my small c test case, what I was doing is allocate small chunks of memory in rapid succession, but in large amount and then immediately free all those memory say, for example, uh 2, kb chunks of say 1000 such chunks allocate from the free pool and then free immediately then sleep for some time to see if the rss that shot up would come down or not.

A

So we could see that the rss stays on top forever, unless you add more process to the system and bring in a demand in the memory, the process is never going to relinquish the the amount of rss that was accounted in the system in the process.

A

That's really a surprise to me for at least myself and my fellow customers or the end users. I don't know if that's a known art, but it was really surprising. The resident set size is not truly reflecting the resident memory of the process. The process has long back freed up the memory, but still, unfortunately, it's all accounted against the same process.

A

Now, what's the resolution, we don't have a resolution yet identified. We have an rfe opened against a libya v component.

A

uh Unfortunately, the challenge with respect to the implementation or the solution of this problem is the fact that uh we don't have a consistent or well-defined api that would capture the actual uh account or memory in a process. The actual accounted memory in the process is something called working set size in linux.

A

uh Unfortunately, we don't have an api to get that value out of it and then. Secondly, uh this is not something which we can implement cross-platform, where uh node.js is supported in a number of other platforms where we may not be able to consistently get this value. So for these reasons, the rfe is still not completed.

A

So the key learning from this exercise is the fact that rss may not truly reflect the accounted memory or the active memory in your process. It depends on the system characteristics also if it is largely ideal with a lot of ample memory, never trust the rss. That's my bottom line.

A

So the next one is exit race. This is an issue that was reported in northeast version 11. So what is exit race? Of course the name was given at a later point in time when we completely diagnosed the issue and the resolution was identified, the issue was manifested as random crashes by all sort of programs with no specific patterns whatsoever that can be identified.

A

So we got a segmentation fault, we got aboard, we got illegal instructions to name a few.

A

The pd approach was to run the program many times under the problematic version and collect as many systems as possible, associate the failing context from the terms and see if we can do some classification uh based on the failing context into a discrete set of patterns rather than indefinite set of random crashes.

A

So this approach helped in two aspects: number one. It helps to understand. There are two or three or four discrete set of patterns and not for too many number two. It revealed that in all the cases, the main thread was in its exit path, so that essentially means in all the programs it was about to exit or the main thread has initiated the cleanup activities. So this was really a strong lead.

A

The other challenges that we faced were like when we did a git bisect. uh It showed some comment and by looking at the changes that was belonging to the commit, which was so evident that that commit has no relation with the issue because it was absolutely no way connected.

A

But then it was just adding to the timing window.

A

So the problem was that there is a set of resources or entities or assets whatever we want to call it that node, creates or initializes at the time of the boat, strap and those assets or entities or resources are destroyed or cleaned up in the reverse corridor at the time of the node exit.

A

So the order is absolutely important, because there are interdependencies that could be defined between the resources and the core part that access these resources. The issue pops up when there are multiple threads, it could be the worker threads or it could be the internal helper threads that node uses to perform asynchronous operations.

A

So when there are multiple threads involved and if we don't do the cleanup in the reverse order properly or we don't harness the threats, then these threads were accessing the resources which were already destroyed, and that was causing the random crash and definitely based on the at that point where these threads were accessing the resource.

A

We were seeing random crashes, so the resolution was to follow the order in which the resource was created to be followed at the time of the, as well as to make sure that the helper threads, which can potentially access these resources are, um you know, kept in a very consistent state at the time when we are doing the cleanup, which means if one or more threads can access one of the resource. We make sure the thread is brought to the state where it does not make for the threat to access that resource anymore.

A

That was the resolution. One of the problem determination technique that we learned and applied in this scenario is called a timing injection. So timing injection essentially involves adding sleep delays between activities where we suspect presence of multiple threads, which can cause issues so basically widening the timing window in an artificial manner and leading to simulating race.

A

Conditions all right data truncation with repiping again one of the complex issue that we have worked with this is something which is not even easy to explain to a normal programmer.

A

So what was happening consider to node.js style processors like any other process. These process also have the standard streams, such as the standard input output and the error stream. A key difference with the child process of the node just is that these streams are essentially pipes.

A

Long story short to our confusions around which process owns the console from where the process are spawned between the parent and child. All the native stream handles or the child process are closed by default, and then three pipes are created by the parent and the one end of the pipe is held by the parent, while the other end is given to the child and the child uses it as if it's one of the standard streams.

A

So whenever the child writes into the stream, it goes to the parent and whenever the parent writes into the input stream of the child, it gets uh flow into the child process. So that's why the child is able to use the standard streams effectively as if those are the standard, input, output and uh error streams, while uh parent is able to find control those streams, so that mechanism works pretty good.

A

Now, coming back to the context problem context, what was happening is when a two child process were engaged in the communication by uh re-piping the standard streams, for example, one of the child process output stream was piped into the other child process, input stream, so that whenever the child process is writing, the data should flow into the second child process. So that was the understanding that was the design, but unfortunately, the data was getting lost in an arbitrarily manner in an arbitrary manner.

A

The challenges with respect to the problem declination was that the truncation pattern was not consistent. Every time we get different set of results. Sometimes the test passes, and sometimes we don't get. We don't see any data flowing into the second process, etc.

A

The second challenge was that we did not have a proper tool. We did not have a proper methodology also to debug this kind of a problem so.

A

One of the hypotheses that we made around this was the child. Endpoint is already a pipe. We know about that, and that essentially means there is um already a channel that is open between the one of the child process and the parent process. Now, then, the child end of the pipe is re-piped. That essentially means the the file descriptor. The underlying handle of the pipe is copied and redirected to the other end point of the second child. What happens to the channel that is still open between the first child process and its parent process?

A

So arguably that is still open and the data could flow through that channel as well so now, depending on which two pairs of the process are currently active, which two process uh are having the cpu cycles to spare and which process is currently writing and which process is uh ready to read in terms of its code flow and control flow.

A

The data would actually go partly to the pattern process and partly to the second child process, because both the both the pipes are open and it depends which pipe is being exercised, which process is active, etc. That's really funny scenario.

A

With the help of some powerful tooling such as asteroids, we were able to actually get to the bottom of it and prove this hypothesis.

A

We were able to see exactly what they suspected that data was partly going into one process and the rest of the data without any laws, was flowing into the second process. The solution was pretty straightforward. When the pipes are re-piped, we need to make sure that the existing pipe should be deactivated.

A

That means the remote endpoint of the existing pipe needs to be closed, not.

A

That we to build a theory that potentially explains the problem at hand and then write test cases and debug around the hypothesis to prove or disprove that hypothesis and then, if it is not to be the case, make a new theory and write a case and validate around that without that, if you are just starting by writing test cases and traveling in the reverse direction, I don't think it's going to give much use to the problem. Determination, exercise.

A

Okay, shutting down node.js in flight well looks pretty straightforward, looks not even an issue. What is shutting down in flight terminate the javascript runtime and the platform that was started by the node.js process? Isn't it as simple as calling process.exit or killing the process with a signal such as control c?

A

The point is in this particular case: we don't want to get out of the process, but still want to terminate the node.js sequences, but then what is left when node.js terminates that makes this case very special. The use case here is that we make use of an embedder.

A

An embedder by definition, is a process that embeds node.js with a container component relation, not as a patent child or not as a server client relation, but as an in-process component.

A

So this means an embedder has a custom launcher or an entry point function, and the entry point is responsible for initializing a number of components of the embedded application, and one of the component happens to be node.js. Now in this setup, the lifecycle of the embedded process could be much larger, much wider than the lifecycle of nodejs. To give a simple example, in response to certain external commands, the embedder would want to shut down the running instance of the node, while continuing to work with the rest of the components of the embedder.

A

There are many nodes inverters, but the two things that I know of are one is electron and one is ibm iib. Iib stands for ibm integration, bus that defines a framework for connecting heterogeneous applications.

A

So the problem is that node.js event loop is designed in such a manner that, as long as there are active handles in the even loop, the loop is not going to exit. The only way to get out of the even loop is to send an exit signal, but that's going to terminate not only the even loop but also the wider embedded process.

A

So that's where we need a mechanism to uh shut down the node.js instance, specifically and in a custom manner where the even loop is quest and it silently and gracefully uh comes out of all the processing so that the embedder can continue.

A

So the way the stop api, which was proposed. uh Work is like this. It inserts an async handler into the even loop in the beginning and when the embedder wants to stop the node.js instance, it um files in a sync event, with the handle that we talked about earlier as the sync handle this triggers an async event in the even loop and subsequently the handler gets called, which is um invoked in the main thread, of course, by stopping all other processing.

A

In the main thread, and in the handler we close all the handles in the even loop, and basically, we close the even loop we come out of it. We terminate all the worker threats and then start the cleanup activities in a graceful manner.

A

This eventually shuts down the node.js instance and returns to the caller, which in this case is the embedder now. My key take away from this particular activity is that when new architectural posabilities are attempted, we learn a lot and in this case specifically around how the even loop works. How the overall architecture of the even loop is designed and then how the threads coordinate around the even loop, etcetera.

A

And the last one in this series is the truncated stdio data on process exit. This was reported by many users with varying problem descriptions. Here the process is again a child process and they started out. Data are typically large, like in kilobytes, so that means, if you're having a child process that writes kilobytes of data and exit immediately after the right, the parent does not intercept the data in its completion, so some amount of data gets truncated.

A

That's the issue now, as we know from one of the previous topics, that the parent and the child communicate through a pipe and the pipes are subjected to buffering and chunking, and there is a max buffer parameter that defines how much data could be allowed through the channel in one shot. A trivial suspect is the value of this max buffer. We increased this but did not see any result. Any positive result. That's coming out.

A

We tried testing in different platforms uh that basically exhibits a different pipe behavior, but that was also not giving any clue as such.

A

Now changing the data volume had a visible effect and, more importantly, each time we were getting different amounts of data. So clearly there was some race condition, but between which entities, so the sequence that writes the data to the parent is one sequence and the sequence that wants to exit the process is the second product, second sequence.

A

So clearly, there is a raised condition between these two sequences and which one is able to progress further, based on that, the other one was lagging behind so based on cpu cycles, based on operating system, scheduling and based on various other um control flow related uh sequencing.

A

The data truncation was varying in a variety of manner, but uh why wouldn't the process exit uh wait for the data right to complete? That's where the crux of the problem is so the console.log kpi, which was used to write the huge chunk of data is asynchronous by nature in node.js. This means, if you present a large amount of data to it only a chunk.

A

That is that volume, which is you know, bare minimum which the pipe can hold, is written in the first place, and then the rest of the data is scheduled for further writing when the pipe is drained completely and is ready to write again what if an exit is pending on the process, is this data rewritten? No, the data is the data right activity is completely abandoned and a pending exit is processed first, and here is this is actually the issue now.

A

uh What is the solution? The solution would be not to use the console.log api, which is asynchronous in nature, uh so one of the non-workaround is to use process dot. Stdot write api as opposed to the control dot, lip load, api and process, dot, std dot.

A

S3 outdoor api is a blocking api, so it holds the process on a blocking state until the whole data is written, so that is essentially the known program, but fundamentally for the console.log being asynchronous and at the exit point, if there are data, that's pending uh will get truncated does not have a comprehensive solution as of now, and that means we do have a number of issues opened on that. I don't think we we will have a design level proper fix for that in the near future, but we do have a stable workaround.

A

So that's pretty much. It is. I hope you enjoyed this talk. I want to conclude the talk by saying these are some of the examples of issues where we spend considerable amount of time, debugging the knowledges and tunnels, and gain vital insights around how things work, plus usage of a variety of tools and techniques and methodologies.

A

I would like to strongly recommend folks who are interested in getting further details on this or how to use some of the tools or even to get started with contributing to node.js and start debugging. You are welcome to write to me or any of my fellow project members. We are more than happy to assist you. There are enough issues in the backlog at the moment that are waiting to be hand-picked, so you wouldn't be disappointed at all. I'm sure on that once again, thank you very much for listening to this talk.