Apache Mesos Containerization Working Group, 26 Jul 2018

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: GMT 2018-07-26 Containerization WG

Description

Agenda: https://docs.google.com/document/d/1z55a7tLZFoRWVuUxz1FZwgxkHeugtc2nHR89skFXSpU/edit

A

Good morning,.

A

B

B

B

C

B

Okay, let's wait for the kappamini if any others can join sure. Let me be interested to see whether he wants to be here.

B

Okay, oh by the way, chibi do not try this making good day. He cannot yeah yeah.

B

B

B

Okay, I think we can just start it so.

C

Yeah, thank you for thank you for allowing me to talk about this in chess or not in the short notice, I, pretty much decided to talk about this yesterday, mostly because I need to talk where the content people working on the containers back. So this seems like a pretty good time. Slot I didn't really get too much time to prepare this.

C

Although the bug yourself has been honestly affecting our cluster for like at least a couple quarters, we roughly see this kind of happens every around about every month realized, but it's not mental in Korea, just kind of keeps accumulating, and eventually it renders most of the chief humans in our cluster unusable and we massively restart the vessels agent or all of them the symptoms disappear and then, after a while, they slowly come back and they gotta clean again.

C

So this time we were a little bit lucky in the log analysis and I found a very good correlation with a problem. Let me explain the problem first. The issue we observed here is that it seems like there is some logical possibility that Lydia and VDI GPU isolator kind of leaked a GPU car, the previous allocated.

C

That is why we see the log lines like collect fail, requested one, but only zero available. The one there is the number of GPU devices isolator thinks is thinks it has, because the agent thinks isolator has and they're asking you to allocate for the the allocate, but isolates itself cannot allocate anymore because there's only zero available. So that's and this this prevents Elkin any container with a GPO card being created being properly and provisioned. We had even seen the sometimes they requested to, but only one available.

C

Even that kind of thing has happened, but anyway, in last episode, which is roughly for the Wednesday I think it's Tuesday, we were doing we, we are, we analyzed the logs now cluster and we found a very good correlation in our cluster. Only this is not my log. This is the log of original pug with water and, if you scroll down to the bottom of the comment, the bottom of this task I saw this lie.

C

Termination of executors something of framework failed, failed to kill all processing the container timeout after one minute, and if you click on the next line, the two building today are the previous author, I believe. Is you actually acknowledged at this this? This condition might be problematic because we are. We are short-circuiting all the other isolator cleanup in this case, so it seems like there is a pretty strong, logical, logical explanation here in that the container Iser was not able to the container. The long term was not able to determinate other processes.

C

In this container we didn't know we did that within the timeout and then the container Iser, and then you know, decided not to call on these Isolators clean up that the cleanest method, the TPU isolator, is designed in a way that the cleanup method really needs to recall so previous allocated, the GPU card can be allocated, but in.

A

C

The this DQ card is kind of a lot clearer leaked in this allocation process until the until the mess was really started. If we restarted, then the recover process of the isolator is able to fix all the problems. That's true.

B

This always happens when you, after you queue the executor, I.

C

Obviously, don't know whether we were killing the task or whether the task itself is having problem is terminating, but it seems like there are certain cases that our the process in the container are not able to fully terminate within one minute. I see.

B

Sir so the issue here it is, you get the you guys, get the destroy failure, because you guys get the district failure, because you can now kill the continued process in.

C

Destroy failure is not great, but this is actually not a big problem. I guess this at most translates into a strange error message in our stack. The real problem is when a different task a later task using the GPU, got scheduled on the agent. That's.

B

Where the real problem has I see, I, see I, see so you're, saying you're, saying the Linux launcher could not kill a continued process, so the Oh containers do who the trip here it's roses, but mrs. link definitely want a single one.

C

I think all the process eventually died and the new process could able to use this GPU card. The problem is that the ice, the DPO isolate her member, the cleanup function of GT isolator, never got calls and the duties as.

A

C

Did not know it candy allocates that Dubuque heart did not know it should be allocated if your car.

B

So you need the isolator. Cleaner is not call for the open containers for.

C

The for the Preet for the first container.

B

B

A

Know it makes sense it.

B

Makes sense so so because if you fail, if you're on launch or destroy so the way we destroy a container, we call launcher this joy and then we core isolator cleaner, and then we cut provision and destroy. So if you return a failure which is soft circuit soft circuit, and then we will return a failure for the destroy and you can no longer clean up the resources in the isolator, so I think right now, we need to risk the problem in the issue with the way we resolve this issue.

B

We need to figure why clauncher fail. It might be because basically I don't expect a fail and I'm thinking to potential reason. What may be like what it is that like if a container process using.

B

Gpu resources and then.

C

Question if we have, if we use force to isolator and other Isolators and.

B

C

Isolators we problem problem either the provision provision of my are isolated methods. Would that cause the launcher to fail? The this container I.

B

Don't understand, could you say so if.

C

If we, if we, if we I, mean these machines, usually post the GPO, isolate her as well as a couple of others later, if the other Isolators we post three.

B

Others, provision.

C

Method that that caused the issue you described yes,.

B

Potentially because we called isolated later, cleaner one by one, so only after one, isolated, finished, colonel and then, if you move to the other, yes dependency. So any.

C

Of the isolator failed on the cleanup other either as a little cleaner may not be calls depends.

B

On the order depends.

C

On order, but if if, if the effective order, something earlier, the cleanup of the oscillator, you feel the clean advising the Pima nari called. Yes.

B

Then you can see the error message.

B

C

That log back up again so I didn't got on all the logs ready to this. But the one thing I actually observed is the whatever the isolator that manage it managed. The host path was reporting error in one of the logs saying that this is trying to mount a non-existing hosts, hath.

B

C

Vector the first container long failed because it's trying to mount a host path that do not exist.

C

B

Worsen, are you guys nursing tasks.

C

From Jason that wants to do something.

B

There so so the this means this one point for it. Oh.

C

B

So if I remember correctly, so we have an order of Isolators two copper pair and then we use the reverse order for the isolator to do the Carina, which means I forget whether or not we put the host ball in support as a separate, isolated and I move it out from the plane next isolated. So we need to identify the host volume is among first or we do the GPU I, say it'll clean up first, so so it might be possible any failure of other isolated. We broke, the GP I, say they're from cleaning up.

B

So what? What do you guys have? The Mong toy.

C

These moments are actually NFS months, prepared, created out of bounds, but occasionally the code, the code, the puppet code, which she runs these six months. It's got a failed autobahn without without method agent, explicit, no image, and then these events mounts are not visible on a do not present on a host on the agent host in bed, but our, but the our user, who creates, creates these containers using those attachments who didn't know that they were just creative.

C

They were just try to do a host path among the host path into the container and these hosts mass and that not exist. The other machine.

B

Show so I'm confused here so for this put this I as a cuter? Is it the first continue you it.

C

Is the first it's the first container, it's the earlier container so.

B

So so one question: so if you feel on long tree this joy, if you fell on launcher destroy, if you we will not call any isolate or in that way well.

C

Actually I don't know so.

B

How do you get the ho stroll among failure.

C

Okay, maybe that's already: are you suggesting that maybe I related cuz.

B

Cuz, if you'll get any isolator cannot failure. If you clean out in agent Locke, you said you're supposed to fighters, say no message: okay,.

C

A

uh Gilbert a yahoo this person da is are the mouth are like all of the host mount Isolators and things they are. They ordered after the GPU isolator I forgot the exact order, but I think.

B

A

Water right so.

B

In one point, six I think in one point: six we we have, but we have a deeper order of the isolator and all the custom module they cannot go after they're gonna be all they're after the deep all those building isolator. So if you go to the continuous or crazy method by the right order for those isolator and me, Jen's pitch did some refactoring in one six, so it is 25 min to go. We need to go back to the right version of Messrs, but they are in some specific order and GPU isolated.

B

There are some dependency for GP, isolated, so I remember that dependency was mostly about device management. It depends on. It depends on Linux file system isolated for sure, because.

A

B

We need to make the same post as I said as a shear. It's a sheer amount and we also need to create separate mount in space. So basically, at the GPO I say it depends on those two from the Linux file system, isolated, but I forget which position it is it located? If you have some dependency, when you double check on the one part version, okay,.

B

So for I think for this issue: I, don't know whether you guys see the error message from other isolator cleanup, but the thing I would like to make sure it is like if it is sucking at launcher, destroy and we you guys should not see and me the and there will be no isolator cannot be call. So we need to figure out why the we fail to kill the process.

B

So if it is talking at the any host volume arm mount and then we should, we will still see those message and then we need to understand. Is it related to NFS mount and with the figure so I I? Don't basically for this issue, we need to spend more time to do the charging and then create more agent Locke. Do you guys have any like consistent way to reproduce it? No, it is.

C

Always operation is happen, what a widespread of happening! This are roughly every month, this little MyPlate better, it's kind of like it seems like I, don't think if evenly distribute, it seems like occasionally this happens or one of the machines and then quickly like within a couple hours, we'll see maybe half of the crap within couple hours, I'll say within couple hours, we see about 50, press half of the clusters and machines in just read them latest action to seeing a similar problem.

C

B

Another hypothesis, it is like, if any per container process which consuming cheap here resources, we would be using more time to kill the process. It's.

C

Possible these are process with a lot of low-level system code and very heavy yeah.

A

C

Then regular HTTP services, which we are my experience, dealing with yeah.

B

And then it's totally up to your environment, I, think and I forget where they're not. This is the tile which is configured so.

C

I think we also saw previously saw problems, a zombie state which isn't called something state right, yeah.

A

It's a state like the distant asleep state.

C

Yeah previously in these castles, because we also use an NFS occasionally, these some process fall into this nasty disk, spec disk sleep state which is not even kill above a secure.

A

Yeah, though, like when it's the mystic in IO or having like you know, some up be in a fuse or in offense to my file system is having issues or parts or whatever, and those containers are not not killable and.

C

If those containers are not killable, even by acid KO is possible that the destroy cannot finish within a minute. They don't we and then if, but then we throw. We still consider task is I. Think in task is bad terminate in terminal state I know we tell the tale that pretty much. The issue here is that contain the riser contain the riser. Tells agent tells them her response.

C

Application stack massive scheduler that previous task is terminal, so you can't schedule new things and the system decided to schedule new things, but the TPU was not freed up. That's.

A

C

B

C

Second container so.

B

Right now, this assumption was like any other like NFS bug or any other I/o issue which might cost the queue system call failed or pending or stacking. So measures could not kill a container process and.

C

B

You guys I, think yeah I think initially I I expect we could find a new cost from if even it is from the other component me should find a new cost or maybe about it, on a suicide and instead of like go ahead and adding some rich I logic, I think we could consider that for sure, but we should never understand. What's the problem, alright,.

C

Yeah I don't get on that, but I. What I want to discuss here is a general, so use just general just confirm some design principles here say: do we kind of expect we will be reliable, they terminate any process in a container, so we know that we, so we know that cleanup should always be called. That seems a little bit dangerous design principle, because a lot of things is it's difficult to assume. These things can always be reliable. I.

A

Mean it's time out the like one time out thing. It feels like my there's no room to configure it at that's point one, and also, if you cannot do so for any particular reason in serving all of the heterogeneous and like workload and think there should be some sort of error being emitted being carefully charts. So people can take those just.

C

And there is an error being permitted if.

A

C

The rigor the light reading above you try that amis a concur yeah. We can monitor that concur for sure we have to be probably going to do some temporary remediation see if we see that concur, we just hard over a just restart metals agent from system D. We are considering that approach in house, but I, don't don't think it's a scalable solution, yeah.

B

I think I think best, there's totally up to the operator. That's the operate. The an option for operated and incinerate, mmm basically I I, never in in I, never see this error message before and I suspect the in my related to some internal, colonel hablo and rich and I. Basically, I do not see but we're the tie out. It is I think this might be a default I out from assistant Hall yeah.

C

So actually, if you ask me here, the fact that agent cannot kill this process within a minute is not that a big deal to us if the agent kind of retries with some kind of interval and eventually it managed to kill it. Yes, at least that's okay in our stack. What we are really what's really bothering us right now is we give up trying that we trying this after the first failed attempt leaked so.

B

I understand your motivation on like we solve this issue, and but you feel feeling like adding retry it is that just a workaround we still don't understand why I feel we need to I think we could consider that option, but we need to like Vigo. What's the root cost, okay, well, yeah I, think I. Think right now we do have been lucky.

C

In the sense that we are, we can access people's code, there will be operators wouldn't even have access to people's the code. The people running in the container it's gonna be even harder for them to figure out what was wrong. Unless we are, we have high confidence, there's only one way to leading to such issues, which I kind of doubt it with the different combinations of Linux kernel versions, systems.

C

Different things were you different different type of devices we could be using on so.

B

Did you did you? Are you able to reproduce this issue in your.

C

Produce it, it's just like this issue keeps showing up in our cluster, so.

B

You never reproduce this if.

C

You are able to say, submit a task and they're getting this reproduced correctly myself. So.

B

Since the last time you hit this issue in your class, that did you correct them host mount table and the process information to confirm to continue process is already gone. Did you did you.

A

C

Actually, this is the first time I was I started to focus on the incorrect termination of previous container. We were not looking at this. Previously, we have been. Our hypothesis was about internal back of the isolator GP elyza little cells, so we were focusing our. We were, focusing our attention limited attention there and we couldn't really make any fun any log that packing those assumptions. So that's why and we have to recover the cluster quick, because our customer wants to use the Dubuque ours. This is the first time we starting to focusing on.

C

This is the first time I saw the log that indicates this may actually come in from previous containers.

C

Another thing is, for example, if the previous container got incorrect termination, but nobody submitted task on this machine for six hours, then, in a six hour in the middle there's, nothing showing a penis there's nothing that indicating something could go wrong later.

B

Okay: okay, since I've already created friezes single for this continue.

B

Haeju town to town- yes, so what's the proposal here so you guys still want to like.

C

Do anything right now, I think we can probably tolerate one more episode of this and the collectible ox next time and if you have hints about what seems wish with the focusing are, can you comment on the task say if.

A

C

One, what other signals we should try to collect before we make any change to correct the system. Yeah.

B

Yeah sure I am thinking if we could ignore logging message, which might be helpful to collect more debug information because and I can.

C

Try to operate 1.62 rule out the other problems here. We do have that plan anyway, because.

B

Yeah, because personally I personally I'm tied up still want to go ahead and and without knowing the root cause and then go ahead and at the Richer logic, I think it is like some random choice. We expect, and it might happen that order which I fail. If we don't understand what was the car what's the route cost, so so, would you guys my like keep watching and once you get an alert, try to collect the correct the running process ID and to see if the kind of thing.

C

This is annoying part of this bug, actually, the our our our bug, reports received, at least after the the the later container shows up. We never knew there was a container wrong team that actually failed.

C

Sorry, I. Let me progress off. We were never focusing on the failed destroyers and our users specifically rarely care about that. We never nobody was monitoring, that's that. So, if we do that, we would probably need to deploy some infrastructure say, run some utility scripts or basically, one that continued destroy. Arahato I, see.

B

C

It's possible, but we don't have that infrastructure in house right now. Yeah.

B

I understand so they operated me no! No. This is particular this issue. Yeah.

C

You have suggestions, or whatever, whatever a lot of lines, does logs that we can pay extra attention to all we could. We can definitely focus on that and then we can add for both buildings and all certain login there. So.

B

One more question is this issue happen recently, after any upgrade, or you guys have been all.

C

This issue we have been receiving the public pass on this pretty much as early as beginning of this year, if not even earlier so.

B

So you mean, since the beginning of you guys run, were low on GPU kind of.

C

B

C

Class I was running one point, one point: five: the issue are.

B

Still there so did you guys use NFS since the beginning? Yes, yes and.

C

You interview a nervous from everything: okay,.

B

And then you guys had this issue. Yes things. Okay, picking me! Okay, so would you might would you might also like? Let me know how you guys use the NFS provinces. Is it for all.

B

Sleep there so.

A

Those are for, like my data crunching and also there are a lot of data shared behind between not like in the cluster, so yeah I mean that's, none, that's a really required use case and we cannot easily get rid of no.

B

No I'm, no I'm, not asking you guys to get rid of, and never as I am trying to understand so how you guys consume it. So.

C

The amounts are mostly so previously. We were loading both executables and the data for say for a lot of workload similar to much learning training. It may not be detecting liberal training, training but kind of pretty similar code like offline offline data processing pipelines, the the code actually lower the both executable, as well as the data from Nevada run, execute them recently.

C

Some other engineers moving to the moving our cluster to a model say executable binaries will come from the container container image and only data may be coming from the inner fence, but even I, don't know whether that's fully converged yet.

C

But but the number of MS months we need to use in the process did not really change. We still bounced the same amount of endeavors months and the load things from there.

C

A

C

At least the estimate about at least a year, or so.

B

Okay, since time we already called a synchronous, or your decree- not 364 discontinued so yeah I need I need more time to on this issue, because I don't think it is something I could fight it across. In short turn, but sync up continuous continuously.

C

If you have it, if you have thoughts, feel free to comment on the back, I see.

A

At this point,.

C

The path week, at least we can, we can say it's like it's likely. This thing is related to either container Iser or the internals of other Isolators, which is kind of internal to the computer, either yeah yeah, previously I was working with AB and Mahler, but he's not that familiar already in turnover, containerized, eh, ok,.

B

Yeah, it might be something in the linux launcher and we might, we might want to I'm, not sure it is a tie out issue, but I will I need to spend some time on that yeah. Let me add: a national is an idol for I will.

C

Try to put more monetary in our stack so see if I can catch issue more aggressive. You know earlier time and the monitor things and better analysis there and see if we can provide more information.

B

C

B

Life series in that, for you guys, production, yeah, yeah.

C

I think I think this seems like the biggest problem in our GPU cluster other than that there isn't that much. There isn't too much complaints so far.

B

Ok, so yeah, so definitely I need to spend more time on this. One I think debugging might not finish and not being able to be finished.

C

B

I, probably will ok, it's.

C

Currently assigned to me in this life, yeah yeah.

B

We can we can, we can cooperate on this issue. Ok,.

B

Any other any other agenda. I didn't people also would like to discuss.

B

Okay, so I guess we can say we can say in 23 minutes and thanks for trying they working cook meeting and I will see you guys in two weeks in two weeks. Sure thank you guys.

C

Well, Tristan, are you there.

C