GitLab Pipeline Execution, 13 Jan 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: #4119: Job marked as success when job terminates midway in Kubernetes

Description

Discussion around https://gitlab.com/gitlab-org/gitlab-runner/issues/4119

A

Yes, okay, so we cannot discuss issue 41 90, where jobs are being marked as successful for communities.

A

B

Sure so, basically, that the issue we had here in the agenda, basically, when we execute commands inside pods, we wait for the walks of the commands and for the existence of the commands, include Safeway disputed command, which takes a minute to run even a simple slip. If in this sixty seconds, while the command is running, if let's say there are some connectivity issues in the kubernetes cluster or in the networking stake in general, the connection between the runner and.

A

B

Spots can be broken and once the connection is broken, we get no error back. We get a new error and that's when we assume everything is fine and at this point we just continue the process, which means we mark the job as successful or you give up, and we could talk the walks at this point. So.

A

Basically, to make sure I understand correctly, like two connection dice, so the streamlines- and we don't get this right. So that's why we market a successful yeah.

B

It's a specific case, like the networking stick, has no idea. The connection is killed. Then I called the networking stake, just tries to read some bytes from a stream and it gets back on the file.

B

Zero by threat, of course- and this is a case which isn't handled I can I can just show you real, quick, where this is um like.

B

Let me see: no, it's not copying.

B

So basically, this this isolating so so this is the the placing remote commands where the basically the connection is handled. So this is the place where STD in and test out, basically being copied to put to the works and there's a spatial error stream, which is part of the ideas of the occupancies communication protocol, and this error stream can decode a message.

B

So it expects some some kind of message, which is this kind of message and usually it's gonna get to JSON and it's gonna get the call it and we're gonna get whether the command to a successful, a failure, and in this case, when this air stream method is trying to monitor the stream I, owe you burrito turns.

B

Thirsty new instead of error, that's basically because the bytes package kento center file and when Z center power turns new, that's basically how that goes bad packages implemented. So we're basically at no error. We get no message and we default to this case, which returns new.

A

B

This is sexually like the root of the problem and it's kind of specific, because there might be a case where the connection legit doesn't have anything to send to send back to us, but we still get get enter far. We don't get enter file. The connection basically causes, for example, from our end, and this.

B

This isn't like guarantee that there was an error message, but we're gonna get to that. A bit later, I'm gonna get back to the agenda yeah. So this is the route I'm. Just gonna go real quick to how to use this.

B

This is really cute, multiple commands in the pods and we wait for each command legs. Good, we'll just read em in certain order and that's why we're on this exact Anil wait for the result back to get back. This usually isn't here.

B

And we have this exec.

B

B

This is the last place in our code before we actually execute the mode command and the code I showed you before. Okay, so yeah. The problem is here: remote command touch it basically remote command, creates an HTTP to connection and to this HTTP to connection will create multiple streams for a standard input, standard output and the spatial error stream, which, which is part of the protocol and basically for cuter. They should be two connection. We get enter file and so happens. The thing I explained.

B

Let me know if no.

A

I think that's pretty clear so like if I understand correctly, basically.

A

It's not returning an hour, so we're not realizing that failed and like the job failed because of some odd reason and wisdom marking. It's that's funny because you didn't cut any exit. Come from. These three poster makes us yeah.

B

That's exactly it.

A

Are you like I, have not two hundred percent like caught up on the kubernetes executors haven't looked at it in oil, but do we only look at this stream? Outputs want me to find the status of the job, or do we look at other stuff in certain other scenarios? No.

B

We just we just look at this error here. That's it. If there's an error, it will think there's not what.

A

It makes us to like, for example, check something else like can we check like the port exit code or the pod advance, or something like that, so we have like two ways to check if a job can fail or not, but that makes us fine.

B

A

B

The street we can't check the eggs code of the poll because the boat is gonna be still running. We are, we are running on additional commands.

A

B

So that doesn't work we could. Theoretically, we could check for the status of the process like we can run PS inside the boat from another remote command. That's possible solution, but it wouldn't be my first choice.

B

A

And our solution, as so we modified in remote command, so it we try to reattach the stream.

B

Yeah I read in part to 0.2 yeah yeah. That's that's one. So this point consists of two possible solutions. We need to modify mode command in this case. If we want to try this, what this can give us two possible outcomes. We could either be able to reattach dreams. Let's say III, don't know, I haven't tried so deeply.

B

Maybe the HTTP to connection is gonna, reconnect its we're, basically using ghost HTTP to stack, so we create streams on top of the ice due to connection. So if we, if the connection dies, it could reconnect. If we tree connects, we could regulate the streams on top of it passer and we could just resume resume listening to to standard output on the process. We.

A

Connect with the stream with that we execute the command with us in the first time, just we reach the buffer, basically yeah.

B

We read the buffer in the buffer. If that's how it works, we are not to go away to the command, because, obviously, that can bring some asset side effects now.

A

And how would we know if we need to give it taste like we attach the stream, since we don't know that the stream dropped off yeah.

B

That's a good question. We could, let me just find it.

B

We could check if we have already gotten like a starts response. If we don't have a start response, yet that probably means the connection died. Unexpectedly, that's one possible way. Another possible ways don't rely on IO. If you're Ito, we could implement the buffer reading ourself, maybe or we could just handle the case where we get nowhere, no message and then we could just retry. It's not gonna, be like it's been supports, don't be wrong.

A

Come on, let's just recharge after star.

B

And yeah: that's in the case where we can actually recollect. If we can't reconnect, we can fall back to like checking. If we have this touched responds and if we don't, we can just return an error and saying the connection died exactly so we're at least gonna solve this problem where the job is marked and successful. But it's actually not.

B

So let's say we cannot touch these dreams again and we get we get to to this point and we get to this case. We could check whether we have already received some status response and we could at this point you could say we have. We have until it starts response. This is probably this probably shouldn't happen.

B

So you get an error I'm, not hundred percent certain how this protocol works, I couldn't find any Doc's, so it's kind of it's gonna have to be tested.

A

A

The reason I do not like mucking around with the stream, because it's still the network right, it still can go wrong, multiple in multiple directions. If the stream disconnects does the process still execute, so let's say we pass in the basket: right needs to be executed and the stream disconnects Midway is the bass shifts and running. In this case,.

B

Yeah I tried that with clip CTO exec and cube CTO, what, when you gradually, which keeps it exact it style process I am not sure if it stops when the connection dies.

A

Update the bash script, so if there's a nerve, it puts a file somewhere and we read that file to get the status of the pod. So we don't rely on this tree at all, but after that would not be the case of the stream dies and the bash script starts executing that wouldn't work at all.

A

Would that make sense? Yeah yeah.

B

Yeah that makes perfect sense by the way we can just. We could turn the the process. In the background, it doesn't need to be attached to this one chrony exec. We have here because.

A

Custom, it feels like yes, the stream is the problem, but you might be using the stream incorrectly, if that makes sense like it's so flaky that we cannot rely on it because, yes, we have a problem right now, but in the future it might show up as a different problem and we won't be able to.

B

Yeah we could. We could just do basically something like something like this, and we can. We can run two commands that just think out loud. We could. We could read from the file and straight from the file, and we could also check if the processor, wife, using PS yeah.

A

So yeah that's good, so what we have, for example, the shell executors for bash- please retire back to standard part right, but for our shell and patch we save the file and then execute the file. So we can do something similar sort of streaming to the standardout. We sense the files there's a way to send files to communities and we execute the files under me can read from the stream I'm, not sure if that makes sense, if that's a big change, I have not I'm just thinking out loud. It.

B

Shouldn't be such a big change, I, don't think so we can. We can rub this command in yeah whatever, and we still have the input and we're just gonna have fun output and streams.

B

Theoretically, this is not gonna, be such a big change and then we'd have to check the file and protest acts like periodically like a few seconds. Yes,.

C

My first office that I joined discussion- my first think I'm here is: let's just get this part of.

C

My read slide above blinking like we are doing something terrible war on games like.

C

First thing is that I would don't think now about this fire redirection and problems with running streams, because reading streams is one of the basic things of Cooper City, our exact I. Don't think we are the only user on the world that relies on this. If, if kubernetes, for some reason will start have problems with attaching to the ports and reading the streams of any comment at this executive, it will need to be fixed quickly quickly. Now we are fighting with some really really edge case scenario and in most cases this just works.

C

So we need to find a way to basically the biggest problem. Is that not that there's something wrong is happening? The biggest problem is that we don't detect that this is happening and we say everything went went okay. Of course, there was at least one client in the in the issue that that claims that he sees this every time or everyone every build, and in his case it would be just that he would get all of the jobs failed, but I think this is. This is a second thing that we should look on.

C

The first one is how to detect the problems I.

C

Really really don't like the idea of getting the remote command implementation inside of the runner. Making this up to date with kubernetes will be a help.

C

Second thing is that okay, we will try to reproduce this we'll try to reattach this, how this works on the kubernetes side, because copying the client library stack and changing it so that we can try to reattach it's easy. The question is: is the kubernetes server supporting this and how to find what to attach to like I've, never seen how to attach to running exit comment? You can attach multiple times to the main command started from the from the container, but the execuse, always independent, one-time entity. Like you start exact, you can see it.

C

It ends and I'm, not even sure if kubernetes will allow us to to reattach to this. So this for the this, for the problems that I see have we thought if we can move out of exit and started.

A

C

Attaching to the container, because I don't remember now, which executor it was at some moment we had two different implementations like docker or kubernetes. One of them was starting the container with shell detection script in the as the main command and then was executing all of the runner scripts through the exec, and the second executor was attaching to the running container and executing the scripts and at some at some moment we change it. One of the executor to just make them working the same, and we chosen to use the exact one so that.

A

Was my change actually so the way the current status right now? The docker executors basically says the p1 of the container described to be giveth right. So it's like the sheltered action or the blue script whatsoever. So we can't get the exit code from the container status and that's how the docker executors work, but for the kubernetes executors we start the pod and run kubernetes extract, but these scripts are standard out and the reason we had it like that.

A

I'm not sure the reasoning that was for I joined, but the reason we wanted to change docker to where Sana's kubernetes, meaning.

B

A

The containers to don run docker exact is for the windows, the the interactive web terminal for.

A

Keeping the environment when the job has finished so the current behaviour, if you have the interactive work, terminal open.

B

A

The job is finished, there's a configure timeout where it will wait for it, for example, 30 minutes until it stops the pod right. So if the job is finished and you're connected to the interactive from terminal years, so you can still look around and like change files and things like that. So that's why we wanted to change the Dockers order to be the same as kubernetes to use.

A

She could sit on exactly that was the idea behind it and that's why we kept using cube exact for kubernetes was with a bound on that feature for the interactive terminals, because if we look at in the link the questions, one I think that was from someone community contributor who rewrote the kubernetes executor by using jobs.

A

But it's using jobs in a different way. It's using jobs, as its use, has its peak time with the talker execute. The script is what the job is executing. So we can look at the exit code very, very easily I.

B

Basically runs a few scripts, you have one my script and.

A

B

Is running side and you just touch with walks.

B

But this this by the way is not no.

A

Not just from the bottom of the agenda under questions I sounded like that's an implementation of the kubernetes executors from that one. The questions not that one for the measure passed the.

B

Yeah I just don't see if this is the same one. No, it's not the same one, just a.

A

New project, the South America's yeah.

B

Yeah, it's in it I think.

A

Basically, it's just creating a job and it's doing what he talked executors string. So it's not relying on cue backs I like the job, the script of the job as descriptive, guitar owner generates and that it gets the status outside.

C

Okay, but what if it would like to switch to the attaching to to the main process of the container, how this creates a problem for the ca Webster? Now, as.

A

The terminal one of the feature as after job after year- let's say your job is to running and press the debug button, so you have to have terminal open right, yeah and one you have to elope terminal open and the job finishes. You can still use the web terminal.

A

You can still use it for like 30 minutes at the bands of the configuration so that that's what it would break if we change it so, like the script executes stay attached like we can't read it the time like we can't really stop it from stopping or keeping it upon the life. Does that make sense like I'm, not sure if I'm explaining the problem properly, yeah.

C

Yeah I get the problem I thinking how we could, how I could change the execution of the script to have the outfit, but doesn't stop the container in case when it fails, because the biggest problem with exec is that we don't really have a good way to reattach to it and to detect problems. We've attached it to the container. We can get back to the same executed comment multiple times, and this would just make things much easier if we could attach to the Container not to do the exit here when I get the problem of.

C

Although the back there me now.

C

A

Wonder will it make some stop T lifecycle of depart like this, so we create the pot, we upload the scripts, and then we execute whispers like that. Would that be more reliable instead of depending on cubic? Second, we pass transcriptase thundered out.

C

Remember that the scripts are being executed by rather in in common build, and we have. We have rules around what script in what case should be executed and which error next is propagated to the chopper.

C

So if, if we would like to do this, like that, we would need to upload also the control of scripts execution there and somehow switch off what runners does. Also ok, let's say: let's say we will, even even for even if you would do this one by one.

C

We upload the script. It's in the container, we started how we will get the trace, how we will get the final result. This is where the fire happens, not on starting the execution but I'm watching from the execution. We still need to read the output. We still need to look for the exit code and we still need to make it not failing or foresee succeeding in case of network problems. Disconnecting disconnecting us from doubt.

A

Like 40 I'm, just trying out ideas of this Bank for for all your questions, we can always redirect the output, standard output and standard error into two different files beat drops files, but reading files will always be a bit messy and writing device.

C

You start with analyzing the kubernetes internals and is there a way how my fake questions, because this is what we were talking when I joined, do we know in which place the error is happening? Where.

C

Network connection is disrupted, yeah.

B

Yeah yeah yeah, it's right here! It's if you, if you go a bit deeper, you go into the bytes package.

B

This it like, we read zero bytes, because the connection is dead. We get data file and we get back 0 and now, as an error,.

C

B

That's from me or you to read: oh ok, so we go to buffer, read from I, have to retool and get back here. Ok!

B

So we we go to this statement.

C

How we can how we can detect this problem on the router level.

C

Is there any other case when the IO AOF could be returned?

C

B

It probably could I mean even with grace who shut down connection. It's probably all turn end of file, but the thing is, the connection can always send back new bites again. It doesn't need to be there to send more bytes. We just can be inactive.

B

C

Remote comments- and this is in aerostream.

C

What's the arrows.

A

Are you sure that first note of the stream times like the does not register an event.

B

In any case, I, don't think the event works of events that won't be reliable. For example, if your crystal is under pressure, the works are gonna, be different tests, so they might count 50 mins later they won't come.

C

We need to solve this Sankar, even if all the curling stock is it's just not supporting finding this out. I.

A

Mean it'd be such a bad idea to do a breaking change on the interactive abdominals.

A

Sorry. Would it be such a bad idea to do a breaking change on the interactors webdiver us.

C

Basically makes.

C

Moments when you would like to use it when the job is succeeding, he mostly don't need to.

C

Detect that the stream was broken like this seems that this should be basic functionality of networking communication. Soft.

A

Should we try and us and the kubernetes slack if you somewhere, can help us and.

C

We can ask the question- maybe maybe someone was fighting with this in the past. No.

C

B

Could do it after disco? You can feel enough context now.

A

Yeah I would imagine sake, API mission, the people to talk to.

C

Because, if, if not, if not making each other horrible- and we basically need to check everything by hand like I, don't think I don't think we need start with the locks, push them to the file and then manually stream them from the file.

C

But but detecting the final exit code of the command yeah.

A

We are somewhere, we are at the command right, so we can just save five top 10 status and read from that.

B

Yeah, how do we stream the file without.

A

B

We need to stream it in a fashion. So with me 10, 10, whines and the connection dice, we need to version from 10th one. It could be possible with some help from some value on top of it, I don't know, but otherwise I.

C

Don't think we need to stream this, we need to have secreted yeah like let's, let's, let's consider like that, we update the executed script need to think if this should be on the kubernetes drink or for all the executors. We update the executed string so that the final exit clause of the script is saved as file on an on location. The location would have example that the project URL job ID and the sequence idea of the script execution by the run. So we have a unique file from this specific file.

C

This specific script that was executed now the comment exits. We get non-zero exit code. This means that for some reason, the common exited with an arrow. In this case, we don't need to do anything more. We just fail the job and propagate the exit codes as we do now. If we got exit code zero, then it may be that the job finishes. Okay or maybe, in this strange case, 4x equals zero.

C

We always execute another command that just reads the file and give us the output the problem, if there is something really strange happening in, therefore, this may also fail.

B

A

We can retry it a spin out the bytes. We should the amount of bytes we should read. So it's.

C

A

Just retry it yeah.

C

We can we can retry it we and with script as long as cut and the name of the file, for example it. We should really have very bad luck to fail in in such networking problem again in in such quick timing and then getting this output. If we just pin parse it- and we can set this as the final exit code of the of the script, if it's still a zero, then.

A

C

Then this means that the job just finished as well and and the initial exit code was probably okay. If the file doesn't exist. We probably are in the case that we are looking on now and we then should repeat it and, and the next question how we will know that the job is still being executed.

C

Like first thing we need to check is if kubernetes is breaking the script in that case.

C

However, from the issue, it seems that it not because we get we get the positive output of the job and people say that it's still right and again, it fails in the background, so I would assume that currently still execute this job. But this needs to be checked. So then we need to have a way to detect if the job is finished and check the exit code file after it is finished, so some sort of run file, but is created when the job is started and removed.

C

At the end, it makes the final script a little longer because we need to check if the file is created. If yes wait some time and repeat, if not try to read the exit code from the file.

C

Seems messy and such manual checking that would be done for more than half of jobs. Let's assume that most of the jobs are successful. This means that each time we do this stuff only to detect if there were any networking problems, but for now I just can't find any other solution.

A

Going with that scenario, we're.

B

A

Would it be simpler and less harmful if, for example, writing the status of the job of the script on fire? We use the attach method, but when the user connects to the web time, we write a file check for that fire. Would that be simpler? As of doing all this, don't you explain, for example, I kept.

C

Killing people because I don't think I understand.

A

Okay, yes, a furnace, so let's say for we're saying we can't use the attachment because we will break the rubber terminals right. Okay, so the reason we break them up terminus is because the user can't keep using the web servers when the jobs finished.

A

Wouldn't it make sense, instead of doing the logic that we just explained every time the user connects with the web terminals, we write a file somewhere and before we terminate the pot we check if the file exists or not, if it access, we keep the pot around or not. Would that even work.

C

Yeah, remember that if we will use attach, you execute a command from the main process of the content. If this common fails, it fails. The container the container is stopped in media, but the site of our control yeah.

A

Yeah, but in the command you can check if that file exists or not when something happens, cause we trapped here. So in the trap we can check after file access file access, just two asleep, for example.

A

Would that be simpler done like itself complex and it's still messy, but would that be less messy than what you expect? Okay,.

C

And we sleep for the amount of time that happened.

A

C

The for the CI web for me now and after this sleep we exit with the exit code. That was that was prompt, and this is scout by runner. As the exit code of screen.

A

Yeah we already do touch as far as I know: I fear that web terminal, you do not update the job status of the job failed because of your script. We still keep that incorrect exit code, yeah.

C

A

Would have thought now.

C

This is from the exit, so we just get the exit code returned by the exit right now and in the runner on the run, our site. We just wait: we've stopping the container here. We would need to do this inside of the screen. Yeah and.

C

The problem is that now we can get this exit code, I think we are showing it to the user, but the base is failed and then you have still this- let's say half-hour of the terminal working here.

C

So from user perspective, you look on the job trace. You see that it's not updating, but you have still access the web terminal and we have no way to say to this user that this job is already failed and you have your last 30 minutes to check what's happening there. So.

C

We would need to find some way to pass this out here without stopping the contain.

C

However, the simplest one would be that in that trap script we could, when we detect that the web terminal is connected. We can just write this. The just output.

C

It will be a little if you look a little different than how it looks now, because then you would know that that is a job block. You would have something like your command, some output from it and something like your job failed.

C

You have 30 minutes for your web terminal to use, and when this is finished, you have the normal red failure with the job felt with exit code, bla bla, bla bla. So we have this. This information doubled, but it seems that it would be easier to do something like that, but this requires us to totally rewrite how Copernicus executor works.

C

So we need to be sure that we want to go this way.

B

Okay, by the way, if we go back to the solution to must provide it, we can read the output of the command we execute in exec. What if we just had, sounds stupid or which we can check whether the processes exited successful, then we can know whether we should check the the status code inside the file.

A

That might be a good idea, so we don't check it after time to start I'll. Just you know we'll check it when there's some missing log line, for example right.

C

Yeah some specific look like nine-cent. Oh.

A

Yeah, that might be a.

C

Solution, I, don't like relying on parsing the output, but it would be definitely simpler than what we were talking a moment ago about and the output of the job from user perspective would be identical like it is now, because we would not stop the container, but we will still get information about the fail.

C

B

And by the way like in the in the real wonk term, we're probably going to remove this because I guess it's kind of a problem for people where they get their jobs to stop in the middle of their work. So we might want to at some point to reward the covariates executors sound. The other way is you guys suggested.

B

Maybe move on scripts inside know something 20 more complex.

B

Yeah I said was at some point. We might wanna find a more complex solution, but that's gonna allow jobs to actually not die in the middle, but maybe using this solution for now just to solve this issue and there are people to get their jobs displayed properly. It won't be enough.

C

B

Yeah we can take the the output marker just so we don't forget.

C

This was discussed in the context of switching from exit to attach.

C

We have. We have here two ideas. The first one was that we still use x ik, but we save a file on a well-known director Lee, where you can check the exit code and another way you can check. If the script history in crank or not.

A

Okay, now I'm confused, so wasn't just one about keeping keep exact saving the status of the job on fire. Yes,.

C

Okay, this was the one. The second one that we discussed was that we switch from exit to attach we stopped sending the exit code and killing the con. Instead, we'll look for some specific Martin doubt so we can decide what is the status of the script execution? Yes, and then this should support also the the web term. In our case.

B

Okay, now I'm confused I was talking about the the previous point.

A

That's how I understood there.

B

A

Tomas is because he said we will check the file status, the job start to fine. We save every time the cube exact return to start with zero right to make sure that's exactly terminated, but instead of doing that, I can settle every time every time at zero. We check it. We do okay, it's zero. Let's check the job output advertised dismissing job block fine, like we will actually check the file status because we know of the lock has missing data. We should check the fire hose dat.

A

Cubics I got terminated of the exit code is zero and we have this log line. We know that the actual cubics I ran successfully, so we don't have to check the file at all. Is that what you were trying to explain? Jersey yeah.

B

C

Okay, so just write down what what you think the solutions are. If, because I still think we are talking about two different things.

C

This will be not what I'm thinking about her. We just add my my idea.

A

So I understand.

B

That that, basically, is just so we're not gonna, know the amount of bytes in.

C

That case, if, if we are working on this using the fire, has.

C

C

I was thinking about something different.

C

This probably should be deleted.

A

Not delete the first to check their status.

A

This one is very much.

A

We ask something in API machinery to see what should we ask if we wanted to ask them something.

B

We could tell them, how can we detect an abrupt death of the connection.

B

B

By the way, I anticipate their answer, they're gonna, say exec is not for Wong current process.

C

Okay, so if exec is not for one tiny process, can we do the same with using the attached to the main common yeah.

A

C

We will remove the sending complexity code. If we rely on some file that is safe to read it then then it should be not a problem for for the the bottom. You know and we can attach to it again and again and again so.

A

You would say we asked them for an alternative for keep exact or what.

C

You mean now I'm I'm, not talking now about what to ask the RP machinery now I'm saying about what in case when someone will tell us that keep skills exact should be just not used for money.

A

So, let's explain it to them, so let's explain the problem, so the reason we come to use cube attached basically, is because we need to keep the pot around even.

B

A

T1 finished right because of the terminal.

B

So I don't know how they worked. Illumina works, I guess the main process of the pot is just message.

B

Is the main process of the pot just SH.

B

B

A

Cubic SH and then we.

A

So it's just another exact yeah.

B

Yeah, we basically just need to keep the prototype for what yeah, usually, theoretically, we should be able to create a bash script which never dies theoretically,.

A

Yes, but then, if the bastard never dies, we'll never get the status code right, yeah.

B

We could capture error codes and the successful code. We can easily capture it, go away, they were at the end and we can then capture error codes and send them to a file.

A

And we're back to the second point right: yeah.

B

It's gonna, be, maybe that's so Tomas was thinking about, we can read, we can hit the walks and we can get to know when to read the exit code of the process.

B

Yeah, that's one way to keep to work with the touch, and we can. We can get the walks with qgt walks. Of course, we're.

A

Probably going yeah.

B

Yeah I can write down it's just we get back to the other program, but we still need to write it down. So let me get that.

B

It can be tuned.

B

We can even like, even if the process never dies, it could even send this information to the works.

B

Why the walks of the Potter never got divorced. We can read them as many times as we want, so we can use them for information instead of using files does just think. A lot of course, might do so.

A

For example, let's say the job successful job fails.

A

A

Yeah I just, but even if we can chop success and job failed, we, how would it work? We will check for those logs and ok, let's say: there's a job. Success now am I gonna check if the user connected to the web term or not after you is connected to the web term and not just do nothing, but if the user did not connect to the web train, you know just filled the pot since it's an infinite script. I.

C

Don't know how that works, I think that we can just reuse what we have now like right now. If the, if the script fails, it fails in the context of the court is still running, so we still need to kill this spot with this infinite SH process right- and this is exactly the same case- we just we just don't rely on the exit code to rely on the some specific lines, some specific words out that just makes it a premiere, decide. Okay, this job succeeded.

C

Okay, top five and the logic for the web terminal is is already there how to how to and when to kill the boat. When the four terminal is open.

A

Fi definitely not happens afterwards.

C

Like like this doesn't change anything how we schedule these right now, we just change the way how we detect the exit. Oh man.

A

C

The script just might.

A

C

Yeah my game, and then we could try also to switch. If we want to use city unlocks, we need to switch from exact wattage. We need to start executing everything on this main process.

B

Yeah, maybe I'm missing something but I. Don't think this is gonna, be simple cause. Someone still needs to orchestrate the whole order of the process.

A

C

Like this doesn't change the stages now we still get the information about the script failing or succeeding and at the level of common built where the so stages are executed. We even don't look on the exit code. We look on the error that was returned by script. Execution of know. This is what the executor is responsible for, so in case of communities. We can just to show of the logs through some hope that will read it check if there is the marker, but we want and push it push it further.

C

If we will detect this marker, we remove this line from the output, so it is not printed in there in the job blog page, but we we just know that the comment was finished after that the comment was was fed and we we just do the same- that we now do when we detect that the process exited with a failure.

A

We just have to basically have all the stages and the attach all in one go. Yeah.

A

Would that work for artifact, upload, onsuccess and onfailure.

C

Why why you say we need to have them all connected right? We all we can attach multiple times each time. Well, we from common, build. We execute the screen. This is the executor to somehow schedule the script on the target environment in case of kubernetes. In this moment, we attach to the run container, send.

C

Stripped through the yeah and then we attach to the locks and we start clicking meaning from them. We read from them until something breaks again the networking stuff, and then we can retry it or we get the marker that says that the job succeeded and then we return from this comment in the exact same way. How we do it now, returning it without error, I already get a lock.

C

That says, but we have some error in the script, but it internally would exit with there with the exit code, something, but we just make it not send the exit code, and then we know that from the kubernetes error we need to return with the we need to return from the let's get back to the common build and common build behaves exactly like it does. Now it got an error or not depending on which state enforce it execute something else or not.

C

When this is finished, we go to the cleanup method of the executor and at this moment exactly turn knows. If someone connected to the web terminal or not, we don't need to check it is checked already and if it's connect, it just waits for the terminal to exit or for the timeout to kill the port. Finally, if it's not connected, it gives the pod. So the only thing that is changed is detecting the exit of the screen and detecting. If it was failure.

B

Isn't another change so that the the P 1 bar script is gonna, read the standard input and take the creditors commands if.

C

You, if you attach to the container then this is this- is what happens by default. If you attach to the running container, you will be just thrown into the main process that is right. Yeah.

B

Yeah we lost change the script already.

C

Because because this is a running shot, oh.

B

Okay, yeah yeah. This.

C

Is wrong so, when we attach to this process, we attach to the running shell in this common standard input, it is just the input of the shell, so everything that will write to this process is may try to this show. What we need to do is we need to switch from exactly attach.

C

We need to update the script at least for a kubernetes, so it will not exceed what we'll have there a trap that will catch this exit card and write something to the output and after we attach after we send the script to the standard input. We don't leave there attached looking on the standard output and standard error of the process, but instead we switch the locks or we switch.

C

Who walks in case of what will be disconnected we're switching from the beginning would probably make it less confusing in the code and just simply like, we will have one way to read: what's happening in this in this job.

B

Okay, so maybe if we go through these two real quick, this one I think is gonna, be simpler, the first one, but it's not gonna, solve the problem of 1 million jobs. Dying. This one's gonna be a bit less simple, but is gonna solve this problem.

A

I'm not sure about this, but we have an issue: our recording artist, ignoring the entry point or something like that which cube exact, still have the problem with the entry point, because the direct am I. Just the.

C

Requirement that we have forerunner from the beginning is that you need to have the show and in case of the talker, you need to make the container started with the shell in. So if we have any problems with entry point on the kubernetes executor, it needs to be fixed, but this doesn't change requirement. We give you the entry point, so you can do anything that you want, but finally you need to give us the the shelter, so we can start executing something on this shop.

C

B

A

Like to last point the most just because it solves all our problems, the other one seems a lot of complications and it's still a workaround I.

C

Don't really like to use the file and exit cut file is I. Just don't consider this to be the easier.

A

Way, yeah exactly, and we already have issues in the future where we want to save the build directory like we do so we can do a git fetch every time we open up these and if we save files, that's gonna be even more of a problem, but.

C

This fire would not need to be in the same.

A

C

It could be on the run or TMP, but.

C

Look looking looking on this, it looks just like a nasty hard to prepare workout. It doesn't look like a proper solution and the second version looks way more proper I'm, still not a big fan of using the output and some specific markers to detect the failures, but it's still better than in my mind, my main better than and saving the exit code to file and then reading the fight to get the exit code like we don't have exit code at all.

C

At this moment we have some string that needs to be read, and in this case reading the alphabet is just simpler. It's one step less that we need to do.

A

So maybe we should go out with this or do you wanna I, don't know.

B

A

And Wednesday and see if we still all agree that the last point is the ideal solution.

A

Maybe let's say let's do a pocket like to where we do a park and say: okay Wednesday, we sit down again and say, as the park like extremely hard to do. Is it's a terrible idea now that we're coding it or Botox.

C

Yeah, let's, let's just be a little person and get back to this.

C

Well, maybe, let's get also Elliott, maybe maybe let's ask kami if you can join, he could also have a different point of view on this issue. Maybe he'll just give some some some idea that we gave think about today.

A

Do you wanna sit down tomorrow, George you together and try to you to puck or what Tomas or whatever yeah sure and we'll set up a meeting for Wednesday.

A

Afternoon so others can join us.

B