GitLab CI/CD, 27 Oct 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Discussion about job statuses and exit codes

Description

This discussion is related to https://gitlab.com/gitlab-org/gitlab/-/issues/16733

A

Okay, so we've had a new proposal that implements everything on the braille side, but um this proposal kind of conflicts with what happens on the runner's side, because the runner will report the bad metrics.

B

So, basically, uh before we like dive deeper into this conversation, I would like to check in and stress the importance of keeping cicd product reliable, and I would also would like to make a point about a difference between making things easy and making things simple.

B

So I feel like the company's proposal is definitely making things simple because easier, because we do not need to implement anything on the runner side, but basically I'm a little bit worried about the behavior of cicd system, because presumably the most important aspect of the cict system is to mark fail, build as failed, builds and successful, builds as successful builds right and with introducing code that basically uh changes the status based on something like we make this. You know line between a successful and the build a little bit thinner than it is right now.

B

So um I just wanted to say that I I believe that we should make things simple and we should choose the proposal that is more simple and more like predictable. uh Over the same, you know choosing a solution that is easier to build. So that's just something I wanted to share and I believe is important here so yeah.

C

I I would say that I I agree with you uh like. I understand that, uh in terms of like implementation from the engineering side, it's easier to go with with the new proposal, but if indeed we are realizing that this is not something that will be like simple for our users to use, especially and then I don't know if it makes sense to go with that. But let's, let's dive into the.

B

Yeah I I just would like to add that simplicity is a little bit broader concept than just making things simple for a user. It's about how the system works, uh it's also about all the inner workings and uh how easy it is to maintain and extend and how predictable you know that the feature is going to be in the future as it grows as the entropy kicks in, and you know all the stuff, so simplicity is not only about the user. It's you know the concept that is much broader.

C

C

D

Sorry, should I add a few words or like yeah.

B

I think that everyone is waiting for.

A

D

I don't know really to be honest.

D

It's like uh and they're, like I mean the problems with both of the approaches and like I, I cannot really even confidently say by myself whether something is more preferable than other. I just hear like the pros and cons of both of them.

D

Basically like exits, like looking at the exit status code from one angle, feels more natural, but also it's kind of less predictable because you don't have like full control or, like you, cannot I'm kind of like concerned that we cannot really, in all cases, get the the exit status code as it's being returned by the processes.

D

So it's from one angle, it's kind of lucky, because you are building on like very specific behavior and like that today. This behavior is either like it's zero, which is success, anything else which is not success and like in some of cases of some executors or like some cells or like some commands. They may not pass like the exit status code from the chart process up to the parent uh in in like the way that you kind of expect um like the the status quo.

D

That you're using may may be something different that you're kind of expecting, but on the other hand like the regex, like you are looking at the at the contents of the of the text, the kind of feeling that if you see this like line, uh you kind of assume that this is like your your stylus, so like on this angle, like exit status code like it's not like, you are moving you're kind of uh ignoring exit status code to to like to to superset it with something different that gives you maybe something more accurate as us as for like for the uh for the status.

D

So from my perspective, uh I I read that the the comment about the steve and like the matrix, um I don't really think that this kind of structurally challenges the metrics like my proposal, but on the other hand, I'm not convinced like which of these proposals is better and it's simpler and it's less complex to the user.

D

My my assumption would be like that. Exits like using exit status code may be simpler, but it's gonna be actually harder if you have more much more complex case, so I I'm like another aspect is like that: I'm looking at do exit status code work reliably in all our executors. Does it like? Do we get like the exit status code in the ssh or kubernetes or docker? Is it something that like is like overwritten by something else?

D

Maybe it's something that is uh being overwritten by the usage of the power shark and some can someone like? Let's can someone do a trap in the execution of the bus? That's gonna override this exit status code. I don't know about that. So that's.

B

Kind of I feel, unless that's super interesting, and I totally agree that exit codes are not like the silver bullet here and that they might be still error prone. My key point about the difference is that I believe that a runner should actually know the status, because I I know that we violate this a little bit.

B

We've allowed to fail, but when a build is failed and it's allowed to fail, we indeed mark it as successful with warnings, but the build is still failed like when you look at the build log, you will see that the build has failed and.

D

It's still a failed, build.

B

In case of exit, like in case of uh everything else, we introduce this- you know thin line between statuses that something might be failed or not like. We cannot really update, determine whether a build is failed or not. It's even worse for the trace regex, because it happens on the rail side.

B

uh We presumably do not want to build this on the runner side, which could be you know some kind of a like middle ground. The the key point is that when a runner sends a status currently, it can be either success or failed. It should stay this way.

B

We should not uh send a failed status from a runner and some somehow change it to success uh pure success without warnings like it just becomes a successful job on the rail side, because we have some strange business rules that are going to change uh failed, build to successful build. I mean it's very error. Problem.

D

I mean in in my like in my perception in any case, runners should not really care about that. Runner should just send the exit status code. If you would go that direction, as is from the perspective of the runner. Whether this is success or failed, it would be the same as this is it's rather like additional annotation exit status code. I I don't really kind of. I don't think that the runner needs to understand this business logic at all. Runners should be as damp as possible and send as much of the data as possible.

D

So um if, if we talk about the exit status code, my perception is like we would not implement uh exit status code on the runner. It just doesn't make sense. From my perspective. We would rather send the exit code alongside the the status, which is like the status would be success or failed.

D

We also send the value reason, and maybe we would send the exit code and it would be raised to decide how it want to handle that and like it's like it's like you added pending states recently, maybe maybe like. We have like a dual information which is like what we receive as a raw data from the runner, but how it got passed by the sorry how it got evaluated as part of the business logic as part of phrase.

D

It's like a completely different story, and I'm kind of like thinking that runner from whatever we do should not have any understanding of of that. Behavior runner should be.

B

Dumb on board with that, because I would like to stress uh the importance of what we are talking about here, we are talking about. How do we determine whether a build is successful or failed, and that's the totally most critical and more most important aspect of every cicd system? So if we choose to make a change, I think it would be change. It needs to be a change that shifts us moves us towards a more reliable statuses than statuses that are much more unpredictable.

B

So I guess that we can find the solution here, but this solution should be presumably more simple and it's difficult to find something, that's more simple than what we do have currently, because you know there's this quote uh that I love it's. It should be printed like every above you know the entry to our company, or at least verify department that simplicity is a prerequisite for reliability right and in case of such a critical aspect as evaluating whether a job should be failed or or successful. This is something we should take into account. This.

B

The system should become more relatable instead of less relieved and how we do that, like I'm, not sure, I see problems with both proposals, so perhaps we need another one, but we should keep this in mind and I cannot stress it like you know the importance of does this enough.

A

E

I I think for me like having different statuses, both from the runner and cat lab, and then good luck. Does it something I I I really like dislike that just because, like even from a typical debuggability perspective like if we look at our metrics right now, oh we're saying, there's 20 000 failed shots, for example right um that might seem as a system administrator for the writer.

E

That means that probably something is wrong right, so they go check in what's wrong, but then oh, no, it's just the user script failing right, but then on gitlab remarket as successful. So like there's like a split brain issue in that sense, and even likewise right if I want to go look at the job like what happened on this job at the runner level, we're marking it as fail, but at the level we're marking it as successful so like even from debugging and administration.

E

That seems like a bad idea um camel mentioned that some processes might not report the correct status code. Yes, that's a good point, but that's also uh the real word scenario like, for example. If I have a a make file right and that's creating a multiple processes, one process might not report the correct exit code or it might strap it in some way and that's still gonna mark as successful for the user, even if they run it locally, not just in ci right. um All we care about is the final exit code.

E

I think it's up to the user to handle that kind of like child processing and trapping it shouldn't be up to us, because at the end of the day they know their script the best right as long. So as long as we have a contract with the user, we will respect your final exit code.

E

It's up to them to report the final election code properly, because right now, technically, they can already do this kind of thing themselves by running a trap right looking at the exit code, if it's like one, two, three four, they just return next code: zero, like they override the exit code, like we already provide this feature in a way we just want to improve the ux, so the user. We already tell the user that he has control over the exit code and I think we should try and keep that methodology as well.

E

D

Like I I'm kind of like we are, I kind of understand steve, like your perception, but I'm kind of like wondering, because it kind of assumes that you may want to say that exit codes, 0 and 1 are considered success.

D

I don't think that we should ever allow that. That's that's! That's my uh assumption like if we, if we say like and if we go with the exit code and if we say because it's clearly like the break of the contract, it's like saying 137, I think which is like out of the memory. We say it's success. It doesn't make sense. If, if we want like these things to be consistent, um I would rather say that the exit code should be rather something that modifies the current status rather than fully overwrite the status.

D

So I think, like convention, of our configuration, which is like another aspect like we would never move away from saying. If you receive zero, it's success. We don't allow you to override that. But now like we have the the like the the case of the phase drop that you may want to dynamically mark them as allo with failure, which is still like the first talk. We just change the state and this can be like a specific status code, so it doesn't really break the contract.

D

From your perspective, whether something is success or faith, because it still fails. It's rather like dynamically annotated with some additional data. What other statuses we may want to support, maybe like the failure reason right. Maybe we may want to allow user to configure failure reason. I don't really know how to approach that, but I I'm not sure if this is like that.

B

We that's a brilliant solution, basically that we do not actually violate contracts. We do not change the status. We just allow it to augment with additional data, so perhaps the first iteration and mvc here would be to actually allow a user to define and exit status. That would make a job conditionally successful with warning. So if it's failed and there is this exit status, then we actually make it a successful warning on the rail side. Of course, the runner should actually send this information, but it's still a failed job right.

D

It's it's still fair job. It still failed from the uh runner and from from the race perspective, which we only like toggle a single plug, and maybe this is like our nvc- that we kind of support.

B

Then we could extend it to more like uh decorators, and perhaps we would allow to override the failure right, but the failed job should stay. Failed and successful. Job should stay successful and I completely agree that you know currently with the state of unix exit codes like zero is usually like. Everything is fine and everything else. It's much more volatile.

B

You don't know if the one actually means like you know, where does exit 1 come from like it might come from different places, and it might not come from the place that you actually expect expect it to come.

E

I think that's a nice solution to like decorate the failure as we're doing with the allo failure um like, but we have to remember about windows as well so windows, unfortunately, don't follow that contract. So, for example, when you install some windows 2, they might return an exit code of 255, which means hey.

E

This is successful in style, but it requires updates right, for example, and things like that so and I'm guessing this is where it came from, and also yes for the failure and decorating the failure is really good but, like some users want to skip and cancel now. I agree with comments comment where they say, like cancel shouldn't be done because canceled is a user action, not the process section, and I 100 agree with that.

E

But regarding skipped, maybe we can have failure or like some other failure mechanism if they want to have the skipped mechanism, but um we have to keep in consideration that not every os. We support supports uh that combination of zero exit code as a successful one.

B

Unfortunately, so perhaps we could augment a build log, not only failure reason but like runner message or status message, and we would not show this as a warning, but like more like an information that, uh depending on what you define in the gitlab ciamble runner, is going to send some information that you then actually show on the build block or build page depending on the exit status or other criteria like we would not really augment the build status like the failure with reason.

B

The success with warnings, it's still like something we can do, but then we would perhaps streamline uh how we actually pass additional make data based on some criteria. How.

D

About like saying that the failure reason is basically uh status, reason and saying that, like we have the value reason always which is like predefined and like if we transition to skip it like, we would still show the value reason I mean skip. I mean the status reason which could be status was transitioned and because of the runner requesting to be marked as skipped.

D

I I don't know like something like that. I'm kind of like seeing uh like two general improvements here, which, like kind of photograph, logging and visualizing exit code, it can be handy for some people that, like you, quickly see that information uh in the in the build page uh because, like you, you could probably like look at the at this.

B

Message and say.

D

uh Oh, this appears to be like my mother error, or this appears to be some kind of other error by that. But then, like maybe like our notion of saying failure, reason it's like we just need status reason.

B

Yeah, I think that status reason might be a good additional information, but uh I would argue that we should not change the status itself. We can provide more data. We can visualize that in the ui, but again you know uh how we evaluate the build to be failed or successful like this should stay simple and if we can make it even more simple and even more reliable, that's good, but I would really advise not changing it in the way that it makes it more brutal and error.

B

E

Also, looking at some of the comments most of them wants to control the flow of the pipeline, rather than failure. Success like they want to control, skip this job because something happened, and things like that, so maybe using the exit codes is not the right idea either because like if we look at all the comments most of them are regarding hey.

E

I want to control the shop to run at a specific time and things like that, um and I wanted to show that this job run and dui because, um like I, don't want it to be run as successful, but I didn't figure out anything right.

E

um So maybe we need to like for me like looking at these comments again for me. Maybe I need to understand again what the users really want out of this. Instead.

A

Of just hey, I want the success.

E

Code to be successful because I know it's mexico that is wrong but like for me, it's a successful job. If that makes sense.

D

So like from from my perspective, I'm kind of think like free aspect that may be configurable like for your reason uh very reason. If you can like argument, 30 reason gives you an automated retry function that is like you can define the retry semantic that if you see this very reasonable you can it's not going to be rich right.

D

And it's pretty handy because, like we move, we could add more different value reasons for like for different purposes, which could be like the dependencies that would allow people like to be more granular and presenting the information and like also like retrying uh past, with wordings, something that is like fixed. But let's say you have the quarantine tests the quarantine tests fade, but you may want to consider this job as a fight.

D

But you kind of saw that as success with warnings really and the third case that I'm kind of trying to find like the um the workflow that would be suitable. It's like marking job as a skipped, but I'm kind of trying to understand why something should be marked as escaped when it was running, and I I didn't yet find like there's the workflow for that.

B

So perhaps we should start with the first two and it it would be like a good progress already, and then we collect more feedback and decide whether we should do something we've skipped or not.

D

Because, like like these first two, they don't end, they don't argument the status they may argument. Are you failure or failure? Reason so like we are less suspectible to changing thoughts, but then we have the concept of the cause of the predictable status, because we would be changing the something to be running to be skipped based on some other criteria, which is like a different behavior. I guess.

B

The two changes that you described, I think it makes a lot of sense and can be very useful. I'm.

D

I'm kind of looking at dove and like doff like what is like your perception of what is like the user. What is user asking for.

C

I think what the user is asking basically from what I was reading is like they have like different scripts and like they. They gave like the terraform example, and they don't want it to be failed because they did not receive zero. Basically, that's what the user wants. I think they actually want to control like the successful failure of of the pipeline and if they are possible to proceed based on.

B

Success with warning would solve this particular case right.

C

I think yeah from what I was reading here. Yes like successfully, though as long as they can keep like, they will not fail the entire pipeline.

C

I guess some of them are building like rappers and that, like change the status code, so we will consider them as successful so that that was like my my rate from from going to all of this threat.

C

So I think that success with warning can can definitely probably solve, I would say, most cases, but we can start with that and then and we can collect additional. So I have a product.

B

Proposal so can we just take the allo failure, column, uh keyword and change it from being only true false to being also an array of statuses, and this is going something we are going. If that's not only true false, we set it to runner and- uh uh or we just like, evaluate the the warning uh when we receive the exit status from.

D

A

Implementation.

B

These are technical details, but the user facing change would be making a aloof failure, keyword to be not only true or false, but also an array of statuses.

D

I'm kind of thinking like about the syntax that would be kind of future proof. Would it be future proof if, if we like, would like to control more behaviors, would it be like where we would add that would it be, would we change audio failure, then, or would it be like? Are you failure specific because.

A

D

Like we also talk about the very reason, and if there is an interesting case like how do you want to control, let's say setting value reason and all of failure at the same time? Do we kind of like consider.

B

D

Kind of syntax later or like how we deal with that, if we even deal with that.

B

I guess that we might discuss this asynchronously, because uh I guess that people would like to think a little bit more about that. You know these are like implementation details. What matters is that we probably do not intend to change statuses, yeah, I'm sorry so yeah.

A

Okay, so yeah, I think we should talk more about this one synchronously, because we are at time now.

C

Okay, so do we have like what we need in order to carry this forward to continue the discussion maybe start planning an nvc like with the allow failing like mario's. Do you think you have like enough data for that? For now.

A

Oh yeah, I guess we could uh uh start something, but I'll need a few more details about how to do it. Can you.

B

Maris, can you post a summary of this call into the issue.

A

D

I'm kind of wondering because, like make it possible to control beat status is like much wider topic. Maybe it would be like these are able to create an issue with the very specific minima trends that we discussed and like move this discussion and the solution validation with the customer.

D

If this is something that answers their needs because like um if we close like this issue as long like this issue, kind of of zagos describes way more behaviors that we may want to support where we discuss very small slice of of that issue. Really.

C

Okay, we should not close this issue. This issue description is very specific and any solution that we will provide will be partial, so, like we should probably argue like. We should not close this for sure until we validate that dissolve, like the majority of the cases.

B

This issue three years ago and I feel a little bit ashamed now, because now I think we shouldn't do that.

D

B

D

So like my proposal, marios would be like if you like, open, like very small, minimal issue, about like very specific trends, that you would be adding and then maybe we could of like reach to the customer and validate that they did actually answer uh their request.

C

Yeah yeah just just open the issue and I'll write the comment. Yeah, you can just open the issue with the summary of the meeting that we had and I'll do the administrative task I'm taking this thread.

D

Then then, like we could really like figure out the syntax and like that ingredients, but keep keep making the issue being open for another year.

A

Okay, yeah thanks everybody. Thank you. Thank.

A