Magento Community Engineering Hangouts, 24 Mar 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Community Engineering Hangouts. Mar 24, 2021

Description

Agenda:
- Predictive Tests Selection by Soumya Unnikrishnan
- Mocking 3d Party Services for Testing Infrastructure by Alex Kolesnyk

A

Okay, guys so welcome the community engineering hangouts. Today we have two really interesting topics uh connected to some test: automated testing models and strategies and how to how to deal with some infrastructure problems. um Okay, so go ahead with your presentation.

B

All right go ahead and share my presentation, real quick.

B

Here we won, my name is samuel and I'm a part of the quality engineering team. uh What we do here is we work on building solutions and tools for improving quality processes for product teams here at magento, I'm here to present a small talk on the work that we have done so far on the predictive test: selection research.

B

So this r d is really focused on improving efficiency and code delivery process and also reducing the testing infrastructure costs that we see by implementing a change-based test approach.

B

So we've based this research on uh papers published by facebook and google on how they are deeming the continuous testing process, uh especially when they're seeing high feature churn rate and an ever increasing test full size.

B

So, a little bit about the current state as magenta is evolving, we're seeing more and more code submissions to the code base and with that more and more tests getting added to our code base. So with that, we see a gradual increase in the feedback time on test results and bills.

B

uh We are also seeing that the test infrastructure costs to run these ev increasing test processes, also increasing. So we've been researching on certain ways to implement a change-based test approach, which means running tests that are relevant to the coaching instead of exercising all tests on those.

B

So a common change-based test selection strategy is to choose tests that depend on the modified code right so, according to say bill dependencies or composer dependencies.

B

So the this technique has a significant shortcoming: a low level code change may select as many as all tests, so to develop a more efficient method. We are considering a different question: what is the probability of a given test finding a bug for a code change?

B

So if we estimate this, we can rule out tests that are extremely unlikely to fail on that code change. So what we did here was we trained. uh We used standard machine learning techniques to train a predictive model with a large data set containing test results on historical code, changes uh that are collected by our mts platform, our country's integration platform.

B

So this model then selects tests based on the probability score of a test failing on a new code, change I'll talk a little bit about the proof of concept that we worked on. So what we did was we collected three months of functional uh ce, build uh which has these mftf test laws right. So we collected that from uh the jenkins job archive database, so we specifically looked at certain job types like mts api and existing prs.

B

The reason why we only consider these job types for the poc is because we could easily find the information for extracting pull request numbers from these logs from the pull request. We could get a change list. Information from github other supporting uh data sets that we used were module dependencies of mftf tests as well as module to domain mapping.

B

So this essentially formed uh the raw data that we worked with for this poc.

B

So from the data we collected, we saw that very few of our tests actually failed, but those that do are generally closer to the code they test. So from the the data that we collected, we built a data set containing code, change, information and right and test information test outcomes of those changes to train the model.

B

So we work with the assumption that every new pull request will at least be slightly different from the previous one. However, features extracted from the new pull request can be similar to features extracted from a previous pull request.

B

So by feature I mean module dependencies size of the change number of authors in the pull request, active areas of the code base, failure rates of tests and uh dependent modules of the tests.

B

So we built a data set that contained features we extracted from the collected data. We categorized. These features as change level features, test, level features and cross features, so change level features were features related to the code change itself so which included change history of files, the file cardinality number of dependent modules for a specific file file. Extensions number of authors, test level features were historical test failure rates that we saw from the runs that we we've been doing for these past three months.

B

uh Cross features are features which are engineered from the change level and the test level features. So we looked at features like number of intersected modules between code change and test number of intersected domains, number of common tokens in uh the file paths.

B

So this was essentially what the the test data set or the training data set. Look like.

B

I'll talk a little bit about the model we trained to predict the test failures, so we used a standard machine learning, algorithm uh gradient, boosting decision tree classifier, so we used uh very standard machine learning techniques. We did a 70 30 split of the data set such that the most recent records fall into the testing data set and the remainder the previous records fall into the trading data set. So this way we wanted to ensure that the evaluation, the model evaluation closely- represents how the model is going to be used in production.

B

So what this train classifier then does? Is it returns a score for each test in the test data which can be considered as a likelihood of the test? Failure.

B

So, in order to avoid predictions of any false positives, we you know we encounter test flakiness, very often in our uh bills, so we wanted to attempt to not include such flaky tests into the training data to prevent false predictions. So here the term flaky means that the test had at least one re-run and a build. So in our ci system we have a concept of re-running a test, a maximum of three times until it passes to it's a deflating strategy. So we wanted to exclude such tests from our training data set.

B

So what we did was we um we, we set a threshold and we removed such flaky tests from the training data.

B

So uh in machine learning model in machine learning generally what you do when you're training a model, you also do certain hyper parameter tuning of the model. So what it does essentially is to select the best features that the model was trained on and use that for predictions.

B

What we found was that the best performing model or the model that was giving us the best evaluation metrics, were uh considered um features like test failure, rates, file, extensions change, history, number of intersected modules, a number of dependent modules as the strongest feature in the trading dataset.

B

So coming to calibration of the model, so we use standard machine learning metrics like recall, scope to evaluate how the model did on the test data so in our case a test data. So we used three months of data for for this poc and the first two months of data was used for the training and the third month of data was used for testing.

B

So um the the test data is basically uh the data extracted from one month of pull requests submitted right after the time period of the data with which the model was trained. So we looked at two metrics that were that we were interested in, which was test, recall and change. Recall so test recall indicates the percentage of test failures correctly predicted in the test. Data and change recall indicates the percentage of build failures, correctly predicted.

B

So what we found was that, by setting a probability threshold score of 0.12, we were able to catch most of the test failures and build failures by just selecting 21 percentage of tests.

B

What this means was that, just by running 21 percentage of tests from the test suites, we are able to catch most of the 99 percentage of test failures, as well as bill failures.

B

So we are currently looking at so from the results. The promising results that we're seeing on the poc. We are currently working on um high level design of how we could integrate this into our uh into our mts or the ci platform.

B

So for a machine learning model to work in a build system, we need to have an automated process of continuously evaluating and retraining the model with the new data, the new pull requests that are coming into the code base and all the runs that are being done on the ci platform.

B

So the model meeting our criteria automatically replaces the one operating in production. So the summary of it is that we save new training data as we receive it. When we have enough data, we train the model, we test its recall against uh the machine learning model, and if we see that the accuracy of our model is degrading over time, we we do some more feature engineering and improve the scores of that model.

B

So some of the next steps that we are working on is doing uh more testing of this poc, and uh we are also currently um talking to some of the architects on uh the mts side, the platform side, as well as magento architects, to uh finalize on a high level design of how this strategy would be implemented with our existing ci infrastructure.

B

We also have several dependencies that we would be outlining and defining some of the dependencies are currently in discussion.

B

So we, the the whole idea, is that we would be reducing the number of tests that we are running on our pull request and the user defined uh our user requested bills, um cutting it down by say even one-fourth if we are seeing good metrics and then have bills that are scheduled on um on a specific currency, every four hours that will run all tests, so we still want to exercise all tests, but the number of so, uh but the frequency of it would be um sort of uh on a cadence rather than having it on every build.

B

So there are several strategies that are currently under discussions and we'd love to keep this group updated on when we have when we have a formalized approach on how we will go about this project, uh we also are looking at um you know, flaky testimony which is one of the uh the biggest uh factors that are slowing down our pr processing.

B

We are looking at ways to how to quarantine these flaky tests and not have them run as a part of our uh clog, our delivery process.

B

So yeah, it's pretty exciting stuff that that's currently under discussion, and you definitely keep everybody posted about the progress that we make. So with that, I will.

B

Print my talk and open it for any q, a or comments. Thank you.

A

Hey guys any questions.

A

Okay, if you, if you will have questions later, just submit them to the chat and we will be able to answer them.

A

Thank you so much representation, hello and the next one is alex.

C

Hello guys, let me share my screen.

C

Now uh do you see my screen yep good, so my presentation is not as cool and fancy as some yes, but I think it is very important, as well as predictive test selection and how we build our test and infrastructure.

C

My presentation mainly going to be I'm mainly going to talk about the problems we have when we test our application with any third-party services third-party integrations.

C

So, first of all, I wanted to start with a current state of what we have when we test magento integrations with the third party systems. I've got a very simple diagram here, how it works.

C

So, as you can see, we have jenkins job running tests and inside this jenkins job we have a magento container, which basically represents magenta application, and when we do execute our automated tasks, which designed to verify that magento behaves correctly with some third-party integrations, we usually use third-party systems and boxes.

C

For example, if.

C

If we use- let's say payment, paypal, integration or integration with youtube videos, something like that, we usually create an account uh which directs to paypal, sandbox or youtube sandbox, where we can play and test that our application works correctly. So and- and- and you can say that- then what is the problem? Those sandboxes were created right for this specific purpose, so you can test your application works correctly.

C

You have no box, so the problems are different. Based on the sandbox. We've got different types of problems. The main problem is reliability of those third-party system sandboxes.

C

Sometimes they they are down and not work, and sometimes they are just.

C

They cannot handle amount of tests we have which sonya just recently showed us and explained us and there's different pro problems can be, can appear uh why those sandboxes fails.

C

The other problem is that recently, in last couple of years, I started seeing that those sandboxes introduced some user rate limits what it means. It means that you have a limited amount of requests. You can send to those sandboxes.

C

And then you just have to pay some money for that or you you blocked executing those tests.

C

Other problem can be, let's say if this is a an integration with paypal and when we go through checkout process.

C

Well, integration, redirects us to uh paypal, ui, page uh paypal, web page sorry, uh sometimes our tests um been written to use specific selector to click a button or do something and then suddenly they just change this uh selector to something else. So task test will fail and and and that's causes a lot of maintenance uh issues, and basically this. This is a problem for us uh and and there's many more other problems we can face with those sandboxes when we have this uh connection with the outside world.

C

So what is what is what solution we found to to solve this problem?

C

So what do we want to do? And it's it's currently on uh proof of concept stage? uh It's not! It's haven't been developed yet, but we look forward uh to see this being implemented and work fine for for us good. But what do we want to do? We want to mock those third-party services and we decided to.

C

Build couple more docker containers for our infrastructure, which will serve this. The main the main idea is that all requests which goes outside of magento will go through the proxy server and proxy server will. If, if proxy server will know that this request should be marked, it will go to third party service mark and this service mark will give us a necessary response. The response we're waiting for and it can be anything it can be- can be any format you can imagine in the html.

C

It can be json, it can be whatever you want, xml you name it.

C

We do it and everything else which is not mocked or shouldn't be mocked or we haven't mocked yet, but we will do this later, uh we'll go to the world wide web and get will will get data from reels and boxes.

C

uh That's that's pretty h that that's pretty! This is a very simple idea we would like to implement, and next I wanted to show you the profile concept we have and how the work works. So what we did, let me switch my screen real quick.

C

um Do you see phpstorm yeah? I can see it okay, uh so, as I mentioned, we will have a proxy server which will filter those requests and and redirect them to some services and the server smoke itself. So we we decided to build uh to use a basic docker container functionality where you can specify and and in my example, I will use selenium chrome, debug.

C

Container, which will in and in this container we will try to open, google.com and see how we can mock google.com page. So this is the main configuration here and basically, you just need to execute a simple command of docker compose up and it will bring everything up and will do and will do the magic for you.

C

The idea is that when we start selenium chrome debug, you can specify environment variables, docker, environment variables called http proxy and you can specify where, where this proxy is located, so, as you can see here in my configuration, I've got proxy server dot docker on this specific port, and there is my proxy, the docker container. That's what I have here. I show you later and this will this will sort serve as a proxy server and then I'm having my service mock and I'm going to mock google.com.

C

So the mock proxy server is pretty simple.

C

This is a node.js application which creates a server which listens to this uh requests and also when we start up this service, we register all uh service mocks.

C

So in this environment sample file, you see links to all statuses, small and if you have several of them, what you can do you can specify. Let's say we're. Gonna have google that google doc here mock service?

C

uh You can have, let's say a paul docker, so you you can specify any amount of mocking services for your proxy server here and when, when you start this proxy server, it will register all of them and it will know what to do with every single one of them and then we have our service mock.

C

I won't go through the all parts of implementation. I will show you the main part, which is the most interesting for everyone, is uh how it's gonna, how we can put those responses. What mark those responses we would like to see. So, as you can see here, you can specify any you. You can specify the link.

C

You would like to hear- and you can say what you would like to respond here and you can basically build here whatever you want, if, if, if you you, you can even put here your own logic, which will build uh this response based on the post request you send in some third-party service, you can do whatever you want, but I've got here pretty simple example.

C

I will respond that I'm a google and if I will go to a google.com test, it will respond on my test, google and, as you can see here, you can read get parameters as well as posting anything else. I just don't have here uh an example for this: let's um run this up and turn them down.

C

If it's running. Let me change.

C

C

Let me open my.

A

It looks like it's not real. It started without any arrows.

C

You you will see, you will see errors soon. Do you see my docker dashboard right now, yeah good. So, as you can see here, this is my main, selenium standalone, chrome, debug container, which I use to execute functional passwords and do something else, and those three are my mocking services containers, including this separate, selenium, chrome, debug.

C

So what I want to show you I'll start I'll connect uh to this one, which is my main container I'll, go to google chrome.

C

Yeah, so this is my main, which is do not have any proxy set up here and if we will go to another one, which is a different port and we will try to open chrome browser- and I know what's going to happen right now- it will show an error and I didn't right now. I don't know why this is happening. Yes, so no internet connection.

C

For some reason, I've got this bug where I need to manually start container after first attempt on the first attempt, but we'll figure that out and as you can see that's another container where I've got where I mocked my google, let's try other urls, I'm a test, google, and if I will send some parameter here, you can see you can read those and, at the same point of time, your other container with another browser will work just fine.

C

So if you don't need those marking to be set up for every single container, you have you have this freedom. Another beauty of this approach.

C

There is my presentation, another beauty of this approach, comparing to other third-party tools uh like we, we investigated one, uh maybe you're familiar with it called monty bank, but the problem with that is, you need to go and configure magento itself uh to work with monty bank. So if you have integration with google.com, you need to find a place where google.com is hardcoded or configured in magento change it to um to link which monty bank provides you and then it will work.

C

In our case, you don't you, you have to do zero uh configuration for your magenta application. It will work even with those uh urls and configurations just you've put into magento.

C

Yes, this is pretty much it. If you have any questions, feel free to ask.

C

D

Hey uh is there a way to lock all requests that proxy receives.

C

I think so, just as I mentioned, that's just proxy server itself is a very simple node.js application.

C

I think we can simply write a logic here which will log all of those.

D

A

Hey guys any other.

A

A

Okay, so thank you alex. Thank you guys for attending. The meeting will be available as recording on the youtube on our youtube channel yep yo.

C

Bye. Thank you. Everyone have a good one.

A

A

A