DFFML Examples, 9 Oct 2021

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: Example: Data Cleanup

Description

Latest release version of this tutorial: https://intel.github.io/dffml/examples/data_cleanup/index.html
Current development version of this example: https://intel.github.io/dffml/master/examples/data_cleanup/index.html

A

So hello, everyone, I am sudhanshu kumar. I was a part of google summer of code 2021 program. I was working on the python software foundation, sub organization, dffml data flow facilitator for machine learning.

A

During my uh tenure at google summer of code, I have actually worked on two projects, so one of the projects was actually related to accuracy scorers. So here I have created a documentation for it. So this was the accuracy scorer on which I worked on. As you can see, uh all these accuracy scorers actually comes from sql and matrix methods. So all these scorers are part of the scores which I have integrated.

A

So we have scores for regression. We have scores for classification, we have scores for clustering and also we have uh supervised scorers, which are actually the model's default score. So every sql model also comes with a default score. So if you would like to use the default score, we can use the sk model score method.

A

Entry point right, so this was one of the project. My other project was data cleanup operations.

A

So as we know that most of the data scientists and data engineers actually spend like 80 percent of their time uh on data cleanup processes, so this project was targeted at how we can pre-process data, how we can clean up data in the data flow itself. So I have written a documentation here. You can always visit the documentation with this link, so I will be running some of the code from this documentation and I will be showing you how we can actually pre-process data using the data flow.

A

So first thing first, so what I have created here, so what I have done here is: I have taken up a data set here. So, as you can see, this is a data set kc house data set in which we have uh 21 columns. So there are 21 features in here and there are many number of rows.

A

So I think there are yes. There are like 20 21 000 close to 21 000 entries are there, so we will be using this data set, so we will be doing it in two steps. First, we will train this data set as it is, and we will check what is the accuracy that we are getting and then we will apply some of our own preprocessing on top of it, and then we will see what is the accuracy that we are getting out of it.

A

So, let's first start with training first without any preprocessing step. So I have written a shell script here.

A

So here is the train processing. This is a normal training part data model training part. So we are using the model of a scikit eln and we have also given the features and what is the predict method we have, then we will be storing the model in the temporary directory and our source will be in the format of a csv file as it is noticeable here, and we have also given the source file name.

A

So, first, what we will be doing, we will be training a model on just this data set and we will see what is the accuracy that we are getting. So what we can do here is we can change into the directory.

A

So, first, what are you doing? Is you are training our data set.

A

And before we actually start with uh before, you guys started any of the uh steps mentioned here, you first make sure that you have installed all the plugins that we require for uh going through this documentation, so make sure you have installed all of the document all of the packages first.

A

So, as you can see, if you have successfully trained our dataset and let's check what is the accuracy that we are getting now.

A

So, let's check in the accuracy file, so this is the accuracy which we are running. We are taking the same circuit model now the scorer which we are using is exv score method, so this scorer actually comes here in the exp score method. So this is the easy score method which we are using, which is the regression part here, so we will be using.

A

And we have also mentioned what is the predict features and what are the features that we have.

A

So, let's wait for the accuracy part to get completed so, as we can see that we have accuracy of around 0.62.

A

So let's now do some preprocessing on top of it and see what accuracy we will be getting so first for the preprocessing part, we will have to have a create command which will create a data flow for the pre-processing.

A

So this is the create command which we have here so in the create command. What we are doing is we are actually trying to create a data flow and in the create command we actually define how the data flow should run like which operations output should become which operations input. That all thing we will actually define here. So we have actually provided a config here, which is the file which we have the kc house data set.

A

We have given it to a one of the methods, so this method is converts. Convert records to list. So using this method, what happens? Is we actually convert? So this is a source file, so you with the source file. It actually converts all of the records in the source in the form of a list, because all the cleanup operations that we have here all the cleanup operations that we have actually works on a matrix of data and not a single row of data.

A

So what we need to do here is we will need some some operation to actually convert all of those records into a form of a list of lists. So this this was the so it needs a config. So for that config we are providing it that this is the file here and we're also providing what type of functions. Then it is a csv file.

A

Then we are providing the inputs here. So these are the inputs for the preprocessing part. Now here is the flow which we have defined so the first operation which we are performing on top of it is standard scalar. So the standard scalar method is actually an operation using which the mean of the data is actually reduced to one. So it will have a unit variance and a unit mean it will have, and then we have also applied another linear population, which is called as principal component analysis. So what principle component analysis will do?

A

It will convert it into a shape. So, let's suppose we have a data set of 500 rows and 10 features so using principal component analysis. We can actually reduce this data, set into something like 500 rows and, let's say three features. So here we are not doing. We are actually keeping the data, the rows in the columns of the data set same and we are trying to extract what are the important features in the data set.

A

And in the last we have convert list to records. So after we are done with this p processing, then we will have list so when we then we'll have a matrix to convert that matrix back to records which our sources will understand. We will have this method called as convert list to records.

A

So let's run this method and see what is the data flow that we get.

A

So this is the data flow which we, which is which has been generated so using this data flow. Now we will go ahead with the merge command.

A

So this is the merge command, so using merge command. What we do is we will take. So this is the source and this is the destination. So, whatever the source data data which we have the source sources records that we have, we will move it to this destination source. So this is the merge command, so our source is actually a data flow source. So using the data flow source, what we will do, we will run our data flow, which we have created earlier.

A

On top of this data set that we have and on running the data set, we will then be having these pre-process data set, so these pre-process data set, we will store it in a csv file which will be pre-processed csv file, and we will store all of these file into this files, and these are the features that we want to be pre-processed. So these are the features which we have provided, and this is the data flow that we had earlier generated.

A

So let's run this command.

A

So, as you can see, we have successfully pre-processed all of the records and if we actually go into preprocess.csv.

A

We will be having a csv file which has the data preprocessed.

A

So, as you can see, this is a csv file and all the values are actually standardized now, so we have applied both the operation, the standard scalar and the principal component analysis on the top of the data set, and this is the final pre-process data that we have so, let's now train on top of it.

A

So this is our train method. It's similar. We are using the same model which we had earlier used, we'll be providing the features and the predict feature is price. Then uh we will be providing it. The pre-processed csv that we had just generated.

A

So we are done with the training part as well. Let's check for its accuracy.

A

So for the accuracy what we are using here is so we are using the same exp score. Regression scorer will be providing it that we have. We are predicting the price.

A

And these are the model features, and this is the file uh using which will be predicted. So, let's check what the accuracy we are getting.

A

So the accuracy which we are getting is 0.66, which is around 66, and previously it was 62 percent, so we have definitely improved on our accuracy using the cleanup operations method.

A

So this was my second project which I had worked on during my uh period during my tenure at google summer, of course, and I'm very thankful of john anderson, saksham arora.

A

uh To be my my guiding mentors for the period, thank you.