.Net Foundation ML.NET, 26 Aug 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Data Preparation in ML.NET: Select, Drop, Take, Shuffle, and Filter

Description

Data preparation is important to get your data in a workable state for it to work in a machine learning algorithm. This video will show how to select columns, drop columns, take rows, shuffle rows, and filter rows in ML.NET to start preparing your data.

Code - https://github.com/jwood803/MLNetExamples/blob/master/MLNetExamples/DataPrepRowsColumns/Program.cs

Contact:
Twitter: https://twitter.com/JWood/
Blog: https://jonwood.co/

Gear used (affiliate links):
Mic - https://amzn.to/2YEXtxI
Mouse - https://amzn.to/2ZtASoQ

A

Yawn so often in machine learning, you will need to perform some sort of data preparation steps on your data to get it ready for the machine, learning algorithms, and there are quite a few things that you can do to prepare data. So I have these as a series of videos and in this video, though I start by showing how you can select shuffle and filter columns in ml Donette. Alright, so we're here in Visual, Studio have a console project loaded and, as you can see, I have some setup already done here.

A

I have a lot to get packaged already installed and alright to create my context and I already loaded in the data and reusing that the housing data said again here and already have my input schema nicely. The first thing I want to show is how you can select columns, and these are the columns that you would want to take from your original data set, and so your sunlight calls and select columns is going to be a transform, so the context that transforms the select columns and we give it the column names of strings. Now.

A

One thing to note here is look in our original data set. We have different. The headers are kind of spelled differently, they have underscores and all that, but in our input schema we kind of remove the underscores and do kind of even do some different casing here and when you do to select columns since we load it in with this schema class. Here we need to give in the column names that are represented in this file instead of the original data.

A

So, with the clip select columns, you see, I'll select the housing median age and the total bedrooms now just I'll keep that for now and then, since as a transform, we need to fit and transform that with our data.

A

So we do select colors that fit on our data and we call transform from there on the same data. Now, let's see real, quick I'm going to I'll just paste in just kind of a helper method.

A

So we can see our data pretty easy and what it does is it takes in the out data view from mo and Annette, and it calls a preview function on it and we tell it only to just give it the first five rows and in the preview function I did it the rows, I get all the rows and then I just print out the keys and the values for the road. So it's going to be the column name and the value of the column. I said what this transform.

A

Now we can call it display columns on that transform and I'll do a console.readline, so we don't need to do in your breakpoints or anything. So let's run this and see what it looks like all right. So we just get the housing median age and the total bedrooms there's two columns that we told to select from.

A

And I'll comment that out and next other job columns. So instead of selecting the columns that you want to use in your algorithms, the drop columns will just drop the columns that you don't need and similar to the Select columns. This is going to be a transform, so they transform stud job columns and again we just give it strings of the columns that we did. We want to drop and so I'll drop the latitude and the longitude columns.

A

And then I'll call fit transform on them and I'm continuing to use the original dataset with this as well, so we'll get the full dataset with it and then I'll display those.

A

So we go back here today to our original dataset, the latitude longitude or the first two items. You know a dataset. So if I run that, let me see it starts at housing median age and we don't see the latitude and longitude anywhere in our dataset. So the next thing we can do, and so we can actually shuffle our rows.

A

So if you want to you kind of a random shuffle for kind of sampling on our data or something like that, we can call this shuffle rows method and what that is is not a transform. So it'll be on the context that data the shuffle rows and we skipped our data and we have the option to give it a seed parameter to which giving it a seed tells it to shuffle the same way every time.

A

So you get a bit more reproducible outcomes that way so I give it a seed, and that's all I have to do for their shuffle rows, and so we can call the display columns on it and in fact, I'm gonna call it the first with the original data, so we can see show that actual shuffle the data.

A

Let me do another console.writeline to differ trades between those two display call them calls and do a strict's you guys so here's our first first set before or shuffle and we can see like the total rooms here- are different and the total rooms here. So we can tell that our data has been shuffled.

A

The other thing we can do is we actually take the number of rows that we want and.

A

Just how the shuffle rose days on the data property in the context and take rose, and we just take tell it the number of rows every want to take I'll, say two and that way, since our display columns method does the top five. We should only get two rows when we call this here we go.

A

We got the first two rows from my data set and the last thing we can do is we can filter on our rows, and this is also going to be on the data property of the context and we can filter rows by column. But I just want you to make note that there is a filter rows by missing values. We don't have any missing values in this data set, but that's something to make. No doubt that that is there.

A

If you need it and say, I want to filter on the population, call them and imma tell the lower bound to be zero in the upper bound to be a thousand, and you probably tell this one and works with numerical data. You know we can display those good population here we got nothing over a thousand. We actually look at our original data set here. We can see our population the first couple. Rows actually has over a thousand. So if filtered odd those rows all right.

A

So that's kind of all I wanted to show on some data prep in terms of taking and filtering rows that you would need to kind of get started on data preparation within and well done. Ed I hope you enjoyed the video if you want to see more and feel free to subscribe and we'll see you next time, thanks.