.Net Foundation ML.NET, 9 Mar 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Clustering in ML.NET

Description

Video to show how to perform the clustering machine learning algorithm with K-Means in ML.NET.

Text version of this sample - https://jonwood.co/blog/2019/3/2/clustering-in-mlnet

Data used - https://www.kaggle.com/dongeorge/seed-from-uci

Sample code - https://github.com/jwood803/MLNetExamples/tree/master/MLNetExamples/SeedClustering

Contact:
Twitter: https://twitter.com/JWood/
Blog: https://jonwood.co/

A

Hey, so in my other in Medinah videos, I used datasets that had a label in them so that the algorithms can use that to make a predictive model. However, there are some datasets that don't have those labels on them, but we could still use them for some machine learning models when popular type is called clustering, which is where the algorithm will attempt to cluster or a categorize. The data into sections that have similar patterns and we can create a close for your model in a mode net.

A

The data I'll be using in this video is the wheat seed data which you can find at angle. This data has different attributes of different wheat seeds, such as area perimeter, length and width of each seed. This data will determine if each seed is in a certain variety of wheat, such as comma Rosa or Canadian.

A

Alright, so I'm back in visual studio with the usual dotnet core console project loaded up and I already have a mode that it downloaded, and my data is in the solution here. One thing I tend to forget: sometimes when I have data in the solution like this is to remember, to mark the file to be copied to the output. Otherwise, the program won't be able to find it. Now, if you've seen my other videos that you videos that use em on its new AP app, then you know what happens next.

A

I'll start by defining the training data location and then create the Emma context. Instance. Next to reading the data are use. The same context, data that text reader class that we've done a few times before and then tell out how to read in the file and the separate area is going to be a comma and it's gonna have a header.

A

For the columns, I'll use the same column, names such as a for the area of the cp4 parameter and so on, and all these columns are floats so I'll use the data kind r4 and then put what location they are within the velum.

A

The next I'll use the text loader to read from the data location and then I'll use the context that clustering, that train test split method to split my data set into training and then testing data. You know I'll use a test fraction of 20%.

A

So now, let's create the pipeline, and all I really need to do to. The data here is to concatenate all of the features into one column called features and then I'll append it to a k-means trainer and then I'll tell it. The features column is named features and I can give it a cluster count.

A

Now, since this data set already has the labels, we can go through it and notice that there are only three unique labels, so we can specify the closure count to be 3 here in berry Mon. You would not always have this in your data set, so you would have to experiment with other cluster counts, to get a optimal cluster, and now that we have the pipeline. I can call a fit method on it and passing in the training data and to perform the same transformations on the testing data.

A

I can call model that transform and pass in the test data to transform it.

A

So now that we have our model created, let's now evaluate on it to do that. I can call a context that clustering, that evaluate method and give it a transform to testing data, then specify the score. Column is named score and the features column is named features and the metric I'll be looking at is the average minimum score and you can think of this score for clustering as the average distance, from all examples to the center point of their cluster and so the lower the number here then, the better.

A

The clustering is and keep in mind.

A

If you ever get an average minimum score of 0, then that will indicate each example will be in its own cluster, all right and now, let's have it, make an actual prediction and to do this, like, in the other examples to create the prediction engine which I do on the model, and this method is generic and I'll put the seed data class as the input and the C prediction class, as the output now do need to create these classes so I'll create the seed data class and I'll just paste this in.

A

So you don't have to see me type all this.

A

So with the prediction engine created, I can now call the predict method on it and give it a seed bit a class instance as the input and I have some random values for the fields on this class. Well, I'll paste that in and to see what cluster this is predicted predicted to be and I can look at this selected cluster ID.

A

So I'll run this and see what we get all right and we get the average minimum score as well as the predicted cluster. Our new data will belong in and there we go. We now have a clustering model built in ml net. I hope you enjoyed the video thanks for watching and I'll. See you next time.