.Net Foundation ML.NET, 30 Sep 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Natural Language Processing in ML.NET - Removing Stop Words in Text Data

Description

Using ML.NET to remove stop words in text data.

Code - https://github.com/jwood803/MLNetExamples/blob/master/MLNetExamples/StopWords/Program.cs

ML.NET NLP Playlist - https://www.youtube.com/playlist?list=PLl_upHIj19ZzYBP8I7l9MDQY3r6HbVxWw

ML.NET Playlist - https://www.youtube.com/playlist?list=PLl_upHIj19Zy3o09oICOutbNfXj332czx

Contact:
Twitter: https://twitter.com/JWood/
Blog: https://jonwood.co/

Gear used (affiliate links):
Mic - https://amzn.to/2YEXtxI
Mouse - https://amzn.to/2ZtASoQ

A

Hey everyone, so, in the last in Multan, a video that we did, we went over a tokenization for natural language processing to continue that natural language processing, type of pre processing for text data I'll go over how to remove stop words from your text. Data now stop words are words that just for us, as as people, we use them for extra context to what we're saying to help people understand what we're what we mean when we're talking, but for machines. Stop words, just add extra noise.

A

They don't really mean anything in terms of all the text, so we removed those we removed that noise and just leave out the words that the Machine actually cares about and what we actually care for the machine to train on for our natural language processing. Alright, so I'm in Visual Studio, here I have a dotnet core console project. Loaded and I already have something set up here: I have a node on net installed using virtual 1.3.1 and I have a couple of video classes here, the text data class.

A

We just has our input text and the text tokens class, which has the output of our tokens as a string array yeah. So the first thing as usual is create a new email context and kind of similar to the tokenization.

A

The stop words is similar, where we don't need to provide any data to train on, so we can create empty data, so I'll create an empty list of the text. Data class.

A

And then we'll create an ad data view of that empty data, but using the context that data that load from enumerable right no before we remove stop words, we do need to token as a data first, which is what tokenization is usually the very first step in natural language processing pre-processing.

A

So we can do the same thing that we did before in the previous videos over use. The context transform stat text that tokenizer in the words and we use the tokens as our output column and the text is our input column and we said the separators like we did before. We said it as a new string, new character array and would do a space period and a coma.

A

But after we get the tokens we can append on to this pipeline, and this is where we can use the transforms again and this time on the text property. We can remove default, stop words, and in here we can give it the same output, column of tokens and the same input call on the tokens. Since the tokens is the upper column of the previous transform.

A

So we can also give it a third parameter here, which is the going to be the language and so mo.

A

Dotnet provides different languages that have it has its own dictionary of stop words that you can use, so you don't really need to provide your own stop words, and there are quite a few language that in modern that supports here, not just English, better supports Arabic French Spanish Japanese, but we use English here then without pipeline created, we create kind of the model for it and we fit on that bottle with our empty data and we can create a prediction engine with that text.

A

Data is the input and the text tokens as the output and we pass in that model.

A

Then we can perform kind of prediction on it where it runs those transforms on some input data, so here's the engine that predict on it and we create a new text data class and give it some input text to perform the pre-processing on and let's see we can do. This is a test sentence and it is a good one. Just do that as an example there and like the previous video I, have this print tokens method, that kind of there's a helper method to print out the tokens to the console.

A

What many does is takes in this text tokens and it uses a string builder to go through each of the tokens and appends to the string builder. Then, at the end, I just write out everything in the string builder to the console.

A

So here's the text tokens for that and let's run it and see what we get and I forget to do that. Console.Readline so I'll. Add that real, quick, that's run this again in our console. We'll stay up this time, all right, and so we see you only get three words back here and test sentences and good and everything else in Medinah considered a stop word, so it got rid of it. So that's how you can use the kind of the built-in default stop words that mo dinette provides.

A

But what, if you do want to provide your own, stop words? Well, you can do that with a similar kind of a pipeline here. So I'll do this real, quick, I'll create another pipeline and I'll start off tokenizing in two words, the same that we did before and on the appended transform.

A

Instead of removing default, stop words, I'll just say: remove, stop words and I'll do the same input and output columns as tokens, but the third parameter is just a string array of which that we can provide so I'll, give it a small list, I'll just say a in it and then we'll see what happens with that, we'll just fit that into a model, create a prediction engine and then I'll take kind of just copy.

A

The previous predict that we had before just a new engine I'll keep the same text as it was, and I'll print out these tokens and just run this and see what we get all right. So that came back. We see we get the original results with the default, stop words, but in the second set we get the set where we just remove our custom Stoppers, which was just two words.

A

So we get a lot more data back here and not only just using the default built-in stop for a dictionary that comes with a net, but also how you can remove your own. Stop words in case you just want to use a small subset of a list of stop words, and so thanks for watching and we'll see you all next time.