.Net Foundation ML.NET, 23 Sep 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Natural Language Processing in ML.NET - Tokenizing Text Data

Description

Using ML.NET to tokenize text data to get it ready for machine learning algorithms.

A

Hey everyone, so, in this video I'll go over some natural language, processing and and Monnet specifically I'll show how to tokenize text data, so you can pre-process it and get it ready to be used with a machine learning algorithm, the first real quick. What exactly doesn't mean to tokenize text data and when you do that, you essentially split it into each word or two or even to each letter of the input data.

A

So let's see real quick, how we can do this in a mode Annette, alright, so I'm in Visual Studio here I have a console project ready and I already have emailed Annette installed a museum I'm using version 1.3 point one for this and I have a couple classes already done: I have text data which is going to be the kind of schema of our input and then the text tokens which is going to the schema of our output, and this is basically just an array of strings.

A

So first things, first I'll create my email context and because we're just tokenizing here and tokenization, it is a transform, but we don't need to supply any training data. So what I can do is create some empty data, which is basically a new list of the text. Data object.

A

And actually I forgot to bump up this text here. Sorry about that there you go. This should be better, so credit context and we create the empty data which is just an empty list of our text. Data class and with that we just create the add data view from the empty data set with the load from a new mobile method.

A

And now we can create transform, which appear on your transforms, that text that token oz into words and like the other, transforms we give it an output, column, Y, just called tokens the input column, which will be text since we're doing inputs of our text data and that has a property called text the next.

A

We have to give it this kind of the separators that we wanted to use and that's going to be a an array of characters right now, I'll just start with the space, and so with that transform we could even call fit on it with our empty datum.

A

And similar to kind of creating a machine-learning pipeline and doing prediction on it will create a prediction engine: give the input and output schemas and pass in the modal to get took in our string here. We call it engine not predict and get our text data and say text, so they go to some text everyone to tokenize. So let's see what we're going to use give it a text of you know: annette is great for machine learning and even deep learning. I.

A

Kind of help print out this output I'll create a little helper method. Here you call print tokens.

A

It's going to tell you the text data as the input now give it a new line. Each time that's called so we can separate out our different calls and here I'm, going to create a new string builder.

A

And therefore, each token and tokens that text- sorry, not the data as the tokens gonna be token, start tokens and then free to those I'll append belong to the stream builder and then I'll write out the contents of the stringbuilder.

A

Alright, with that I'll just call print tokens and passing the tokens and then I do a console dot real on so it'll stay up and let's run this and see what we get.

A

So each word is: got split out on its own. Here you notice. We still have some punctuation here and one way to kind of fix that is it with their separators character, brain we can pass in a period and comma another point elation that we kind of want it to to recognize here. So let's run this again and see what differences via there you go so now see we have no comma here, no, no punctuation, so you can use that separators parameter and to kind of handle all that the punctuation within your text data. Alright!

A

So that's how you took Naz into words. What if you want to talk nas into different characters or letters within your input, data so kind of a similar way here we want to create a little bit of a pipeline and when doing the same or kind of similar transforms that text tokenize. Instead of into words with token oz characters as keith and we'll pass in tokens as our output and then text is our input and I'll use marker as characters just said that as false now long with that I'll append another transform.

A

Let me transforms that conversion map key to value and because we have input column s text as tokens in our output column is also tokens. We can just pass that in that's the same thing. We do that because it's going to tokenize as keys sort of them map those keys to values. Then, after that we do the fit on those transform pipe lung, still not empty data that we created before we create an engine and if they call it predict on it, and you know what we'll give it the same input here and.

A

Then we'll print these out so.

A

Let's run this and see how this is different,.

A

All right, so here's our first one we're tokenized each word and here's our second one, so each cater has been tokenizing into its own, and these kind of a question mark things here. This indicates that there's a space, alright Oh in things right, they're, just going to show you how you can token us, within a mode within a node net, to kind of prepare for some natural language processing. So thanks for watching everyone, I hope you learned from this video and if you enjoyed it, please like and subscribe, so you can get more content.

A

I'll see you all in the next video.