.Net Foundation ML.NET, 27 Mar 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Natural Language Processing in ML.NET: Producing N-Grams

Description

How to produce n-grams from text data in ML.NET.

Code - https://github.com/jwood803/MLNetExamples/blob/master/MLNetExamples/NGrams/Program.cs

N-Gram article - https://blog.xrds.acm.org/2017/10/introduction-n-grams-need/

ML.NET NLP Playlist - https://www.youtube.com/playlist?list=PLl_upHIj19ZzYBP8I7l9MDQY3r6HbVxWw

ML.NET Playlist - https://www.youtube.com/watch?v=8gVhJKszzzI&list=PLl_upHIj19Zy3o09oICOutbNfXj332czx

Contact:
Twitter: https://twitter.com/JWood/
Blog: https://jonwood.co/

Gear used (affiliate links):
Mic - https://amzn.to/2YEXtxI
Mouse - https://amzn.to/2ZtASoQ

A

Hey everyone so in this video I want to show you how you can produce engrams within in Linette, but before I, go to straight to the code. I want to just very briefly go over what engrams are and how they are useful. The more technical definition is that engrams are sequences of n words together and n can be. However, many numbers that you want kind of, usually you'll, see these as two or three grams go through just a couple of examples.

A

Here first example is Jazz Band and that's a two gram, which is also called a bigram. Then we have I'm hitting home, which is a three gram and that's also known as a trigram.

A

These engrams are used for is to create, what's called an Engram model, and what that does is that it can help us predict next words, and it can also help make spelling Corrections and you've probably seen these almost every day within Gmail, to help predict what you're gonna say next within your email or within your phone and your text messaging to help predict what the next word of your message is going to be.

A

That's kind of a brief introduction and I'll put a link in the description that goes a bit more in-depth into engrams and what the Engram model is. But let's go straight to the code and to visual studio here and just usual and console done that core project here and let's install in Linette.

A

Using version 1.4 here.

A

All right, first things: first, let's create our e-mail context.

A

And instead of reading in some data for here, I'm just going to use some kind of in place fake data and for that I'm just going to create a new list and.

A

I'm, a creative type of input here and let's go ahead and create this, and that's just gonna, have a single property string called text and I'll just do a couple of new items here.

A

And the first one I'll kind of relate to my examples that I had in the slides here: first I really enjoy being in jazz band and then another input, but I'm done for the day and and heading home.

A

So we have our input data here.

A

I can create a data view using context data load from innumerable, and now we can create our in Graham pipeline and the first thing we need to do is we need to transform side texts that tokenize into words, so we need to tokenize or text first before we can generate our engrams, the output column is gonna, be tokens and my input column is going to be text and I can use the name of operator here, input, text and I'm going to append another transform, and this is going to be a conversion, transform gonna map value to key and keep the same output call them as tokens.

A

Then I'll append on our last transform here via text transform- and this is where we produce engrams and then you can pass in a few parameters here. First, is the output column, which I'm going to call engrams?

A

Do the input column that we're going to use it's going to be that tokens column they recruited in these two transforms above, and we can tell what Engram links that we want to use. So if we want to get background trigrams or anything else above and here, I'll just limit to two and to actually limit Ingram lengths, we need to use said the use all links to false.

A

Otherwise it's going to get other links of engrams and we can give it a weighting and in here it's going to be a wedding criterion and we have a couple of different choices here and, let's briefly go over these first, we have TF, which is the term frequency that guess the the frequency of the amount of times the term is within the corpus in the corpus in terms of natural language processing is pretty much our our input text data. So it's going to be what we have up here.

A

They have IDF, which is inverse document frequency, which tells how rare the term is within all within the corpus. Then you have tf-idf, which is a product of the term frequency and inverse document frequency. For this example, I'll just keep it a term frequency. Here we go now we have our in Graham pipeline. We can fit on it with our data here.

A

And then we can get the transformed data from the fed data.

A

So now we have our data transformed into engrams here, but we still have it as an odd data view. There's some measure steps we can do to actually produce the engrams themselves here. The first thing we need to do is we need to get the engrams slot names and we get that from the mo data namespace, and this takes in a reference to a V buffer of read-only memory of characters here. So let's create that.

A

I'll call it a slot names, you know just set to the default value of this type. Let's see we do have an error here. We need to upgrade our language version to use that in Visual Studio. You can do that for us and then just give the ref of slot names and from there I can get the engrams call them using the gig column method.

A

Here, that's going to be a V buffer of floats and then I can pass in the column there to transform, schema and then give the engrams call from there and then I can get the slots using the slot names that get values.

A

Okay. So now we have this reference to the engrams column, which is what we get from this transform up here. Let's do a console, dot, write line, I'm going to do a for each loop for each row and our Ingram's column here and we're gonna do another for each within there for each item and road items, and then we just consult our write long.

A

Here we can use the slots item, that key index use that nesting index and I'll just create an empty line for each row and then I'll do console to read lung to the console, doesn't disappear when I run this and let's actually run this and see what we get here here we go, so we got our engrams from the first input and then the first and the second and put together here. You can see what we get two items here, so we do get our backgrounds.

A

So that's how you can use in Lynette to produce engrams within your natural language processing applications thanks for watching and I'll, see y'all next time.