DFFML DFFML: Status Updates, 19 Mar 2020

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: DFFML Overview and Status Update: 2020-03-19

Description

As of 2022-09-23 not much has changed fundamentals wise so this is still a solid overview.

- https://intel.github.io/dffml/main/examples/icecream_sales.html
- https://intel.github.io/dffml/main/examples/shouldi.html
- https://intel.github.io/dffml/main/examples/notebooks/index.html

A

All right, yeah, so sweet all right. So what is dffml all right? The goal here is that we've got this thing that makes machine learning easy to use. It helps us make data sets. It helps us make machine learning models and it makes it easy to go use like those models once you've trained them all right, and so the basic flow of machine learning is: we've got some data or we've got basically.

A

Okay, we've got a problem right and we know that there's some way that, with enough number of examples like a human could say that's what I bet. The answer would be right. Well, we're going to make it so that the computer's gonna do that right and that's machine learning, um and so we just need to come up with all those examples, and then we need to choose the right, algorithm or model.

A

um That's going to give us that's going to give us a high accuracy on this problem. Given those examples right and then once we have the right examples and the right algorithm that which produced a model which has high accuracy, we know that we've basically solved the problem, um and so that's the gist of things.

A

And so the idea behind this library is that it makes it easy to do all of this, which is gathering all our examples and making the model and training the model and retraining the model and trying different models until we have one that has good accuracy.

A

um And then we want to go use that thing and while this should make it easy to use so we've got three pieces here. We've got this dataset generation side of things which is powered by directed graph execution, and while a directed graph is basically like a flow chart just like this, um and so these directed graphs are sequences of things that happen um sort of, like mostly data scrapey things or data transformation.

A

Things so say: you've got the name of a city and the ice cream sales and the month, and you wanted to predict like what is the ice cream sales in a different city in a given month.

A

Well, we'd have some operations, maybe that would go and grab the weather for a city given the month and we could correlate the temperature to the ice cream sales and then, if we get a new city, we have another operation in this directed graph execution thing which generates the future data, and one of our features is: is this uh now the temperature that we're generating based off the city name so we're taking our existing data, set we're combining it with this new generator thing, um which is going to generate that temperature analysis and we have this new data set that has City temperature month and ice cream sales and that way, if I get a new, um so that would be like generating the whole data set there and then, if I get a new city coming in and somebody says well, what's what's going to be my ice cream sales in this city, I can give them a production, a prediction on that single record and that record would basically be.

A

You know that that one row in that Excel file or that database, um which is going to be you know, what's that city um and so then we go and we scrape the we scrape the ice cream data or we scrape the weather data, and and now we can predict how many ice creams we're going to sell. So after we've generated all that data, all the example examples we're going to feed them to these machine learning models right, and so the first thing we're going to do is this model.

A

Training um and model training is basically just we're going to give it all the examples and we're going to give it the thing that we wanted to predict and uh then uh we're just gonna like rerun it through this training process, which is different for each model until it comes out with an accuracy that we like, and so that's the accuracy assessment.

A

um So then, the other side of this is right once somebody comes in and we have this model with high accuracy and we want to use it uh now, you know give me a new city. uh Predict me the ice cream sales. That's that prediction using shrink model and then, of course through all of this, we have to store and load this data somewhere right. So all of this lives somewhere right, it might be a CSV file. It might be an Excel file, it might be like an actual database.

A

It might be a Json file, um it could be whatever you want, because we've got these abstractions and uh you can. You can store wherever you'd like um so the core features here, like we just talked about, are machine learning, data set generation and data set storage and for machine learning. We can do that from the command line. We can do it from python or we can do it over HTTP with an HTTP server that exposes all of this stuff.

A

So all of these things have the same sort of interface and as a programmer, you're doing the same type of stuff over and over again you're, just doing it over a different way, like you're doing it from python or you're, doing it from the command line, it's sort of the same parameters and everything you're going to be passing or you're doing it, you're not sending it to an HTTP API.

A

um Like a you, know, a web surface um and so for the data set generation. This is where we we have that um wait a minute, my slides, didn't advance.

A

Thank you come on now, all right there we go um so for the data set generation, that's where uh we're using that directed graph execution in the last slide- and this is a concurrent execution environment with managed locking and what that means is we're running all of these little functions. That scrape like you know. The temperature like I, was talking about and we're running all of those at the same time.

A

Basically well concurrent is slightly different than parallel, so they're sort of running at the same time, and if you can basically think of them as running at the same time um and uh but for certain things which you can't operate on at the same time, then the execution environment will make sure that your functions don't run at the same time right. So if two things are going to like, if two things take a city name and only one can operate on a city name at a time.

A

Well, then we're going to say a city name requires locking, and only one function which takes the city name will run at a time. Otherwise anytime there's a city name. All of them will run at the same time.

A

So we've got come on now.

A

All right, so we've got this high level python API, which basically, this is what it looks like when you're, when you're actually training the models and assessing the accuracy and making predictions uh we've got, you know you basically say a model equals whatever model. You want here's the features which is the input data that we want to train on. So right, for example, uh we're saying here like the example. These are the examples right.

A

Each of one of these is a is an example record, um and we want to have the model look at the years, expertise and trust, and we want it to have it predict the salary. So it's going to be looking at this data and it's going to be saying. Okay like given this data, try to predict uh the number and the salary right, um and so this trains the model, and then we say: okay. Well, let's have the model. uh Let's see what kind of accuracy this model is at um and well.

A

This is a linear model. So, as you can see, it's 10 20 30 40, 50 60.. uh We provide it with some more examples that are consistent with this linearity. For the linear regression model and, of course, we're going to get an accuracy of 100 there. Obviously, if we provided different things, we would get a different accuracy, but for the sake of the example, that's what we're doing, then we make a prediction, and so, in this case we're not going to pass in. uh You know what are these?

A

What what is the true salary here right? We have just years expertise and trust, and we're saying well predict me what the salary might be and, as you can see once again, all of it's linear. So we chose input, examples that are going to give us the next in this sequence here um and that's what it looks like basically using this stuff from python, and here oops, it looks like we are in the middle of this GIF, um but I. Don't know why it doesn't reset the GIF on uh on changing the page.

A

So here we're going to do the same thing that we just did um in Python, but from the command line, um and so the first thing we're going to do is we obviously install the uh the package and then we create these training tests and predict CSV files, which are comma separated, values kind of like your Excel file type thing, um there's the files you can see, I'm listing in the directory, We Now train the model. We do the same thing or say: okay. Well, we want the scikit linear regression model.

A

Here's the features that I want you to care about. As the input data is years, expertise and Trust I want you to predict the salary and the source of your data is going to come from this training. Csv file and I turn on some Duty bug. Logging to show you that, yes, something actually happened. The input number of Records was four.

A

You can just see the thing things happen there and now we assess the accuracy change it to the test, CSV file, 100 percent, and then finally, we ask it for a prediction, um and so when we do a prediction here, we're going to see the Json output where we're looking at okay, there's the salary and there's the confidence in that prediction.

A

So um yeah. This is just sort of like updates since October um and uh we've got basically, we've got some unsupervised models which sort of like throw things into different groups, um and then we've got some natural language processing models that got added. uh We switched from tensorflow one tensorflow two and we've got this easy to use Python API and we also have a better tutorial on creating your own models.

A

um There's this tool called should I it's this metastatic analysis tool the what we're going to do with data flows and they're. Basically like these graphs like this, these flow charts, and so you can just think of it like a call graph right when you're calling one function. So imagine these ones in the darker purple are functions in that top package. There is uh a input, so um you can.

A

You basically would get the input data, which is the package and we're passing it to these various functions, which are the other shades of purple here, um and so those are every time you see an arrow, it's acting as one of the inputs to that function. Right so safety check takes two inputs. It takes. One of the inputs is the package name, and one of the inputs is the version which is going to come from this Pi pipe latest package version function, and so basically anytime, you see an arrow going in.

A

That's where the output of of one function has become the input of another function or we got it from outside the network or like we got. We were provided it um instead of having it as an output of a function.

A

um So then the nice thing about these things is that uh so we can take operations that were written as a part of this should I package um come on. Oh, it really doesn't want to change the slides for me.

A

It really wants to skip this slide right here.

A

All right, well, there's a slide right here that it does not want to show in between these two um I wonder if I could get out of this. If it'll show it there, we go um so basically we can take this I shouldn't have gone back a slide. It doesn't want to update oh well.

A

um Basically, you saw that last slide right how we had mainly just this one without the long arrow pointing to the new one. We can extend these data flows and we can basically say: okay, here's all these functions here, there's here's how they're linked together and now, here's some more functions, uh throw them all into the same data flow right and now we talked about concurrent, which means they're, basically running all at the same time.

A

So all of this stuff is basically going to run all at the same time as much as possible, so whenever one of the inputs is available, it's going to run even if other things are running, because you know it doesn't really matter if other things are running, let's just get it all done at the same time.

A

So long as you've got your inputs to your function, right um so and that's sort of the end of the status update here and sort of gives you an idea of of what we're doing here, but so yeah, um and then we can. We can cover more of like you know what, where, where I need your your.