DFFML Weekly Sync, 15 Apr 2022

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: Weekly Sync: 2022-04-15

Description

Meeting Minutes: https://docs.google.com/document/d/1vKYEPtqKiwsFwhVKPmPub5ebMqN9HteBcbdFAuTXalM/edit#heading=h.1f4g9rj08f5e

A

Obviously be contributing a lot. Would that would that period of my contributions be considered in this decision because I've, I only heard about gsoc really recently right so um yeah.

B

A

Able to contribute as much.

B

Well, okay, so here.

A

B

Gonna post this link in here um this is, uh we have a. uh We have some issues with the uh basically we renamed the master branch to maine and uh in.

A

B

um And in the process we also removed some files, which were uh the image files turns out.

B

You can add image files to a repo and um it'll be okay for a while, but um then, as you start to get, you know hundreds to thousands of commits, um then it then it becomes a problem, so um we had to remove them.

B

So this is the old version of the docs that has the image in it, um and so this is a good one to look at uh the I, I believe, you're talking about the contributions section here um and so the contributions section, your your this is, um you know norm, so it's normalized for six being. Whoever did the most work um now, not not. Many people have gotten prs merged this year um because of the whole, like things have been rather disorganized this year over previous years.

B

um Okay, there's just been a lot of chaos going on so yeah.

A

B

Would say you know, try to get some stuff in because we do like. It is extremely helpful for us to see something uh because, or else.

A

B

Very hard for us to judge we need to judge, um you know, do if you're, proposing 175 hour project or a 350 hour project we need to understand um you know is: is that actually, because that's different amount of time for everyone right.

A

B

So yeah, so so that's the purpose of getting stuff in. um So I'm gonna put us on here so that because we're trying to add uh we're trying to have a we're trying to put no.

A

No, no, but the the thing that was confusing me um like right now. My proposal would be submitted by like I guess sunday, so uh there would be a period right between right now and uh I guess it's june.

A

So in that period I would make some contributions and would those be considered in your decision.

B

A

B

Is we have to select proposals like right.

A

B

So um it doesn't, it ends up being kind of a that ends up kind of being an issue. So let me look: let's look at the timeline together, real, quick and and then we can. um You know we can decide uh what.

A

Right now, because I just joined so recently, I don't really have any contributions to my name.

B

Yeah, yeah and and that's fine, you know it's it's something that that is very helpful. um But not I mean it's not critical, but obviously people who have uh contributions. It's going to be a lot easier for us to um to.

B

You know, grade their proposals, um because we, if you don't have contributions, we just won't know you know what's realistic and what's not right.

A

B

Okay, um okay, so proposal rubric.

B

Okay, so all right, um yeah, let's so, let's jump over to so what are you? What are you thinking about? We can spend this time since there's nobody else here today. uh You know talking about um you know what what what your thoughts are, um at least until anybody else arrives, and uh you know we can just workshop together. So uh so so I'm sorry, how did you, how do you pronounce your name again.

A

uh Just call me tintin, it's very easy. It's my nickname.

B

uh Oh okay, just white.

A

B

A

The character, oh.

B

That's great yeah. That is very easy: okay, uh okay, all right! um So so what um so? What are you thinking then? What's your proposal uh thought process.

A

B

A

Yeah at this point I am I'm going into the time series project, so I have been like. I have some I'm a fresher right. I just joined college, so I don't really have much uh experience in computer science before this, but since.

B

A

Started like about four months ago, I've been like really into uh artificial intelligence, machine learning and stuff. So the time series thing it kind of resonated with the type of stuff I was already into right so um like in the time series thing I was uh attracted to the uh I I was researching it, so I I was attracted to the concept of um the decomposition of a time series graph, okay, yeah, the there's, a trend and a seasonality and uh something else on that residues.

A

So you know I I was. I want to work on that and try to implement that into uh as a as a project.

C

A

Other than that, uh the another part I was recently uh looking into the iris data set that you guys have.

B

A

I want to contribute by adding a few more data sets. It's also part of the same project, the five.

B

A

Sets you've mentioned all right.

B

A

Yeah so like I have familiarized myself with the overall method of adding data sets.

B

Cool, so uh so you familiarize yourself, have you done? Have you tried writing any code to do that.

A

I actually did, but that was a little bit of a fiasco, so I didn't like go through with it.

B

So that that, um let's do you want to pop that up and share your screen, and we can workshop that together um because I think that could be. um You know some good coding practice and that could get you a pr, probably pretty quickly. We may be able to get one within the scope of this meeting.

B

um Sure like okay, one second.

B

I gotta set up my camera properly here. I'm like, like, I feel, like I'm scratching down. I just moved.

A

Into this place, so.

B

I'm like getting used to this whole setup.

A

um I'm gonna have a little bit of difficulty right now. um I.

B

Can I can walk through writing one, and then it can be on the recording if that's uh and you can follow the recording and then go from there if, if, if your environment is not conducive to doing it right now, do you want to.

A

B

Okay, great all right, um so let's open up a terminal here and let's.

A

B

A

uh I was looking at the iris data set and uh one thing that struck me as a little odd was that we were writing. Individual codes for uh specific data sets right like for the irs data set. We were writing one and if we were to introduce another data set, we have to write another. So why can't we just make that into one module and then just apply it new datasets.

B

uh That is the that's a great observation. um I I love that, um so that is the general pattern that we follow um generally and actually this is probably something worth uh writing down here. So general pattern.

B

So uh you know the the general pattern is sort of this okay, so so we're going from um um high level we're going from ideation.

B

uh To um production I would say this is.

A

B

Is like you know, here's you. We want to take your idea and we want to shoot it to production as fast as possible right um and so, as the steps along you know, the steps along and that's kind of the core. That's part of dfml's core mission right is to try to figure out. How do we take anybody who wants to do something related to machine learning from the idea to boom into production in existing application as fast as possible? Right um and we'll?

B

You know we'll follow these same principles as we do the development on the library itself right. So uh so you know the first steps um you know. First, we have like the idea right and so in this case, um and let me just make sure that we have an example here, so example: data sets and data set sources.

B

Okay, so, okay, so idea, you know we want to access. Various data sets uh transparent. uh We want to access various data sets uh where model doesn't uh is is decoupled.

B

This is basically the core, the core core premise of dfmo decoupled uh from uh dataset access, um so uh so first first solution. So first attempt um so write model access data set directly or actually this is just you know, access data set directly just to confirm that you can do it. You know right, um you can read file, connect to database, etc.

B

Okay, um then so I'll just.

B

All right so then um so yeah so first we we confirm that we can do it.

B

All right and then we'd go in and uh so yeah and this would be for any data set. You might want to add right. um Okay, then, for the second, we need to you know uh abstract behind um interface, which uh so and oh, no. The second would be write another one. So you know you write one write another one and then.

B

uh You know come up with some interface that works.

A

I think we already have the interspace in rochester.

B

Yeah exactly right, so um so this is and then I'll just say this is the core. This is the core of dfmo we're just going over it just.

B

Okay, yeah, and so this is a generally applicable process right. So when you're saying um okay, so so we do one, we do the next one another one uh come up with an interface that works for both. uh In this case we came up with sources um and that's the class right, okay and then fourth, uh you know uh expose uh first and second via interface, created in third, okay, so.

A

Would you explain this a little bit.

B

Yes, so, uh okay, so this would be um you know, data set, so the data set, you wrote, uh the data set you accessed in step, one and the data set you accessed in step. Two, uh you know understand how you access or like what you needed from them.

B

Those data sets uh in terms of how you access them uh and expose and modify the interface. You come up with and iteratively.

B

Modify the interface you um hypothesize.

B

uh To fit both um use cases right, so this would be you know. um In our case, uh we access all records with records, uh a single record with record and update.

B

Records with update- and so this is basically.

A

Yes, I I I wanted to ask about that as well. The update function. I I saw that a few times. I didn't understand why we didn't just uh directly update. Why did we use the update function.

B

um Yes- and this is why so this is so. This example covers how we came up with uh the s data set source, abstraction source, abstraction source sources and how uh this pattern can be extended to anything.

B

Okay, so all right! Okay! So why do we do that? Okay,.

A

Yeah I mean we have the access to the records as it is. Why don't we just manually? Do it.

B

A

I'm already manually doing it.

B

So so your question right. So let's go back to your your your question so uh so struck as odd that we were writing uh code for each data set. uh Why not.

B

Why not write something generic.

B

That allows for accessing any data set right right, yeah, okay, that.

A

Would make uh that.

B

Would be much easier, yeah, yeah, okay, so and the thing is so so we're building. Have you you you have you heard of the concept of this? The software stack.

A

B

Yeah, okay, so so you can think of this as like building layers onto the software stack right, uh you're, building, um layers of abstraction right, and so what we did first is we first uh built this uh layer of abstraction around a generic data set itself right, so maybe a csv file or my sql database right, and so the first thing we did was we accessed the csv file directly. We accessed this mysql database directly and then uh we came up with this sources, abstraction or source class right right.

B

um That allows us that that we then wrap um and let me go ahead and open it up- uh that we then wrap um the implementations that we came up with and we expose them through those. um You know we we implement the class, basically right, and so then we move on to the next problem right and then the next problem is so. This would be. You know this is the general pattern and then uh pattern.

B

Applied to okay: let's call this gender pattern. Add a layer to the software stack.

B

Okay, add a layer to the software abstraction stack pattern applied to specific data sets.

B

Access to specific data sets okay, so let's let's go ahead and apply it.

B

Cop paste, okay, so uh okay, so high level we want to and I'll just copy paste from here, so write something generic allow for accessing any data set.

B

Okay, so idea, uh okay, so I want to have uh to be able to access any data set using same interface.

B

Okay, so first you know access the iris dataset.

B

And then second so do you have an example? Data set that that we could use here.

A

uh All right that was the solar energy one. Second, let me just stop yeah.

B

uh Could you paste it in the uh meeting chat.

A

B

The the gitter chat actually.

A

uh It it wasn't one of the issues I think.

B

It's in one of the issues, oh yeah, if you could send the issue, that's great too,.

A

Oh yeah, no, then, in the issue itself there were like five of them I'll, just I'll just send the issue here.

A

B

Oh ruffles has no food, are you which yeah so.

B

B

Let's see: okay, hey sahel, how's going.

C

A

Hello, I'm tireless clock, hi.

B

So we are going through um we're going through right now and then we can jump to any agenda items that you have. um uh How um so.

C

So are some of the things I wanted to ask.

B

C

B

Sahil, let me let you know what we're doing right now and then uh we'll we'll get to your agenda items. So uh just so, you know where we're at so uh uh you said your nickname is uh tintin all right. um So tintin is working on um he's working on a proposal for the time series project and uh so he's interested in in the decomposition of time, series graphs, uh including trends, seasonality and he wants to add some more data sets and he was looking at the irs data set and uh said.

B

You know it's odd, that we're writing code for each data set. Why not something write something generic that allows us for accessing any data set, so we're taking this opportunity to understand. You know how we add a light layer to the software stack, uh the layer of abstraction how we came about you know with the sources abstraction in general and then how we can extend that by we're gonna. Look, we're gonna, extend that general pattern uh write it we're just gonna wrap a existing data set uh real quick.

B

So, let's see and uh talk about you know: okay, we're all we're. Gonna do is wrap the existing data set and then we'll um uh look at uh we'll just talk about what we would do going forward. So um so, let's get your agenda items before we go further.

C

So I I had a question about how are we providing feedback this year? Because some people have said that they don't have an option to put a draft link.

B

C

Don't have the option oh yeah,.

B

Some people don't have the option to do. Google docs, I know great great great firewall, china and stuff um yeah I'm trying to. Basically, if, if I see if I see it I'll provide feedback, um I haven't uh you know so. Basically, people who email my work, email have gotten responses, um because that's uh what I've been monitoring lately um and so.

C

So, let's make a part a note of it on on the ideas page at least.

B

That's a good plan.

C

A lot of people are dropping me emails and I don't know whether I should respond directly or should I respond to csoc, but also the gsoc thing. You, you mailed me that there would be something login system, something something.

A

That hasn't went through.

C

A

So let's try uh when when mine gets completed, do I email it.

B

Yeah so, and that's what we're saying right now so ideally and that's what we'll update the docs page. So let's do this right now said: doc's contributing gsoc, 2022.

B

All right so project ideas, okay, so contacting okay, so getter. uh So.

A

Hi, so I wanted to ask what kind of feedback do you guys give for the uh proposal like.

B

Well, anything that we see um you know we can't catch.

A

B

But if you ask for.

A

B

Feedback, we can give you feedback on that. Otherwise, it's just going to be in general right.

A

B

Yeah, so feasibility yeah, some things that come to mind are feasibility on timeline.

B

uh Things that you might need to do is pre-work uh things that you might need to do uh things that you might need to explore um to make sure that your proposal, like things that might be helpful, pre-work, which are related to your proposal, which you could do as prs and things that um you know you need to look more into probably because your proposal may not align exactly with what it within the code base. Any other comments.

C

Mostly mostly, I would say that we would request for gravity on things, for clarity on the right.

A

C

So that you know things are not very abstract. Okay, some have something actionable, but that's what I would like to look for.

A

Also in the project description do I like I I was I started off writing like humongous paragraphs and stuff. Then I thought okay. This is too much and deleted a lot of it like. How should I go about that.

C

So actually I I posted past accepted proposals in some of the threads. I.

A

Saw those but they've like uh I, I guess I'm asking for your personal opinion like what you thought.

C

So as as someone reviewing this, I would say that if it is to the point and not very verbose uh instead, it just delivers it as uh in a small form factor not just for short, but in a reasonable amount of detail that what that would do writing long paragraphs is just you know. We have to dig through it and read through it. That is something something that I would say is not not possible for everyone in the day to day, because, like uh I am or working somewhere, john is also.

C

You know I kind of have exams and stuff also like you might have to have right now it is examination season as well yeah. So that is the thing. Time is always you know we are running on it and that's it. So just try to keep it not too long, not for sure.

A

um Also, one thing um I I I mentioned: uh I mentioned this to john as well. um The I have uh I I joined the gsoc program really late, like only about a few weeks ago. I didn't really know much about it before, so I don't have many contributions to my name so uh like well. What do I do about that.

C

Well, contributions are kind of normalized, so you see it is going to be a relative scale between who did the most and if you've read the new break, you must have seen that I have. I.

A

C

uh You know just you need to go through the guitars or information out there. You know.

B

C

You mostly, uh we are kind of a beginner friendly project. We are very much welcoming business and also, if you, if you have the time you can see that gsoc talk, that we have on the idea space, we have linked the video up there.

A

For the max dtu right, yeah.

B

A

One, yes, I was wondering about that. I'm from.

C

Yeah, so so they actually one of the dt students is with me in ed, and he asked me that if I would be interested in giving up on csoc, so I think all in all, uh it has a lot of important things about gsoc. You would like to see as a beginner. It would be the real.

B

C

So from my side, john, I would try to uh put in my comments on on the proposals. I have tried to forward you all the ones that I have received. I appreciate that and I am asking people to see you as well, even if it is the one series one.

B

Great thank you uh yeah. I appreciate that um so.

B

Let's go ahead here, so we're gonna have okay, um so typical feedback. What did we talk about? We talked about um you know areas you need to understand more. You need to read more existing docs to understand how things currently work at places, issues or potential prs uh where that might be helpful. uh What else do we say.

C

We said that one of the most important point that I have been repeating is that please do the trading rubric, because, because we kind of have a token there.

B

Yeah, let's see so.

A

Also, another thing um the I I know I I know you have that scale for the uh contributions but like, um like I had the you might remember this.

A

um The full request I sent about the file structuring and you know you pointed out that it wasn't actually needed, um so I am which one the relative import one. Yes that one! Yes, yes, uh now like, I get that that was a rookie mistake sort of thing, but um I sort of incorporated that as a closed pr in my proposal would that be.

C

So it was actually not a pr, not an issue as well, uh so so what is? It is very common for someone coming to python ecosystem and trying to execute parts of the module as a script and then end up trying to import uh beyond the parent level package. So I I kind of linked up very nice.

C

Yes, that kind of explains it and it is fine, but uh I wouldn't say that uh so what we, what would count as a contribution and what would not? So, in my view- maybe maybe john, can add to this. uh In my view, what would have come to the contribution is something that has been merged or that has been blocked due to something else.

C

uh That goes as a contribution and anything else that is not merged, or that is not contributing towards the collective goal of the project uh would not we could. We cannot count it as a contribution to be fair to everyone. Okay,.

A

B

Yeah, so the and yeah it's definitely I that was perfectly said. um Okay,.

C

Other than that, some questions were there uh on on twitter that I would want you to address once uh yeah.

B

C

To answer at my best, someone asked that why don't we use uh external libraries.

B

Oh, oh in the base in the base package.

C

In the base package.

B

C

B

A frequently asked question to.

C

Answer that it is, it is up here.

C

In I'll just paste it in the or I will just share my screen so that right show it for a moment.

C

Just let me make sure that no crap right stuff, yeah.

B

I've been using this new windows finally got the feature that lets you do multiple desktops and it's about time. Yeah virtual testing.

C

So he asked that it is like.

C

Hey, I was wondering for the issue why we haven't used pandas library, so this is what I've written, that pandas updates and data prints are indeed from the use records as a universal data format, so see how.

A

um Could you could you guys please try to explain records to me a little better, I mean I, I understood it to some degree like it's a related dictionaries and uh associating key key attribute pairs.

A

Am I missing something like? Could you explain that yeah.

C

That is, that is pretty much it. I guess you know like so for every types of data there would be some atomic type like for our csv. A row is atomic type for uh uh sql. There is a record which is called the tuple is something like that and in the fml we have a record, which is a dictionary, and maybe it's on.

B

Yeah, that's actually what that's that's very closely related to what we were just talking about, um which is that um you know we're. Oh yeah, that's fine uh very closely related to what we were talking about right right before uh so he'll join, which is um come on.

B

Google docs, which is that we have this source abstraction right, and so the record is the abstraction around you know, whatever one of those rows are um so like a row in a csv file or a row in a database, or um you know an entry in a json list right um so that that is a record uh so that- and so we talked about you know, how did we get to the source abstraction? Well, we applied the same pattern to get to the record abstraction for each. You know piece of data within a source.

B

Right, um so does that make sense a little bit: okay, okay, yeah! You can just think of it, because we have this generic thing right where we want to be able to access. We want our models to be separate from our sources so that we can use any model with any source right. Then we also need an abstraction, for you know each entry within that source right or within any source- and that's just a record- is what we're calling that right, um yeah so and then okay, so anything anything else.

B

So we talked about okay, so mentor login. So I'll get on that one. um uh Let's see mentor login! Okay, uh sahil! Could you could you send me a email to my uh to my work email just because the other day I got off a call with somebody, and I um I did the exact opposite of what we had just talked about for like 15 minutes. um So if we don't write things down, I'm liable to do the wrong thing.

B

um So if you could send me an email and just say mentor, login for gsse, that would be awesome, because I also.

C

Don't want to is it just? Can you pick me that mail? Because I have the gmail one, I don't remember the other one. Oh.

B

It's it's yeah, I just actually. This is oh, let's finish this, um so it's this john.s anderson at intel.com. So thank you yeah, if you just just put it in the subject that way because, like you, said proprietary, we're on the recording, I don't want to switch over to my calendar or email right now. Yeah, okay, great, thank you! Okay. So this is, I think we captured in uh you, know some stuff here um and then let's say: okay. We also want to grab hashem's name.

A

uh Also one thing: are you guys the only mentors for this project, because.

B

So hashem is also mentoring this year um and I think we we may have some other mentors involved, um but I don't we I haven't reached out to that. As many people I usually reach out to many people, but I haven't had a chance to reach out to many people this year, previous students and such and I know some of our previous students were actually trying to start their own org, but I haven't synced with them on how that went for them in a while. So uh it it's sort of it's right now.

B

It's looking like three of us um and potentially we can rope in some more folks, but I'm not sure right now. um So, let's see and hashim's last name is in his email, great.

C

He just wanted to ask: is there some music coming from my site? I just don't know.

B

Remember that one time when it did the copyright.

B

Yeah, no, that's good yeah! There's no music! This time, yeah noise canceling must be working whatever google's doing all right. So, let's see so uh doc's contributing, oh, that was that was that was funny. I wonder. Okay, can you guys say.

A

Something was the main master problem fixed, or is it still on fire.

B

It's still sort of in progress, I would say we're not uh that's not fixed right now, so um let me just transform and transform. We need to crop okay, never mind. This is I'm just gonna. Leave that one okay, so gsuc22, I add uh what did we do here? So we're adding? You know some clarification around proposals so.

A

Oh, may I ask why why that that was necessary, though the most amazing policy.

B

It's not a github policy, so so the thing is it's it's just sort of this is more of a thing in the united states, uh I think, probably than than in other places, but there's this general, um like awareness happening of you, know different word usages, um so we're you know trying to align with that, um which is the the reason to to you know, change the branch name uh in the process.

B

We. So when you, when we change the branch name um yeah it it, it was an opportunity for us to go through and clean up these files as well. um Now we didn't totally get all of that, um but yeah that the file cleanup proved to be a lot trickier than than originally expected. um So uh we'll probably um yeah we'll have to we'll have to that. That's gonna need to be dealt with uh remorse.

A

I was wondering like in some cases why why don't you just um uh fork it into a new branch called maine and just treat main as the master branch.

B

That is what we did: yeah.

A

Oh okay, right yeah,.

B

um Oh yeah for oh fork, I see which media fork so so there is so okay, so we'll we'll only spend a couple more seconds on this and then I want to go write the data set, um but uh we are trying to get the project transition to an external government governance structure. That's non-intel, long-term, to enable people to become maintainers of the project. Because, right now the permission settings are locked down.

A

B

Intel itself, so I'm in the process of talking to legal about that and they've okayed it conditionally on government's documentation from the buildtree org right now and so we'll see we'll see how that evolves. It stalled out- and that was part of our previous mentors, we're running that some of our previous mentors yash and saksham and some others, uh but but I haven't heard from them in a while. I think they got just same same as all of us. You know your day. Job is, uh can be a lot of work right.

B

It's a full-time job for for a reason right, um so yeah, we'll we'll we'll that may be a next year thing, um but for this.

A

Summer, it's going.

B

To be under intel, um okay, so and I'll put this: how are we providing feedback? uh That would be this link here so um and I'm actually gonna nix. That proposal review link so that people go if they see the meeting minutes and they read that full thing. Okay, so anything else uh so.

B

Sahara, anything else on your end,.

C

So a couple of things, one of the things was that I wanted to just a minute.

B

Okay, all right! So let's jump into this um this you know so so we're going to go apply this pattern right to specific data sets. So we have the iris data set um and I'll drop a link to that. So, let's see so that we can follow along here. So.

A

I don't I don't get what's the deal with in like why, why are you guys under intel? I thought this was python. Software foundation.

B

A

Yeah, no, no! That's.

B

A good question: that's a very good question, so um so uh you know so so I work for intel and this project was originally developed at intel and then until you know, uh we worked it through until every company has a uh you know, different processes around how they deal with open source intel is pretty good about it, um and so we were able to open source the project, and then you know we can. We can do gsoc and things uh which is cool.

A

B

um And uh so the python software foundation is, you know a uh within gsoc. There are top level orgs and suborgs, and so we're uh suborg of the python software foundation, which means that uh the python software foundation manages talking to google and all the complexities involved there to actually sign up uh as an org with google and then.

A

B

We have a simplified process which we can follow uh for our suborgs right and we're on a much smaller scale right than the python software foundation, because they're a conglomeration of other things, um and so uh so yeah, that's the general gist of it um and uh good good friend of mine and co-worker terry. uh She runs the python software foundation, gsoc involvement, um and so she runs.

A

B

Called cbe bin tool, which is a really cool project, um and I recommend you guys uh like that. You check it out. If you haven't checked it out, it's uh not not machine learning, focus security, focus, uh really cool stuff, though really important, stuff um and uh yeah. That's it. That's a another good good project, um so uh yeah, so she she uh she got us involved as well. Okay, so does that answer your question.

A

Yes, yes, it does. Okay, you started off as intel and python software foundation.

B

We're under that umbrella, uh it's sort of like we're associated, but we're we're not sort of in right.

A

B

Yeah, it's it's an uh by association, sort of thing. um Okay. So where are we going here? We're going source god? We've got a lot.

A

B

A

B

Great so here's iris and let's open it up.

B

B

Oops, okay, so here's the iris source right and you know basically uh so what we've done here and- and this is what we're seeing is actually there's another layer in the stack in between here and that layer is um that layer is this data set source wrapper um and so uh we'll we'll cover this briefly, um but basically we implemented the sources construct right um and to to access different sources right and then we saw okay well, we need this abstraction around accessing specific data sets um and actually, okay, so unfortunately we're doing this sort of uh backwards here, because we already have the abstractions.

B

So what happened is um um so if we look at this adding layers to the stack here, what happened is we skipped step two and we went right to coming up with the interface, and so we implemented a single source um and the the other source was actually implemented as an example.

B

So, let's see where is it data set source, this source, my training, so this is actually okay, so this is actually where step two is.

B

So for step, one, we implemented this iris training right, and so this is the meat of it right here this we basically we download the data set and then the data set itself is really just a csv file. So then we just you know, yield the appropriate source um and you'll see heavy heavy heavy usage of context, management and async in dffml.

B

A

B

Yeah and and the reason for that is context, managers allow us reliable cleanup of any resources used and asynchronous uh methods. Allow us you know, network access in a concurrent way, because a lot of things are, I o bound right um and so most most of most things are. I o bound machine learning models and stuff are cpu bound. uh We can then schedule those out uh to threads, um so this is, uh you can think of this as as attempt one right. So we wrote this uh thing, that's like okay!

B

Well, how would we implement? You know a simple, abstract or a simple source? Well, we would download the download the data set uh like because this sources, abstraction is around generic sources in general, and then the data set source abstraction, which is this next layer which we're adding, and maybe I should. We should probably say that so pattern applied to specific data sets, uh develop the data source, abstraction, okay.

B

So and then this is, you know around accessing a specific data set, not just like a data set stored in a csv file or a data set stored in a database right, and so this one we we basically we download it, and then we say okay. So this is a csv source and then we have this little um help.

B

This is a helper function here um and then our second attempt for the second attempt we basically just wrote an example, um and so this is the example data this my training data and and- and this is going to look a lot better rendered on the um on the website here. So let's go to data set source.

B

Come on, oh, this is not.

B

Okay, so in this example here we are um in this example. Here we say all right like this is the abstraction itself that we ended up, adding as this layer right and here's the training data right. So instead of the iris data set, that's a csv file, it's this we're downloading it from some server and when we download it. This is what we do right.

B

We basically just use this cache download function, another helper function- all it does is download if it doesn't already exist um and then it yields the csv source right um and then here's.

A

I just uh you know what I'll save my questions for later.

B

No no go for your questions.

A

um Yeah, why did you use cache at this stage.

B

uh Yes, so um well we so we have this helper function, which is really so when we do network access.

B

uh There's not a lot of network access in this library in in the base library itself, um but when we do do it, we have this helper function, cache download and the reason why we always go or you don't have to go through cache download it's just typically, if you download something you don't want to re-download it right. No.

A

No, I I get the cash download, I'm asking why cash like cash means, like you know, um it's temporary memory, is that what this is.

B

Oh, uh so cache doesn't necessarily mean that something is temporary memory. uh Cache a cache is really just a place that you store something um so that, like so that you don't have to go, get it all the time from its main source right.

A

B

When you think about when you hear about like the l2 or the l1 or the l3 cache right on a cpu, that is just a place where we can store stuff, that's usually in main memory, but we don't have to go out to main memory, because that takes lots of cycles right. So cache is just you know more of a word. It's the word itself is used just for the for.

A

B

Quick, quick access location, um so basically we use this cache download function, because if somebody wants to call this training data set, uh you want, if you, if you open this twice, we don't want to redownload it. It's already on disk. The.

A

B

Is that um we have for security reasons, uh these the synchronous download function uh which this cache download is backed by and then cache downloaded itself? uh Have a protocol allow list which basically, the the purpose of this is to stop people from accidentally passing it http links which are insecure right. um So we do hash validation as well, um which is another thing.

B

So, basically, if we're, if we download something to the cache, uh we we it's it's uh it's kind of nice, because we can just turn around and uh like the way that we can ensure that we have the right contents in the cache is the same way that we ensure we downloaded the right stuff right and that's by uh pinning it with a with a a cryptographic hash.

A

Double checking.

B

Exactly um and so uh basically, if the hash doesn't match, we either re-download or throw an error, and if somebody tries to uh you know, send you know, give us a link. That's in you know an insecure link, we're going to throw an error right, and so here we're explicitly allowing the use of tls encrypted http access right by by adding this to the allow list so and this kind of gets in.

B

So basically, what we've done here is we, let's see so write another one uh and then I'll just put this example right, and so that's the example um and then you know come up with the interface that works for both and here's the same link. Basically because uh you know this is uh that that's the point of this is that uh we, the second one, was ended up being the the example for the abstraction itself.

B

um So, okay, so then your question is why don't we have something that that you know we'll go back to your question, which was you know, can't we just have something that works for everything, and you know the answer. Is we've now built this? uh We've we've built this. We did our first attempt. We did our second attempt.

B

We wrote our abstraction layer right so now we can start basically saying well what are all the things that we might want to access right and if the the pattern we're going to follow the same pattern, basically for the third, this third level right. So the first level was access data access, a data source right and then the second step was access. Specific data sets right, which might be backed by a specific data source right, which is like a csv file or whatever. Now the third step is well, we need to figure out.

B

How do we access an arbitrary data set right? So we need to probably do this for a lot of different data sets right, because we only did two if we're going to make something that accesses an arbitrary data set. We need to have a really good understanding of all the types of ways that that might be accessed and the only way to get that is to implement, implement, implement right. So part of this project is to go through and implement right. So um so we had a.

B

We had a uh proposed.

A

If I understand what you were saying correctly, you mean like um different places like different websites like kaggle or um another data source would have.

B

A

Csv in a different location or a different way, so we have to yes.

B

Figure out how.

A

It's gonna be implemented. Is that what you think.

B

Yes, and actually here here, we'll do one right here, which is the: um why don't we do uh mnist, because that's a that's an easy one, and we already have an example for that and we don't have a data set source for that, so we can basically go ahead and um we're gonna get kicked off here so we'll see if we can do it in like seven minutes. So this is the this is the deal. Here's all our shop384 sums, here's our link um to the to the data.

B

Let's go through and uh implement mnist here um and I'm gonna pop shell again. Okay. So where are we okay so get check out? uh Dataset.

B

Source, mnist, okay, so, and ask me any questions along the way, as we do this yeah source.

C

So one of my questions was: do you have a heart stop in seven minutes or is it like.

B

You know I need, I think I might. I think my calendar might have gotten cleared, um but if we get kicked, let me actually either either it got cleared or I'm missing the meeting right now. I think it's fine, I think we're fine. um I, the past few weeks, I've accidentally run over another meeting. Doing this.

C

So actually, I just wanted to have a word with you off record, okay for four five minutes.

C

B

That sounds good, so, let's, let's uh yeah, let's let's plan on that, then so um I think I think that that's that's. uh I think that's doable documents, python, okay! So let's take a look at what we did here. So I'm gonna remove this file. I don't think we need this. I think that just generated this one off all right so so, basically, I just copied the uh base into the new file, which is the mnist right, because the base has this example here. um So actually, actually I'm not gonna do that.

B

I'm gonna copy um the iris, because this is one that has a um this, has the whole url and everything okay. So let's do mnist training, yeah and then the question will be you know how do we then extend this to be arbitrary right, and so you might end up like eventually we'll end up with something once we implement a few of these, that's really obvious that that you know we can basically just say you know, source.

B

I I you know, I don't know what it'll be right, but we're going to have to figure we'll figure it out. You know through uh iteration here so.

A

Oh yeah, sorry about that. No.

B

B

Okay, so uh if yeah yeah, okay, so all right, so we've got our.

B

Url and what do we have here so this is train images, train labels, train images, 100k, okay, so test, that's test, train images, train labels and then what do we do here? So.

B

What do we end up doing with this operations? Image source file, name, label file name?

B

What is this df3 process?

B

So? Okay, so we're grabbing actually we're grabbing from multiple sources. ah That's right! Okay, sourcing, which is data flow? Oh okay, that's right! So it was tricky one okay, so we're basically actually gonna create. So what we see in the usage here is that we're going to oh, this would be fun okay, so we're actually going to do this example as a data set source.

B

It looks like so, um let's see so we're definitely going to get kicked off, and this is going to take a little bit longer, but basically we're going to re-implement this example here. So what do you mean? Kicked off?

B

Google has limited the meetings to an hour, so we'll have to start a new meeting link. I don't think this will take. uh You know more than like 15 minutes here, but it's definitely gonna take more than three minutes so um so we'll we'll. I think we can run into that, and then I can confirm that I'm not in another meeting right now. um So, let's see so.

B

Basically, if you look at this tutorial here, what we're doing is uh we're showing how to how to train a, I think, a tensorflow model yeah, let's see so yeah, okay, so yeah we're going to show how to train the dnn, classifier and so effectively. What we're going to do here is we're going to end up doing all of these pre-steps and we can actually just actually. We can just take this whole thing and we can just do all of that. um So our first step is download all the data right.

B

uh Actually, no paste read docs examples, mnist all right. So our first step is: we want to download all the data right, so we're gonna go through and we're gonna cache download uh the train, idx, 3 and train labels, because these are these are the things that we care about, because we're implementing the train data set right now.

B

Okay, okay, so this would be download labels.

B

Right so this this data set itself download features is made up of uh you know the the it's made up of four files. It's made up of one for the training features, one for the training labels, uh the things that we're going to predict and then you know two more files for the uh test label features and labels, but we're not concerned with those right now because we're just doing the training right.

B

So, uh let's see okay. So, let's see and we'll just uh you know, let's just do a find replace on iris uh mnist, and that way we should be pretty good here. Okay, so and then we can grab these sha values so.

B

Okay- and that was the images, and then this is the labels. Okay and then this is also http links. So we'll um you know down we'll make sure that we're we're adding that until the allow list right so and then we can say you know, training original uh features and labels right so we'll download these files to the cache and um so we're going to download the features and then we're going to download the labels and we're going to download them to this cache directory here, which is in you know, home cache.

B

Dfml data sets mnist and uh yeah we're going to validate the shas based on. You know that the values that we had given in our tutorial already right- and we could calculate these. If you look at the cache download function, there's an example: it just runs it through shot, 384 sum and calculates. The sha so then looks like. We also depend on model tensorflow and operations image here. So we're this. This, which means uh this kind of, goes to your question about. Why?

B

Don't we use pandas um every time we have a new, distinct set of dependencies? We should have a new, um a new plug-in um and that's because uh so basically think a distinct set of dependencies would be like. I have models implemented using tensorflow right now. I'm sure you're aware uh when you download some of these machine learning libraries they take a long time to download right and they.

A

Take up a lot of disk space.

B

Right and so, if you're you're working on deploying an application so for dfml itself, we want to have this plug-in interface around sources and models right, so we can quickly mix and match sources and models to train the best model on the right data right. Yes,.

C

B

When we go to deploy, we don't want to have like this whole kitchen sink like most people deploy using a container these days right. If you download the dffml container image right now, it's got everything you need to do development, it's like seven gigabytes.

B

If you go to deploy every time, you deploy think about grabbing a new seven gigabyte image, because you always want to grab the latest thing. That's just going to take forever to spin up that container right.

B

So for your specific use case right once you've trained your model, you would download um you know whatever things that are applicable to your model right or your data set collection, if you're doing like pre-processing on the fly, and that way your image is smaller right, and so to do that, though, we need to make sure that we create a new plug-in every time we have a new, distinct set of dependencies and that allows people to only install what they need.

B

So in this case, this is going to require a df model, tensorflow and dfml operations image. So we're just going to do a quick check here that we have those. um So we're gonna run the version command and see. Okay, it's it's gonna! It's gonna! Look at this file all right. Okay, so we have a bunch of stuff in here.

B

So uh okay, so, let's just actually keep implementing for a second, so uh we download the data, we're actually going to include this little install command here or actually that's going to be covered already or well, we'll we'll include it for now. um Basically, you know you won't be able to run this.

B

If you don't have this stuff installed um and and tensorflow is not one of them, but you will need the image operations right because uh this to to do this so so to to to use this training data set, because the the because we have to normalize the data to do anything here, um let's see, should we do that, I think maybe we should not do that yeah. I think we should leave the normalization out of this, so we're actually gonna we're actually gonna leave the normalization out of this and basically say hey. You know.

B

If you use it, then you normalize we're just going to give you the raw data right now, uh uh just and just just to recap there what.

C

Normalization should be in operation.

B

Exactly yeah, we don't need to include that in here, so basically people are going to need to normalize on.

A

B

Yeah exactly we're just going to provide the raw data. We don't need to do any normalization built in, but the but the cool thing is that you could right like what we're seeing here is that we could, you know, have mnis.training.normalized which, whenever you import, that kind of like you're saying you know why can't we just have one for everything like this one. You could have one specifically that already does that normalization built in right. Oh.

A

Yes, uh speaking about um speaking on this, um like while I was writing my proposal on uh one of the aspects of the project was writing various operations right, normalization and uh nan removal and stuff. So I was wondering like I. I want to implement a few of them myself, um but you guys would already have frameworks for several of these operations right and uh so like. How do I tell which, which ones need to be which.

B

Ones exist and which ones don't yeah.

A

B

So I would say: git grep is your friend um and also a big, mainly because the docs are not entirely up to date uh and then also the plugins page and this and we have a current. We currently have a bit of an issue with the way the docs are splitting out, but this is your sort of your your your main list of what exists where this plugins page, because this searches across like we said each time you get a new, distinct set of dependencies. You implement a new plugin.

B

So this all of this, which is split out right now for some reason we don't know why we haven't gotten to that in a while. um That is just the main package right. So if you were to uh like get that, that's basically all of the anything you see inside um anything, you see. Okay, that's that's a bunch of pyc files.

B

So anything you see inside dfml, dffml dffml, basically like the dfml directory within the root of the repo. Those api docs are covered under reference and then there's usually a section that says apis right now. It's like all top level. For some reason. It's a bug with the dogs. That's.

C

The problem with springs broken from that point.

B

Yeah, we'll fix.

C

B

Yeah, if you, if you go to the non-master docks you'll, see that this is what it will look like is uh api reference right. An api reference is all of these files. Right now, if.

A

You want to see.

B

Everything that's implemented, then you would go to the plugins link and now you can see across all the different plugins right. So everything like you know in if you were to look in you, know like this- is the top level repo right and so that the api reference is dffml and then there's also like model and operations right and within those we have. This is dfml operation, binsec dfml, operation image, which is the one that was referenced in this tutorial that we were just looking at um wherever that was oh yeah here right.

B

So this is this slash operation image is this: this package on pi, pi um and, and so the plugins page allows you to search across those and if the goal is eventually will support third party plugins. So basically, you know these are all the plugins that we as a community have developed, but we also want to make it very easy to just say: hey, you know, here's my rant like we were talking about. You know hey.

B

I want to implement a source, one-off source for this or one-off search for that right, you'll, be there's a script within the cli reference, which is um here or wait. No. This is the this is the problem with the docs right now, but there's a cli reference um command line.

B

So if you go here and you look at this, dev create there's a helper script to create a new package, um and so basically this is this will generate you a new model or operation or whatever right, and then you can push that's just a skeleton of files that gives you a blank python package that you can then push to your own personal github, and then other people can then use your models or operations or whatever, by basically just doing a pip, install and then pointing at your repo link and all of a sudden they'll have access to all your stuff and whatever you implemented within dfml, and that way, if you implement a random new source, then somebody can take the existing models and use it with your new source.

B

Without writing any code right. All it is, is a command line flag. um So that's that's the goal. We aren't quite there yet right now, it's all just uh within the repo but but uh yeah. So um okay. So where were we here? um So we're gonna do oops, okay, so basically uh and then looking at our uh this was you know the this is so the the the python there's a python api, an http api and the command line, interface and they're all sort of mirror each other.

B

The http api is not not not that not the greatest right now, um but so anything that you can do in one. You can do in the other right. So this line here basically says says no paste, so this says create a uh source object, which is actually two sources combined right.

B

This will expose both the uh this or this will expose the separate feature, features and labels files. We downloaded.

B

uh As a single source right because that's what we want to do um so, what we end up with is um so this one is going to pre-process to normalize. So one of these you see is pre-processing and one of them is just a straight. This is the label, and this is the images and the images are being pre-processed.

B

And so this is the images here, and so we um we don't want to pre-process them. So we're basically just going to say- and you can see- they're they're being pre-processed using this data flow and so we're basically just going to cut that out, because we don't want to do any pre-processing um and we know- and- and we can see within that pre-processing- it was actually accessing this idx 3 source right. So we can instantiate an idx 3 source and an id an idx 3 source for the um the the the the train images.

B

uh So, let's see later features features path and labels path right. So we downloaded the features we downloaded the labels and now we're going to instantiate an idx 3 source for the features, because that's the format that they're in and idx one source for the.

B

Labels, okay, so, and we're going to pass them both two sources, so idx, one labels and idx 3 features.

B

And uh we need to go. I think I think this works out of the box like that, but uh we're gonna go double check that because so yeah, so we need to pass uh probably file name, oh yeah, so it looks like idx 3. So this instant. This says what class to instantiate and then this says property file name equals this file. So that means we need to say. File name equals this file.

B

Eventually we we're hoping to have a converter that actually just spits out python code given command line arguments and vice versa, feature equals image, and this says basically hey you're loading. All the data from this file um in this format like when I yield a record for each entry. What should I call the data that I loaded for the from file and we should call the data image so file name on this. One is images path and then feature is label.

B

So, basically, uh you know create this sources: objects, which is really. You know two sources in a trench coat and each time you read a record from one read a record from the other one and call the data that you read from the idx 3 source image and call the data that you read from the idx one source label- and this is uh this. This should be it um so and then we will we'll yield it yeah, okay, cool so uh and then from here we basically just say uh you know, base so csvx.

B

I think there's an idx data set source cache download file. All right. I think we implemented it um so yeah download the files and uh spit them out. Okay. So now we're going to, um we can run the the tests um node. Okay,.

B

Okay, let me just run it around python um unit test. Actually, let me show you the docs page, so.

B

Contributing and then testing and then here's how to run a specific test and here's we'll just do this shorthand here. So basically oops, oh and I figured out this pretty, which is cool recently, which I really like. So if you do pdb, uh if you do python and then pb, then, if there's an exception that you didn't handle already like you're developing some stuff, uh it'll drop you right to a uh python, debugger shell as soon as you're, uh like as like yeah it'll, it'll drop you right to the debugger shell, which is pretty sweet.

B

um So let's see mnist uh uh training.

B

Okay yeah, but it starts you here: okay, so let's see invalid syntax. Okay, so we won't do that right now, we'll just do python.

B

Okay, so invalid syntax, okay, I forgot a comma. I forgot a comma here and I forgot a comma here: okay,.

B

No module name df file, source idx.

B

So, let's see where is the idx source implemented? Okay, so it's actually implemented in idx1 and idx3 respectively.

B

All right and- and I I see that here we can see that there's separate files.

B

B

Idx ones: okay, they're called source.

B

Oh, let's see, we didn't define it all right, let's see so so this is what happens so basically, uh this is actually yeah. This is a good thing for us to think about here. So basically, what happened here was we went in and we ran okay. So let me just let's actually do this explicitly here.

B

B

Start directory is not in portable.

B

Tests- oh, we don't want discover, so we're going to run this just this test and what it's doing is there is a it's running. It's running these console examples. When you put restricture text in a markdown file, we have some extra stuff that we've added on to say when you put test it's going to run that, and so what we're missing here is from the top level we every plugin that we talked about, needs to be registered and so we're gonna go register this reinstall the package and re-run uh mist and then.

B

And then okay, so we reinstall the package and then we rerun.

B

Come on call is about to end.

C

B

Is not defined.

C

Shall I send you a different link.

B

B

B

All right, yeah, we got kicked.

B

I think we're pretty close here. Let's this, hopefully won't take too much long. So it basically says um you know some import errors. We're fighting function did not yield source instead yield. Oh okay, this looks like oh. This is just the fact that we're doing input validation, but we didn't um well.

A

You're not presenting.

B

Oh, thank you. So what happened here is basically um there's some code. That's that checks so.

B

So what happened here is we download the files right? We instantiated our sources and we have some code um and then we, you know we combine them by yielding this sources object. Well, we uh um this would be.

B

This would be uh so so, basically, there's some some some about this is the implementation of this uh uh effectively there there's another layer of abstraction between the data set source and it's this context, manager wrap source um and uh it has on line 108 see over here line 108. This is the the error that we're getting function did not yield source.

B

uh So basically it's checking if it's base source and because we didn't account for the fact that you might want to yield a sources object.

B

So this is just uh basically, there was input, validation being done to ensure that we're yielding a valid source from the data set source right, because we have this data set source decorator that we can use here, which is this at data set source, and this function is calling another function which is in this wrapper.py at line 108, which is saying whatever is yielded, let's make sure that it's a valid source um and then it throws an exception if it's not a subclass of source. Well, sources itself is not a subclass of source.

B

So we hit a new use case here, and so all we had to do was go modify. um The wrapper here to say you know, don't just check for source sources is also valid. Does that make sense.

A

uh Sort of you.

B

A

Is not being um yeah.

B

A

Within the wrapper.

B

Yeah well yeah, so basically that data set source like since we're implementing a new data set source. We were using this at dataset source decorator, and that thing calls this, which then says ensure the object. Yielded, is really a source and there was a little.

A

B

Here that basically says raise an error if it wasn't base source, and so we need to just add sources to the list of allowed things, or else we're going to trigger the code that we wrote. That raises an error.

B

So let's try it again.

B

Okay and then sources is not defined there either.

B

Okay, new bug fun; okay. So what is this so failed to run memory, error, memory, error.

B

Well, that's a new one.

B

B

In idx 3 x-file read, this is like uh this is being thrown in, like the call to the open function in python.

B

B

Look, this is we.

B

Memory error, okay, so the file is gzipped, so the file is just so basically we're doing a read: a dot read on inner array size, so every object every class in dffml has the ability to log.

B

So what we can do here it with self.logger. So basically it looks like we're. Gonna hit this, so so we're in this loop here. So we can start adding log messages here. Logger.Debug and we can say um you know uh open file.

B

Actually, you know what we can do yeah we'll we'll do this so open file. um Reading.

A

So this message would be printed at each iteration.

B

uh Yeah, so this is well we're outside the for loop right now.

A

But no I'm sorry. I didn't understand what this was.

B

So basically, what we're going to do is we're going to add this debug. So we got this weird error which I've never seen before. um So I don't.

C

Know exactly like if it means like we ran out of memory.

B

Yeah, that's what it means.

C

B

C

This is how much memory we.

B

B

Gigabytes free, so I'm highly suspicious of that, because this file is small. So um so we're going to add some logging, and this can be done- an opportunity to learn how to debug. When you hit weird issues.

A

In my experience this this happens when you run into some kind of infinite loop or something yeah.

B

Okay, so then we're probably going to see a lot of log messages here, so we're going to go ahead and say you know we're going to add uh uh you know so we're just gonna give us ourselves some debug information right, so open file reading size blank, and this will tell me how uh tell us how many iterations of the loop we're gonna go through and then uh we're gonna say uh you know: okay, so inner array, size uh and we'll just dump out this inner array, size.

B

Inner array size and then we'll say so this is gonna, be you know this is the size of every read, so it's possible that we miscalculated something right um uh and let's see so n rows and calls. So, let's go ahead and print those as well, while we're at it.

B

Just so that we know, because one of these is probably wrong right and we're probably reading some we're, probably attempting to read the header information in this file. It's probably incorrect and we're probably attempting to read something. That's like you know way off right.

B

So then the the thing that we can do is if we set this logger logging logging equals debug as an environment variable whenever we run a test case, it should print us for both debug logs um and then enable all of those oops and enable us to see uh what you know what is going on here.

B

So, let's see okay, so they may not. It may not, because we're actually running into a sub. Oh no there. It is okay. So here we see the download. So uh so just to recap here so we did this.

B

We ran the command the same command we were running before, but we exported this environment variable, and this is how you do a temporary environment. Variable you just prefix, whatever command with it um and then when it ran. You know this sub command. Here it picked up on the fact that hey you know. I need to be enabling the debug logging, so here's the nice download progress code um that we have and oh so here you see it instantiate the source. So this is the source.

B

We just wrote right and here you see it actually downloading the first file right, there's the first files finished and then here you see it downloading the second file and then you see it instantiating. Both of these.

B

That's why because they're csv files, so we can tell you right now. I can tell you right now what's wrong. So basically, these are idx 3 format, files which are gzik gzip compressed and our file source abstraction will automatically decompress files for us, but it works based on the file extension. So if you, if you get the file extension wrong, uh then it's not gonna uh uh properly decompress it. So we basically.

A

B

A

B

Bytes and that's not gonna work, because that is more than 128 gigs.

A

Okay, yeah, so to uh sorry, if I understand this correctly, you try to read the wrong type of file format.

B

Yes, and that resulted because, because these these are actually compressed files, if we decompress them, we would get same values for n rows and n columns, but instead we got you know whatever the compressed bytes are and when we multiply those together, we get this value which.

A

Is like, I don't even know.

B

How many gigabytes that is, like that's, a lot of gigabytes, probably terabytes. So if we fix our file extensions, um we will end up with uh sanity. um So let's see so, okay and here it is dot gz. So if we fix these- and we named these csvs.gz right- you saw where I just did that here and here then we should end up with. You know something: that's not completely. Nuts.

B

Let's see what happens it works. So this is the correct output. The correct output is a giant giant giant all you know those are what what are they um yeah? This is I'm trying to control see this, but you know it's got the got the better of me here, but uh those are, I think, what 28 by 28 images I think right. Does anybody remember yeah? So this is basically yeah. So these are the flattened, arrays versions of the 28.

A

B

So it's dumping the whole thing to standard out, but it works. So we were able to successfully um okay, I'm gonna, I'm gonna kill it, um so we were able to successfully dump out uh that and let's see so if we look at it here because this is, you know that just just to save ourselves some trouble here, we can dump it to a file. So now we can actually just so. This is we're writing.

B

We wrote our test case as the documentation within the docs string of the function where we did the implementation right, so everything is just like boom all together docs tests implementation, um and so we can just dump this to a file so temp uh test.json, because this outputs in json by default and it's going to take a little bit um but it'll dump it out for us here um and we can go and remove those uh debug messages now, um so we can say get check out uh source mdx3 because I don't, I don't think we really need those those are sort of.

B

I don't think we want to keep those around okay. So then, at the end of the day here, um what we ended up with is a couple fixes so or one fix so get add. So this is what we usually do is if you know we ended up with our implementation, so git status.

B

So we have three files that we changed. Setup py changed because we implemented or we added the uh the entry point right and then that points to this. This is the python path. This points to this new file and then we changed this uh source wrapper because we needed to change the input validation.

B

So what we would do is, we would say, git, add source, wrapper um and then git commit um and then we'll just say um you know source. So whenever things are within that main main plug-in or the main library itself, we don't prefix with anything we don't say dfml.

B

But if you were to say you know, write in a different library like a different plugin, you would say what plugin you're writing in first, so we'll say you know, source wrapper and we basically split it on the file path, so source wrapper, source wrapper and then.

B

Allow add sources to uh yielded.

B

Type uh allow list for uh context and then I'll say uh for context, managed, uh wrapper sources and then I'll say data set source right, because this is what ends up calling into that um and I'll add you both as co-authors, since we did this together.

B

uh So I have. I have sawhill's email.

B

But what is so, let's see, do we have what is.

B

All right, so can you spell your full name? For me, uh I mean I'd eventually contribute.

A

B

That's okay, you were in the meeting, it's credit credit where credit is due. We wouldn't have done it if you hadn't asked so here I got your name here on google yeah, I mean we yeah. This would not have gotten done unless you asked for it. So uh all right and then what's your email.

A

uh Same same spelling.

A

A

Like that, one no dot, no doubt okay,.

A

One at gmail.com.

B

All right, okay, great and then boom.

B

And then we give, I always give my uh quick look over here. Just make sure everything's correct, oh, and we got a format with black. I think everything's correctly formatted, let's see yeah that looks like correct formatting to me all right, we'll take a chance here and see the the ci will kick us out if we did something wrong.

A

B

Okay um and then we'll say.

B

So, source data set uh mnist training, add numbness training, source.

B

C

B

C

This meeting I get an urge of laughing where my phone.

B

That's good, I'm happy to hear that. I think this was a pretty good recording today. I think this is a pretty good one. um I think we got a lot of information on paper so to speak. um So then we'll push it up. So I you know for me: let's see get remote dash v, so get push so so mine.

B

I I push. You know to my my fork: get push you pity extreme and then uh data set source, mnist, oops oops. I missed a man in my own name and then ah ghpr create, and I I've been having fun with this data set.

B

With this uh github cli boom pull request all right. So now we can wait for the ci to fail, because we have lots of failing ci um but yeah we need. I need to I'm I'm planning on spending some time fixing this stuff. I I think I told sahil, but I I should have some more scope. This is basically has this project has been something that I do outside of my day.

B

Job- um and I know sahil- does this outside of his day job as well, and uh so uh you know we we spend as much time as we can working on it, but you know sometimes things fall by the wayside. So currently the ci jobs are failing, and I know I know you've been working on that. So thank you for working on that and fixing the ones you have all right. So any final things for this meeting today.

C

Other than I have dropped you, the link on on the am.

B

Great and then you and I can jump on that call. Okay, all right! Let me uh go here and I will add this as the uh mnist example um for how to add a data set source, and that should give you a pretty good idea and for pre-work for the proposal. I would recommend that you maybe go through and uh you know add something you know similarly to what we just did and you could follow the.

A

B

As a guide so yeah, because I'll post it on youtube great all right, this was great any final questions. Yeah.

A

One one thing: I had one question: yes, yes, uh so I actually asked silent this earlier. He said he wasn't exactly sure about it. So um one of the projects that this same project the data set, adding things uh I wanted to ask if I can work on it before gsoft, uh even though.

B

You can work on whatever you want before g-suck right, um so keep in mind, though everything's open source right, um other people are gonna, see your stuff, we're gonna know the the beautiful part about open source is nobody can say that they did your work um when you did it, because it's right out there and we can tell that you did it first. So if anybody tries anything like that, we'll just say well, then we're not even going to consider your proposal.

B

We had somebody try that one year I think, and it was like really like uh we can tell who did what we've been talking to you all? So it's yeah that you, I would say the more the merrier and that is you know the more you get.

B

If you decide to go knock some of this stuff out, then that's just going to be great, pre-work right um and then you're going to modify your proposal accordingly right to say you know, maybe do something else right, because you want to you know you want to get the most out of the time right and- and you want to.

A

B

This is all of this is, like you know, stuff that you're hoping other people will use right, so cool all right, great and then I think we have just enough time to sync before my uh my nine o'clock here uh so and then so, let's see writing.

B

Data set source.

A

So I I am just to come back to the first question that I had so yeah uh the whole centralized generic function for this data source thing. um Instead of like copy pasting, the same code.

B

A

Where are we with that?.

B

Yeah, so basically, so what we found here, no, that was a very good thing. I'm glad you I'm glad you said that so so what we found. What did what did we find here? Let's pull up the now. We have two sources that we can easily look at side by side um and data set data set and iris okay. So let's look at iris and let's look at our new one.

B

Right, okay, so let's go to the bottom right so so we we started with this. We copy pasted it modified it a bit and got to here right. So so now we can look at and see well, okay! Well, what you know? How far are we from our hard-coded two examples here to the point where we would have something generic that works for everything right and I think, looking at this, we can see that. Well, what's our general pattern right? It's!

B

What are the files you want to download what are the hashes and then what is the.

A

B

Source that you need right, and then uh you know in this one, we happened to do some pre-processing uh on this one. We didn't need to do any pre-processing right so, but the generic sense here is uh well. You know what what is. What is the file you want to download and um then what is the internal source? You want to do right um now now uh the the goal of this project, so so this is just you know, regular old python code, which is great right now.

B

This project, also heavily leveraged there is, is heavily uh based around like this idea of the data flows right and so the data flows.

B

Are this generically configurable thing um that that you can use to basically say um you can sort of do anything on the fly with them right and and they they allow you to sort of like mix configuration and code, but at the same times keep them separate and organized, um and so looking at this that in general, my answer is you use a data flow because the goal of this project in many ways, as well as being a great like you know, like trying to be a good place for machine learning, um is to explore this concept of data flows and how you could you know, extend anything?

B

Oh actually, this is I just caught something here. We accidentally downloaded the same file twice and why did that work? I'm not sure why that worked. Now, I'm a little bit concerned. I think it just didn't load anything. um I think we just ended up, not loading, anything. Okay. Well, that's that's!

B

um Okay! So, but I would say in general the next step that I would explore if I were interested in in creating a generic interface here is what would a data flow look like um like?

B

How would I define a okay, so how would I define a source as a data flow, and then the data flow allows us like basically arbitrarily arbitrary configurability right, so you could say um you could have an operation that instantiates a source that you can have an operation, that cash that runs cash, download and an operation that instantiates a source and you could effectively wire up, and so the the spoiler alert here is that I'm working on, I have a pull request open.

B

I believe that allows you to define any class as a data flow, and so once that pull request gets merged. So basically you can say this method. Does this within the data flow? So a data flow is a set of functions that get run on like on demand right and on. Demand in this case is whenever a method is called right. We talked about update and records and record. So in this case you would run you know a specific operation for a.

B

Where are we going here? Where is let's go back to that mnist example?.

B

A

For instance, like you guys have this, uh you were working on the web interface right right, drag and drop.

B

A

B

A

How does it, how does that work if there's.

B

A

Like new files, every time.

B

Yes, uh well, so you wouldn't so basically, this is so that so here's an example of creating a data flow and there's a command line interface for creating it, but it basically dumps out to there's a serialized format which can be represented as really anything, so this in this case we're dumping it out to yaml- and this is the graph like you can you can visualize it and that visualization is similar to that web ui project, where you could put together this.

B

The contents of this yaml file, which would be put together by you, know, drag and dropping this type of thing together, which is also defined in this case. We're we're building it using this command line, and so what.

A

We're going to make like.

B

Yeah here one second, let me I think, I'm.

A

Getting into it.

B

For you, so basically what you would do is you would say um this. So, for example, this is multiply right. So so this is a data flow. These are inputs here right, so you could define you define your data flow and then you'd say when I call the records method which lists all records. I want you to send the input well, records doesn't take.

B

Any records, doesn't take any arguments, but you would basically say run these operations whenever records is called, and that way you can define it you, basically you can say any any set of functions that you want to run.

B

You can run on any method, call right, so you're, basically defining a class dynamically right you you can basically think of it as like code generation, almost right, and so you can get arbitrarily complex in terms of like mapping, dynamic inputs or static inputs right and so in the case of these static data set sources, you would say um you know we would create a data flow which just statically maps these like these are the inputs that we provide statically, the url and the hash right, and we would.

B

We would say on records, call cash download, and then you know uh the and then you know, run and and and uh yield each record in the csv or yield the csv source like it's it. It's all arbitrary right like it's, it can be whatever you want right and so right now it and and effectively what it does is it allows you this mechanism to make everything configurable right. So we would take the code that we have here.

B

We would write it as a data flow and we would set these as the default values perhaps or just say anytime. You get a url and a hash. I want you to call cache download right so for this one we have one url and one hash this one. We have two urls and two hashes right.

B

So if we wanted to create an arbitrary source that works for any data set, we would say we would add cash downloaded as an operation and then we would add something that figures out what the file extension is and instantiates the correct source based on the file extension right. So we would have something that downloads every time it sees a url it downloads.

B

Every time it sees a file extension, it instantiates the proper source right, and this way we can build this arbitrary thing, so you can pass it any url in any hash and it always instantiates the correct source right. So it's just sort of this like meta thing that figures out the right thing to do.

B

Okay, so I think we're over on time. I need to drop here and double check that I can go into nine o'clock um and I think I can. I think that meeting got. I think I had no eight o'clock and I think I, my nine o'clock, got canceled, but let me double check and then let me hop on the meeting with you sahel. Okay,.

C

B

Great all right, thanks guys, have a good one. This was great bye.