DFFML Software Engineering, 30 Jul 2021

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: Software Portal: Sources: 2021-07-30

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

All right, this is the first step in uh implementing the crawler for the sap inner search portal. uh So we've got the uh sap has developed this front end. um That's a basic html front end for the uh for an inner source uh portal, um and it essentially looks like um this, and uh so what we're gonna do here is we're gonna go and we're gonna implement what they call a crawler, and so this front end is static and it pulls data from this repos.json file.

A

uh Now we're going to essentially implement the thing that uh you know puts the data in the repos.json file and they got a little diagram here of how that works, and now ours is going to be a slightly modified version of these two using data flows and operations and also sources. So what we're going to do is uh you know we?

A

We need to put information in this repos.json file. Our input data is essentially a it's. This tree structure, so the it looks like it looks like it's a it's a directory and within that directory uh we have um within that directory. We have subdirectories, which are org names of github orgs and within those directories we have repos dot, yaml files and in those repos die email files. We have the name of each repo that we care about tracking and we have the owners of that and we could.

A

We could have arbitrary other information in there as well. um So let's go there. This is what it looks like here. um So, for example, we're tracking intel uh here's the reapers.json on intel and then we're tracking tpm2 software and here's some repost.json uh there. um So, let's see if we can we'll grab one of these, so you can see what they see. Look like right now, um so, for example, right so here's dfml, right and and I'm the owner and then there's cbe, bentool and terry's the owner.

A

um Obviously, I've just picked picked one person here to simplify things, but uh you get the pictures. So that's our input data and we're going to be pulling repos from there and we want to take each one of these repos and we want to generate a uh you know.

A

We want to generate an entry in the sap repos.json, which is looks like this, and you can see basically it's id and just some info from the github repo and- and this is basically what they've got here by default is if you read that crawler's document, it's just what happens if you hit the github api for search for repos, um and so this is the basic information that they're asking for and they will display in the portal um so we're gonna essentially, and then they have this.

A

You know they have this section in your source metadata here, which is where they add extra information and so we're going to add some extra information there. um We're also going to have you know this information here populated the standard stuff and we're gonna do that by um so we've got our input, which is these repos yaml files, and we got our output, which is this repos.json file, and so each each uh repo is maps to like the dfml concept of a record um and which is a you know, anything that's a uniquely identifiable.

A

So with github repos you got the owner and the repo name, and that's essentially you know that's your unique. You can uniquely identify any repo based on use using those two things um or you know, for example like the url here right, so the github and soul earth because they're doing a demo with planets here.

A

So what we're going to do is we're going to implement the dfml source for our inputs, that are this org tree with the repo yemo files and we're going to implement the dfml source for the outputs, which is this repos.json file, and that's what we're gonna go through here right now in this demo, video and then there's gonna, be another demo, video of how we do the data flows and stuff.

A

So the first step is really just you know: read them read the repos in that we care about and then write them out, um and so what we do first is we implement the source to read the repos in in our input format, which is the directory and yaml files, uh and we also implement the uh source to read the uh repos or to write to the repos.json file and we're actually going to implement you'll, see we implement the uh the repos dot json uh reader writer first um and then we're going to implement the one for the for the directory structure with the animal files and then in the following video.

A

What we're going to do is we're going to go through and we're going to run the data flow, so we're going to create a data flow using operations and those operations will collect the pieces of data that we want to have in the output repos.json, which we want to display on on our inner source portal and we're going to we're going to so we're going to write little operations that are going to collect. Maybe each piece of data. For example, we might have an operation that grabs a description.

A

We might have an operation that calculates the participation, um error, grabs the logo and, for example, one of the operations we're going to do is based on the owner, we're going to generate the gravatar url. So we can display the gravatory url if you're not familiar with gravatar, it's uh go check it out. Basically, you can add your picture and then everywhere on the internet can use your email to give your profile picture for the whole internet. Essentially so, um okay, so yeah we'll implement those operations and we'll show how using a data flow.

A

We can leverage uh different data, that's generated by different operations, so that not every operation has to do everything. For example, if we generate if we grab a github api request, that gets this data. Like I said this, is you know the data you get just from doing a regular, github search right? We might. We might return a repository object that has that minimum this data and then, instead of each operation, you know each other operation that might generate some metrics or some data that would go.

A

You know in this inner source metadata, that's the additional stuff uh having to make the a a request to the github api itself. We can leverage the object that we return from the first operation, and now we can write more operations that you know maybe calculate things or, for example, generate things based off the owner email address. We can generate the the gravatary url and we don't have to you know, make more web requests.

A

We can just pass around the same objects, um and so this means that the authors of these operations, uh don't necessarily have to know how to use the github api. They may just know that hey there's this object floating around that somebody else has already used the api to kit, um and that way we can create these. You know large tree uh directed, you know it's a large directed graph of how the data, how you you write operations, they produce data and you can leverage the data produced within other operations.

A

You know without having to re-go grab that data um all right. So, let's get to it. We're going to write these two sources, starting with the repos.json source, which is the output source and we're just going to implement, reading and writing, because it's it's very easy for a json source and then then I'll show you when we uh we implement the one to read the repo yaml files organized under a specific directory tree all right. So, um let's get to it!

A

So I'm gonna, I'm gonna, play you an ascii cast of uh of how I, how I did this so, and uh I know this is 4k, but uh so you might have to zoom in all right. So, okay, this is gonna, be too fast. Isn't it all right.

A

All right, so what I did here is I started from the json source itself um and the json source is it's it's an existing source that that has that stores things in a json object. Now the sap format is actually an array.

A

So what we're going to do here is we're going to subclass from we're going to we're not going to actually subclass from the json source, we're going to subclass from the same things that the json source subclass is from, which is the file source and the memory source and we're going to uh with the memory source essentially provides us with a dictionary that backs all our objects that we've loaded into memory in this dictionary, which is self.mem, and so we really just need to uh and the file source uh provides us with the on open it.

A

It provides us with file descriptor and on save. We call this dump fd, and so all we have to do is implement those methods from load to file descriptor and dump from the file descriptor. So we're just going to load the repo data.

A

You make the unique key that at the html or the html url, which is same as the repo url and we're going to create a record object for each one of those where the feature data is that example, you know data json object up there um and then, when we dump it out, we just dump the dictionary that we have in memory to a list, um and you know, but then then, when we load it back in right, we load it into the dictionary keyed off the off the url um and so we're going to register this thing as an entry point.

A

uh We're not going to cover that quite yet, and that is essentially going to allow us to a shorthand version of this call right here. So right now. What we're going to do is we're going to write the long version where we specify the full path to the class which we want to use to list the sources, and so you can see sources, sap, porter, repo json and then the the class name there.

A

And then we provide the arguments which are the same as the file sources. Config arguments which is essentially you know, that's the file name um to load from and and that's just going to be, our json file that we give so uh and here you can see. I got confused because I had these uh relative imports that I didn't delete.

A

So now we deleted them and now we can go and we can run the uh we can run the run the source here to run that command line example, um and we see that we dump the data. um So, let's see yeah I was going to create an issue and then I didn't alright. So here's here's the dump right. You do. You run this list command um and running the list.

A

Clear man instantiates this class um we're using this file name as as the uh as the um uh you know, the file name to read the json from and and does lists all the records in it which we populated via populating that self.mem object. So now we just copied uh the previous file to the new file. um You know to use as a template and we're going to go through and implement this for the orgs.

A

So this is like, I said, the directory structure, which I just showed earlier. uh It's got orgs and then it's got each sub directory is the org name and here's the contents of those files. It's got the name and the owners which is a list, so we go through and uh we're just gonna list. You know we're gonna, do a recursive listing of all the yaml files.

A

um Our config object is just gonna take the directory, which is the top level orgs directory, um and so then every yaml file we find the org name is going to be. The the org name is going to be. The um the org name is going to be the subdirectory name and then the you know we load all the yaml documents and you can see.

A

Lingamble documents are separated by the three three dashes there, so we load all the yaml documents in each repos.yaml file um and we'll you know key the urls, which I think I forgot to do so I'll. Do that right here at the end, um we'll key we'll we'll key the records based on you know: org name, slash, repo name right, and so here what we do is we can just implement the the memory source we're not going to use the file source because obviously we're reading from a directory, not a file.

A

So what we do is those load that load fd method. It actually gets called on the a enter method, which is a context entry to this. uh This source object and you can read about that on the double context. Entry page. um So everything follows this: double contract entry pattern and for uh the memory source here you you can uh the the context is handled for you, so you only have to do the top level context.

A

um So uh you know I'm gonna open this up in a second here, so because now it's going too fast. So all right. um So uh let me bring this up sources, okay, so uh on entry. So when this class is instantiated per the double context, entry documentation, you know we're going to do this double contact century pattern, and this is the first context. Entry is what you're seeing here in this a enter method.

A

So when we instantiate this class and the first thing we do before we use it at all, is we hit this a enter method?

A

And so now we can use this as an opportunity to populate the self.mem that we were populating before um and so uh in in the previous example and and all the other sources, if you go go through them, if they're based on memory source, um so you can yeah you can you can you can populate that self.mem here and you can see what we're doing is just you know, recursively listing all the yaml files uh grabbing them and then setting the name, and so we have to.

A

I like I said I screwed up here so, and I forgot that we want to.

A

um You know, make the full full url the name so we're going to do github.com and then slash the org name, which is the yamlpath.parent.name parent, because it's the parent directory, so the subdirectory of whatever the configured directory is, which is orgs parent dot, name, uh slash, doc.

A

um Let's see yeah stocked up now: okay, great oops, all right, yeah great- um and you know, let's.

A

A

So this is the records key, which is the unique identifier that I was talking about. um So as long as we have a unique identifier, then we can, you know we can use the source construct. um So now we've got. You know the key, which is the uh the github.com repo name, and you can see we'll we'll run this example here and will show up like this so where's my there we go so rerun. This example command oops, oh and I'm not in that.

A

A

All right so now we see that the key is is the correct uh set of values here, which is that you know the full github url um all right so uh and now what this is gonna do is right. So what we can do now is is, is we'll go and we'll make our data flows and we'll make it so that you know just like the list command. All you know all the command syntaxes is very similar here.

A

So so uh we're going to do a data flow run command and we're going to use this orgs thing as as our our input source and then we're going to run the data flow on each one of these and we're going to produce a record. That's going to be output to the um to the other source that we wrote, which is that records.json source. So that's what we're going to cover in the next video.

A

Alright, thanks, bye.