CHAOSS Augur, 14 Nov 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Getting Started with Augur, November 14, 2022

Description

This is a workshop type meeting where Sean Goggins explains how to get started with the Augur software.

A

And I will also turn on live transcript all right. Everyone thank you for coming to this uh auger hackathon, getting to know about auger. We we have an agenda that we're going to follow, which is right here, we're going to talk about everything about agar and how to contribute look at the file structure. What they contain, we'll go through installation and the process of adding a new worker foreign.

A

I am just gonna, do real, quick before we get started. The one thing I am less familiar with than uh my team right now is how to add a new worker, because we just changed um that process. I'm, sending a Discord message to my team to see if anyone's able to join us.

A

And let's see Zoom.

A

I just have to put the uh participants and I copy invite link.

A

All right, so since that's last, hopefully one of my team members will be able to join I, know less about that than the other things in auger, but um I'm hopeful. Someone can can join us. So.

A

Let's just start with everything about auger, so those of you who've, heard of agar know that it is a data collection system for adding information, basically collecting effectively all of the information about any open source software project- and this is the directory structure, how it works, can be described in a couple of different ways.

A

The first that I'm going to show you is just through the front end. um There are a couple of things and I remember.

A

I'm just trying there was a there was a version with everything that I used yesterday.

A

What was it? Oh, gosh, Brain, Brain, Brain.

A

All right, well, I'll it'll, come to me momentarily so auger um effectively. If you look at the group, so under groups you'll see a list like this in the front end under repos you'll see a list like this, which is each of the repositories in various groups and I can order them by group name and under insights, which there aren't any I. Don't I, don't believe at this um particular agar location and I'm. Trying to remember one that oh I know I know what it is.

A

So I'll use the eBay instance under insights. um Auger has some workers that look for anomalies so the way that works? Is it trains the data based on the last thousand days of commit issue and pull request activity and looks for changes from the moving average using uh uh for random Forest algorithm to essentially identify whether or not in the last 14 days, any particular um repository has seen a statistically significant increase or decrease in activity and I'll explain the machine learning workers that we go through the parts of auger.

A

The purpose of auger is to give someone who's looking at a collection of Open Source projects. Some kind of idea of you know how much activity and what kinds of activity exist in in that repository so similar to grimorelab and other tools, we're trying to present a sense of the overall health using auger, auger metrics and when you're looking at uh a list of the repos.

A

It's somewhat intuitive that, for example, um some some repos will have more activity than others. Is this in the direction of what we're looking for under everything about agar and how to contribute foreign.

A

Looking for a yes Sean you're on the right track for an intro or not.

A

So, at a high level, this is what auger's trying to accomplish. If I go to, for example, this eBay project, you can see that auger indicates license coverage. It lists the lines of code added by the top 10 contributors over time.

A

If I widen this a little bit, you can see the lines of code added by the top 10 authors and if I click a year you can see by month um how much let me go back there, and it's like that now. This shows average lines of code for for commit. It shows organization, information which has to be provided by whomever the auger user. Using organization is, and then it looks at um gives you reviews and pull requests by the week number accepted by week.

A

uh Pull requests decline, issues opened closed, Issues new issues, code, changes which are these are lines or commits. These are lines of code added per week and then, if there were Library files used, they would be listed there over here.

A

There's another tab called risk metrics, which shows the number of forks by week, the number of committers by week, whether or not there's a cncf best practices badge the types of licenses that are declared and where there's no assertion of the license, but a license of some kind is declared and then the coverage- and it also indicates the percent of OSI approved licenses of those that exist. So this this is auger's original overview of information uh provided by week.

B

Celebration, I have a question, a developed question yeah so um like will it be possible for you to like show us like how Olga works with an example? Does it like work instantly like when you plug in a repository? Does it take like? Does it work instantly where it gathers all these insights like say, for example, you're taking, um maybe a random repository and puts puts the um bit of beautiful details. I know, that's not because it work like instantly, then. If, yes, could you like show us like an example.

A

Sure so it doesn't do it instantly so the way that and and I most Tools, in fact, all tools that are in the chaos project don't provide data instantly. There is a data collection uh period, so it takes a certain amount of time right now. The current version of auger, which is under the auger Dash New, Branch and um I'll, just make some notes here over the end over the agenda, foreign.

A

I can type um the current branch is, Branch we are working on, is auger new and that that I expected to release any day now I've been working through some final glitches in in auger new, um really very minor um glitches.

A

So in the past it might take auger over a month to collect data for 10 000 repos.

A

Now we can collect all of the data for 10 000 repos in less than a week, and that's because of some new technologies that we've we've come come to employ. So when it comes to adding new information to Auger there are, there are two approaches, one which I think is the place to start is immediately following the installation of auger and and is at the command line and I'll show you that uh briefly now, I'll just show you a new auger interface.

A

Oops, if I could type TV without an.

A

A

Auger has a new interface hmm right now, and this this new interface does a couple of things that I'll explain briefly. One of the things it does is, if I create a user. I can go to that user's profile and add a new repo or organization that I want to collect data for does. Does anyone have uh so many of the ones that would be obvious, like chaos are already collected? Does anyone have a GitHub organization that they'd like to have new data collected for.

A

Any organization would do.

B

So you do not need to have like a right assets to submit to one correct. Okay, let me find one. Let me let me use a to-do group, maybe yeah.

A

That would be great, so.

B

A

I need is the GitHub organization name or I? Can I can do the full URL.

B

And we get that too. Okay I think um I'll drop the Drupal on the charts.

A

Okay drop it in the chat, hang on chat.

A

A

So what happens here is I'll, add Drupal, and that will take a minute we're working on. You can see by the X not being finished here that it's still working one of the changes that John one of our maintainers is is creating in this interface. Is it will show you a waiting sign, and so now it's uh it's a state, it'll say successfully added repo or org, and um it's not apparently a clear we're, also going to make this a little bit more clear what your user repos are.

A

um Now, if I go back to auger.

A

Oops, well that didn't work.

A

um It will, let's see if it added.

A

So it doesn't, it doesn't show you that immediately, but it did add. The Drupal organization and I can demonstrate that um real quickly. Here, give me one. Second I can demonstrate it that it I can demonstrate that it actually added uh the repo.

A

By showing you, the database.

A

um Maybe that's not the one.

A

All right hang on just a second via DOT chaos, TV, death, unstable I, think, oh I, know I know what I did there's a little bit of a it should create ordinarily, under normal circumstances,.

A

um All right well, this would reveal that I, don't recall specifically which database this is getting added to, but you'll have to trust me while I remember that um uh later uh that it does add the um foreign.

A

Because nobody wants to watch this.

A

Resume recording so when you, when you do that, step where you add a repo group, you end up sharing, it ends up creating an entry for all of those repos and I'm, just sharing a portion of my screen right now. So this is the repo database and you can see I don't know. Can you see this okay because, unfortunately, I can't make it bigger.

B

A

You can see that it added all of the repositories under repo under Drupal, and so one of the indications that it's not collected yet is, when you add, when you add the repos it'll cue them for collection and repos that have just been added in the repo database will list the git URL, but it won't. The repo status will be new and it won't be as yet processed it so that the repo name or the repo path, which is internal to Auger.

A

The repo path, is where the cloned repository is stored and the repo name is, of course, the name of the repo and new would change to complete. Once the calculation of all of the information about the repo was was satisfied.

A

Does that make sense?

A

So if I go back here under under this eventually yeah there's an error there, because I have a bunch of empty repo names which is a again something we're working on correcting.

A

But now you now you can at the zoo.cast TV site.

A

um And I don't know how to make this portion of the screen bigger, but oops.

A

That they will eventually show up over here, make sense.

C

A

B

A

So the next maybe to explain kind of how auger Works um one of the next things I'll do so that's that's basically showing you that when a repo is added, the repo is created in the repo database and is cued for collection.

A

At that point, which is useful to to say the least- um and here are some repos that have had the data collected for them already.

A

um But your question was just kind of how do all the pieces work together right like what's the what's, the technical flow of agar am I. Getting that correctly.

B

A

Okay, so I'm trying to decide if it's better to Let, Me Maybe, explain.

A

How about I explain like in a on a whiteboard the overall architecture of how these things all work together and go back into describing the file structure and more in Greater detail how they all work together? Does that seem like a good approach.

C

A

All right, so, why get my whiteboard set up I'm, just gonna? How many times I'm doing this for a class and I forget um things like the recording.

B

A

We have get repositories. Can you read the word, get all right or is that shining? Is there a light shining on that.

D

It's a little bit.

A

Does that make it better or worse.

B

A

Well, maybe here all right, that's a little bit.

B

A

Can you see the word get yeah.

B

A

Can okay, I'm gonna use some darker colors I'm moving here.

A

Okay, so your get your git Repository almost always exists on some platform.

A

And this can be usually it's GitHub or gitlab and, in addition to the git repository which which they store, there's also when these get terms here.

A

A

A

Contributors and messages.

A

Now, all of these all of these things, which and you can read this okay I'm sorry I- just want to keep making sure that you can read it all right.

B

Yeah yeah I think it's clear: okay,.

A

So all this data exists on a platform somewhere and auger has obviously we list the repos that get attached, and then we have collection.

A

So data collection forget we do a clone.

A

A

Commits foreign s of code, author author is the person who wrote the code who who made the commit? The author is the basically the person who commits the change, and then we have a term called contributor.

A

For which there's also data in the table- and this is the person who merges.

A

The change these can be the same people, and this is the author's the person who creates the change triangles. A chemistry function for change if you're not familiar with that for pull requests, issues.

A

Contributors and messages- this is all done. This is all collected using a foreign.

A

Api, and so we have, we have something called so to access the platform API.

A

Auger now has something called tasks and I'll show that to you tasks are the artists formerly known as workers.

A

So a common auger language previously had been workers. Now we call them tasks. Each of the tasks performs functions that collect all of the data related to these primary entities on GitHub. It's also worth noting that there are other things that we collect.

A

That could be relevant for a particular case.

A

These include events.

A

A

Metadata includes things like the plaque, essentially, the platforms View.

A

The flat the platform count for.

A

Issues commits.

A

Prs Etc, so that becomes relevant a little bit later, but these are some of the other things that the the tasks also collect so for a task that collects. uh Let me use one example: pull requests information.

A

A

Of course, the pr itself.

A

It also includes um the um the: uh what do you call it? The head.

A

And the base, so we know what Fork was used to create that pull request and to where it was merged it also uh these this pull request base also includes a status.

A

And that can the status can be merged.

A

A

Or open merged means the same as close merged and closed both mean closed. If the status is closed, it means it was closed without being merged and if it was merged, of course, it means it was merged and then, of course, when it's merged it's closed, um it can also. It also includes things like assignees.

A

Labels and some other other metadata reflected in the tables that it may not be occurring to me right now and then for each of these pull requests and all of the things about it. There are.

A

There are pull requests.

A

Files and commits so in the case of pull requests, all of the things are are um in there, so the pull request isn't one just one table, but it's all the things about a pull request. The same holds for issues same same exact metaphor, um an important piece of all of this is the.

A

A

um Are there, it also includes the there's, the yeah.

A

We also include who created the pull request and who closed or merged it. These are people so and obviously all the merged dates and things like that. So the things I'm. Can you see that I'm I've circled assignees people created and closed the merge in purple?

A

Can you distinguish that? Color purple? uh Yes, no, not really.

E

A

Let me um let me try a different color. What about the Orange? Is that easier to distinguish that's much better? In my opinion, okay, okay, good. So these are all contributors.

A

Contributors are obviously important to us, so I think I heard Isaac's voice, Isaac joined us. Isaac's gonna explain the tests of how much time you have Isaac.

E

A

I have a big chunk: oh okay, all right so great, um but I will still try to get to that part of the question while you're here early, so you don't have to spend two hours with us.

A

um So contributors are really critical because the author for a pull request and the contributor for a pull request, the person who merges it all of these things are people.

A

And one thing to know- and you may you may already have figured this out in a sense about any kind of uh auger or anything else is that knowing who did something is kind of important and useful, and we have an Isaac specifically has created some logic, because if you recall we're doing a clone count for the commits one of the distinct and interesting things about commits is the the platform identifier for the person is not included in the git log, and so these by default are emails and everything else is a GitHub and I'm just using GitHub as an example, um it's both an ID which is numeric and a username as an aside.

A

Username people cannot change their username on GitHub, so the canonical platform identifier is now ID. Isaac can discuss a little bit more about how that gets resolved. So when auger run, you have a question.

E

Oh yeah, for uh contributors in terms of like how we're identifying them like we have a EU ID that basically takes into account like the user ID and the platform ID together, and that's just like one value: that's Universal for all contributors across all platforms. That's.

A

Actually important thanks Isaac.

A

This is a change in the latest version of auger. The fact that the contributor ID is a uuid that combines the what are they combining again Alex I mean Isaac.

E

uh It's the like the canonical light. It depends on the platform because platforms have different things, but for GitHub it's the platform and the uh and the user ID I mean obviously the platform ID for all platforms will have to be there because it distinguishes the platform, but the individual user ID will be a bit different because you.

A

Know and we're using the ID, because people can now change their username. So if we use the username it's it's possible, a username will go away. The ID won't go away. If you change your username to Tony from Fred, then we would lose all of all of Fred's references, but Tony and Fred will always have the same ID.

A

So it's the platform in this case GitHub, plus that ID that GitHub assigns to every user and then Isaac's processes with uh the commit counting tool, which is called facade derived from work that Brian Warner did um 10 years ago and significantly evolved with Brian's blessing and permission um in auger, we'll resolve all that information to the same contributor.

A

So a contributor. If, if my contributor is let's let's go with me, if I am dead, marker.

A

If I am s at goggins.com,.

A

In my commit record because commits are all by email, then.

B

A

Goes through and identifies that essigoggins.com maps to s Goggins.

A

On GitHub and this whole thing is stored as a uuid.

A

Why this is important is because, if you collect uuids for a collection of a thousand repos, those uuids will be exactly the same on any other auger instance, where I contribute with the email asset goggins.com or any of my other emails I've. For example, in my case, I've probably contributed to GitHub on 12 or using 12 or more emails, and my uuid will be exactly the same on all instances of auger, which ultimately would make it easier to integrate all of the data from all the instances of instances that you might have of otter.

A

If, if you chose to do that, uh so the platform API stores, just um if I'm looking through the platform API and the contributor, that's automatically stored using this same uuid,.

A

But I have I, have it I have this GitHub, username and password already collected, because that's what the API gives me and.

B

If I've also made.

A

Connects it's going to map to the same to the same. Uuid does.

D

A

That's and then the same would hold for issues. uh Contributors are gathered following this. This example so anytime anytime, for example, a pull request, an encounters, a user that isn't already in the contributors table and contributors are stored in the contributors table. It will go and actually retrieve the information for that user that isn't already in the database.

A

Is this all making sense so far or are people asleep.

B

Making sense so far, I like.

A

Okay, probably the the the conceptually at least the most important other thing to share um uh that that's important to understand is that messages.

A

We see messages in many places, but the two main places are on issues and pull requests, and so the way that we've organized messages is that there's- uh and you can see the bottom I hope is I- have an issue.

A

Message is like a comment that you would make. So if I have an issue, one issue and I have a PR 1pr I have one.

A

Messages table where all of the messages for any platform and or all platforms and any issue in PR are all stored in the same table.

A

There's a bridge entity relationally, that's basically issue message: when we call it issue message: ref and PR message.

A

Which is PR message of that's important, because if you only want to see, for example, the issues or the messages for pull requests, pull requests has a one to many relationship with pull request messages and pull request messages.

A

You know Isaac is this is what's the what's the relational thing here? Is this one to one.

E

uh I can't make out what does it say between PR.

A

And this is pull request, pull request, message, ref and messages.

E

um Here, let me see actually I don't know if it opened my head.

A

Because I'm thinking like a message, would only apply.

A

C

A

Yeah, like there'd, only be one reference there.

A

Forgive me I, designed this relational model five years ago or six years four years ago,.

E

Yeah Andrew's been doing most of the method stuff, so I'm not super familiar with it. So.

A

I'm pretty sure this is one to one where each individual pull request. Message has a single message in the messages table and the reason that um we have this bridge entity in relational terms is so that we can distinguish the origin of of the messages. Now in hindsight, we could have done this differently, but, as I said I designed this four years ago and um probably made it overly relational two important things one is, you can distinguish pull request messages from issue messages or other messages that we gather and there's only one.

A

uh All of the messages are stored in one table called messages.

A

So cons and, and so a task is basically anything that gathers data for for auger. That's what that's! What tasks are.

E

Yeah, it's basically just like uh like, and it like a concurrent thread that auger can run or not really a thread, but you can think of it as well.

A

But it's what we use for data collection, any any like I've thrown a lot of junk on this whiteboard. Any any conceptual questions here.

C

D

C

To ask one thing: yeah yeah, so I wanted to ask about. Is this some kind of an ordering that goes on with tasks.

E

Is there some kind of Moder, oh ordering yeah yeah, there's a uh like task for organized into phases. Basically, so uh there are like large groups of tasks that um are differentiated because like for one reason or the other, they absolutely cannot or they're not supposed to run at the same time as other groups of tasks.

E

um So, like an example would be like. um We have a preliminary phase where uh currently, the only thing that we do is we check all of our sources. uh In like say we have like 10 GitHub repos. We check to make sure those haven't moved, URL or anything before we run the rest of the data collection.

E

um Another example would be the machine learning workers like uh they're, pretty resource intensive, so we don't want them running at the same time as anything else so they're in their own phase um and within the phase uh you have various uh ways of organizing individual tats uh that are given to you by celery, like uh you, can put all a bunch of tasks in a group and they'll run all at the same time put a bunch of paths in a chain and they'll run sequentially and uh there's uh more stuff.

E

You can do obviously, but those are the basics.

A

But after prelim preliminary Alex, it's the to all the GitHub tasks, kind of run in parallel, or do they run sequentially in some way like.

E

A

E

Are you asking if the like the repo collect phase? Do things run all like concurrently, yeah yeah? They do um like uh if, if things aren't like dependent on each other and they're in the same phase, there's no reason why they wouldn't be run at the same time like, uh like, obviously like stuff like uh there's a task for uh pull request. Files like that can't run until pull requests is run because it's dependent on the pull request existing um so that that is like a direct relationship.

E

That's specified within the phase, but like um something like uh I, don't know like facade can run like at the same time as we collect full requests, because, like there's, no reason why they can't.

C

Maybe it's like two workers are updating the same table then they might not run at the same time is what the idea and definitely machine learning workers need the data, so maybe they're, even after the entire collection is done.

E

Yeah, we have actually run into to to deadlock errors in regards to editing the same table at the same time, but um we've.

B

E

Very rare at this point and I'm optimistic that they won't like I. Don't think that we'll need to be that restrictive and uh or capable usage.

A

So when I one one of the things I mentioned earlier, is that where it used to take over a month to collect data for 10 000 repos, now it takes a week or so. One of the reasons is the auger new branch, which will soon be the main branch, um does a massive amount of parallelism compared to the prior version of auger.

D

C

uh Just just one last question over here like uh this parallelism that we're talking about is it done by the use of multiple servers? Are we using multi-threading in some? You know on the same server to achieve that panelism.

E

um It can be used with either right now, where uh I'm, just I've just been testing it on the same server machine uh it's possible for both, though uh salary supports both uh because whenever you schedule a task or a phase or a group of tasks, um it just sends like it just cues that all up on um on something like redis or their other uh back-end viewing uh services, but like we use redis. And so, if you were to have your salary instance rather than on a different server, you could totally do that.

D

A

Oh okay, um there was something else. I was thinking that might warrant explanation right now, um just.

B

One second, um it's fine, yeah Manuela has a question. Oh.

A

Okay, so I'm sorry, I can't see the tasks. uh I can't see the the tab. Let me see here.

B

uh Emmanuella, do you want to guard.

A

So I'm having a low battery and I just want to ask it.

B

Okay, yeah go ahead.

A

Take it enchant if you're having microphones.

B

uh Do you want to use the chats since I think your audio is not coming up.

B

Okay, I would just wait for the question then.

A

Well, I'm waiting over into the question.

A

A

If you don't mind reading your questions out loud when it comes.

B

Okay um question is um trust Legend very interested in all girls, and so how can I contribute I think that's the second part of um yes.

A

Is Amela able to get a charger and join us for that? Second part.

B

Nah she she's almost dropping off, so maybe once okay all right, let me try it. Let me try.

A

To hit that as quickly as I can, if not it will be recorded. So the the first thing that you probably want to think about when it comes to contributing.

A

Oops is to be going to the to the downloads.

A

So if I were to look at the Auger.

A

Thing here, there's there's two places um to contribute. One is the piece where we actually go through and and install it and obviously obviously kind of prerequisite, but the first place that I would point someone who maybe wants to make in the auger directory. So under the root of wherever you clone auger in the auger news.

A

And I need to share my screen.

A

So, if I'm, if I'm here under the auger auger directory, there is a folder called API and under API, there are apis, are effectively queries that present a restful API that provide an existing chaos, a turkey or effective model, and there are two kinds: one is a standard metric and the other is a non-standard and they have various characteristics, the end result uh of which is.

A

Reposed, so it the the things under things under API, if they're a standard metric, for example, one of the standard metrics and these all return, Json objects that provide specific information in the case of repos. It's the ID, a name sometimes if it's got to get covered, because sometimes they think exist, we're actually not going to include those extra.

A

It includes the total number of issues, internal repo group ID, um the URL, and so that's what the repos API is. Now the API structure for standard metric would be after we go, so you could type in the idea of a repo and a metric. So one metric is emitters.

A

Thought it was committers.

A

Of course, I can.

A

A

A

A

A

Okay, so code changes is um but go under, metrics and submit is installed.

A

A

A metric called code changes um endpoints and the standard metric like that will give you the name of the repo.

A

This may actually.

B

But it'll you know the audio kind of dropped. Do you want to change that.

A

Say it again, your.

B

A

It's better yeah, yeah. Sorry I had to move my microphone over to do the Whiteboard and I forgot to move it back.

A

um So one good place to start contributing is to look at the existing apis and you can navigate them in the OSS Docs um area.

A

So that's that's one API Doc and point, and so that's um that's how they're that's so if I was to go in here and under API metrics again, our standard metrics routes are non-standard metrics. If I go into one of the standard metrics. This is actually a very easy pattern to follow. So if the metric that you want to develop is inside is a chaos metric, you can take a look at any of the files under this metrics directory and they share a common structure at the very top.

A

Are these 12 lines of code with the spdx identifier, a description of what's in this file, these libraries and it instantiates a database connection? Each individual metric can be identified or can be developed just using SQL. So if there's data in a chaos metric or metric model that you want to build, the SQL for first of all, some of the SQL May at first appear somewhat complex, but keep in mind that there are hundreds of different queries already developed and you can use those as a pattern to follow for developing metrics.

A

You could also, in the slack Channel, ask a question like I, want to develop a metric endpoint for x and we can help you get started with the SQL.

A

All of these metrics, in addition to this register metric declaration have a definition, so you'll Define the method that metrics exists under and whether it's a repo and this signature, repo group, ID repoid, begin and end date and period.

A

If there is no repo group ID, provided you need to provide a repo ID, that's the most common use. The parameters are defined here, so in a standard metric. These are always the same if the application or end user does not provide a begin and end date, simply the beginning of time according to computers is provided, January, 1st 1970 and the end date is essentially the very second that you are making the request.

A

The SQL variable is set to none just in case it's previously been set, and then it checks to see if a repo ID has been passed. If it has it uses this SQL Alchemy function, SQL alchemy.sql.txt, you can see up above that we've aliased SQL Alchemy 2s, so that all we have to do down here is s.sql.txt return.

A

Triple quote put in your SQL close triple quote: uh One X tent there, and it goes there if a repo ID group is provided, it's a separate query and then results in the standard metric are always returned as pandas reading the SQL connecting to the database with these parameters to get the results and then oops the method.

A

Excuse me Returns the results. Those results gets processed, get processed by auger into this Json file as an API endpoint. So that is one easy way without getting real deep into auger that you can get started um helping with auger, because it applies some very easily templated things, and then you can build the metric.

A

The one thing that then we do ask is that once you've created your metric and you can decide if the metric that you want to create is something logically inside each of these existing uh python files or if it warrants its own python file. Any any questions before I press on.

B

So um I think maybe I'll. Let you Manila ask a question because that's our question.

A

It opened my chat just in case it comes that way.

A

Foreign while you're preparing your question, if yeah, if, if there's a like, if you want to do like the other thing that we'll probably talk about, are tasks for data collection uh as well as installation and deployment. But if you want to get started, you know we can spend an entire session like this going through getting you started.

B

Oh I guess Manuela dropped off but um though.

A

Yes, it looks like she did.

D

Oh, my God, so I wanted to ask about some these print events. That's you know. Maybe you want to be nice in order to be able to have a better Clarity on how um our goal works and how to contribute to it.

A

So, are you asking about I heard the word events? Sorry asking about the file structure, or are you asking specifically about event, objects that we might gather.

B

Okay um Sean, so Ahmad is the person. I talked about from PI data Ghana that I wanted to do spring um with auger yeah.

A

B

I did a slack chat, um well, yeah I'm, not sure you've seen it yet.

A

uh I probably haven't I was at the LF member Summit last week and I'm pretty behind on comms.

B

Yeah so Amanda, are you asking about um what's kind of what ways can they contribute or like what was your question.

D

Yeah yeah, so it's kind of everything together like what ways we can contribute and um I'm sure most people would want to contribute through code or some people want to contribute through code and wanted to I just wanted to like have a better idea of um I I joined earlier, but I I got distracted with work and so I didn't get some of the details, but so I just wanted to see. If uh looking at the file structure, we could you know, get kind of what each of those files are doing and also.

A

D

Do it can contribute okay.

A

B

I want to add some more context to uh Pi data Ghana. um It's like a community and um Ahmad wants to run like a Sprint with Olga. So he wants to kind of like understand how high data um community members can what's what and what's available that they can contribute, since they do a lot of like I know. They have like a lot of python people interested in Python, so I I think from the file structure as well. We we can get through how um different members from PI data can contribute via the Sprint.

A

Okay, um my thinking is that one of Isaac I'm going to turn to Isaac here before I start jumping off, because Isaac is deeper into things um than I am Isaac. What I'm thinking of for a group like Pi data, who has a lot of python skills, is that we might um arrange or coordinate uh Sprints around some of the tasks that we have yet to move over from on the previous version of auger, like the value and dependency workers expect, especially but I, don't know. If that's not, is that all right? Yes,.

E

uh It would be great to have like just experience like having new people like look over like the tasks in this, in the face system that um I designed to basically just making sure that uh they can make sense of it and write good documentation for it and basically like we should have a worker template like we have in the old version for the tasks that we have now um but yeah. Okay, that's a good idea! So.

A

What I'd like to suggest that we we do next is I'll, provide a brief overview of the the file structure in agar new and then uh I'll start talking about tasks and um kind of set up a question for Isaac to help us walk through. That part. Does.

B

That seem reasonable. I want this at work.

D

Yeah I think that I think that's going to be a good start. Thank you.

A

Okay, so I'm gonna go back to the prime, the main auger directory under the auger New Branch can't emphasize enough that this is where to work. Dot.

A

Github is um essentially any GitHub tasks, act, GitHub, action, type things that we've organized you don't need to look at, that the auger directory includes API and what else so I would consider the application, which is kind of the core um auger face, which you can ignore the tasks which I just mentioned, and then utility functions which are as like, I assume the utility functions are simply things that are shared across different parts of auger.

A

Isaac may have stepped away. Oh sorry, what um the the util directory is just a bunch of utilities that may be shared across different parts of auger. Yes, exactly, okay.

A

The other directories are Docker somewhat uh self-explanatory. That this is where our Docker stuff exists. uh Front end is the directory for our what I, what I'm, calling our old front end, which is basically this.

A

um That's all uh view JS not to worry about, for the most part. Scripts are primarily things that we use to install and configure auger. We have control scripts, Docker scripts and install scripts, so those are mostly shell scripts that are used to get augers set up and tests. Our unit tests that primarily Andrew has written that effectively test the different parts of auger um and eventually we'll reintegrate that into a GitHub workflow of some kind.

A

It used to be Travis CI, but Travis CI kind of blew up its whole model for everyone, um and so now we're putting the tests there, so the meat of where contributions would be most most welcome and helpful. Obviously, in any place where you find uh something you want to tidy up or whatever always welcome, but under the auger directory. It's this tasks directory, which is where the data collection work takes place, and this data collection work is divided into four main categories.

A

Data analysis is where the machine learning workers live, get is where the facade or commit worker stuff lives. Github is where a great deal of stuff exists, like contributors detect, move events, it's.

E

Just pretty much anything that interfaces with the GitHub API yeah.

A

um Exists there, so this is a a place where um things you know. Contributions would be useful, a very useful, extremely useful. um To give you an idea. The tasks directory originates from a different directory in our current main branch, and it's essentially the way that we have been able to paralyze things parallelize a lot of the work to enable much faster collection.

A

The workers that have not yet been moved over to tasks include the depths, live your worker, The Depths worker.

A

And there are others but I'm going by highlights here and the value worker. So in terms of workers that have the highest Roi for us, it would be depths worker depths, libyer, worker and value worker um foreign.

A

Isaac maybe now should I hand control over to you so that you can demonstrate kind of the process that you went through or.

E

uh Just in terms of like how lights are.

A

Made like if yeah like, if I let you drive, you can probably do so much faster than I. Could.

E

uh I I was uh gonna, pull up uh just like uh one of a simple like a simple path and just like show how it's organized yeah. So.

A

Let it giving you the letting you share is I. Think I think we agree on that, letting you share yeah one second.

A

If you're, if you're able to I know you I, know you're running on uh hardcore version of Linux, probably yeah.

E

Yeah, um it should work uh one sec, because it's a because I'm using vs code and that's an electron app but uh yeah one sec here, I'm ready.

E

You are ready. No okay, I got a big error message from Zoom, saying that resume. Recording.

A

So while we, while we wait for uh Isaac to come back um the different tasks for data analysis, this is where the machine learning workers principally live the clustering worker clusters repositories based on the patterns of communication that are identified as present. It also does topic modeling for each of the repositories. So we can see what kind of topics are discussed.

A

Discourse analysis identifies 11 different categories of discourse, which can then be used to discern a Time sequence, analysis of how conversations go around pull requests and issues on individual repositories and the message insights worker uses, a software engineering tuned sentiment, analysis and Novelty detection algorithm to identify the nature of speech in terms of inclusive or not inclusive. There are also a couple of repositories that were adding to this worker to look at inclusiveness specifically and also ableist language and the pull request.

A

Analysis worker looks at the history of pull requests around projects and makes a probabilistic estimate about How likely a particular pull request that is currently open is likely to be merged.

A

The get worker is this is entirely facade, so it has. uh It runs the old facade tasks as we modify it. There's some user utilities.

A

That we have um in here.

A

And so that's all that the git worker does is everything related to commits GitHub worker, as as I noted uh previously, it looks at contributors. So all the logic for resolving the email addresses and the con to a contributor on a platform takes place here.

A

uh The move, detection worker is the first one that runs and it determines if a repository, that's currently in our set for collection, has moved so it's more frequent than you might think than that a repository will change organizations or change its name, and when that happens, for a period of my anecdotal observation up to about a year and a half GitHub will continue to resolve all of the old links to this new location. But we just go about proactively moving it events. Look at the event stream on GitHub GitHub.

A

It does have a 400 page limit of 100 instances, so you always get to last 40 000 events, but the longer you're collecting the less likely you are to have gaps. Facade GitHub is really focused on that contributor resolution piece. So you it's in the GitHub directory because it uses the GitHub apis to resolve contributors issues and pull requests are fairly straightforward. They gather all of the issue and pull request related data.

A

As I mentioned, all the messages are in the same table. So once all the issue, all the issues and all the pull requests are created will start collecting messages for or each of them and in all. In generally speaking, we literally get every single message and it's metadata. That's issued against a pull request or an issue on a platform releases specifically looks at um if you, if you were to go to GitHub, uh for example, it gets this metadata, so you can see. There's a releases thing down in the lower right anytime. There's a release.

A

um You get this, uh you get all of the data about a release and all of the releases on a repository collected, really state. It can be especially useful for when you're trying to look at activity in a time period that reflects the interests and needs of a of a repository. So if I look at my, for example, my cycle of pull, request issues and commits and I just look at them by month, those months have less meaning in terms of the cycles of a project than if you were to look at them.

A

In the context of time between releases time between releases tends to be a a good indicator of the Cycles. Now that said, not all GitHub repositories, I, would say slightly more than half but less than two less than three quarters somewhere between half and three quarters do use releases, um but a quarter to a half. Don't so, obviously that data is only useful when you're doing you know when you're, when they actually are issuing releases.

A

Repo info is especially important and uh as a uh for auger, because what repo info shows you is all of the data about a repository that is platform metadata. So, if I go to.

A

I think this is no all right.

A

Activity metadata repo underscore info, of course, I. Don't follow my own foreign, because.

A

If I look at the number of forks, the number of Watchers number of committers all very interesting where it gets super interesting, let me go down where there's some actually, some data issues count because we're collecting all the issues and we get the issues count metadata. We can know if we have them all same with pull requests. If I get the pull request. Count I should have 1459 pull requests for this repository and 462 issues for this Repository.

A

That's that's important, because now I now I can tell with some with a great deal of confidence that, when you're doing analysis of pull requests, issues and commits that you do in fact have all of the data um most other tools. Do not do this verification or give you visibility into this verification, and so, for example, with GH torrent or get archive.

A

There's no metadata, and we know there is data missing um with with other tools. There is no validation that we have the correct count, and so one of the things that I think auger does well. That's super important is validate against the platform metadata to ensure that we have everything any questions or should I keep talking.

A

Oh Isaac you're back.

E

I should be able to share uh my screen.

A

Now: okay, all right sorry, I, just I, just kept rattling on go, go ahead.

E

Yeah no problem.

E

A

It looks like it's working, okay, you're, sharing your screen now I believe I need to. Are you sharing your screen in Zoom yeah? Okay, so everyone can see this yeah, okay, awesome, so I think what we want to know is well. One of the things we want to know is like: let's take the the value worker, for example, when I guess you could explain it first, but what I'm thinking of are what I'm thinking of is? What are the steps if one wanted to convert the value worker um into a task.

E

Well, uh um I actually did this with the release worker, um which I've converted into the release task right here. Your collect releases, okay um and I got I, went to the old uh like logic and I put it in um well. First of all, I put the whole thing in the uh and then GitHub getting the GitHub file under cast okay uh and then I made a folder for what task it is. So it's the releases uh stuff like that goes in its own folder. So your.

A

First and then uh so, converting uh so steps in I'm just going to do steps in the notes, steps to convert old workers in main to tasks in auger new. So one is you copied just the error old worker into a task file under tasks.

E

uh I wouldn't copy the entire file. What I would do is I'll. Just like get the directory structure organized before I would start writing like um I would just like. Can you create a folder.

A

You know what um so you create a I guess uh you can just walk us through it. So like what does that mean, create a work create so I'm? Imagining I've got a workers directory with value worker in it. Would you just first create a value worker, a value directory in one of the tasks.

E

Yeah pretty much I mean it depends like I. Don't uh what does the value worker do.

A

The value worker does all the code, complexity, accounting.

E

And does that interface with the GitHub API? It does not yeah, then that should probably go and like um uh either get or its own folder if it's not related to like the get logged or GitHub because, like it's mainly organized by like data source.

E

um So if it's like a machine learning worker, it.

A

Does reference the facade cloned repo directory, so I suspect it goes in the git worker yeah.

E

So for that one I'll.

A

E

uh And then in the git, folder I create the value worker folder and then, in that folder. What you want is you want to task stop by and you want to core dot pi.

A

um And cast out pi and chord.pi, and is there any like template or what would be? What would the template for those be? Maybe you could walk us through what.

E

um It would look a lot like the releases well, the release I chose the releases model just because it's really simple, um because first, you would just like import all the database and like stuff that you need to run it. So you need to set the database session um that we have. It needs the um well first, you need to you need to import the other file because that's a part of it, but uh and then you basically just need to import the database stuff and the other stuff that you're writing over here.

E

I, don't know if you can see my curves here, but.

A

So is this the um so core I see so the task.pi Imports core.pi, but what goes in core dot pi.

E

Word up high is, is the actual functionality of the worker, the tasks to just be like the logic that starts it, you know, okay, um you can have like easy error handling for like the whole model like from here um and then all like the actual like manipulation of data and insertion to the database, and everything happens here.

A

Okay, so when um so, when you did this I assume you had to change by Ruth I assume that you had to change lots of things or some things about what was in the value worker.

E

um I didn't have to change as much as I thought, um I pretty much uh just had to um uh like manipulate it, so that it would uh interface nicely with our database orm um and insert it correctly. But uh um it wasn't that hard to do like you're.

E

um Yeah, this is how we insert data now, which is a lot simpler than the old way, that we did it.

E

In my opinion, at least um just like any of the the list of dictionaries of data that you need to insert at the table that you're inserting to get into and do UDP on that table for um support for uh on conflict to update and uh if, if uh you're not doing it on conflict, you update insert, then uh um it is most likely better to do it in an actual like uh SQL text and then just executing that SQL text.

E

um But in most places you you would you want to do the on conflict? You update.

A

um So um session insert data blah blah or does it say on conflict, do update.

E

um That's the here: well, the insert data method is for the on conflict to update. There is an option to not um to change it to an on conflict. Do nothing like that's what the insert data method is for.

B

E

If you want to insert it manually, you still can, with the um you, can do like a like an s, dot, sql.txt option. You can just write SQL here like.

E

Or you can just execute that um but yeah, but in most cases you want to do the on conflict. You update where the ion clock will do nothing, which is why it's just called insert data and it's a general method right. There are cases in which you would need to do um like a specific SQL query, and you can still do that.

A

A

So, probably, probably um Ahmed what time in what time frame is your group thinking of um doing your auger Sprint.

A

I, don't know if we lost X-Men or.

D

Yeah sorry I I was um so um I think um sometime over the weekend. So maybe we could just um Galvanize those who will be interested to register for the events and then um let them have like an idea of what the um project is about so that they can prepare towards it. And then we have like um a five four to five hour Sprints, where you can get them involved and contribute different capacities.

A

Okay, um what I'm, what I'm thinking is, um so you kind of uh I'm fine, so the way that I think it could be done is if we um the way that I think this might work is if uh we would do two things to support that effort.

A

um One is, depending on the time of day, possibly I could be available for basic questions. Obviously, there's a getting auger installed, part that needs to take place. What operating system are most of the folks working in.

A

D

Met most of them were working with Windows Windows operating system. For me, I'm working with a uh in Linux machine, and we also have very few people to working with some Mac laptops, but majority will be windows.

A

Are they comfortable at a Unix command line.

D

um Yeah I think some of the intermediate users will be more comfortable with um um it's. The Linux command line, yeah.

A

They would be comfortable with the Unix command line. Yes, yes, yes, okay, and how many participants do you think that you might have.

D

um We can have around um 20., that's yeah.

A

Okay, so with 20 participants um for new for new data collection, Isaac I'm, just trying to so one thing I can think of, is if we created a branch that sort of templated out the basic steps and created an issue explaining what needed to be done for three workers.

E

Oh yeah, I I. That's, definitely something that we should have.

A

um And then so, if we did that, then so, essentially we could create a branch where the the workers weren't working yet, but we had some. You know we described in the issues. What needed to be done for each of the three that I have in mind um and then with 20 people, um I think if we had a general template they they could. uh So everything that we get right now is from from GitHub I, don't know.

A

If, if there are other pieces of information you would want to get from GitHub or if some of them might want to work on API endpoints or things like that.

A

E

Yeah uh Andrew would definitely appreciate that if people wanted to help with the endpoints.

A

Okay, perfect, um so I met I. Think um if you're doing it this weekend, I think, uh maybe if you could give us a few days here to put together those templates and uh issues or how much, how much in advance do folks need.

D

Yeah we can do this weekend or to get more people. We can have it next weekend, but I also have a question is: is there a way we could um help people who, for some reason, are maybe not able to finish their um tasks that day and maybe they need some extra support to complete their tasks? After that um that event, yeah.

A

I think one one way, of course, is always uh issues uh another another good way is uh I will schedule another one of these sessions like if you did it in, if, if it's, if you're flexible- and you did it like not this coming weekend but the weekend after that- would give us more time to be prepared.

D

After works, we can do it next weekend, then, so that would be uh be.

A

Like the weekend of the my knowledge of calendars is uh limited, hang on. Let me find uh let me find my calendar.

A

Okay, so yeah this. What are you thinking like the Saturday or Sunday, or what day of the week are you thinking.

D

um Saturday Saturday would be perfect.

A

A

Okay, so wait a minute Isaac thanksgivings on the 24th right.

E

That sounds right.

A

Okay yeah, so all right, we yeah that that would work I can I can I can be around um to do some support on that day, but I think probably the last uh thing that we need to think about is auger has historically been difficult for people with Windows computers to install it on and um because Windows everything on one I mean you're in a python Community. You understand Ahmed that everything works differently on Windows, yeah and and I. Don't know how experienced your community is with dealing with all of those inconse idiosyncrasies.

D

Okay, so what we can do is we can just um streamline this test um cohorts to people who actually have um a Unix machine either Mac or um Linux. So that's you know. That's that kind of sets a bar so that we we don't have we don't waste too much time trying to help people I.

A

Think I think I think Docker would be a solution. Isaac's kind of our expert on Docker I, I'm I suspective, deploys on Windows I. Don't know how easy it is to do development when you're constructing things in a Docker container, Isaac I, don't know if you've um ventured.

E

In there be too much harder, um the docker container is functional, although uh I don't know if it's like production ready, um doesn't have to be production, ready.

A

E

That's fair: uh it's definitely like ready to like Tinker, with at least um like last time. I ran it. It ran, uh although I have not run it in a while, since I've been trying to do a bunch of facade stuff.

A

Yeah, so so maybe the thing that we can get ready for you this week, Achmed is um uh uh Docker for Windows, okay and um the task, the task, examples um and stubs, and maybe something similar for API endpoints and perhaps you and I could have a conversation.

A

um Maybe on this day next week, where we we go through some of that in a bit more detail. Okay,.

D

Okay, that's that's gonna, be great. Okay,.

A

um So I will um I, will just put a I'll just put the same session on the calendar on the chaos calendar for next week, um and we we can catch up. Then, um if that, if that works for you.

D

Yeah I think um this time next week also works.

C

D

Is um is it it's um it's um for 450 here right now, so I don't know what time.

A

10 50 here so it'll be five o'clock there shortly. So this started at three o'clock your time.

D

A

I'll, just um I'll go ahead and just uh make this particular um meeting uh occur again next week. At the exact same time, foreign.

A

And I I have a dentist appointment at 11, 20, so I think we're gonna fall short of getting through our entire agenda. Most significantly, the auger install but I will um I'll make a separate recording of that later today, so that so that I can refer you to that, and this recording um are you in the chaos Slack?

A

Yes, yes, I am so I will make a note in the chaos slack about where this recording ends up.

D

um I think Ruth has already then created like uh an introductory um conversation between three of us, so you can't just continue from there.

A

Yeah I mean I'll, probably uh I'll- probably just share it in the general Channel, though so that others have access to it. The the auger channel so there's another channel on the chaos slack and uh I can share it in that initial conversation as well, I just wanted you to know. I would also be sharing it in the um in the in the auger, in the auger slack um on the chaos channel, so that others are aware of what we talked about here.

A

A

All right sounds.

D

Good yeah yeah. That's great! That's great! Thank you very much.

A

Well, thank thank you, Ahmed uh nice to see you again meet. Hopefully you got some stuff out of this I think roof. We've certainly learned a bit and um we'll talk with you all again soon, all right.

C

Bye-Bye, um can you just uh add that slag board over the meeting as well, so that we can get a reminder on slack for the meeting in the next week, which is.

A

There something special I have do I have to invite the slackbot is that the deal.

C

uh I guess Roots was just mentioning that to me that there is some things especially to be done. Okay,.

A

I'll I'll ask Elizabeth um what do I have to do with the slack bot.

A

um All right, I'll check I'll, take care of that. Yes, right now,.

C

Right away from this portion, like of developing the worker and all we'll go through it uh the next week only right, yeah,.

A

I mean I'll record uh installing auger.

A

um Recording later today, all right all right.

A

All right talk to them and later thanks a lot.

B