IPFS Weekly Call, 18 Mar 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: IPFS Weekly Call 2019-03-18 🙌🏽📞

Description

Newsletter: https://tinyletter.com/ipfsnewsletter

A

A

All right, everyone, howdy hello and welcome to the IPA FS weekly, call where we get to learn the great things are happening in our committee and stuff that we're building um today, let's see so before we begin. If everyone can just um before we again, if everyone can just fill out the IP, our.

A

Weekly Call list, so, if you're attending to call just put your handle in under the attendees section, we don't have any announcements for today. So we'll begin the main presentation, but before we begin I want to thank Ollie for taking notes. Thank you, ollie, and today's main presenter is Michael Rogers and he's going to talk about github ecosystem metric. The project he's been working on the past several weeks, so Michael, please take it away cool.

B

During the broadcast now let me know when you can see it: cool, okay, I think, I'm. Sorry, awesome, awesome! Okay, um let's go to content okay, so this is going to be a little bit kind of anyway.

B

So the system that we built uses a gh archive, and so this that this is when they get up with a but basically every hour they put out a file right and it looks like this and it has so it's it's an hour of public data, so every single piece of metal is metadata for every single action that anyone takes across all of github public resources, so any push any at people watching the repo people commenting on things. Anything like that. There's only a couple things that it doesn't really pick up like.

B

If you you know emoji thumbs up something, it doesn't pick that up, but it's it's a very large amount of data. So if you go back like maybe two or three years, you end up with like a couple terabytes worth of data, but because of rate-limiting and a bunch of other stuff. This is really like the only effective way to look at entire ecosystems across github.

B

So what we really wanted to do was like be able to query this in a saner way, and so what we devised was essentially the system where first we can like identify repos and the way we do that is like you know, we might want to look at a dependency graph for all of the repos and ipfs, and so that would give us all of the entire dependency Network.

B

And then we want to sort of look at all of that repo set across all of github archive or we might use bigquery or something like that and like with bigquery, we could actually look at. They have a snapshot that they date I think like monthly of every single file in every master branch across github. So that would allow us to say, like oh, like what does the doc Rica's look like. So what are all the repos that have a docker file in them? Things like that, so you can be a really useful.

B

Information of that. Bigquery is quite expensive, though, so we really need to to limit those queries like that query, to query those docker files, it's probably like $15 to run it like once so we identify repos in one step and then we basically we want to get a filtered set of data and I'm going to explain how that system works a bit here, so we use lambda, and what we do is that we ask lambda for a month of data.

B

So that's that lambda function is called filter months and what that ends up doing is creating between 28 and 31 requests, depending on the month to filter day and then filter day is going to make.

B

24 requests to outside to another lambda function called no sorry, it's not filtered, it's called pluck, and so what pluck is going to do? Is um it's it's first going to check like do we have that github archive file for that hour in s3? If not we're going to go and get it and then once we put it in s3, we use this thing called s3 select an s3 select.

B

Actually, it allows you to do like sequel queries on either CSV or JSON data, and that data can even be lines of JSON in a gzip file. So it actually works perfectly for good of archives except every couple months. Some people do a gigantic push to github with a lot of file updates and even the metadata about the push is larger than one Meg.

B

So what happens is that this entire plucks function crack like an onion uncatchable way, and so, when that happens, we we saw back to this thing called fallback, and this is a much slower version of s3, implemented and nodejs.

B

That just does the exact same operation, and so what happens is that with both of these, what they're going to do is they're gonna return a just just to the attributes out of the objects that we want: individual, turday, okay, so coming back to this initial piece here, when we call filter month, we also give it what we, what we call sort of a filter, and so that has that's like an encoded. Seymour object basically, and we stored in s3.

B

So we don't have to pass it to the lambda functions because they can actually get too big to be passed around between the lambda functions of the time, so that Seymour object has not only just these flex values in it, but also any repositories that want to filter out. So what pluck des will do after I guess this plug set is then filter out. All of the repository sorry I'm, all of the information just for the repositories that it cares about and then filters day is going to store that in s3.

B

So what you end up getting from in this filter month request. Actually so it's like! Oh, is you end up getting just a bunch of files for every single hour of data that you ended up, filtering and plugging, and then we do another one.

B

Another lambda function called scan cat and that can catch all those files together and stores the new value in s3. So when we go and get a year's worth of data, we basically just do a month at a time a month is going to generate between 700 and 1400 lambda functions depending on caching, and those are all gonna run in parallel. So if it has to hit plug fall back it'll take a little bit longer because you're you're off the whole sets gonna go as slow as the slowest request.

B

But you know in about like less than 10 seconds, we'll get back an entire month's worth of data, and then we do a month at a time. I would love to pull a year at a time we got the lamda rate limit increased, but then, as soon as the rate limit was increased for lambda, we noticed that there's also in a three rate limit. So now we were blowing out the s3 right limit. Once we started to actually use our new landal in it, so we're trying to figure that out.

B

Hopefully we would eventually just be able to do a year at a time, and then we could pull a year in about ten seconds. But as it stands, you know we can get. You know three years worth of data in less than ten minutes. So it's pretty fine, but the nice thing about this system is that because we're taking a c-more object here, we're not limited in terms of like how many repos that we can query for, we can literally filter for a hundred thousand repositories.

B

It would take a little bit longer because each of those are gonna have to code this giant sieve or object at its s3. But you know we can do really really huge sets of data. That's one of the reasons why we couldn't use some of the off-the-shelf stuff, like big queries. Big query also has all this activity data in it and other than this being cost prohibitive.

B

There's a limit on the size of the single query that you can do so being that there's just a single query: we can't like jam, you know hundred thousand repos in there, so we'd have to chump through them and then it would be insanely expensive, so yeah. This is basically how we can. This allows us to pull in a very reduced data set of just the values that we care about for just the repositories that we care about system.

B

So what ends up happening now now that we have this system is that we as long as we can basically turn an ecosystem into repositories. We can then filter out a datasets for those and then in just a matter of like actually processing the metrics, and this is sort of the stage that we're at now what we're, trying to figure out useful data that we can get from this right. So you can do things like we can get the unique like people that are engaged in different kinds of activity or all activity.

B

So we could say who are all the people that committed to this eco system?

B

Who are all the people that engaged in anyway, so there even in you know, sort of issue, comments and stuff like that, so you can assume that they're, like users of the system, you can also just look at you know the overall level of activity right like it's, this Justin coming down, or is it still sort of you know rising in terms of overall activity and then, most importantly, you can take all of that data and you can start to look at like well. What is the growth rate of that ecosystem?

B

How much is is activity and unique users growing over different time slices? Do you want to look at so yeah? This is sort of what we're doing with the system. Now. The main reason that I wanted to sort of show people the system and how it works is that it's a little bit different than traditional metric systems, where you log the metrics in some database, and then you can do queries on it.

B

We actually have to you, know, go and filter this giant data set from github and and then get interesting, metrics out of it from there. But with it we can look at not just our ecosystems right so for ipfs and let p2p and all these other systems. But we can look at ecosystems that were interested in maybe getting involved in right as.

C

B

C

B

Work we can say like okay. Well, what is the growth of docker files compared to package JSON files and how many people are engaged in those, and one of the interesting things here is that when you look at different ecosystems, you know the number of packages and a package manager may be growing or you know just the number of docker files may be growing in different repositories, but there may be really big differences and the number of users of those ecosystems.

B

When I did this analysis four or five years ago, using much worse tools just separating out to back-end JavaScript from front-end JavaScript, what there was like a noticeable difference in the number of people engaged in those front-end projects, so it you could sort of correlate that to there are more people using those actually there's more users of these projects than the backend projects. So this gives us like really interesting sort of competitive analysis between different open source ecosystems as well so yeah I. Think at this point, I can open it up for questions.

B

This isn't like a gigantic talk.

B

Yeah, let's go questions.

A

Probably shown me myself, ok, thank you very much and for those who have questions, if you just put it in a chat box, that would be great.

A

So are there any questions, I.

A

Have a question it seems like you are at the point where you're, what, if that is just it, was clean enough for you to analyze. Have you seen it being of interest in terms of people who are contributing and activity on ipfs.

B

uh It's a little bit too early to walk away with, like really big takeaways, we're still sort of refining how we do the repo identification piece so like right now for repo ID I'm using the library's io data and there's a pretty big discrepancy between like github dependency.

B

What so, if you go into your github repo, there's like a dependents tab, and you can see all the repositories I'm on your repository, there's a lot more repos there than in the libraries io data, probably because libraries IO is doing like a really complicated operation, where they're trying to look at repo data's. That is in the package JSON and then correlate it back through the dependency graph. But not everybody has that metadata up-to-date, whereas, like github, you know has all of the repos themselves.

B

So they know if a package.json dependent on that package name somewhere, they can even pull. You know, repositories that are depending on it that aren't published in any way. So that's really useful um and so yeah and but but the problem with that data from github.

B

Is that there's no API, so then poking them to try to get them to give me that data in some usable way without me, like literally, writing a scraper through the website and bi yeah, so I've looked at sort of like looking at sort of the data system in our ecosystem. One interesting thing is that so far the the projects that are in a similar space to us I'm, a very similar sort of curves and they're grow. So it looks like you know the space that we're in the sort of the centralization space is growing.

B

It really at a particular rate and we're all growing in it really well. So that was one interesting kind of takeaway, but I'm sort of waiting to get more data and to refine our repo identification before I make a lot of other judgements about it. Also, like I, really want to look at more mature ecosystems and see sort of like where their growth hit particular spikes and what the different curves went and see. If you can find correlations between them and then we can look at our growth and say like okay.

B

What what phase of adoption are we in? In which phase of maturity are we in in terms of because in belleville we don't have great stuff on that? Yet we literally like just got the system sort of working and then every time that it works. Then we like make some make some improvement to make it a little bit faster and then that that blows out a new rate limit that we didn't know about somewhere.

B

And then we hunt that down and then add a new layer of caching and that blows out some you rate limit, because things are faster. So a lot of sort of like if you look at the repository there's just like a lot of churn in the code and the methods that were using to get to something relatively stable, now, essentially like at at the filter month, phase there's effectively like a rate limit or inside of it that tries to estimate the potential number of SLAM to functions.

B

That it'll call, and you can basically just change that.

A

This question is very similar um phone Jonny Quest, so many insight so far.

B

Yeah yeah I think I kind of covered that um not yeah covered as much as there-there is so far. My man insight is the LEM de is like super powerful and can do this kind of stuff like really fast like. If you're running this locally on your own machine, it would take like a month. You can do it like a lot quicker. However, there are really really nasty undocumented, like fairly undocumented rail.

B

Everything- and also um this is like not a common use case in the lamda world, like bursting from zero to thousands of functions and back down to zero is not as common, so we're hitting a lot of like you know, weird things with that.

A

Next question from ollie: have you got any exciting headline stats out of it so far any surprises and I guess that would be um not yet because you're still any early phases, I'm analyzing the data right, oh yeah,.

B

So right now I'm trying to just provide the data to the projects and then I'll. Let the project's decide that what they, what kind of top-line numbers they want to pull out of, that um I will just sort of caution generally against the top-line numbers like all of these are estimates right, like not everyone, not all of our users engage in a public repository, not all of our users, public repository, you know.

B

We we also like, don't have a great way right now to to look at the NGO ecosystem right like there's a package minute situation, it's pretty bad so, like the the dependency Network stuff that we do for for JavaScript projects just doesn't work there.

B

So we're gonna have to figure out something like in bigquery analyzing every go file in github and trying to figure out if they are using IP FS inside of a go file to identify repositories, so we're sort of we're missing some data there so we're in we're in the states now where we're just kind of trying to refine it, where I wouldn't want to pull out top-line numbers I.

B

Think that the most thing we want to pull right now is just growth rate right, like the growth rates, you know, if we're, if we're not sampling enough or we're potentially over sampling.

B

If we compare that to another ecosystem, the same biases should remain, and so we should be able to compare growth rates, at least even if we, even if we know that you know these numbers aren't going to give us an exact number of people all the time or sorry, very, a completely accurate number of people.

B

I think, there's also something that we need to think about, where the more that you sort of traverse into a package of dependency, Network, the more that you're sort of in passive usage, rather than active usage right, like like people, didn't actively say, hey, install ipfs for my thing, they installed something that it's all IDs or installed in I guess clients into it, and so you may also want to separate that out.

B

You won't may want to say, like you know, just one level of depth or maybe two levels of depth and that depth graph is as far as we want to go and and the rest of that we're going to kind of have grayed out Yeah right now, we're figuring out like the best ways to look at this data and slice it up and yeah I, wouldn't want to pull out a complaint number of like number of people right now.

A

Excellent. Thank you. Next questions from Rob Brackett. Are there any specific uses we're planning to make from all this info.

B

So I'm trying to feed some data right now into the process that we're doing around package managers I really want to find good ways to measure these ecosystems so that we can provide into the decision-making process that data.

B

We also just had a sort of a quick metrics print where we're just trying to get a bunch of data out of a few different projects, so that gave us some basic info on on the packages in the in the org for Lupita, Pei, PFS and IP LD, and so we were able to see like okay like what what is the further growth rate right now in just activity on our own project. Stuff like that but yeah, where we the whole sort of like across the organization. The sort of metrics in KPI situation.

B

Right now is like just log data, like figure out how we get data and then and then we have once we have a lot of data, then we can start to talk about the best ways to use it and analyze it so so yeah. This is like a lot of what this presentation is meant to do is to open up the possibility of questions that you can ask about data, because now we can capture this entire ecosystem data and please sort of feed these kinds of questions back into the process.

B

And we can improve the tools and run these regressions again and start to feed more and more metrics and data into the the overall process.

A

um Next comment is from Jim Pitts. He said I'd love to see some cohort analysis see how many people are contributing and then turning it out.

A

Churning churning sorry turning it out viii, you are and I engine turning turning out.

B

I'm sorry, what I'm like confused about the question? Sorry.

A

It's just the comments. Oh.

B

A

Analyzing the contents of dr. files.

B

So we would use bigquery for that, so bigquery can look at the contents of all of the doctor files and on github and then from that we would get a set of repos. So if we, if we wanted to have some kind of matching inside of the docker file, then yeah we that would tell us, like you know, if a doctor file has this thing in it.

B

Tell me about it or tell me the repo name or tell me the file or whatever, and then we could then do a filter analysis across all of those repos in the archive. But this the system itself doesn't have the contents of all the repositories in it. Bigquery is quite good at that and yeah we replacing that would be pretty massive and I mean and not a huge win for us, because the the types of queries that we need to do on the actual file contents right now are pretty minimal.

B

Like you know, we just need to see something matches. Usually so we can, you know, pay you know, fifteen twenty dollars or whatever a query, get that data and then have the set of repositories and then have like this. It's much more cost effective system to look at all the meta analysis about those repositories.

B

Yeah hope that makes awesome.

A

Those are all the questions. Are there any more questions.

A

That's blooms to me about it. If we don't have any questions, then we can end early cool, ah no thanks, I'm Andrew. Could this be used to cluster package data together.

B

What do you mean by that Andrew wanted to talk for a second.

C

There we go sorry I just had to plug my mic in No.

C

So what I was thinking there was, rather than just like a repo that has a match or some content in a file. But like are there kind of clusters of repositories that all say, use this these five packages or and what mister then cross over of like other group packages? That would be often used together to kind of kind of highlight collections of similar or shared bits of code, so, like ipfs libraries always being required along with these kinds of things, maybe probably lit p2p related stuff, for example.

C

But also can we highlight similar kind of groups of types of dependencies that people are using with ipfs to get an idea of where there might be almost like an amazon recommendation for stuff? That is closely related to a given set of packages and I'm, really mostly interested in ipfs packages, particularly there.

B

Yeah you couldn't use this for that, but I think you could just use the github API right so well, okay, first, you have to crawl that dumb dependency thing that they don't have an API for, but when who found all the the repos and github that had been actually depended on it?

B

They then you would just look at their package, a son and then start to basically log all of the packages and see what are the kind of top ones that were required alongside and then you could use this system to then do an analysis of those repositories and find people that are the most active as well. So you can start to like include the IP FS.

B

You could system alongside stuff that it is used with it and then do an ecosystem analysis evolve the people in activity across both of them and see what the correlations are. There.

D

B

Would be that would be interesting, Oh.

A

Any more questions.

A

Well seems like that's it. Thank you, Michael for your presentation.

D

A

Will see everyone next week the next Monday's women, it's ipfs, recall, um have a great week take care bye.