OpenSSF Securing Critical Projects WG, 7 Apr 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Securing Critical Projects WG Bi Weekly (April 7, 2022)

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

A

Hey eric off the whole question.

B

Are you related to carolyn tys? I am not.

C

Okay, good yeah. I get that a lot about a number of other people. It's usually the question about the old football coach mike, but um not related to him either.

A

No thanks, it's just a usual name. Thanks.

D

Welcome folks, oh I see jeff mendoza's all the call.

E

Thanks yeah uh amir is uh out today so I'll be um uh shepherding this meeting uh as usual. We'll start at about five. After, if you have any topics, go ahead and drop them into the meeting agenda, that's linked to the calendar! Invite, if you don't have any of that, just ask in chat or ask here.

E

Okay also, as always, it's optional to add your name as an attendee.

D

So I I want to re-add to the agenda and I'm going to put jacques on the spot, and I'm sorry about that, but I know this, you know it's the whole hot. You know I'd like to continue our discussion also about how do we read you know: how do we determine and update the replace the critical projects list, combining the data with human um information? I know you care about this and I do I care about it.

D

You've actually done the research, so I'm still doing the research you're still doing the research okay, so maybe I can beg you to give a 60 second update of what you've learned so far.

A

Briefly, holy people talk a lot about a lot of things. um There are several literatures and they're enormous, so I barely skim the surface got it.

A

D

Okay, I just tried to capture what you just said and I may have done it correctly, but maybe I made the attempt how's that there's.

A

Nothing libelous.

D

I I was reaching for the higher part of accurately capture, but I I guess I'll take not live. Let's just start, oh, are we recording and shouldn't we yeah? We are recording, never mind.

D

I'm going to post the link again: oh, don't have edit rights.

D

A

He's uh his name been added by I.

D

Think david, yes, indeed, excellent, okay and darcy's also been added. Okay, github, npm, okay, welcome.

D

Jeff your show.

F

D

F

E

uh Yeah thanks everyone for joining um and thanks for adding adding to the agenda uh as usual. We'll start the meeting with new and attendee introductions.

E

If you're new, here or haven't joined in a while or haven't introduced yourself in a previous meeting, feel free to to jump in and say, hello and anything you'd like about yourself while you're here all right.

G

Hi, um I'm marina, I'm a phd student at nyu, I'm, I think I know some of you from other working groups and such but new to this particular one. um I'm here kind of um as so I've been participating in the securing software repositories working group, the new working group there and one of the things we've discussed there. That's kind of relevant to me as an academic is the possibility of creating kind of a repository of data about software repositories and projects on them, um and so I thought I'd come here and kind of exchange.

G

Some ideas make sure that we're all working together to see if this is something that we'd like to collaborate on so hi.

D

I'm sorry you're looking for a repository of data about x, help me understand.

G

Oh sorry, I I don't. I guess I didn't explain that properly sorry, um so I'm I've been working with um maintainers of um software repositories. So thank you for like ipi npm, whoever ruby gems I see jack here um and then um and there we were kind of trying to figure out if there's interest, both from the repository maintainer side and from kind of the open, ssf linux foundation side in having a central place to store some data about these repositories.

G

The packages on them the usage patterns um for the purposes of doing both research projects and kind of things like the stuff. I understand this working group is doing in f in identifying critical things common issues. You know anything like that. That could be useful.

E

So when you ask, if um you want to know, if there's interest in having a central place, um are you interested in making that a like a work stream of this working group.

G

um That's part of the question is where this, where this would best live, is like maybe in this working group, maybe that other working group, but um just want to talk to everybody before we make those kind of decisions. So.

A

Should I add it to the agenda.

G

H

Are you prepared.

E

Yeah, so you know, are you prepared to make a kind of like a longer proposal? You know.

G

Yeah, I have a document I'll share as well, which has more detail than I could say.

E

Prepared today, or do you want to add it to the agenda today or do you want to like um plan? I can.

G

Actually agenda today, yeah, I will I'll add a link to that document and yeah.

E

People have time to.

G

Read it now, so it's okay! If folks want to push that off as well. I don't.

E

Yeah, unfortunately, need something.

G

In five minutes, okay,.

E

Jen is pretty short, so I think we have time to get to it.

D

Yeah- and I can point to some things when we get to that point so excellent.

E

All right uh any other new attendees.

I

I'm a repeating offender. I've been here before, and I've been around openness for quite a few months now, but uh I don't regularly attend this call. So um I have actually I'm part of the open technology group at ibm, an open source and stereo specialist and uh more recently I started playing with the criticality score tool and I actually submitted a couple of full requests and I've had some chats on the slack channel with caleb, and so I was available. I saw the call on the agenda I figured. I would join see what's up.

E

Awesome welcome.

J

uh I think I'm also also new, so I go by jono. um I've met some of you before different places, um I'm at search cc at carnegie mellon, I'm coming at all of this stuff from a vulnerability management and prioritization perspective.

J

I have led our work on the stakeholder specific vulnerability, categorization stuff criticality seems like it is related to some of the stuff that we would expect to use to prioritize vols sort of generally, so I'm interested in it. From that perspective, um I'm also on things like the cvss sig and the epss. They get first for other vole categorization stuff. So I can um happy to talk about stuff, that's going on there if it's relevant. If people need that context, thanks.

E

Welcome jonathan glad to have you uh anybody else.

B

I think I'm a first timer at this meeting. I can be wrong. I've been lurking the openss repo for a little bit.

B

My name is darcy clark, I'm the em for the npm cli team, as well as the github cli team, so unique uh scope there and have been for about three years working on npm um and can be a gateway, hopefully for for change and innovation uh in the ecosystem there, but also have been working closely with the openjs foundation, helped champion the package, vulnerability club space that we kept off last year, which tangentially, I think aligns with what's happening in the open ssf space, so want consolidate. Efforts uh also, ideally tie some of the work.

B

That's being done here back into the uh mpm open source project, so yeah happy to be here. I would love to contribute in any way possible and get my team involved and yeah extend on all branch when, when there's one that can be extended, so yeah.

E

Oh thanks for joining welcome.

E

Any other new members.

E

All right, I guess that's about it um so move on to the agenda and something that we had talked about in the previous meeting that we wanted to do today was discuss the charter and, if you click the link and it'll, bring you to issue 52 and we're very thankful to see rob for uh giving us a really good um example here on the second link on that issue. uh The charter for the security tooling working group.

E

And just scrolling through it to me it looks like it's not very customized per working group, so it seems like this is essentially something that that you know we would pretty much take, as is.

E

And anyways so um onto the discussion you know, does anybody has anybody looked at it? Does anybody have any comments? Has anybody been part of this process in another working group? uh The floor is open. Oh hey, jacques. You have your hand up.

A

Yeah I've been I've. I read through it as um part of setting up the securing software replays group, um it's fairly boilerplatey. There were two notes. I think that came out of it for me or three one is that it sets up some guardrails for if things go wrong, so it has a it has a formal process which basically would show up once a million years if things are going off the rails, so I think that's its real purpose or value.

A

The second thing is that we will need to define a list of maintainers, so the sort of the process for changing policies is built around these folks who are identified as maintainers of the working group.

A

um So at some point you have to sort of bootstrap that list, uh and the third is that this is just a very small one, that I have a bug open on the exam, the template, repo, where it comes from um there's a fragment of text in part 2f, which doesn't seem to fit it talks about sending voting members to the attack. It's just like that's not a thing that you can do like the tag is selected.

A

There's no there's no sort of delegating upwards, um so those would be my notes.

D

I, if I may jump in, I think the key is you know some sort of statement of what is this group officially is: what is the scope of this group? I think that's really what they're looking for uh the the boilerplatey thing I mean there's a tsc that doesn't exist. I think that just needs to be removed.

D

uh I think the tac would like to have somebody from the working group showing up you know and that you know that could be attack. Member or somebody else, but just you know, I think the concern right now is the tac doesn't necessarily always hear what the working groups are doing and vice versa. So I think there's a goal for some communication.

A

D

So it kind of stays in sync.

A

I think also tech want to nail down the formalities that seems to be. That seems to be like a theme for this. This tech is yes getting all the ducks in the world.

D

Yeah- and I think that's that's reasonable, because you know this is the first year we actually have funding now, as opposed to the um you know the previous system. um I will tell you that, although things aren't set in stone, I've been interacting a little bit with the folks are trying to nail that down. I think one of the bigger changes that um I I would like to see- and I think it's it's going to happen- is the tac would like the working groups.

D

You know kick up, prod kick kick off projects, but talk to the tac. You know basically preliminary you're in but tell the attack. So if there's an issue, um because um otherwise it's quite possible for the attack to never hear about a new project, and that's just not you know that does not help coordination, uh because right.

E

Now it's entirely.

D

Possible for working groups to do the same thing, and not necessarily for folks to know unless you know I happen to be on both calls.

E

That's true I mean all of our all of our projects should be like very clearly um listed on our readme. I have tried to.

D

Do that I've tried to do that every one uh I have put in pull requests when you create a new project, but that doesn't mean I've always been successful at it. So absolutely that's. That is something else they want to do is every time you click off a new project, hey tac, that's, okay, probably it's fine, carry, you know, carry on, add it and then and then add it to the uh the readme page, so that yeah.

E

So that everybody who shows.

D

Up can find out what it is.

E

Could there be a group in github, that's just cc'd, on every kind of readme update for new projects? Would that be easier than you know, notifying the attack every time.

D

No, no because I I they want to be notified and regardless, uh if it's a new project, uh hey notify um as far.

E

As we could notify the tech through github.

D

um Sure you know make make a github issue on on the attack list. That would be fine. Okay, I would imagine I guess I.

F

J

Know how the tax is gonna work.

D

um It would it would, um it would not be a bad idea, but um I I guess technically the attack hasn't. I think, right now, just the folk what I would focus in on. What do you think this working group supposed to do the scope so kill all the stuff about the tsc, because we don't have one? um I think the key is just. What does this group think it's doing? We have it some text in the readme.

D

I think we just need to.

A

D

Over and make some tweaks.

E

Should we do should we duplicate the um mission and scope between the charter and the readme? I.

A

Would do that as a story so to avoid sort of drift between them um in uh the securing software repos? The the way it works is that we have the abbreviated version like the introductory sort of wording, and then we point back to the readme, where we have the list of objectives.

D

Yeah keep keeping it dry does make sense. I mean if from the charity you link to it. I guess the only problem would be. You know uh you change the the read me without, uh but uh yeah I mean I. I think the point, though, is that, right now it's not always clear what the working groups are think they're doing.

D

I think I think that's the key issue right now is just trying to make sure that uh there's some swim lanes. So we know who does what sort of there's always going to be blurry lines.

E

Okay, so um yeah, I propose that we update the we take the um charter, as is, uh except for updating number one to be um jacques, recommended super terse uh and and pointing to our readme, where we have, where we'll attend, to have a more descriptive mission scope and then um on the issue that you raise shock about. The kind of the boilerplate looks wrong. I think we'll just wait until we get that resolved upstream and then pull it down.

A

For reference, I'm just going to drop in the link to our pr where we're working on the charter. Okay,.

E

I'll put that onto the um issue that would have been a better idea.

E

Oh, is this not.

E

Oh great, I copied the zoom link into the issue. Anyways I'll fix that um okay, so yeah as far as getting this merged. If we make those changes, um leave the pr open should just leave the pr open until next week uh next meeting and then we'll have a last call for comments.

E

Okay got a thumbs up no yeah.

D

Yeah um can can you change the whole technical steering? We don't have one. So we.

A

Don't need a, um I see that mostly as.

A

As part of the mechanism of guard rails, um that's why? I think the only thing that's really essential is to define your maintainers, uh because in it's been a week or two since I read it. So it's a little big and fuzzy in my memory, but it mostly bootstraps off who are the maintainers.

D

Yeah, I I mean if you just delete the holes, I I we don't have a tsc. So there's no point in having a chart. It says we have a tsc. We don't have one.

A

I I would take that question up with tech, because this is this is the chat of their hand, handing out as the standard so yeah. Okay, I didn't want to. I didn't- want to sort of deviate too much from the template. I imagine it's been thrashed out in great detail by uh linux foundation. Legal, no.

D

um I I think they copied the what cncf did and paste it.

A

D

Well, I didn't realize that so yeah so yeah.

A

Yeah, I think that's.

D

A I I think it's a perfectly valid comment back up to the attack. I I thought that has been fixed, but I was wrong so uh yeah this. I I think the reply back. Maybe the reply back to the attack is nobody has a tsc. Why is this uh still in here.

I

Yeah and there were discussion exactly on that point david. I agree with you, but I don't think there's been any formal resolution and I suspect the answer is going to be. Oh, we have a task force working on. You know the guardians documents and that will be the answer, but so it means it's not going to happen for quite a while all.

D

Right, maybe the real issue to start with is at least make it clear what the groups working groups charter uh scope if the uh they think it is. What's the mission what's the scope and if it's just a copy and paste for the readme or even linked to the readme, I think that's. That would be the most important part right now.

A

Well, this this puts me in an uncomfortable position because well in terms of the security software repo group, um in the sense that we've been aiming to adopt the charter at our next meeting and then uh the following week, bring it to tac for a blessing so that we'd be we'd, be a grown-up big boy group, um that's a terrible analogy. Just grown up and uh insofar as you know, I saw the trc mechanisms. I figured okay that that looks pretty much like something that springs to life.

A

When people can't agree, you know you have the maintainers, the maintenance, bootstrap trc. The tsa can take a vote on contentious questions, and if they can't decide on something with that, then they can bounce it up to tech. Yeah.

D

I I I have no objection to a tsc if the group wants to work that way. Just are you going to work that way.

A

Yeah really, the really the big, the big thing that that makes the whole engine turn over is maintainers who who are defined as maintainers, um because otherwise you know the boundaries of a working group are very porous. You know like at the moment. Shopify is not even a member of the open ssf, and yet here I am.

D

Okay, um I'm sorry.

E

How are other groups listing their maintainers.

D

I I think individual projects within a working group have a lead for the most part, the bigger ones have their own. You know, you know, have a have a group of folks, um I don't know of any. I don't think anybody has a tsc in the sense.

D

As I said, I think what happened, is they just copy pasted? Cncf: hey, we need, uh you know. We need some stuff, we'll copy paste and I thought this had been removed, but I'm wrong.

F

Sure I suspect this is one of the things that um the uh the process docs working group- um that that jury pulled together. That's meeting tomorrow at um 11am eastern uh and has had a few a couple of working meetings um we'll get around to is kind of formalizing. The the working group remember the working groups charter for the working groups, kind of structure.

F

I think it's been very ad hoc because that's how very much how open ssf has has started and grown, and we didn't want to come in and say you know, here's the one true path, but um I think they're working on the one true path.

F

So I think, for now, um simpler is better with the charter. Saying here's the it. If you have a defined membership, that's I think that's better than than not um than a defined lead for the working group is better than not um and- and I think simpler for now is probably better, and I would expect the charters probably will get kind of um templated and standardized across the working groups over time that helpful.

E

Yeah, it makes me feel like we should um just wait.

I

I was going to suggest you do exactly that and I know it's been confusing, because the attack on one hand said hey all the working groups need to get a charter and then, if you, if you had asked okay, what does it take, they would say: oh wait. We don't really know yet um yeah at least that's what they should have said, because that's ability- and this is what's coming up here now it's becoming apparent yeah you're, probably I agree just wait a bit.

I

I

And by the way I I'm involved in the group that brian was just talking about so we had a meeting yesterday and we're going to meet again tomorrow, but I can tell you based on what you know, there's very good discussion going on. But it's not we're not going to answer tomorrow. So sure it'll take a bit of time.

E

Okay, so um I'll update the issue, I mean we have the notes here but I'll. Let you the issue that we're just kind of waiting um but yeah. It sounds good. Takeaways, I think, are that it sounds like we need to um agree on who the maintainers are or the the leaders.

E

I mean we have the the co-chairs um and then who, if there is any process if to be a member um and then uh the goal, which is, I think, a separate discussion, um and you know, but we'll just essentially be referencing our goals that we have on the readme um and then that can be updated again with a group discussion.

E

Okay, all right.

B

E

Other notes on the charter.

E

All right, I guess we'll move on um so yeah I'll, preface the the next um uh bullet with a little bit of history for those that aren't aren't uh in the group.

E

um Basically, at the end of last year, uh we went through a process within the group to um go through and identify some like 100-ish critical projects, not the top 100, just 100 critical projects, and it was pretty ad-hoc, but we did take in a lot of data.

E

Since the beginning of the year we've been discussing how we're gonna come up with you know the next version of that list.

E

uh Are we going to be um doing a similar process with more and better data and more and better people, not better people but more people, or are we going to be doing some other kind of you know integrating more uh input that involves reaching out um outside the people that show up to this meeting and we've had a lot of good ideas thrown around for processes or, for you know, ways that we can incorporate that, and so I don't know that we have any kind of um decisions or- or uh you know, exact, like ideas on what anything concrete or anybody.

E

Nobody has raised their hand on saying we're, gonna go ahead and move forward with this um procedure, but we but um jacques had a lot of good things, good, good input and david's asking jacques for an update. I don't know.

A

So also also summarizing history, the two things um that I covered in my presentation back in honestly, the end of february uh or early march. um It's out there if you look for ranking software projects in these minutes, um the two things that I covered was voting a voting system or a direct elicitation of probabilities, and my sort of recommendation you could say at the time was to focus on another elicitation over over voting mechanisms for a number of reasons.

A

um So I've continued to dive into that area of how you elicit an aggregate expert opinions about probabilities.

A

um As I was sort of saying earlier in the in the pre-chat, there are several literatures which naturally don't correspond with each other, um and I've been skimming a few of them to get my head around what's out there and what makes sense.

A

um I've also got in the back of my head to take some time and smash a prototype just to get something to do anything. It doesn't matter. If it's, you know the the formally proved approach to things that somebody wrote in a paper, it's more just a prototype of the user interface of what it would look like, but I did the most important step, the most important step of any open source project, which is to come up to it with a name that collides with a bunch of other open source projects.

A

What I'm thinking of is ceo standing for security, expert elicitation of risk.

K

I commend you for even approaching the idol of open source project, naming much less selecting and proposing a candidate.

A

Well, it's it's! You know, that's how you, you know you're serious, as you came up with a clever name and here's the important thing: you'll notice, that as a rubyist, I didn't come up with a pun um which is a usual sort of style of options. I I I I.

D

Thought that was a requirement, but uh we'll let you get away this time.

A

Yeah, I'm sure that people will come up with you know, square and prognosticator and other names to go with it. So give it time.

D

That's all right. The cosine folks have trigonometric puns in their comments, so.

D

All right, um okay, so you said two things: one was voting systems versus elicitation of probabilities and you think the elicitation of probabilities makes more sense, given the literature that you found. What was the second part, though,.

A

The second, the second part, was that um somehow at some point I want to find the time to build a prototype of what it would look like for the expert.

A

um So the point is that they're meant to be shown: here's a project, here's some information about the project that you have. You have asked to be shown because we can. We can show dozens of data points between scorecard metrics chaos metrics, you know, uh whatever else we we come up with over and above links to the project's own home pages, source repositories, etc.

A

um There are dozens of data points that we could be throwing at people, but one of the things that shows up in the research is that if you give people too much information, they repeat their estimates actually start to get worse again.

A

um So my thinking is that they should be allowed to configure which things they look at, um but anyway, the idea is that they'd have an interface. It would show them some data points. They would then say I think uh it's this likely to go bad and if it goes bad, it's likely to be this bad.

D

So how does this all scale, because I mean presumably we're going to be looking at thousands of components out of millions of projects.

A

Yeah, that's that's an open question, um whether there's a lot of problems still to be solved. This this is one of the downsides of the elicitation approach um is that there are different problems from the voting approach, um not necessarily better or worse. So one of the big problems is deciding whether a probability has been realized.

A

uh If somebody says it was like, there are scoring systems for events like something has or has not happened. So if, for example, we define the probability that's being elicited as the probability of a cve of a severity x within this this next five years, um that's relatively concrete and measurable.

A

Much harder to measure is the impact you know. What is your estimate of dollars of impact, because anything that comes up as the uh the validation will itself be an estimate.

A

But it's still information we want, because we are trying to retire the maximum amount of risk per dollar spent, which means we need to look at both axes. We need to look by that, the frequency that something happens and the impact when it does.

D

um Yeah I mean commonality is one way you could start to estimate impact. um Obviously it's not really the right ex the same measure well.

A

It's not everything this. This is why we want expert visitation right like if, if, if there was a direct line between something like you know, downloads from a package repository- and that was you know, correlated 0.95 with with worldwide impact, that would be lovely. But we we know, for example, from the classic example of something that gets downloaded once or a dozen times into. A ci pipeline then shows up in a million iot devices uh that, unfortunately, that relationship is not one-to-one.

A

um This is where you need expert opinions to be elicited, because they will be able to encode and combine information that is not available directly in the metrics.

A

There is a possibility in the longer run, that those expert elicitations can form the feed stock for a machine learning model that can suss out relationships that we haven't noticed between the data points that are exposed and the predictions, um and that will hopefully give us a bootstrap for the tens of thousands of projects. That might not get an opinion given about them by experts.

A

But there's no way around the fact that no matter which method we use to try and rank. There are just a ton of projects and we need a lot of people to give a lot of opinions.

A

Now that that's actually, incidentally, why I was thinking about the impact of the site, could you gamify it because there are ways of scoring predictions based on outcomes like once an outcome is realized. You can then back calculate the score. You know how good somebody was predicting something, but that would still require us to say the impact was x.

J

Yeah the there was a attempt. I don't know if you saw this in the literature um five or six years ago, when making a market like a futures market for security advice, but they tried to just pay people for getting the right answer and if I recall correctly, it failed to elicit expert opinions about what they thought was going to happen, even though they were literally proposing to pay people for their information.

A

Well, that's depressing because I mean that that was one of the things actually that I I didn't talk about at the presentation, except in the questions where sort of like would a would a production market or a futures market work. um The difficulty with futures markets is that uh prediction markets to effectively elicit opinions and to derive a probability.

A

You need enough people betting enough money like it has to be sufficiently deep liquidity, otherwise it's very lumpy um and it just falls back to being individual experts. um I have some experience with this because I've I've previously received options in in a company that was very thinly traded and uh you know it was. It was always a sort of an exciting day when things bounced around or when it took several days to sell your shares and so on um and the same problem I think, would arise with expert elicitation.

A

If you have, you know a thousand experts trading in such a market, but you have a hundred thousand projects for them to trade on then most things are just not going to get traded this. This is the same problem we have voting. This is the same problem we have with excellent elicitation.

A

Is that we're trying to find what are called nebraska, where it's obscure, but upon closer inspection, turns out to have you know relatively high potential frequency and relatively high potential magnitude of impact?

A

That's that's the hard problem. Yeah.

D

I I I do think that there's hope for identifying lib nebraska in the sense that, with some data support, I mean you're, absolutely right they just you know. um If you talk to just random people looking at their direct dependencies, you won't notice but um stuff, like the analysis done by harvard and some of these other things. I think they at least do a decent job of helping people identify.

D

um You know you no longer have to get people to figure out what they might want to talk. Think about and analyze here. Look at this now you can use your human judgment now that you don't have to try to guess the world at random.

A

Yeah and that's that's the thing like you want. One of the things I've also been considering, including is, is whether I want people to self identify their level of um knowledge of a project on a scale like they could say. I've never heard of this in my entire life, I'm just going on the data you gave me up to I'm one of the creators or maintainers of this project, um but I I have two questions. You know sort of things with that is one.

A

Does it add that much value compared to the cost of another thing that has to be elicited every every single data point that gets elicited adds that much more to the overhead of of getting elicitations, because this is going to be tremendously boring and head head thinking, exhausting work uh where you're sort of staring something or or it's going to be. People are just going to click through quickly. Just pick things at random, um there's no way around it uh with with elicitation literature. It mostly focuses on high stakes.

A

Small number of predictions like about the largest I've seen is 100 variables um elicited over several days, um whereas you know with a couple of like two dozen experts, whereas we're sort of like trying to talk about solicitations on tens of thousands, ideally from thousands of experts and then hopefully, building a large enough data set that we can start to come up with guesstimates of the things that have not been estimated for which there's no elicitation available.

D

I that's a great point and I can't think of any counters. Does anyone know of such literature, because you make a great point, it's I. I can't think of any counter examples to what you've said, chuck and granted. I haven't done as much research on this as you have.

D

Can anybody suggest somebody we could contact that might have uh a you know, know of a different way to turn over that rock and the academic world. So.

J

I don't have an easy answer, um but the if the purpose is sort of labeled data elicitation, which is analogous to other machine learning problems.

J

Then you see, for example, what I understand the google image folks did when they didn't have a they didn't, have a 10 million size or 100 million size or whatever large, set of labeled images.

J

So they paid a thousand people full time for a year to sit in dark rooms and label images, and then they had the data that they wanted, and that worked because it didn't change. You know like so. The only solution I've seen to this is to throw 10 billion dollars at it.

D

I think this is very different, because the imagery stuff you're counting on humans on humans, doing something that they can do in a fraction of a second they're, not you're, not they're, losing their fast thinking, not slow. Thinking.

J

Yeah, so what you're saying is that it's even worse and would cost even more money.

A

Very possibly, that's that's why I was interested in whether it was gamifiable um whether you could use one of these scorings well.

J

Of course, captchas are the other thing right so like can you turn it into a captcha.

G

I think another challenge here is that it sounds like you need expert opinions, not just anyone's opinion right, so you have to be a little bit careful with who who's getting the request for data not just.

A

You want a minimum, a minimum level, that's that's sort of where I was leaning towards in the familiarity with project.

A

You know setting um like whether you would out of the box, give more weight to someone who had higher uh awareness, although in theory such an expert should give narrower bands because they have high confidence in their prediction. On the other hand, research shows that experts who walking off the street are hilariously overconfident, tend to be calibrated very poorly and will give you very narrow bands, no matter what you do, even though the data, if they refer to it, show them that it should be much wider.

A

I sort of talked about this in the presentation where there's this process called calibration, where you give a series of questions to which the answer will be presented afterwards, and they go through that elicitation process for this series of questions and the feedback is meant to show them that they are being hilariously overconfident to widen their confidence bands to pay attention to whether they're being too pessimistic or too optimistic, uh and that that does show a measurable improvement in performance.

A

There's another school of thought that you shouldn't do that, but instead you should take the calibration data for questions to which they do not know the answer and to which you do not present it, but instead you calculate their calibration scores and use that to weight the expert opinion.

A

The thing about this, though, is that in the studies I've seen a depressing amount of the time. What lines are happening is that out of a group of experts, one or two experts are the only ones to listen to, and everyone else gets discarded, which you know seems very wasteful to me. You know like if we did that then we'd be throwing out potentially thousands of of um solicitation results, and we just have vast gaps in the data you know like I would. I would rather have bad.

A

Well, I mean let me qualify that I would rather have an opinion that could then form the anchor for further investigation than to have no opinion. No opinion gives me no signal as to whether I should care whether this is live nebraska. If somebody comes in off the street gives an amazingly high score that shoots it to the top that attracts attention which draws more experts, more experts can, then you know, give it a more fulfilled and detailed estimate uh from multiple elicitations.

A

I don't know if that answered your question. If I, if I ramble.

D

I I think the um spot, the the relative paucity of data is a significant issue. um Maybe we're going about this the wrong way. Maybe it would be better to have some sort of automated algorithm to do an estimate and then use humans to say to propose.

D

You know, adjustments to those results.

H

Every single piece of open source software out there that just beacons back and says: hey I've been running this place and we just get a count of how often certain bits of software are running.

A

And then you know, I think, that's critical, that that should cost yeah that that has the same problem of getting 100 000 projects to adopted right. I don't know if you're joking, I'm sorry.

H

Yeah that I'm not joking.

D

Because it's not.

H

Practical, like nobody's gonna, yield, nobody's gonna embed beacons in all of open source software. Right, like you know,.

A

If they did, there would be a nuclear explosion. At the orange side.

D

Yeah I mean that debian does have popcorn, although I don't think that's enabled by a lot of folks. So you get uh uh that I believe that's opt-in data, the popularity contest.

D

And, of course, that only covers um you know uh system packages of a particular distro. I have used this as a data source for other things, so um it I mean it actually. It helps. I mean at least you do have some real data on use, uh but you don't have that for ecosystem. Only for a system.

A

Well, it's an example of a data point you might set surface or or allow people to configure to be surfaced when they are given a project record. You know they're, showing the project, and here are the data points. You've told me that you think are important.

H

All right, so you can, you can start with things like um like one of the things that are known. Quantities are like go to every single major repository vendor and say give us the top list of all the projects that would by download count right.

H

So, like you know, debian gradle, maven palm, like you know, maven central, you know all those artifact servers that can be a good starting point, um which is not necessarily always public data, but we, you know you ask kindly enough and say you know this is the purpose and you can get that as a starting point. Instead of data.

A

Yes, uh which, which is actually a really wonderful segue, considering we have uh 12 minutes, left to uh marina's question about a data warehouse for for repository systems.

E

Yeah, I agree: let's, let's uh move on, but before we do um any ideas on like. What's the next um step here like? What's what do we need more discussion about, and you know potentially doing something like this.

A

I'm still you know like doing reading based on um you know where I can squeeze it in just to see if there's sort of like some breakthrough hidden in somebody's book um or more to the point like it's been infuriatingly difficult for somebody to just give me a formula like there's a lot of partial formulas floating around, and I am not a data scientist to start with. So a lot of these things are based in bayesian analysis or even more exotic stuff.

A

So that's the thing, but the the next step for me is basically going to be in so far as I can find time- and I know we're all busy is to prototype the user interface for experts and such things as this we could. We could even elicit opinions without combining them yet, but it would definitely be worth like having something in front of people like there's, there's, no, no better design than working software.

D

Well, I I have to admit it depends on how complex I've had more little success of here's, your prototype, it's a piece of paper. We could write this code, but you answer: let's try this three times and see if it's actually doable because no point in writing the code. If uh you couldn't get somebody to do it even after you wrote it.

A

Well, the the holy writ as shopify is the rails makes everything easy.

E

Sounds great um yeah, let's, let's go ahead and move on to marina um floor is yours.

G

All right hello, so um I think I mentioned a little bit. The idea here is just a proposal for um having a data warehouse for software repository data. um I think the overall deal is fairly straightforward. um um Kind of the reason I'm here right is that I think one of the key benefits of this is some of the stuff that you were talking about.

G

Just now is identifying data about how projects are used, um the projects that are stored on these different repositories, I think, there's also some other benefits, um specifically kind of how I came to this project, had a lot to do with being able to test performance of security solutions on real data to make sure that they actually would work at the scale of these different repositories and then, of course, the the challenging part here is figuring out the engineering of the system.

G

You know where stuff is going to be stored, who wants to run this whole thing um and what data is available today versus and that can just be like uploaded to the system versus what data we would have to work with repository maintainers to obtain some of the larger software repositories. Folks, like specifically rubygems and pipi, already have a fair amount of this data available publicly, but it's just all in different formats in different places, so this would help centralize that and then for other folks.

G

This would help come up with what data would be useful to share and figure out if we can get that from whatever logs that they have so so that's the overview, I know there's any particular questions or pieces people want me to go into more detail in here.

G

I think david had a comment.

D

uh I'm trying to carry out to have a single uh db to query um so I I tried to add notes to the um for this meeting to because I've actually looked into this myself. um So there is an existing system, it's called libraries, dot, io and tide lift is, I would say, well maintain I guess from a technical sense. um It lives at a tide. Lift um I don't want to I'm actually not trying to cast uh shade on tide.

D

Lift um I've contributed patches to libraries.io in the past uh back when is it ida? I used it harvard used it in their analysis uh to use dependency analysis, um so there so clearly there's some value to it, because there's something that's there, but um this is my opinion.

D

uh Maybe it's changed recently, but at least when I've interacted with it with harvard's interactive, it's not really well maintained, um and it actually makes sense because I think tide lift originally was thinking that this is going to be critical to their business, and I don't think it really is um again.

D

I'm not I'm not somebody from if somebody here was here from tide, lift I I could get the store if we could get the store straight from the horse's mouth, um so I I'm not going to try to speak for them, but I, I suspect so how's this. I suspect they thought this was going to be central and it's turned out not to be, um and if that's not true, then I'd love to hear what the story is. So I think it would be valuable to have something like this.

D

There has been some brief discussions about this between both harvard who has been using a libraries.io and elf research who would like to do some supply chain research and would like a database to analyze and oh here's what you get.

D

um So I think that this is uh this does have legs. um What might be helpful would be talking with the title of folks, because I don't know what they found, what they feel about libraries.io, at least historically, it was open source, as far as I know it still is, um but unless they've had a recent change of heart, I don't think that it's likely to go anywhere, so they might be willing to.

D

You know, transition some of that stuff. Well, we could use their code, um but you know, I think, having conversations so that it's all in one place would be really helpful.

G

Yeah, it's a great idea, and I do think that um the scope, at least of the full proposal is maybe a little bit broader. But I think the stuff that they've done is a great starting point. Just getting aggregate data about these different packages and these different things.

D

Getting I'm sorry, what should you say getting what data like.

G

Aggregate data and like data just like, like download accounts, I think that's what they focus on mostly right, is download accounts. The.

D

It's the dependency network, a depends on b depends on c depends on d, so that's been their focus of getting that kind of data, um because you're you're right, you can get that data, for example from ruby, gems and so on. But now you have to process everybody's different formats, so loading it all in one place, first of all beating on those poor repositories. Only once because I hear all the repos of other things to do and then having a common format um is really helpful.

A

I I had a question about that. Actually, maybe this is diving into the weeds, but is is you're thinking that you would uh have a processing pipeline on the left side that goes out to whatever is available and then turns it into into shape. Or did you have in mind that you have an api? That repositories would then report to.

G

Yeah, I think, um there's kind of a short term and a long-term answer right. The short-term answer is just collect the data and in whatever form it is and and put it somewhere. I think in the longer term it would be nice.

G

I think that to do the processing on the linux foundation side, so I think we, the goal here, is not to give more work to repository maintainers and I think they have plenty of other things to do, but if they can just give whatever format of data they have, and we can make an automated processing step, um because I think most of the data is somewhat similar. It's in some kind of sql-like or java json-like format that we can, then you know, put into a standard format.

G

Download counts are.

H

I'm sorry jonathan, are you saying, download counts are usually in a database format or they're. Are you saying that the what part of the data is usually in.

G

Yeah this is um download accounts um and and and more basically, so it's all information about. um I think the the most um full data set currently available. I think rubygems has the download logs for complete download logs with anonymized ips um for all of the different data that that, um basically, the request that they get, they then make public in an anonymized way.

G

Pipevi has something similar with just no ip address information, um so it's you know like it's not just download counts but download frequencies um like where the downloads are coming from repeat people, that kind of stuff which might be interesting for various types of analysis. The.

H

Other one to look at is uh github has their dependency graph data. That um is, uh if you go to like dependency insights on any repository. That's using like this. It doesn't work with gradle, but it works like maven stuff like that and that'll that that information is also, I don't know. If it's bulk available, you can talk. People like github to get that information potentially.

G

Yeah it'd be another great source for this um yeah.

D

Yeah, I I I well I I can say specifically for the analysis that harvard- and I did it was very much the I want to know what is most dependent on in various infrastructures.

D

So I need not just the dependencies for a package but the dependencies dependencies of all packages, and uh you can get that with varying levels of effort and sometimes it's a whole lot of effort.

D

Okay, jeff you're the lead, so you can always.

E

Speak yeah, so facilitator question, uh you know what would you be looking to get out of the open ssf or this working group? You know resources, discussion, just a home.

G

Yeah, I think, um I think, a combination of things I think either um resources. I think resources would definitely be helpful. I think it might make sense for the project to um like officially live in the um software repositories working group, because just because the maintainers of these repositories are there, and so that way you can use that communication. But I do I just like some clients, just kind of collaboration, maybe resources, and also just you know letting you know this data might be available.

G

If it's going to be useful here,.

E

Great um well, I think, we're out of time for this meeting, but um it sounds like you wanted to do like a round uh and see, maybe where um it it has the best home. uh I think it aligns up with the the goal of this group. Well, um so I would say before you know, we would do it before we do like an official.

E

You know, vote or you know, consensus, uh we'll just um put it on the agenda for the full two weeks and then um doing a like, not a vote, but you know a discussion in in the next meeting or whichever meeting you'd like uh that's, not right at the end.

E

um Does that sound good to everyone? One.

H

Of the other places, I think, error code has the data. It may have some data too, because they do their state of open source every year.

D

Yeah uh marina, can you shoot me a quick email, um because maybe I can introduce you some other folks I, but I can at least confirm that harvard used this kind of data. We use this kind of data, so there are users of this kind of data which I think was at least at least your initial question.

G

Yeah, I think I also have a couple use cases for the data, but I want to make sure that it's broadly useful before presenting the the idea so yeah.

D

Okay- and maybe I can connect you some uh I'll off research and harvard folks um to continue that conversation.

G

All right thanks, look forward to continuing this in a couple weeks too, yeah.

E

Thanks, marina thanks everyone else for joining uh great to see you all and have a good two weeks and see you see you next time.