National Energy Research Scientific Computing Center (NERSC) Deep Learning for Science School 2019, 7 Aug 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 08 - Deep Learning Reproducibility - Jessica Forde

Description

Deep Learning for Science School 2019 - Lawrence Berkeley National Lab
Agenda and talk slides are available at: https://dl4sci-school.lbl.gov/agenda

A

So first up we have Jessica Ford, so jessica is currently at Facebook AI research, but she's. Also a member of project Jupiter. So her research focus is on the intersection between machine learning and open science, very, very relevant to what we're talking about in this school on Jupiter. She she's build open source, open source tools for reproducibility, a very big open challenge and and any kind of applied machine learning or data analytics. She's worked as a machine learning researcher and data scientists in a variety of applications, including healthcare, energy and human capital.

A

So today we're talking about promoting science and machine learning research with we do reproducibility, Thank, You Jessica anyway,.

B

So I don't I'm not necessarily committed to like having these slides. It's like like going through these slides in this sort of format like we. This is the small enough group in which, like I I'm, totally open with having you ask me, questions like you know. Please raise your hand like you. Can you can go backwards and forwards like this is um this is a talk that I've given in the past and I kind of want this to be to like serve your questions and needs.

B

So, like you know, we can have the see as like as well as we want it to be, um but any but more broadly, I think like reproducibility is a really important question in the sciences in general and I. Think there's a lot of really interesting. That's been work been done and the other science is in an open source. With regards to thinking of how we do reproducibility and machine learning has been thinking about this as well, and my work kind of spans both of it.

B

So this is some of the works that I've been working on and with colleagues also at other institutions such as Harvard and Facebook, AI research and project oopah, and so this is kind of a more general discussion of the area. But please, like honestly, like raise your hand, stop me ask questions. This is gonna, be it's not too technical talk, so it's totally like up for discussion. I didn't necessarily set my thing to not freeze, so I apologize, okay.

B

So um some of the challenges that we face in reproducibility in research is with the simple problem, which is: how do we know what the state of the art is now state of the art is particularly important in machine learning research, because this is generally how we establish new baselines, new methods that are of particular novelty, but it's actually like it sounds like a pretty straight for a question. That's a little more complicated, as you will see in terms of how I think about the soap for state of the art.

B

There are certain things: people think about benchmarks and competitions, a really common way for people establishing state of the art who's. Seen papers with code who's familiar with papers of code, okay, so papers with code is a really great resource.

B

They have tools for establishing the art for various benchmark sort of typical problems, often associated with particular data sets. I Jenna I generally think this is a pretty good resource. So this is one example. Most many people are familiar with image net as a data set. It also was H, it's also a challenge problem. So this is a really important challenge that established a lot of new developments in image. Classification who's worked with image net data. Okay, few people definitely look into it.

B

I think it's a really important resource as well for the fundamental sort of problems in computer vision, classification- and this is another example from colleagues of mine at Stanford. Don bench is a benchmarking library that they work on as well. Khoda lab has is a tool for reproducible research and for competitions. A lot of work is being done in the NLP in co2 lab its maintained by people at Microsoft, and so it sounds pretty straightforward. There's a lot of this is typically what people do they put up?

B

A competition they put up a standard, a data set, and you know the task is pretty straightforward. It should seem relatively straightforward, but if you looked actually into the machine learning literature, there's a lot of really interesting papers that have come out all RL gansley. That equal is one that focused on Ganz attention is: all you need is an NLP paper, comparing methods as an independent reimplementation on the state of the art. Evaluation of neural language models is another language paper and the Bayesian bandit showdown is another bandits paper and people are ello.

B

That matters is probably one of the more notable papers in terms of reproducibility in deep RL. Oh and.

C

This one also, this.

B

Is again another paper, so you know, at least in in terms of like these areas, gann language model, bandits and P bar L, independent evaluations have found that state of the art isn't necessarily consistent for later papers and that, like there, isn't necessarily notable distinctions when other people independently implement these models.

B

Additionally, there's there's problems in terms of generalization and in machine learning when we think about generalization. We often think about this in terms of how does the model perform after the fact on additional data that might be collected after, like from Edition a different data set. So this is a paper from people down the road at UC, Berkeley, looking at see part N and M eyes, image set data, and so they have a. They have pre trained models on imagenet RC 410.

B

These are standard data sets in the machine, learning, image, classification, literature and they collect new data, trying to use these sort of methods for data collection and as faithful as possible, and they find that the performance isn't perfectly linear, I mean I, didn't like perfectly the same. So, as a result, the accuracy ends up degrading, and so we don't necessarily have the same performance that we would expect on these classes on this data set. That was collected at a different time, suggesting that learning isn't necessarily as generalizable as we hope.

B

This is a paper in the reinforcement, learning literature on how long it takes to how many seeds does it take to. This is a one with when they changed the domain. I have additional noise and so generalization doesn't necessarily work when we change the actual input, you are reinforcement learning model such that these aren't the models, aren't necessarily robust to these kinds of interference.

B

This is another paper and there was the reinforcement, learning literature in which the dynamics of the stimuli or changed and in in this case, if you change the way this game of Omid are in this game of Omid are. Have you changed? How the dynamics of this game work such that the simulator it changes in some way, you don't necessarily have the ability to play the game of anima dar as well, and so these questions around generalization have real-world impact.

B

We will be hearing from Emily and in an hour or so, and she will discuss this in more detail, but this is some research from some colleagues of mine who were at Facebook, and this is images of athletes. As you can see, you can kind of guess what these these sports, that they play. So these there's the people on the left are soccer players. The people in the middle are hockey players, I, think the people on the right are weightlifters and you have some boxers and again some soccer players, but these classes are wrong right.

B

So this is an example of which generalization really doesn't matter and I really think that, like questions around like fairness and ethics, often have to do with like fundamental problems in learning.

B

This is a really well sighted study of commercially available implementations for facial recognition. This is this gender shades paper I really strongly recommend this paper. This study studies how commercially available image recognition software performs on people relative to their gender and skin color. They find that women of color are often misclassified at higher rates than the rest of the rest of the faces. In a dataset, this is a paper with colleagues of mine on hospital data.

B

Looking at trying to detect errors in like a radiology notes, and even though performance for a specific hospital looks very good when you try to this model and deploy it or to check performance on an external hospital performance degrades, and you would think that for like notes on a similar domain for radiology notes, you would get similar performance, but, interestingly enough, the performance doesn't necessarily hold when you take this model and you apply it to data from a different Hospital, and this is actually led people in the explained, ability and interpretability literature to even question the use of these sorts of blackbox models for high stakes decisions and, as a result, it leads us to really wonder like.

B

Should we be using these these blackbox deployments, given their challenges in situations in which, like the the decisions, really do matter? Well, okay, so this. So this is the interesting case like it really like, at least in the fairness literature. People have argued that part of this problem is even just how this is measured. So when people are thinking about accuracy, they it belies the distinctions in the distribution. So, in the examples you might say, okay, the performance is pretty good and we aren't necessarily inspecting the errors, but the errors are not normal.

B

Like are not like uniformly distributed and they're associated with certain cases. Then this has like certain implications right. So, even if there is like better than human performance in a certain domain, it doesn't necessarily imply that the performance is as good as a human in general right. So, like superhuman performance that might not necessarily imply superhuman performance in certain cases right does that does that make sense, I mean so, for example, like people, if there's a recent work on various games, so, like reinforcement, learning is a really interesting domain.

B

I mean it really I think this is like this is the what I? What I'm trying to argue is that not only is this like this is part of the problem is how we're framing evaluation. So when we say like this is a particular classification task we have like this is the accuracy of people and like we are exceeding that we are exceeding the performance of people. That doesn't necessarily mean that, like that that that will necessarily hold consistently.

B

So even if you achieve superhuman performance on that give data set it, there is evidence that suggests that this will not hold consistently for images in general collected of that type. So, for example, you might be able to get really good performance on digit classification, but that performance does not necessarily hold when you now introduce it to new images of digit handwritten digits that were not collected at that same time, so, like slight distribution shifts, do not necessarily lead to superhuman performance.

B

No I'm, saying that some researchers have argued that this is that the that this might that we might not even be making the right decision in deploying back box models for high-stakes decisions. That's what this author is arguing: okay, any more questions law in here; okay, so some responses from the scientific community and machine learning. This is who has seen this talk from nerves.

B

It's a very good talk, so this is the test of time. This is a text of a test of time. Talk from neuropathy 2017, a really great talk by Ali Rahimi for a paper that him and Ben Rexroat he's been America's. Also, the author of this talk really great talk strongly recommend it, which reminded the community of the importance of experimental rigor and in rigor in terms of theory, rigor and evaluation.

B

There are some additional papers that came with community. This is another paper in Steinhart on troubling practices in machine learning, literature.

B

Another paper out of Google the iclear reproducibility challenge is a really exciting work in terms of encouraging reimplementation xan, independent evaluation of work coming out at the community at AI, clear and is recently had another seminar on deep learning, following up on the work of ali Rahimi, in which he suggested women been rec, suggested that deep learning has become an alchemy and trying to promote better science within this is another one actually go to national countryside.

B

So, in response to this I know, there's a there's a yawn. Looking typically argues that theory follows invention and so that you know it might not necessarily be a concern that we don't necessarily have the theory, because in many scientific domains the science, the sight, the theoretical understanding of the phenomena that we observe is then documented after the fact, and so this is an example of Boyle's pump. And so you know this is the kind of things that John is like sighting but I.

B

Think more fundamentally, at least in terms of like machine learning research, then the question is then for machine learning, researchers and people who are implementing machine learning models to understand think ultimately are: are we scientists or are we engineers? Are we really fundamentally interested in building something and getting it to work or building something in order to create greater understanding, and so, like I, think this I think they think in terms of the understanding it this sort of question? This has been something that's been debated for a while.

B

This is a quote machine learning, literature from 1988 and I think it really suggests that people have been thinking about this problem for a while in the literature. um I. Think that, fundamentally, we really need to understand when we create a machine learning model cut it up, how it operates, and why is the behavior of these models performing in a certain way that it is, interestingly enough, also and again, I think we should all see this see this talk, because it's a really useful talk, but, um interestingly enough, Boyle happened to also be an alchemist.

B

So when we talk about these reproducibility I mean one of the things that people often fight is boils. Air-Pump boil also actually happened to be an alchemist so anyway, um but but I think really, when trying to square this problem, fundamentally, I think, regardless of whether or not we we focused particularly on the engineering or the science side, I think really. Ultimately, when we are doing science with computation- and probably you all know, this better than I do ultimately be our computational scientists right.

B

We are scientists that ultimately use computation to try to understand the problem that we're trying to solve, and the computation serves this broader need of trying to gain knowledge, and so ultimately, then, as all of you are, we are all then scientists, you write software and so to kind of dig into this sort of quote more more deeply than in the machine learning community. At least.

B

We need to think then about doing more than just beating benchmarks or trying things to get them to work. We really need to then do more careful study through empiricism and by running a controlled experiments of the problem that we're trying to solve.

B

So in this case, then, if we're thinking about this about reproducibility in the sciences from a computational science perspective, then reproducibility is ultimately a software problem. There's various definitions of reproducibility people talk about reproducibility, we replicability versus repeatability I generally, am not one of these people who really wants to be very specific about what the definitions are. I want to focus more on what the problems are in the fields in general that we need to deal with.

B

With regards to the relationship between the consistency of results and software, and so I am I- think for the rest of this talk, we're going to be focusing on reproducibility as a spun as a problem of software.

B

So again these definitions of reproducibility can vary, um and so in general, the problem I'm really interested in dealing with is going from the system specs, you have data, algorithm, hyper parameters and your analysis to like consumers, the results of one or more models, and so you have this sort of pipeline, and you want to be able to have relative confidence that this pipeline can work consistently in the description of this part of this pipeline can be repeated, and so, as I mentioned earlier repeat, reaching consistent conclusions is particularly important than in this question of reproducibility as I framed it in terms of software.

B

So then, if we're going to do that, then we need to then think about the consistency and precision of the setup in order to get to those conclusions. Because, fundamentally, this is our problem, and this is then, these sorts of system specifications are going to be the ones I'm going to be talking about.

B

So, if we think about what a reproducible pipeline is, if you think about the kind of software that you've written for your problem for your analysis, you might have something along the lines of you might have the data that you've collected that describes your scientific question. You want to answer, you will might have your dependencies, your hyper parameters, the scripts to run your jobs. You might have your analysis code and you have the documentation to explain what you did and oftentimes.

B

This documentation may come in the paper by come in a readme or something like that. But if anyone actually has dealt with these sorts of problems in real life and I love, this X PCP comic. This is a lot more complicated when you actually have to start working with other people's code, so who's actually had this problem where they had tried to run someone else's code and they got tied up just even getting things started.

B

Okay, great so this is this is this is like a fundamental problem that I think is particularly important and and we're going to discuss more on so so I think dependencies. There actually is an under sundar emphasised problem in the reproducibility literature and is the one I'm going to focus on so nurbs 2017 um had released code links in their schedule, who's familiar with Europe's, okay Europe's is like a mainstream machine learning conference.

B

It's where people publish fundamental work in machine learning and at their conference they included links to their papers and included links to where the authors said they would have put code in their camera ready, so I just crawled. All of that and so 2017 everyone shared their paper. We've got a cat who got a presentation and.

C

C

Character paper.

B

Little more care of their code, but most of those people shared their code on github and within the sciences, at least this. This number is about the same so within this is a study from Victoria, Stodden and her colleagues, and that in of 180 papers in the journal, science, 36.1% provided data and code when contacted this is a study in which they were contacting authors to request code and about 36 of those respondents ended up providing code.

B

Similarly, in the machine, learning literature only about seven source actions, and so again going back to this, um at least within the machine learning literature people who are producing these algorithms, that you're using for your research or a baby you're building on pond to do new novel algorithms of the people who do share code. The overwhelming majority of them are sharing this code on github who's using github right now like publish their research using other people's code, okay, yeah, so I mean I, mean you hear all evidence of this as well.

B

Right github is now become a standard for open source yeah. Oh, we were about to get there about the other, we're about to get there. Okay, so I would simplify another nine percent of these papers by like snooping around through the actual links they provided like some people are providing like a link to the website or instead of the actual code. So we keep going down the rabbit hole. They end up, getting to actually get hep repository.

B

So now that we're in the universe of people who have shared their code on github among these knobs papers, an overwhelming majority of them we're publishing, get recovery flows in Python, so over half, and if you look even then like most of these libraries are like open languages right I mean MATLAB is like a is a number two, but even then that's like much lower than that. Jupiter notebooks.

B

It's considered a language I, don't consider Jupiter notebooks a language, but you know duds um but yeah, but so, if you hope, presumably some of those github ones are also including Python and then from there. You have a number of languages that are also again like and there's an additional 44 repos that have at least some Python in the repository, and so this sounds pretty good, as I mentioned, because at least now we're on in a universe in which everybody sharing the code of the same place.

B

Most of them are using the same language that indicates that most people then at least know what to do with the code and what to do with how to run it. But of course going back to this Python example. It's a lot harder and then it sounds so.

B

Who's used, docker, okay, great awesome, you're better than when I give this talk. I'd like some machine learning community. It's not everybody exhausts so uh so then who's who's, familiar with Jupiter repo to docker. Okay. So we put a backer. It's re, not taking it personally ah Reuben a docker is an open source and project, um so I'm involved with that um handles who's actually written their own doctor file, Brit who's who's written their own docker file with doesn't who has who have gone through the process of having a docker file written okay.

B

So that number is a lot smaller, so for you who have not raised their hands and for those of you who did not enjoy the experience, but that you described when raising your hand, this should make this easier.

B

That's three for the docker repo docker is a tool that takes some standard configuration files like we see, might be discussed in here and helps with the creation of a docker image based on these configuring, and these configuration files then establish the environment and the environment is necessary to run the code, and so um rebooted docker is how it works. So.

C

You might find this as people on github. You might have these like some high and.

C

So then you let the repo and then you build it. Then there's like 80 Pro server. You can get back to.

B

The repo and you.

C

B

Write so then you get yeah yeah this book I and other people may hate it. Yeah.

C

We will we will get to that yeah. But short answer is yeah, no better I'm, okay,.

B

Package managers and other insulation tools, then like can make these sorts of dependency specifications and these sorts of issues like a lot easier to handle and like dealing with that sort of like environment dependency. How and.

C

So we inspect that these repositories for.

B

The permanence of these configuration files that we put a doctor might.

C

So there's a finer group who's.

C

Who's used, one of a file whose ring we want to be file the back one of these. Anyone.

C

B

Oh great, that's your life you're, pretty good, like you're better than you look.

B

So these are the kinds of files like so this is like file, teacher pip or Prachanda or you're. Gonna use julia right, there's, there's additional ones!

B

They've been added to this, but at least at the time that I publish these results like set off that pyrite uses what you used to figure out: Python, okay, um but even then like, if you look, this is actually pretty rare, so I placed in terms of Knox 2017, most people who are not publishing the dependencies of their libraries in the machine, readable format and I think were they were trying to use.

B

They were trying to use it such that they could install the repository as a Python libraries and so again the month to the point early like doing this is really easy. Most of you, people actually know this, so you actually all know the drive home point of this talk, which is to do Conda, explore tour, fit freeze, but keep doing that. Freight makes me happy, but there's.

C

B

Refrozen but basically, if you have those sorts of files, then you can build a docker image automatically recorded ocker, like you're good, to go like try it out really great, um and so one of the things that I think was like you kind of saw from the video we discussed earlier with the reboot of docker, like opening up a jupiter notebook, is that avoiding dependency. Health like allows us to build and serve this repository anywhere and so like this, because you then can build a docker image and you.

C

Put it anywhere and so you're, probably you're betting. You are familiar with doctor who's, not familiar with that, but okay.

B

So I'm not necessarily systems expert but for like, but on a lot of lately description doctor be allowed, but it's a lightweight virtual image of what you're working with. So it allows you to have your just dependencies in the whole environment set up handle in the more likely way that allows you to then have much more control of your environment, and you can then have a lot of different kinds of environments going on at the same time, and it makes a lot easier, so Reubens awkward actually also been used in competitions and spacely's.

B

This is a musical genre classification, competition recruited or is used for that. That is a machine learning. This is a machine learning competition that many people are using GPUs. So this is an example in which people are using Reaper to dr. Whitby peers and for the submitting of competition images for independent evaluation.

B

This is who's used, binder who's, familiar with binder, okay, fine binder uses repo to docker to as the underlining technology for these docker images. So if you go to like.

C

B

If you go actually gonna like.

C

B

Is binder, be, like you see this? Alright, okay, so I'm gonna.

B

You will see this I'm like filling out this form and has a github feeling. If you looked at log it like says. Oh, it found the ready a built image of this image of this github repository. So that's basically the repo to docker part, and then it's launching a Jupiter server of this built image. So it should be going.

C

I, don't know, but it might hang a little while.

A

B

The docker registry of all of these images that it ends up being added by different users of different git repos.

C

B

Have there's like compute from donated from like Google Cloud and then there's another organization in the OB age? Is the other one launching.

C

Oh, there is okay right, so this is the analysis that we're was discussing right and I.

B

Can run this remotely.

C

That's right, so if you look the.

B

So this can run on someone else's server. All.

C

Right, this is not going a pro.

B

But now you know it's real right, like I can go to one that works.

C

Right I can go to.

B

So now it's like going through the whole process, so you can see what this kind of looks like we're. Cloning, the repo and they're building the docker image. I can go point another one.

C

This is another example: they're gonna block, my laptop.

B

So this is enough, this is a paper. This is another machine money, people for whom colleague of mine in Harvard, and this dude also run um I'm just going to go back. All these things are building.

C

This is a this is a.

B

Nerf paper of machine learning for jet engines and high-energy physics, and you can rerun someone else's analysis. This is their notebook with their analysis and with Jupiter lab. You can actually put everything together. So you have a paper here. You can have the analysis, you can look at everything and you can run it on someone else's compute and you never have to install anything. You just have to start just running the notebook.

B

So now we've talked about dependencies and how like dealing with dependency, hell and getting out of the way then allows us to run the code, and so I would like to then think about what what is we? What do we mean about having good analysis code so in various venues code, submission is becoming more important who are submitting to like then like venues or journals, or something that requires like a code, availability statement or code submission right now.

B

Okay, so like some people, it's becoming more common and machine learning, it's beginning to happen more and more and.

B

So if we look at go back to nerves, at least for for nerves as a conference code, submission is becoming more common. So if you look here in 20, going to 2018 at least 2018 nerves, 50% of papers had code, so they volunteered code independently code was not required.

B

And, at least in the machine learning literature sharing code is associated with higher expectations, um so we can see here the average number of citations generally statistically higher, but again like this does not control for the institution, the code quality, the URL etc. So you know take it.

B

Take that conclusion with a grain of salt, but at least this is the direction at least this might suggest, and at the same time you know, despite the fact that there's this recent interest, at least in the machine learning community of publishing code publishing code, has been as part of papers, has been something that we, the scientific community, has done for a while.

B

This is an example of who has done a bunch of done, some like Bayesian clustering, libraries, ok, whose work played like late and garishly allocation. Ok, one person, ok, so this is this is a cost. This is like clustering, algorithm.

B

That was published in around 2004. This is the one of the early implementations of it at the time. This was like. This is something that they popped, that the author had published with their paper. So at least this is something people tend to do at least like at some point. It's not totally uncommon that people do it, and this actually was useful for the open source project.

B

I ended up working on for Bayesian arm parametric implementations in Python, so this is a colleague of mine who was using this code to better learn Bayesian on parametric's, to do our implementations of that algorithm, using modern software engineering practices, and so this is a library that I worked on with my colleagues.

B

It's still available online in case you ever want to do any Bayesian, nonparametric clustering, the algorithms are all available, but I mean for the two, despite the fact that, like machine learning, researchers, love thinking about state of the art for many use cases and I'm zooming for many of your use cases, state-of-the-art, isn't necessarily the primary thing you're considering when selecting an algorithm to use for your problem right. So.

C

So in this example, we go back to this paper. These.

B

These models are all pre trained models out of like tensorflow or PI torch libraries. These are. These are standard implementations that hasn't been like carefully engineered that people are using who's using like, like pre implemented algorithms for their research right now, who's, not who's, not like rolling their own models.

B

Yeah I feel, like most people, are doing that, like a lot of people are doing it for baselines these days right in terms of like how people are using pre trained models in the sciences. This is a paper for skin cancer detection. They used a pre trained image classification model to start the training of their image.

B

Classification task for for skin cancer ended up publishing it in a medical journal, establishing really great baselines for automatic image classification for skin cancer, and this has led the community then to take greater interest in careful documentation of these pre trained models. So this is a paper on on documenting pre-trained models for people who are using these models to understand. How is this model trained? What data was it used for this model? What's the model itself such that practitioners have a better understanding of the implications of model choice.

B

So then, given the fact that a lot of us are using pre trained models in machine learning to do science to make deployments to use as baselines, how do we ensure the quality of these models that we end up? Publishing.

B

So, at least in terms of machine learning, research and outside of machine learning, research in the professional software and in in open source software, these projects are often not bug free like if you go to github right now, you will see a lot of bug reports for your favorite like deep learning framework or something like that, and so some examples of how this has affects the machine learning literature.

B

This is a this is a discussion from two authors on a Bayesian paper that was doing some like sampling for computational do not make. Indeed, you see the author then realizes that there's a bug in their scientific code, and this actually ended up, resulting then in a paper on the testing of MCM and, ultimately then the authors had to retract the paper and then reissue the paper with their new result.

B

This is the follow-up paper, and this is a really great discussion of what the experience of having to go through a retraction is ultimately like similar.

B

Similar findings have been found in their machine learning where the RL paper and the authors note that this paper was inspired by a bug that was found in some co-author, some, not sorry, snot co-authors, some colleagues code. They ended up looking at their implementation and finding something was amiss with their analysis and ended up building.

B

So we think, then, about best practices for research code. None of this is necessarily novel from a software engineering perspective, but I think it's something that we need to think about as scientists. So when you are publishing code and I hope you all publish code associated with your papers as a little reminder at least to think about how do we do testing? How do we can we can?

B

We include testing include code, review, doing good design, thinking about variable names right so, like you know my there, no no, my bar like coming up with meaningful variable names. Even you know, if we're thinking about like parameter names like instead of doing like lambda right. Maybe we need a really mean like something like this is the learning rate or something like that like have like variable names that really are expressive and be willing to refactor our code when necessary, so to kind of to kind of put in a deeper example of this.

B

This is an example of bugs in code in the in actual software that we end up using. This is a bug from a machine learning library for automatic differentiation who has used autograph, okay, so this is this. Is this is a comment?

B

Auto grab bug like a lot of the libraries we use every day for a research have bugs to begin with, and in this case like in Auto, there has noted that there is a bug or the user has noted above tries to fix the bug and everybody's happy right, and this has happens every day, all the time in like the open source community.

B

But if we actually compare like this sort of discussion to what we saw in the sciences earlier like having to find go through and improve upon, a bug in science is a lot more involved right.

B

It's a lot more involved, a lot more challenging and oftentimes there's a lot more trepidation involved, whereas in this case someone's finding a bomb in an underlying library which is important to know because, ultimately, like we are here, judging our conclusions based on the software, but still at the same time like this is considered an improvement for the general community right, it's relatively straightforward. Somebody finds the bug somebody thinks about like everybody benefits, but that's almost so the case in the sciences.

B

If I find a problem with your analysis or if I think that there is something that can be improved like going through, the whole process of a retraction is very, is have a lot of trepidation, so one of the things I kind of want to leave. You all say me about is like how do we brave the sciences close to this sort of model right where incremental improvements and collaboration is considered a positive right, I, don't necessarily have an answer, but I like I. Don't leave this to you as both you know.

B

Something to be aware of in this ends that, like your underlying software, does have bugs um and like your libraries that you work with have bugs, but also to think about how do we foster that kind of environment in our research.

B

So yeah this is kind of this is basically one thing that I don't necessarily have an answer for, but I kind of want to leave you all thinking about a little bit so we've kind of talked about. We talked about the dependencies. We talked about writing good software and hopefully you know these pieces, like give you a lot to think about in terms of implementing your own models and doing your analysis, but maybe that's not necessarily enough to ensure the reproducibility of our own research. So.

B

I'd like to now think about what you mean when we think about scripts to run jobs on similar hardware sounds pretty straightforward, but it's actually a little more complicated than that. So.

C

This is this is an example of a.

B

Computer in the library this is a computer library in like South Bay, it's a really good library, we did I mean really good museum, and this is like alphago right. This is the game ago, and so we know that, like like, like a lot of compute, is used for a lot of machine learning and research.

B

These days, it's very computationally, expensive, it's very energy intensive, and so a lot of these results may not necessarily be reproducible for the average user or for the average researcher right, like Facebook, recently published a reimplementation of alphago and alpha zero, but they're one of the few institutions that have the computational resources to ensure that their implementation works, and so this is all open go, and this is the reimplementation that they have it's available to the public. But again, like you know, this is this. This works.

B

We have results that suggests the works and they've, but it would be very computationally expensive for someone to take that and to run it for themselves, and so this real, it's so alphago in the sense, might not necessarily be reproducible simply because it's prohibitive right to pave that kind of money to get those similar results, and so, unless you're, a systems researcher many of these details in the implementation of how they manage to get these.

B

These large-scale GPU farms to work together to like path gradients together to like coordinate such that they can do distributed training and achieve super human performance and go is not necessarily the important thing you need. What you really want to know is the underlying ideas of the algorithm that allow it to work um and so similar problem who's. Who does this? Who does neuroscience?

B

Okay, so good? So that's at least then I don't have to assume that someone knows this library better than me. I, don't necessarily use this library, but this is a this is a study of and the results of, an open source library for brain volume, measurements, and this looks at what's the effects of these, this library performance on neuro and brain data, respect to system setup, and it sounds like relatively straightforward. You would think that results in T relative with consistent, but it's not so.

C

B

Are discouraged from updating their releases or operating systems or switching to a type of workstation without doing over the analysis? So if you change any may need to redo it again.

B

I to me, that's very scary because absolutely frightening, but you would might end up making dramatically different conclusions based on how your setup is design right, and so these are types of decisions and details that nutty researchers don't even necessarily consider we talked about saying like okay, you need them PI right, but how many people actually specify the actual version number of them PI's are using right. How many people are specifying the actual like underlying, like type of blast they're using right? Are they using class?

B

Oh they using MPA, which version of that are they're using right. What operating system version they're using? Not necessarily everybody includes those details yep. So this with regards to machine learning, that's not necessarily like you can unit text like your implementation, but given a lot of these details, you can't necessarily unit tests like the results right. It really depends.

B

I mean ultimately like yes, you if you have a unit test that might miss that might work for that case, but you still need to be very, very particular about the environmental setup of what you're running and even then, when you're doing like when you're doing something with a distributed system. It's not necessarily certain that the distributed system is going to end up giving you consistent results across machines right because, like this one of the things that they matter here that they discussed here is workstations.

B

So if you're running this on the distributed system and you're, not necessarily hitting the same machines, you might not necessarily get the same result. Yes, although one of the things that we don't necessarily have a full handle on given the fact that these implementations are non-deterministic is what our fault, tolerance really is yes, but even but, but we can get into that like even that is problematic. I mean well we'll discuss there.

B

Even that is people like when we get into the world of like of like non non deterministic methods like even that I can get into get pretty hairy. So.

C

Another challenge, even when we get to this part.

B

Of this testing, the actual type of workstation, the actual kind of setup is even that's like when you try to be as a neurotic it's possible. That doesn't necessarily mean that this is reproducible for all time. So this is a paper in hood dynamics. We're going to borrow my there's a lot of really great work on open science. I would definitely read all of her stuff he's doing a roll of stuff, also on like alternative scholarship models and he's really involved in thigh-high into the Sakai conference.

B

Yeah Lona Barbra awesome Creedmoor learner in Bartlett um anyway, so she finds that even with really careful protocols within her own research group, she is unable to reproduce her results from previous years, simply because of changes and hardware and kingdoms and underlying software. So she does a lot of stuff.

B

Tp used her fluid dynamics and finds that, because of the changes and hardware that occur over like a five-year period and changes in Kuta like you can't get the same results and that you have to work very, very hard to get consistent results with this sorts of changes in software and hardware for overtime. So even then, when we think about introduced ability, this is this. We need to think about it, then, with regards to some narrow band of time, given the fact that, like changes in hardware and software are going to happen, I mean I.

C

Think they weren't able to like they were able.

B

To get similar results, but it took a lot of additional effort. I mean it is a consistent problem. We I don't necessarily have a solution and it's really hard to pin. This is an example of.

C

A change in operating systems, library and hardware, setup of the same implementation using the same seed right and they find then the deterministic implementation, at least three, of course, and learning.

B

Produces different results.

C

B

C

Not so these can I talk about.

B

With people, so I was talking with some cemente nerds at singularity of singularity, and they find that even like, because of because of how docker is set up like you can't necessarily because you have way you have when you were working with the GPU you have to like get out of the GPU in order to like hit the GPU I. Think you start, you have to get out of docker to hit the GPU you're, not going to necessarily ensure that, as a result until like multi-gpu with a docker, why not work yeah yeah?

B

So this is like, so what I guess I'm trying to argue is all of this is fundamentally very fun.

B

And then you know when we think about that they sorts of concerns like it might not even necessarily make sense to make all our research fully transparent. So many of you have probably who've worked with sensitive data like data that you can't. So there are certain concerns with sharing either legally ethically, you've had to get an IRB some things.

C

Yeah right, and so so one of the concerns with regards to sharing these sorts of refusal pipeline into the data itself right, and so this.

B

Is an example of the reproducibility of the machine learning for healthcare literature as a healthcare? We prefer this.

C

Is something I care about and the reproducibility of machine.

B

Money for healthcare tends to be like less like less strong, and one of these issues, of course, is the data itself. This is an example of people trying to do very transparent research on data.

B

That is probably not the right choice to do in terms of transparent research, disease researchers ended up publishing all of their data and results for OKCupid data collecting people's information off of OkCupid the commute the public did not necessarily respond well to having all of people's personal information, aggregated and public online in the Google place, and when we think about data sharing and ethical considerations. This.

C

Is this is another example of work-life image ever who also get the model cards paper.

B

And her colleague.

C

B

Deena sheets, so data sheets is a questionnaire that researchers are encouraged to fill out describes not only like the full data provenance of like what was the data. How is it collected? Why was it collected, etc, etc? But.

C

They have a whole section, then also on Google and a 15-3, and the example we have here is.

B

Like featural images right and like there's a lot of Baskerville considerations with dealing with data about individuals, so we need to consider, for example, then data at least when we think about what commune's to be reproducible, research and maybe in some cases it needs to be fatal thought about more carefully.

B

We want to talk about, like alternative methods of how to deal with that I'm happy to stress like during the questions, but that's like definitely something that we need to be concerned with, and so going back to your question, then about testing I'm going to go back to talking about testing.

B

So when we think about what science, the scientific method is right like ultimately, the scientific method has to do with the testing of hypotheses right, you form a hypothesis. You say this is what I expect to happen. You identify all of the experimental controls that you want to control for, and then you test with relation to those right and so statistical testing has been a really old practice. This is probably this is like the earliest statistical testing research I have found.

B

Which is on the likelihood of the birth baby, males and baby female, and so they, the the scientist, was looking at the Cystic alike. We heard they may be born for employer girls and find that, through statistical testing and birth records data in Britain, they were able to establish that they were equally likely and under which they this they. They argued for divine providence, ah yeah, yeah, divine, okay, not private, probably Nantes, Providence, divine providence, um and so a statistical testing is something that many of you probably use for in your own research.

B

But it's something that has been underutilized in the machine learning literature. This is some of the work. That's been done for machine learning, statistical testing of multiple classifiers from 2006 I. Think it's one of the later works that was done and follow-up work on this has not necessarily followed.

B

This is a paper on the reproducibility of RL baselines and they find that, despite the fact that they tried to do various kinds of statistical testing, this statistical testing, for our reinforcement, learning at least, is not established right now in terms of best practices and in terms of the theory to handle these sorts of problems. So additional work is at least done.

B

It need to be done, especially NRL, to create better statistical testing of the comparison of models, and so you know, given the fact that I might be arguing that we should be doing statistical testing of our machine learning models. Many of you probably are familiar with like problems around peak hacking problems around creating statistical significance and in other sciences.

B

There are definitely methods for us to deal with these questions, and so who has done pre-registration for their research who's familiar with pre-registration, okay, one, okay, one person, so pre-registration is the registration of a the plan of a study that is published in a particular database or venue prior to the analysis or the data being collected.

B

So this is a this is an example of clinical trials.gov, which is a one of the largest pre-registration databases in the world, so clinical trials that occur in the US are mandated to be registered online in this database and it's available available to the public, and so this is an example of what this might look like who's you, the who's, used, Open, Science framework or familiar with Open Science framework.

B

Okay, Open Science framework is another database that allows for not only the coordination and maintenance of your own scientific studies, but also for the pre-registration of studies, so that you have all of this available. Prior to doing your analysis such that you have a very specific plan of what you want to do, and then you just have to go off and do it that way that everybody has agreed to the analysis and whether or not it's it's statistically significant.

B

Then you at least know after the fact, so that you you have come up with a distinct plan of what you want to do and then you go off and do it rather than the doing it. The other way around, which often leads to concerns around pee hacking, so in the scientific literature, at least in other studies outside outside of machine learning, they have found that pre-registration then leads to the publication of negative results. Lack of negative results can be very probable and challenging to a lot of fields.

B

In that, then we don't necessarily what doesn't know what doesn't work and they have found that that rep, that registered reports, which is basically the model that allows for the publication of peer-reviewed pre registered studies such that publication, then is independent of significance right. So you, you register a study, it's peer, reviewed and then based on the quality of your analysis plan and the quality of your methodology. It is then sent off for pub for publication. That's a saying, okay, you get! Yes, you can publish this. You know come back with your results.

B

That's this is at least this is what we call the Register reports model. We find, then, that negative results are much more likely to be published in these sorts of models, at least based on this literature from the psychology and social science community.

B

But when we think about machine learning, research at least its might be a little bit different when we think about pre-registration right for machine learning, literature, you don't necessarily have to collect novel data and many of your problems. You already have the data on hand as well right. So then, what do we think about? How would we then do pre-registration in our own fields right in our fields? Maybe many of you have not done registered reports. Is there an opportunity for us as scientists to adopt registered reports in our own communities?

B

So one example: one alternative method of thinking about registered reports is to do registration at the time of review. So in this scenario we might say I'm going to describe my model, describe my analysis, describe my plan and then the actual running of the code will be done by an independent body. An independent agent right and those results are the ones that are published right just say this could be done not saying it hasn't been done by anywhere, but this isn't it.

B

This is an example of what we could end up doing right instead is so that we we publicly publish what we want to do. We publish the whole set up, and then somebody else runs it. Yes,.

B

So so, at least in the machine learning community, a lot number of a number of research groups and sponsors have cloud companies. So, for example like so, for example, like there's like some of the main sponsors of machine learning, research are Amazon. Research, Amazon, Google ISM is a huge funder of machine learning. Research Microsoft as well. They all have cloud services great, and so we can imagine a scenario in which, or even like universities have their own compute.

B

We can imagine, for example, that, like the public public, the submission of the docker image is the is the registration and then someone else takes that and then runs it right and that those results then are done by someone else is the results that are published yes and that what I'm arguing is like you could imagine doing. The results like we're doing the whole experiment on your own to say like this is I.

B

Think this method looks good and I want to submit this, but then someone else runs it, but yes, but the, but the results from the independent run are the ones that are published. Not your results right.

B

You can imagine that this could be a possible scenario, and so so maybe but of course like these, these experiments might be too computationally simple for us to think about in terms of like, like what we could be doing with like registering these sorts of studies and so an example, then of like at least in terms of like the companies, like simple examples of like the Social Sciences of registered reports in in the clinical trials, these sorts of things.

B

Maybe these examples that I have given of registered reports and if statistical testing has probably choose to computationally simple. So at least let's talk then about like how would we do statistical testing and controlled experiments in another domain and I'm, not necessarily the expert with regards to high energy physics, but I think at least in other domains of experimental science.

B

Controlled experiments then are better or better defined, and thinking about like how do we think about testing a system we don't understand and designing systems to be tested is better, like is I, think better, like described, and so I think.

B

This is part of the reason why bringing me and Rex identify physics as a possible area for us to consider for the testing of black box behaviors and models and thinking about how do we do testing in general and so um I think there are a few high-energy physicists here who know this problem better than me so I apologize for like for describing higher energy physics, not necessarily well, but an example of what you might think about in terms of higher physics, is least there's multiple institutions that are able to validate each other's work and confirm each other's work right, and so we can imagine, then, for the registration.

B

Somebody else might handle this sort of thing, for statistical testing or for registration at time of review right, because they share a computation, they share hardware and they share resources right and so in this example.

B

This is this is the Large Hadron Collider, and so these so people can then share resources in terms of compute and for registration and for validation of brain and even in this example right we have the higher the physics community has a better understanding of the things that they want to test for and the things they want to control great in machine learning community. We don't necessarily have these clear definitions right.

B

We think we just want to establish state of the art, but we aren't necessarily careful of trying to control for all these other experimental factors right and you don't necessarily know how to test for these experimental factors, and so this is something that we think we need to develop better methods and tools for, and so, as I mentioned um hierarchy. Physicists are interested in the discovery of new particles and when they think about these sorts of questions, then they think about the the thing that they want to test and the nuisance parameters.

B

They want to control for right, and so one of the things that I would like you all to think about, then in doing these sorts of tests and doing these sorts of experiments for your own problems is how do you establish statistical controls for your machine learning model such that you can make a strong conclusion that the thing that you're trying to find is is present independent of all these other experimental variables that you might want to be doing right.

B

So maybe this means you might have to run your analysis a lot more times with a lot more variability in terms of your system setup. In order to have these sorts of things being confirmed great, and even though the one of the questions that I'm interested in leaving leaving you all to think about in terms of how you do statistical testing and how do we develop that our statistical tests I think that's a really important question for us all to consider as we're moving towards more deep learning models.

B

So if we were to borrow this practice, um one of the you many of you are probably familiar with how you do how you do hypothesis testing in your fields like, but at least for machine learning. We don't necessarily think about how to formulate hypotheses right, but at least in machine learning.

B

One of the things we're interested in is thinking about the expected behavior, with or without some sort of experimental change in our model, setup right and trying to record outcomes with regards to various baseline models and considering this sort of variability, and then we might want to be able to then think about some sort of statistical improvement and make some sort of assumptions about what those might look like. So this is some work that I recently presented with Michaela who's in the room.

B

The poster is kind of fun, I, really like this poster, and so one so, as I mentioned, like being able to establish these sorts of systems, level distinctions and get them independent of data and dependent of our other experimental selections is really important and I think this is something that the machine learning community at least, is beginning to start to adapt. I assume that many of your other fields think about these sorts of questions in a deep way.

B

But I know that this is something that machine learning is beginning to wrap its arms around and I. At least I am interested in talking to all of you about how you are working on those problems with regards to bringing machine learning into your fields and thinking about experimental testing and experimental controls given blackbox models, because this is something that at least the machine learning community hasn't fully dealt with and to come up with good practices to deal with these sorts of questions.

B

So one of these problems, but some of the problems that we think about in terms of statistical tools, controls and machine learning is compute, and so, as you know, obviously like compute is a challenge but being able to control for the finding a result and controlling for compute. We don't necessarily know how to do controlling for the random seed being able to the actual assistants. Random seeds between machines is still hard so being able to say that, like your results, is, is independent of your seed selection.

B

You probably will have it to run it a lot more and among on more machines in order to establish that your result is consistent, regardless of what machine and what random seeds you ended up. Picking but I think all of this sorts of concerns about the verify ability and the consistency over the results belies a more fundamental point, which is: is these verification and assertions really what we want, or we are we really just trying to do all of this such that we can like catch.

B

Other people, right is is is not being is, is just ensuring that our colleagues aren't wrong and they're not lying to us is what we're trying to do and I hear. That's not it's not that's not the point. Great we're not running other people's code, just to confirm that they did their job right.

B

What often times we're actually trying to use other people's code, because we have a problem that for ourselves that we're trying to solve we're not just going off and trying to hunt other people down like trying to shame them right and so fundamentally, I think. Ultimately, the purpose of having code is to illustrate the problems and the findings that you're trying to do right.

B

Ultimately we're trying to show what we're doing we're trying to show our ideas as opposed to telling them and that's that's the benefit of code- is that it provides us tools that express the ideas that we're trying to to convey.

B

And so when it regards when it comes to showing then like, are these implementation details in the paper? The main thing that we're looking for right, I argue they're, not right, like it's actually the most important ideas in the paper or not like what operating system you used or like what version of numpy like all of this is important, but that's not the main point right.

B

The main point are the underlying ideas on that you're, trying to explain or the underlying findings that you're trying to communicate, and so this is a paper on the reproducibility and on computational science. That is often cited, and it makes this very radical claim that I think is actually still radical today, which is the scholarship itself in computational. Science is not the paper. The paper is just a description of the idea. The actual scholarship- and this is this- is from like 95.

B

So this is an old idea, even but even still, it hasn't necessarily been fully realized. The actual scholarship is the software that produced that result. If you're doing computational science, like the actual deliverable of your finding, is your code right. Your code got you that result, that's your scholarship right, and so these implementation details will be in your code. They don't necessarily need to be in those people, your paper right and hopefully, then we can then build software that helps people do their own research.

B

That helps them understand problems by enabling them to use that code, as opposed to just be focusing on these details of how to get the code to even run, and so maybe this means, then we need to be communicating with each other outside of this PDF. Many of you are using it. How many of you are interacting with each other on github, like code, is becoming a fundamental open?

B

Source code is becoming a fundamental part of how we're doing science today, and so we can start moving science towards this sort of model in which open source becomes a greater part of how scholarship and shared and so I think, one of the things that is important in terms of thinking about what we're doing then, is that we need transparency in what we're doing, but we need we need ultimately, so we still need them to be independently verified. So an example of what we might be thinking about, then, is like simple implementations.

B

So this is alpha zero, and this is an example that uses tic-tac-toe right. This is this implement. This, like has the model of alpha zero right, which allows general-purpose training of models for two-player games, but it also includes tic-tac-toe and tic-tac-toe, is actually much more useful to the average scientist, because it can run on your own machine. It doesn't cost as much money and it has all of the implementation details of the algorithm right. So we can run the whole thing on our own and we can even put it on a different game right.

B

We can say what it would do if we ended up playing a fellow, for example right, and so we can. We can really interact with this algorithm in a meaningful way. Other examples are notebooks who's. Familiar would still that pub right, so yeah. This is to distillate.

B

Pub is a venue that allows for explanatory science in machine learning literature, and this is an example of differentiable privatizations and all of these are collaboratory notebooks that communicate the ideas that are presented in this paper right, and so any of these note any of these example then can be independently run and interacted with in the cloud right. So collaboratory allows people to run a GPU on google cloud and they can interact with this so that they can run the notebook, but also can ask their own questions.

B

This is an example of a paper we discussed earlier, which is attention. You is all you need, which I mentioned established the fact that implementation improvements in NLP don't necessarily have consistent results, and so, as a result, this is this is available online in a notebook, you can just go through it and see an independent implementation and work through it on your own and so to go back to what I mentioned earlier with the notebook. A lot of you have used snowed books in the past.

B

Many of us use notebooks not necessarily to present science, but just to interact with a piece of data right. So in this example, someone is using ipython they're, trying to plot the data they're trying to load the data right and so oftentimes. When we're trying to think about a problem, we often interact with it. Computationally right. We just want to be able to see what happens and begin to ask our own questions independently right.

B

And so what I'd? Like you all to think about is through opens implementations and, like really good implementations, we think it can begin to immediately build off of other people's code. So this is an example yeah, um a research repository on binder and in this example, they're doing some machine learning for explain ability and for interpretability and they are presenting their analysis on the left. This is their notebook right, but because I have the code and because I am able to run everything independently.

B

I can now ask my own questions about their analysis automatically and immediately so.

C

In this example right, they have.

B

Pretty straightforward, but here I can actually now go through their examples and like look through their data independently. I don't have to like figure out how to load it in simulated data I can check through their data on my own I can ask my own questions of their data.

B

The more importantly they're, using this example of like creating with simulated data with four classes, great I'm now able to say. But what, if you did this experiment with five classes? How was it for him right and I can create my own simulated data and run this novel experiment to ask my own independent questions of their model of their method without having to do any additional implementation work, because it's all there right if this were, if this were another repo that was not necessarily as well maintained.

B

If I didn't have this sort of setup ready to go, this would have taken me a lot more work right. Just getting to this point of saying. Well, what, if you did this experiment, this way right, I'm sure many of you as have gotten reviews in which they said well? Why did you do this experiment? This way? Why didn't you do it that way? How many gotten that response?

B

The nice thing then, with this sort of scenario, is that you can then push it back to the reviewer or push it back to your colleagues and let them do that for themselves.

B

They can then ask their own questions independently of you running them right, which means then that everybody can build off of each other's work immediately, because all of these pieces are together the analysis, the code, the computation all of these pieces are gather such as we can immediately start building off of each other right- and this is this- is one of the benefits that open-source is allowed is that people can then build off of each other code immediately, because we have code that runs and so I think I taken a lot more time than I wish.

B

I did I. Think I want to give more time for questions. So please stop me anywhere, but I think ultimately like in this discussion. Around reproducibility I've identified a lot of challenges. I've identified a lot of like practices that I think could be improved but fundamentally I.

B

Think, more importantly, we need to think about what we really mean by the term reproduce great, and that's part of the reason why I'm not particularly interested in what the definition of reproducibility is, because, ultimately, what I'm really interested in is the question of how would we get research to beget better research right and that's ultimately, what the purpose of science is? The purpose of science is to create knowledge such that other people are empowered to ask new questions based on that knowledge and so oftentimes. Then we would call this an extension of previous work.

B

So fundamentally that means that I'm interested in the question of how do we build extensible science? How do we build extensible, computational pipelines such that other people can immediately build off of what we do and ensure that the community of researchers around a particular question grows, and so this is this is the this?

B

Is the same data set that I discussed earlier, which was the nur ups 2017 dataset, and we were talking about citations, we're talking about research practices with regards to dependency setup, but some of the things that I wanted to point out here is when we look at people who include dependency files, write files that can actually identify denta fie. What software was used in the machine made readable format. We find that people who are publishing these details have more engagement on github right.

B

These are the average numbers of sportstar gauges and Watchers. Many of you are familiar with these sorts of metrics in github, but I. Think the one that I really want to highlight is the forks working who's familiar with fording, a repo and okay. So you all know what that means right.

B

So when you fork a repo that probably means you're gonna use it to your own purposes, right, you could, you could be using it just to like make a pull request, which is great like, but you also could be using it to repurpose the repo for your own problem right and then fine then, but, like notably, more people are repurposing these repositories for their own problem, using these machine learning, algorithms, great and so that men means that they people are applying these repo. These methods to their own problem.

B

They can they can install the dependencies.

B

And so we think last year, I was with some colleagues at product dinner and we presented a implementation of binder that included GPUs. So we took a bunch of repos of machine learning, algorithms that were available on github and we ensured that they were able to be put on binder and run with a GPU. So you could just log on and get like a one GPU to yourself to re-implement the to rerun the analysis, and so it can be done.

B

If you have any more questions about how to work with a repo to dock and GPUs. Please stop and find me, but that's the this that, but having a sort of place in which independent running of repositories, I think is particularly important and notable to think about. I would love to talk about this with you more and.

B

I'm gonna get take questions from here. I, don't know if I have much time left, but if you want to learn more, these are some some citations of those kinds of things. I've been thinking about.

B

So I think one of the things that I'm really interested in and I think this is one of the things I was like kind of pointing out with this slide is like I think that machine learning, one of the challenges in the focus for state of the art is that there is. There is a less developed community for empirical research that tries to understand the behavior of machine learning models.

B

Now this is something that I am working on and it's a really important question I think that drawing attention to that in the community and trying to develop that as a field is like a really important problem, and that's one of the things that I am interested in working on, but I think it's it still hasn't been quite as established. We had there's interesting work.

B

That's been done in theory to try to understand how machine learning models work today, how over parameterize models might be able to learn, despite the fact that they're over parameterize, but in terms of empiricism I think it's still very.

B

It's I think there's still a very young field, and it still is something that needs to be developed but like at the same time, we've found that there has been work- and this even earlier prior to like the the popularity of deep learning, but it still is not in a place in which, like it's a fully developed field, so my hope is at least that by bringing attention to these sorts of questions, we can attract more researchers to be doing these sorts of problems and to be working on this sort of thing.

B

So there are workshops that try to do a fix, there's a ICCB we, which is a computer vision workshop that is interested in a pre-registration model. So that's an example of one thing that might be for that. There's. A recent workshop I went to at ICML, which is on understanding phenomena and deep learning, but it's still a young field and I think that bringing along the community such the people know how to review these sorts of papers and people know how to reward.

B

This kind of work is still an ongoing process, so top the liner is your question. I mean to me: I mean it's I, think the pie thing good to me. It goes back to the venues I think, ultimately, like the venues define what the unit of scholarship is, the venue's describe to find what the method of evaluation is the venue's define.

B

You know like what a good paper is and how to reward a good work. So to me I think, fundamentally, we need to be like going to our communities and like arguing for research that values software so there's interesting work in terms of like there's journals like there's the Journal of open source software.

B

I'm going you know, Loretta Barbra is doing a lot of work in terms of that like there was like Syfy, which has proceedings like there are these venues that are coming up that are doing these sorts of work and like code availability is becoming like a thing for certain venues, but at the same time they aren't creating, like established methods for peer review of code right.

B

So that is something that we need to do, but fundamentally I think like like we could try to talk to our research community, like our departments and like have those sorts of conversations about funding, but I think like to me like. Ultimately, if I give a docker image, if you tell me to submit a docker image, it spits out a PDF brain versus, like give you a PDF like, like I, feel like. If the you know the place gets the PDF like they still get the PDF so like.

B

If we can move towards that, then that would be better, but I mean it is it is. It is a challenge right and so, like I and I've, been to conferences that, like think about scientific scholarship and I, think the I think the journals are still trying to figure that out, but they're also having a lot of challenges. I mean, in my mind, like if we're thinking about like what people would pay for in terms of scholarship, I, think people would pay for for hosting and compute of of, like someone else's research right.

B

If somebody put up like if somebody put up a cloud system in which, like everybody, was publishing their research and like they would pay to run it for you, I think people would pay for that.

B

That's just my belief, I, don't you don't have any evidence for it, but like if you're you know if your journal somewhere any of you run journals or editors like you know, I think this would be a way you could make money, but you know yeah, but I think this is part of the challenge, for this is that there were that you were expecting people to do duplicative effort.

B

What I'm, arguing fundamentally, is that there's like is that if the code is a scholarship, then maybe we need to be publishing, publishing the code and requiring review of the code and like publishing of the code in that way, like you know, well-documented code also is documentation right in the same way that, like a paper, is documentation and if there's a way that we can like remove the duplicative effort and like venues can move towards that like I, think that would be really great, like there's really interesting, work done in certain research group that do like websites for their papers, and many of you have probably who's read like the medium post of a paper.

B

First write like there's, there's a lot of really interesting who's like who's like read a summary of a paper on Twitter right, there's a lot of really interesting forms of like alternative scholarship that are becoming more rewarded.

B

Even if it's not explicitly implicitly I know, researchers who have had like details and like discussion points from there like Twitter discussion or their medium discussion, posted like like, like use as the thing that's cited, when it's not in the paper and so like I, think that, given the fact that, like how we're talking about science and how we're doing science is like already changing outside of these, these venues, like, like my hope, is, is that, like, as all of us, continuing our research careers that we start thinking about how to bring those two back together.

B

How do we bring that in? How do we like, like honor, that sort of alternate form of like scientific communication and like bring that it like tormal eyes that, because, like a lot of you are spending a lot of time like talking about science on twitter and like promoting your work on twitter and like communicating it, and if there's a way that we can do that in a better way such that we can connect these two code? That would be really great. But it's like it's hard.

B

A lot of a lot of these are ultimately open problems that are ultimately social problems and not technical problems. I think so this is this. Is this? Is a social, social, social, socio technical talk, more than I? Think it technically might seem to be yeah.