National Energy Research Scientific Computing Center (NERSC) Deep Learning for Science School 2020, 7 Aug 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Week 4 - Reproducibility in Machine Learning: From Theory to Practice - Koustuv Sinha

Description

More about this lecture: https://sites.google.com/lbl.gov/dl4sci/koustuv-sinha
Deep Learning for Science School: https://dl4sci-school.lbl.gov/agenda

A

B

Good morning, everyone uh welcome again to uh the deep learning for science school. This is the fourth week of our webinar webinar series this year. um I'm very pleased to have costa sinha today to give us a lecture on reproducibility and machine learning. um Gustav is a phd candidate at mcgill university. He is working with joel pinoue and william hamilton and his currently a research intern at facebook.

B

Ai gustav has his primary interest is in advancing logical generalization, capabilities of neural models in discrete domains such as language and graphs, he's also involved in organizing the annual machine learning, pre-predisability channel challenge um and he's serving as a reproducibility co-chair at new york's 2019 and 2020.. Let's go stuff I'll give you the floor.

A

Thank you so much for the nice introduction, mustafa um yeah, so welcome to the talk. So basically in this talk I will go over uh like four different agendas. So uh basically, first I will talk about. Why do we need reproducibility in science, especially in machine learning? Then I want to talk about like the case study to niche need the reproducibility.

A

So I will go over like certain findings that came up in the literature recently, then I will talk about what is the machine learning community doing about it like what are the steps that we as a communities are taking about reproducibility and then? Finally, I will talk in depth about how you can perform reproducible research, so yeah. So basically a brief background. Above me, mustafa already gave like I'm a phd student at mcgill, university and mila, and I'm a intern at facebook. Ai and I've been like running reproducibility challenges for the last three years.

A

So what is reproducibility now reproducibility refers to the ability of the researcher to duplicate the results in a prior study using the same materials as values by the original investigator.

A

Now reproducibility replicability robustness like these are different terms, which are kindly kind of like used in different context, and there has been like a recent debate on the exact definitions of these different terms. So just for the purposes of this talk, I'm going to like uh set up a definition.

A

May this definition may not be like kind of like well accepted, basically, because there are certain things that overlap so reproduce by reproducible. I mean that if you use the same code and same data given by the authors, if you get the same results, then that is, reproducible, whereas robust is when you use the a different code, but on the same data.

A

That is what we are looking for at robustness, whereas replicability is using the same code but on a different set of data, and if you have both of them different, but you still get the same uh claims of a particular paper, then that is like the highest level of reproducibility. That is uh generalization so now taking these definitions into account.

A

What is the crisis that we are facing so in uh nature, published a nice report in 2016 by m baker on reproducibility crisis in science, so when asked people, so this is basically a like a set of people with different domains in science, so not only like machine learning, so uh basically 52 percent of the respondents like replied with.

A

Yes, we are facing a significant crisis and 38 responded that, yes, we are facing a slight crisis now, if we look at the domains that are involved so chemistry and biology come up like the top two domains where they face a lot of reproducibility crisis. Now, looking at this, we might feel safe that okay, computer science doesn't probably have that much reproducibility crisis. But that is not the case, although in computer science we can actually do a bit more about the reproducibility crisis.

A

So basically, let's talk about certain findings in reproducibility uh in in basically in machine learning, so reproducibility should be easy right, at least for us computer scientists. We have the same code. We have the same amount of data. We might also have the same amount of computation and then you should get the same result. Right now seems like it's not the same case so last year uh in europe's 2019 edward ref published a paper, so this paper uh a step toward quantifying independently reproducible machine learning research.

A

In this paper he tried to reproduce 255 papers which were published from 1984 to 2017, and then he found out that 63.5 of the papers were reproducible and the rest were not. Now. He further went on to reproduce 85 percent of the result where, after getting assistance from the original authors compared to four percent, when the authors didn't respond, so he only could reproduce four percent of the papers without having any author, uh like suggestions, so he came up with some significant factors which affect reproducibility.

A

That includes uh pseudo code, so the papers that he reproduced, like basically had all code available, but that is not always the case in case of machine learning research these days because of several companies having like their proprietary use of code. So he mentioned that pseudo code hyper parameters readabilities, especially in equations and tables, and especially the amount of compute that is needed. These are kind of like very significant factors affecting the reproducibility.

A

Now there are certain limitations of this work, so this work was primarily done by one author and over the span of five years, so that might introduce certain biases in the study. But still this is a very significant study which shows that there is the issue in reproducibility in machine learning research.

A

Now, if we dive a bit deep in several subfields of machine learning, let us talk about the computer vision subfield. Now. This is one field where people tend to think less about reproducibility because they, like the usual notion, is that okay, whatever we are doing, should be reproducible because it's the same data set is the same training pipeline and essentially uh people in vision, use model architecture such as resnet or densenet, which are like deep architectures.

A

To give you like, a pretty good performance performance on a large set of standard data sets, but there has been like several works, showing that reproducibility is still a big issue. So xavier butler published this paper, unreproducible research is reproducible last year. So in this paper he shows that if you vary the seeds of different models, the like different models have different variation in the errors. So here in the x-axis, this is the model error and in the y-axis you see. Two representative data sets the m-list data set.

A

Is a small data set compared to c 500, which is a larger data set where mnist is? The data set of handwritten images and cfr is like a large data set of different objects.

A

Now we see that even like in the small data set, the like the variation in seed is really uh like important, and that is really a big issue in the smaller data sets, whereas in bigger data sets, the variation is still there, but it's still not that uh that prone to it, but certain models are like, for example, resonant 101 is super prone to different variations of seed.

A

So the conclusion of the paper was that single initialization seed is brittle because most of the papers that are reported right now are just reported with only one seed. So a better. uh Like evaluation for a given model would be to evaluate on at least n number of seeds, but there is also like no consensus right now. What would n be?

A

Let's talk about another subfield, so this is generative adversarial networks which have been like super popular area of research and machine learning over the last four or five years since its inception, we have seen a lot of papers being published in it and there's a lot of like research coming up in devising different uh gans, where gans have become so good that they can generate realistic images to realistic videos these days, but even gans suffer from this problem. So in 2017 there came up a paper known as argans created equal in that paper.

A

uh Authors showed that basically, they ran a hyper parameter, search on 100, different hyper parameter samples per model, and they show that there's a large variation in the results, and that shows that how brittle these models are like if the hyper parameter searches like give us this bigger range, then there is that we need to basically report this, and if we are not reporting it, then we are not uh like investigating truly on what is the model performance, so no model was shown as significantly stable than the other.

A

Now this raises the question of limited computation budget. So, if you are a, if you are, let's say a university student, you do not have the budget to actually learn like run. 100 hyper parameter searches, so you will basically just run like quite a few number of samples and then you will get certain result, but then your result could become like different if you run on a large set of hyper parameters.

A

So that's why discussing the best code given by your model should always be uh used alongside how many seats that you're running, and we should always try to report faithfully on the distribution of your scores rather than a single score.

A

Now, let's talk about another field which is very prone to reproducibility issues, so this is reinforcement, learning where the idea is that an agent learns to interact with the environment by trial and error by receiving sparse feedback and with experience it improves in real time and reinforcement learning has been used in a lot of real life uh like real life problems, starting from robotics to even financial trading.

A

So we, but when we look a bit deeper into reinforcement learning, so there has been a seminal paper which was released in 2017, which says deep, reinforcement, learning that matters, and in this paper they show that most of the reinforcement, learning algorithms suffer from this problem.

A

When you change the algorithm with different seeds or when you run the algorithm with different hyper parameter, you get vastly different results so for an example like taking some stuff out from the paper, if you take a like a model known as trpo, and this trpo essentially has like a lot of different implementations available online.

A

So if you take those implementations, you might end up finding that none of those implementations give you the same results on the same dataset, so that that is really concerning, especially in case of reinforcement, learning and people have shown that different results vary. A lot with different hyper parameter choices. So these are super super sensitive, and if we look a little bit closer, let's say we take the same algorithm. We take the same code, we take the same hyper parameters, but we just vary the seeds.

A

Even then, the expected reward achieved by these models differ and if that expected model uh like the reward differs, then what should one do like? uh Basically, if I propose a model, then you can come up with a different seat, saying that that model doesn't work so how many trials that we should do so in these experiments uh when people report the number of trials is also not standardized.

A

Some uh people report like five trials like taking five seats while, as some people report like two to three trials, so this also needs to be standardized.

A

Another thing is the baseline, so this happens not only in resourceful learning but in general machine learning, as well. People tend to under report their baselines so that, basically you want to show that okay, my model is superior than the baseline, so people do not take care of them and people just report baselines by copying from another another paper which, when, if you like, evaluate on different baselines, then you might see that the baseline might be beating your model. So there is like a strong positive bias. That's happening so basically coming from fair comparisons.

A

We talk about robust conclusion, so basically we see that different methods have distinct set of hyper parameters. Different methods also exhibit variable sensitivity to hyper parameters, and what is the best method is often depends on the data and the compute budget available.

A

Okay, so what are we doing about it? So these are all the problems that are plaguing like machine learning research, but now I want to talk about some of the concrete steps that we are taking as a community.

A

So first we talk about the open science movement, so the open science movement is not specific to machine learning. It's a much more general movement which says that open science is transparent and accessible knowledge that is shared and developed through collaborative networks. So that means that if you care about reproducible research, you should also care about open science, because that's when the science is well disseminated among people, so there exists a journal for reproducibility and I want to talk about that journal first.

A

So, since the issue of reproducibility is so important, the resigns journal was set up and this journal is a much more generic journal. So this doesn't necessarily focus on machine learning only, but it focuses on any and all types of computational study, be it computational neuroscience to computation medicine and so on, and people can submit the reproducibility reports of published papers in this journal now. This journal also has an extensive review process, so uh your reproduced work will be reviewed by a set of editors and uh from machine learning community.

A

The annual reproducibility challenge reports are also published in this journal, so this journal is quite uh cool, so it has like open reviewing on github. That means that it doesn't have like single, blind or double blind reviewing. Unfortunately, but still you can submit any work that you wish to replicate and reproduce, and then you can do a thorough analysis on that and there is a large team of editors for rolling review over the year, so this journal essentially exists to give a nice incentive for people to work on reproducibility.

A

Now, let's talk about checklists, so this is more specific in terms of machine learning research. So my supervisor, joel pino, like introduced the machine learning reproducibility checklist in 2018. This checklist is supposed to be like a set of uh like items that you should check. While you are submitting your paper now, this checklist doesn't need to be exhaustive.

A

These are just like a generic guidelines and the and the next version of this checklist was actually deployed during the review process of new rips 2019, where the reviewers had access to the responses of this checklist and from now on reviewers in eurips, as well as in icml. They also have access to the answers of this reproducibility checklist.

A

Now uh this okay, I have a question which says: is there a link to the slides? uh The the link is already uh shared in slack um yeah? So when I talk about this checklist, so these are like part of new rips, icml and iclear submission guidelines. And if you look into this checklist, you will have like uh different sections for reporting model. Algorithms theoretical claims how to report your data sets how to report your code bases and what to include in your like figures and tables and so on.

A

So this checklist is essentially is kind of like a guideline for you to follow, and we basically did a lot of study on the on the reports of this checklist. So what people uh did during initial submission and camera ready submission in terms of new rips 2019.

A

So we collected those responses and we did an analysis so most of the like items that we are primarily interested in. So if you look at uh like link to code like link to code, was uh not available during initial submission, while the link to code was more available during camera ready.

A

So this is due to the code submission policies that were enforced in eurips 2019, so which I will be talking about next, but one cool finding from this reproducibility checklist- or I can say a surprising finding- is that 36 papers of like 36 percent papers, judge error bars are applicable to their results, while 87 percent see clear value in defining the metrics and statistics used. So that is kind of like.

A

If, uh if the, if 87 percent see there is the value of the defining the metrics, then 87 percent should also like find uh mentioning error bars as applicable, but this is where we basically need to improve as a community, because we need to do much more stringent statistical validity of our models.

A

Now there is also some effect of the code, submission policies so basically like new rips 2019 enforced. That code should be submitted, but it was not like a stringent informant enforcement. It was like okay, you can submit your code and uh we strongly suggest you to submit your code, at least by the camera, ready deadline.

A

So in here you see that most of the, like the academia, submitted their code in the initial submission phase, whereas in industry the submission in the initial phase was not there, but still the industry like caught up to it during the camera, ready submission and on your right. You see the graph where, uh like how many initial submissions turned from yes to like no to yes, so essentially, a lot of papers did accompany the code during a camera-related deadline. So this is all due to the different code, submission policies now.

A

Does the checklist affect the acceptance rate? This is a very uh interesting question and as of now, no we do not have statistical significance that a checklist does accept affect it, but reviewers. We found that reviewers who found the checklist useful gave higher scores. So we asked like how many reviewers did find the checklist. Useful and 34 percent of the reviewers responded that they found it useful and within them we found that there is a tendency to give higher scores to the uh papers that faithfully report the checklist.

A

So this is quite interesting and uh we we are hopeful that, given in the following years, the checklist will be taken like more and more importance both for people who are submitting their papers as well as reviewers. So you can read more about our checklist and the statistical analysis in the reproducibility program report that we published on archive. I will share the links in slack or in q a later on, so another checklist came up last year and this is very exciting.

A

This is a checklist from uh papers with code and uh it was introduced by robert stoznik, and this is called as the ml code complete list checklist. So this checklist gives you like a nice set of instructions that you should add in your readme, while you're open sourcing your code.

A

Now these instructions consist of number one dependencies like what are the libraries that your code depends on then explicit training scripts, where which are the scripts that I can run to replicate the training which are the evaluation scripts that I can run to replicate the evaluation also, it suggests to release the pre-chain models of your basically of your checkpoints of your individual results and then, finally, the results actually like the checklist says that you should include a table or a plot directly under readme, so that people can easily refer to it.

A

So they did a like a nice study with this set of five criterias using new ribs 2019 uh repositories, and they found that repositories which has all of these five criterias met. Had a median of 196.5 github starts, so that is really really significant number, and that shows that if you do follow these checklist, your research will be more widely applicable to a lot of people and a lot of people will use it in their own work, so yeah. So next I come to the code submission policies that were used in the machine learning field.

A

So recently in icml, 2019 and eures 2019 uh community have rolled out an explicit code submission policy. So there are many concerns in the code. Submission policy regarding data set confidentiality, proprietary software and so on. So basically like when the code submission policy was uh given out, it was written that we that it expects code only for accepted papers and only by camera, ready deadline.

A

But then there was a lot of like backlash because uh like for example, in case of data set confidentiality, a lot of industry uh like researchers says that their data set. They cannot like release it, and that is specifically in case of medical imaging.

A

But uh if that is the case, then one workaround is to like provide complementary empirical results on open source benchmarks, and that would probably add to more value of your work. Then proprietary software is like a common uh like common claim for industry researchers as well, but in that case we suggest that if you are in industry, you can also like provide some minimal code base which, uh which might not have the same training, but that has the similar, uh like expected, results on a small benchmark.

A

So that would help a lot like that would help the community a lot, because if you remember like there's, a lot of papers came out like bird and gpt3, but still the community ended up replicating them using their own code within weeks and months. So it's uh like having a proprietary code out like not the proprietary code, but rather a simple version of your code out.

A

It would be very helpful for the community, so this graph I show, is that how many papers are basically releasing with their code, so in eurips 2017 we started analyzing this and we had like 37 percent of code shared, whereas right now, that number has reached to more than 75 percent, which is very encouraging, and this is what we want to like. Go towards that all conferences uh should have like hundred percent code submission submitted.

A

So finally, in the open science movement, uh I want to talk about some steps that we took uh in terms of reproducibility challenges, so we introduced the ml reproducibility challenge in iclear 2018., so this challenge is essentially uh very unique where given a set of papers that have been accepted or even papers that have been submitted to a conference, you take those papers and then you try to like reproduce parts or full of the paper, and then you essentially submit a report on how well or how bad your reproducibility effort went.

A

But again, reproducibility is not a binary issue. You cannot just ask like okay. Is this paper reproducible like that question is very difficult to answer, because a paper consists of a lot of different moving parts and that's why these kinds of challenges are important, where people can dive deep into different claims of the paper and try to extract if certain things are not reproducible, why they are not reproducible and that helps to the uh to the basically the information of the original paper. So the motivation of the challenge is not at all to be adversarial.

A

The motivation of the challenge is to help the original authors improve their submissions, so starting from new rips 2019, we limited the paper list to the list of accepted papers because the accepted papers were released with 85 with their code, so those were used in the challenge.

A

Now, how is the challenge uh structured? So essentially we start with the process of claiming a paper, so we want to encourage people to work on different reproducibility uh aspects of different papers. So basically we want to increase the like. We want to broaden like how many papers are being considered, so we added like a claim uh like maximum claim list, because in our initial uh conferences, we found that people tend to replicate papers which are easy to do rather, and that would like lead a lot of students like working on the same paper now.

A

I should also like clarify on what is the like: what are the people who actually uh work on these reproducibility challenges? So we see a lot of students working on it and a lot of early career researchers working on it, because this is a great way for students to quickly learn or quickly dive, deep into the machine. Learning state-of-the-art machine learning literature, but also we see a lot of contributions from industry as well.

A

So we have also like divided into like three different tracks, so we have the baseline track where you can work on top of the baselines that are used in the paper because, most of the times the baselines are not like not studied at all. So you should like basically try to replicate the baselines and try to do ablation studies in them. Then a third one is to do ablation studies on the code that is given by the authors. So you take the same code. But then you do ablation studies on different model components.

A

You do hyper parameter, searches and that's how you end up to learn more about the paper and also add to our understanding of the paper and then finally, the hardest is the replication study where uh it is where you take the you: do not use the same code base as provided by the authors.

A

You create the code from scratch and that became like super challenging, and but we were very glad to see that a lot of students were trying to do do this track and uh that led to a lot of interesting uh discussions and interesting outcomes, and then the students or or people who are working on it can submit their uh work. So in case of new rips 29 2019, we did the review process in open review and open review. Helped us a lot to set up the different uh like.

A

Basically, they created an entire different portal for us, because that portal was tied to new rips 2019 accepted paper. So people who are reading through the accepted papers can easily link to the corresponding reproducibility challenge. So uh I encourage you to like go to our open review website to see the challenge like the reports. Everything is public and once the like, the reviews are done. So we use the same pool of reviewers from eurips, and luckily we had like a lot of reviewers so yeah.

A

Thanks to them, we did like a thorough review of the reproducibility reports and we ended up selecting 10 reports to be published in resigns journal.

A

Now we had like uh more 63 universities participating in the challenge and 10 institutions participating in the challenge, and we want to see these numbers grow and we had five machine learning courses throughout the world who had like registered specifically like making mandatory this challenge as a part of their final project.

A

If you are an instructor like this is a great opportunity for you to uh use this challenge as a final project, because the timeline that we do, we try to launch this challenge in fall so that by end of fall, your students can submit certain reproducibility reports.

A

So we saw that 84 reports were submitted from 173 papers, which are claimed and 10 of them were published and we saw the highest participation coming from mcgill university in canada, brown, university, us kth and then indian institute of in university roorkee.

A

So we are very glad to these uh about the students and we uh worked closely with them to get their uh reports published like whoever was selected in the top ten, and we want to like continue this trend.

A

So there's also related work in the community. uh Jesse dodge has been working on emlp reproducibility, so em nlp is a conference in natural language processing and he did like a similar reproducibility challenge with students of university of washington in winter 2020, and it is very good to see that other venues are opening up for reproducibility challenges.

A

There has been a lot of workshops prior to this challenge in icml, uh 2017, 2018 and 2019 authored by rosemarie nk, and all these workshops had the same objective of people to submit reproducibility reports and then disseminating on their reports.

A

So this concludes the first part of my talk. So if you have any questions, please feel free to ask.

A

B

Let's see, I think uh we have a couple of questions. I think the the first one uh do you wanna read it steve, since he also had a very similar question.

A

uh Where can I see the question.

C

Sure um yeah yeah so kusum. However you want to do it, I mean we can we can read the um the q a things to you or you can you can view them?

C

But let me just like read this first one, because I also wanted to generalize it a little bit uh which machine learning architecture comes with the best for producibility, and I guess my generalization is more about like um if you could expand into the different subfields of machine learning, which ones are sort of the most and least reproducible, and then are these related more to the cultures of the communities or like the stabilities of the methods. Are there interesting things that you've seen there.

A

Yes, so yeah that's a great question and there is no definitive answer for that. So basically, the way I see it, the like the range of reproducibility varies actually with different subfields. So if you're looking at reinforcement learning, the problem is more severe there because of the uh variation in the training in the environments, whereas in terms of like computer vision or natural language processing, the there is still issues, but it's not as deep as that, but still like the like. The variance of the models also like differ.

A

If you ask about like what kind of models would be more reproducible, that's that's a question that we really want an answer from the like right now. The notion is that if you are working on, let's say like a large scale model, so let's say bert or gbt3 like these models could have uh like better reproducibility because they are actually working on a large data set.

A

But then that is not quantified anywhere, and it is very hard to quantify that because then you have to like run gpt3 on your platform for like n number of runs, which is a huge, a huge costly operation, so uh yeah. So there is not a definitive answer to which types of architectures are more reproducible.

A

All I can say that uh we, whenever we are reporting our own like model performance, we should give like a good uh variance of how the different uh model performance are working so that we get like a good notion out of it.

B

Okay, maybe I can ask another question of my own, so um so you know like at the end of the day, the research, the real contribution of the scholarly contribution of uh over research is really the code right. It's not the.

C

B

It's not the description of the code, which is the paper. What are the challenges to actually enforcing a full submission of the code by every paper that gets published?

B

um I mean at main conferences, of course, but also a journals and then having some mechanism in place to uh running that code to make sure that it's actually reproduces what is reported in the in the paper.

A

Yes, so thanks for the great question, uh there are like a lot of challenges that I, uh like a briefly covered on the code, submission policy. So the major challenge is that in a lot of people from industry are not able to submit the same code because of uh proprietary reasons and also, let's say in medical, imaging there's a lot of like data set restrictions. So if you are working, let's say right now, a lot of people are working on covid imaging. You cannot publish those data sets right because those are very sensitive patient information.

A

So, in these cases the reproducibility takes a hit so- and this this is like this occurs across all conferences, even in nature, when, uh like computational uh medical scientists like publish their work, some of these works do not at all accompany with code, and that raises a lot lot of confusion.

A

So one way to mitigate this is that if you have a proprietary data set, that is fine. You can also like report a similar uh like results on open source benchmarks, for example, if you are in the medical community, you can also report uh like results on expert on uh other. Like chest, x-ray images like pad chest and so on, which are openly accessible so that uh people can at least verify your claims on those data sets now in terms of like review ability. uh That is like right now.

A

We do not like we do not stress to the reviewers that okay, you have to like run the code as as it is given, but we want to like move towards that. But then it is very difficult for reviewers as well to set up the same code and same dependencies and so on. Some conferences like especially in computational uh in computational medicine or even in computational neuroscience.

A

People have uh tried to advocate for jupiter notebooks to be submitted alongside your code, which has like a nice property of uh like replicability, so that reviewers can just run the jupyter notebooks, but that is more feasible if your training doesn't include like machine learning. Training like if your training is just can be done on cpu, then you can do that in case of machine learning that becomes more challenging.

A

But these days there are other solutions available, such as google, colab or even code ocean, where you can get like gpu tied to the jupyter notebooks that you can use for submitting your work as well as easy reviewing for the reviewers.

B

Thank you, I think yeah. My question essentially like was for a summary of what you already described yeah. um I think we have many many other questions, but most of them, as I see them, are, um uh will probably be answered in the second part of your talk. So maybe I'll, let you finish your talk and then we can come back to some of these questions.

A

Okay, awesome so yeah, so till now we have uh like. I was basically looking into the problems in the community, but now I want to talk about like how can you perform reproducible research, so uh this this will follow, basically a talk and also I will release a blog post, so I have already released the blog post. I will like share with you guys after the talk so that you can go into the different uh suggested practices.

A

So let's say you start with a simple example: uh let's say you have a awesome research idea involving transfer learning on mnist data dataset, where transfer learning uh involves like learning from one task and generalizing to the next, and you are very excited about setting up this project. You uh you have like looked into the prior research and you are just starting to code, so you start with the basics. Now in this talk, I'm assuming that you're working with pytorch, but you do not have to work on that you can work with tensorflow as well.

A

So let's say you set up like data loaders, you set up training loop, you set up test loop and now you are running the experiments, but soon you figure out that you have too many arguments in your training, like you have so many different model arguments and these model arguments like kind of increase as as you're progressing in your research. So what you can do about it, because if you miss certain arguments, then you have to like add those arguments as a default in your arc parse.

A

So uh there is a lot of like nice tools available. uh One of the tools that I recommend very much is pytorch lightning that helps in maintaining the different uh configurations that you run. So basically it creates a copy of the configurations in csv files, so that you can refer to that from when you last run your code, but still maintaining this long set of arguments is trickier, in my opinion, so for an easy management of config.

A

You can use like config files instead of argument parsers, so config files can be either json or yaml files. I personally prefer yaml, because there you can add comments and you can like have this arbitrary large number of configuration files and you can like easily run like large scale experiments using these configuration files, because now running your code is as simple as just mentioning uh like the name of the config file.

A

Now I want to like plug certain uh libraries. So there is this nice library known as hydra, so you should check it out and hydra also contains another library known as omega conf. So these two libraries like together gives you like a really big power over using config files, because there you can use like uh inheritance in your config files, which is very useful. So let's say you have a sample config file of your basic uh arguments and then based on your different experiments.

A

You just like inherit that config file and just modify certain values that you want to inspect in your current experiment. So that gives you like a really nice leverage of maintaining config files and the important thing is. You should also release these config files in your open source code so that people replicate using them now.

A

Okay, so you have set up your code, you are doing inference and, ideally, you basically first infer on your validation set to improve your model. And finally, when you are about to write your paper, you should evaluate on your held out test set, but for this script you also need to like save your model, and you need to save your best checkpoints.

A

So the next practice involves effective checkpointing. So, ideally, you should save as much as your resources permit, but then resources comes cars, especially in hpc's when you are given, like a small amount of data limit to work on. So in these cases it's best to save the last epoch, as well as the best performing epoch. So you should have like some validation metric based on on which you determine which training checkpoint is the best performing epoch. And you save your model, you save your config files.

A

Even people try to save a copy of their code, which enables like greater reproducibility if you want to like go backwards in time. So if your model fails, you can load from a previous checkpoint, and this example that I'm showing is. This is how pytorch lightning does so. Pytorch lightning is a nice experiment, management library, which basically writes checkpoints based on the epoch number, and you can also modify it to write based on the validation performance, so that gives you like. Okay, this is the checkpoint that I will. I should use now.

A

Okay, so now you have good checkpoints available, but your model doesn't seem to give you a good scores. So how does the? How does the training look like, and maybe you need to check the training loss, so the next uh practice is logging.

A

Now this is very important in terms of reproducibility, so you should log all your features, while training and evaluation now uh features include like validation, metrics and training uh metrics to log now, ideally you, you can also save your logs in like in file system in a log file, so you can use like python logging module, which you can like redirect to save on a file.

A

But then you have to do a lot of manual work while doing the this kind of logging, because whenever you want to see certain validation results, you have to like search and grep through your log file.

A

So instead you can use logging services that are available and one of the earlier logging services that have been used is known as tensorboard, and this is still like the most widely used in machine learning, where tensorboard gives you like, a nice local uh runtime of like uh basically a visualization where, once you log your values. So logging is very simple: you just like log metrics and then this uh this tensorboard will show you different metrics and different plots.

A

Now there has been a lot of criticism of tensorboard, because the way you cannot like directly uh interact with the different plots, so there has been other entrants in the field like weights and biases. So this is uh like really a cool uh platform. Where you can check interactive plots, you can go into interactive sessions of different hyper parameters and you can see which hyper parameters affect your training and learning, and also it helps to give you uh like a large view of different plots.

A

There are other, uh like different plotting uh systems available, like comet ml wisdom ml flow. Now, a lot of people still prefer tensorboard, because it you can uh deploy it locally. Now the issue with weights and biases and comet ml right now is that you cannot deploy locally, which might be a problem with industry people, because these systems tend to log a lot of information like including your system resources.

A

So that might be some proprietary information that you don't want your users to look into so tensorboard also launched a tensorboard uh like an online version these days, so it's kind of a way to share quickly your results to your collaborators and that uploads your tensorboard system to google servers.

A

So you can take use of these uh different login platforms. Now, as I said, practices one and three incorporate like best experiment management practice, and I uh I I highly recommend using pytorch lightning. So this is a very uh like large, growing community, where they have like really really best practices for fast training, evaluation, validation and they expose a lot of different loggers. They expose a lot of different ways to save and evaluate your models and on the same line.

A

There has been like previous work from sacred, so this has been used a lot in pre-machine learning era and also there is ml flow, which is also like an experiment management toolkit. So if you want, if you do not want to set up these boilerplates, then you should use one of these platforms.

A

Although if you are a phd student, I would recommend you to set up these platforms from scratch, because that would give you like a greater control of what is happening once you understand what is happening and where then, you can easily like switch back to these uh different, uh like uh different things, because these are. After all, these are not libraries. These are frameworks, so you have to essentially learn these frameworks and if some uh like something changes in the frameworks later on, then you also have to update your uh your experiments for it cool.

A

So now all is good, but something is still odd. So, when you run multiple runs of your experiment, it shows like different results. So that means you might have forgot to set the seed, and we have shown previously that this is a huge uh problem in reinforcement, learning research.

A

So before running the experiment, I recommend you to like basically draw n number of seeds, so it's ideal to use five seeds, but, depending on your computation budget, you can also use three seeds, and so you take aside these five seeds. You keep them stored and you never touch it, and this is where you report your experiment results. So you should not optimize the seed. You should report on whatever seed that you have taken, and you should also like report those seeds in your paper as well as in your code.

A

So, ideally, you should average on different seeds for to help readers understand what is the model variance now I have given a simple snippet of like if you are using pie torch, how to uh set your seed properly. There is other, like snippets available to do the same in tensor flow. Now you also need to keep an eye out for gpu reproducibility, because that is something that is not uh ensured by even by torch team because of like cuda reproducibility issues.

A

So you just need to like take care of that and uh pytorch like recommends these uh these apis to call to set your seed. So this is uh probably like the most important part in reproducibility of your work.

A

So, okay, but wait. I forgot what did I do to make this work like? I did a lot of work in my research, but I I do not remember like what I did last night while being a caffeinated that worked this morning. So I cannot like tell it to my supervisor. Okay, what is happening? So what should you do so?

A

The next practice is the most obvious is uh versioning your code, and you should always like start with uh get setting up a get repository, but the key thing is that you should commit early and you should commit often- and you should also add- descriptive- commit messages so in machine learning, what you can do, you can add the raw results directly in the commit message that you can say that okay, I ran this experiment, and this experiment gave me this this result.

A

Now you should commit early and you should commit every time, you're running an experiment, and that is important because otherwise you won't have like exact reproducibility from one time point previous and that will help you to maintain like sanity in your large projects.

A

So as a example, I show you like a commit message from the hugging face repository and they have like really really nice way to like commit uh different uh like different functionalities, that they are adding to this uh repository now. This is an open source repository, so they have to maintain these standards. You do not have to do necessarily all of these things, but at least for your own sake, you can add in as more descriptive information as possible.

A

So github is your friend. You should also tag versions of your project on on major decisions in your project. So let's say you want to like incorporate a new model architecture. You want to incorporate a new training architecture so, prior to that, you should tag the versions with the experiment results, and you should have like a separate branch for small proof of concepts. Essentially, you can use github to its full potential.

A

Okay. So now the next thing, which a lot of people kind of like tend to overlook, is when, let's you have to mind your data or the data, sets that you're using so let's say in the transfer learning setup, you decided it would be a good idea to mix certain classes of fashion, mnist and ms digits together fashion. Mnist is another data set where, instead of handwritten digits, you have like small images of like clothes, clothing, different types of clothing and bags, and so on.

A

So now you went a little bit too deep in this rabbit hole and you ended up creating many many different data splits and finally, you are not you're, not sure which data split you are using, and you probably could have also overwritten one of your data splits. So you are not getting the same performance, so you should also keep track of your data, and this is practice number six. The easiest way to keep track of your data is to add it to a git version system.

A

Now, due to the large size, it's sometimes not feasible, you can look into git lfs. So this is a large file system for git.

A

Where you can add your data sets, but even more simpler thing to do is that you can compute a hash so basically like a md5 hash of your data set, and you can add it to your config file and every time you run you should like validate using this hash so that you are not running on a different split of the data set or you probably have not changed something in your dataset.

A

So this is some easy things that you can try, but also you should backup your data periodically using google drive or aws s3 buckets now there is also a recent entrant uh called data version, control or dvc, which is kind of like a git for data sets where you can add in your data, and it will give you a nice way to track your data as well.

A

So data tracking is very important. If you are releasing your own data, you should also consider adding a data sheet now. Data sheet is very important and you should. I encourage you to read this paper data sheets for data set where the authors propose to add, like a readme for the data set which contains uh stuff like motivation, composition of the data set. What is the collection process? What is the pre-processing used? What are the use cases of this data set and how would this data set be distributed and maintained?

A

So these are very important points that you should have if you are releasing your own data set cool. So now you have done your experiments and you have like really shiny plots. You show to your supervisor and your supervisor is like okay. I don't like this plot, give me another plot, so you run back and then you replot again, but then, after a certain number of weeks, you cannot find your plotting code again.

A

So that happens a lot and the you essentially don't care about the plotting scripts too much, because once your training is done, you are like okay, I'm done with it, but that's not the case. You should also like keep track of your plotting, your tables and everything. So why?

A

uh Because that helps you a lot in terms of like paper uh production in in terms of like submitting your paper and also doing camera ready, so the best thing to do is maintain notebooks, so you can maintain uh like a set of jupyter notebooks in a separate folder in your code, where you should maintain separate notebooks for data analysis, result analysis, plot generation and table generation.

A

Why? Because this will help you a lot, because when, let's say you get some reviewer comments and you want to update certain plots, but you don't want to like update the other plots, you can just run your plotting cells and that's it now. You should add these notebooks to github and github. Also like renders the notebooks in line. So you can like share your intermediate results with your peers and your collaborators.

A

Now you can also like supercharge the existing jupiter notebooks by using jupiter contrib extensions. So this gives you a lot of like uh powerful tools from collapsing cells to collapsing headers to like tocs and everything you can also uh want to like share your results using collab like google collaboratory, which gives you like a gpu and tpu runtimes, and you can also use binder to which is another service which gives you like jupiter notebooks associated with a virtual machine.

A

When you need to update the results on your paper, you can also like access uh like rerun the cells, and you can use something like paper mail api, so that the notebooks like you can run your notebooks in a different set of parameters. From the original notebooks, so all of these like uh tools- you don't have to worry about it.

A

I will like I have written everything in my blog post and I will share it in the end of this talk and also another point that if you are like maintaining tables pandas, have this nice api to latex, which I use a lot, and that gives me like a nice latex tables to my uh experiments without having to like copying the results. Hand like in my hand, to the paper.

A

So the next practice is reporting the results. So you should always try to report the results with proper error bars. So, as I said that in like there's a problem of like seeds, so you should not run grid search on your set of seeds. So, as I mentioned, you set aside a set of seeds and you basically run your model on top of it again and again, and that gives you a nice variance in the results.

A

So obviously it would be much better if your model has less variance, but I would advocate you to not to worry about it, even if your model has large variance, because that will add to the understanding of your model and that would add to faithfully report how uh how your model performs.

A

So, even if you are reporting your table, you should like mention uh like the confidence intervals and different variants in your paper now I define like different criterias. One criteria could be like multiple seed. Another like higher part of reproducibility is multiple data sets.

A

So you can also report your results as a variance on multiple datasets, and that would be like a super higher bar of reproducibility of generalization, and even if your model has larger variants on different data sets, it would be still encouraged to report them, because then people will have a better understanding of how your model is working cool. So, okay, a lot of practices. We just have three more left beard with me. So practice number nine is managing dependencies, and this is very important.

A

Irreproducibility often stem stems from software deprecation, so basically to replicate a published work. The first thing to do is to match the same development environment, the and containing the same libraries, and that the program expects. Thus it is crucial to document the libraries and their versions you use in your experiments after your experiments are stable. You should leverage pip or conda uh to like collect the requirements in a file, and you should like add it in your repository now.

A

You when you use python like this, is a nice way to uh keep track of the different libraries, but there could be other factors affecting as well, so it would be even better to use like docker or singularity containers, so these are like virtual machines, which uh basically you can upload your docker file and your singularity file to like a service, and then that would be like easily reproducible with the same exact environments.

A

Now, let's say you do not like. You did not have worked on docker like starting from your training, because training on docker is a bit trickier. You have to use nvidia docker, and then you have to run on systems which support it. A lot of hpc's do not support docker from the fly they support singularity instead, so you can use something like repo to docker so repo to docker. You can use it to convert your existing repository in a docker format, and then you can change it and you can uh use your uh like.

A

You can release that docker environment to the public so that they can easily replicate uh your results. So managing dependencies is very crucial.

A

Then comes the next practice of open sourcing, your research. So after your paper is released, you should consider open sourcing your work now. This adds uh visibility to your paper and this encourages reproducible, research and uh this. This is very uh important because you can uh like this is basically the hallmark of reproducibility. If you have like a good, well-documented code alongside your paper, like everyone will love it and everyone will basically work on top of it, and that will give you more citations on on like in that regard.

A

So there is a like a great service coming up, which is like papers with code and in this service you can list your uh like code in the paper and so that people have like more uh visibility of your uh like released code.

A

Now before you release code, there are some pre-release checklists that I want to uh like mention. So basically, I talked a lot about like maintaining different commits for your research, but then those commits are are especially for you to read and for you to understand where you did certain changes now. The way we do research is very messy. We tend to like just fix small things. We say fix this bug or no.

A

This is not working and stuff like that, but these commit messages when it becomes public, it might become like a a bad thing for the people to read. So it's ideal to squash your commits in the public branch to a single commit before you make your repository public, so that helps you to remove the unwanted commits, and it also helps you to like remove any sensitive information that you might have in your commits. So you should also make sure that your code doesn't contain any api keys.

A

So if you're using like uh like weights and biases and comet ml a lot of times, we forget that our api keys are still in the code, so you have to like remove them uh before releasing your code, so you also need to keep an eye out for hard-coded file locations so that people can run your code in a separate environment.

A

You should also format your code properly to improve readability, so you can use something like black. So it's a python formatter which formats your code in a nice readable way. So you can just like black all your code at once, like you can just run black then star, which includes all your codes and then it would format them in a really nice way and the final part very important is to document your code, so you should add like documentation as much as you can. I know it's it's.

A

It gets a bit uh difficult to do all of these things at once, but as much as you can in your libraries in your function, calls- and especially it would be great if you add, like tensor dimensions in your input and output of your function, so that will help like machine learning community to understand okay, what this is the tensors that are going in and coming out of this function, so practice 11 is effective communication with readme.

A

So once you release your repository, you should also like add these information from the machine learning code, completeness checklist in your readme, such that your repository gets a lot of starts and a lot of uh publicity, and it's only not only about publicity like people can easily replicate the results. If you have like already shown in the in the readme, it's also really good to have like a contributing guide so that if people want to contribute your project, they can they know what to work on and how to work on.

A

And these days it's also really important to release a blog post surrounding your paper, which is kind of like an informal document where you talk about different things that you work on in your paper and these days, like just for publicity, people also like post their paper and code in twitter and because a lot of like academic researchers like discuss stuff in twitter these days, that's also a good way for effective communication, so cool. So I'm at my last practice and it has been a lot.

A

But these are very important practices and the last one is also very important, is to test and validate your setup in a different machine. So, as I said that you have to take care of your hardcoded parts and your dependencies and everything, but the best way to ensure that everything is working, is that if you use google cloud or aws or seo to spawn up a small environment, and then you just test the inference of your model or just training, one epoch of your model.

A

So that would give you a good sense of like how uh your model is working uh like whether there is certain like issues in the hard coding of paths or certain like dependencies that you haven't uh mentioned so yeah. These are all uh 12 practices that I basically wanted to talk about, and the key takeaways of this talk is that reproducibility in machine learning is extremely important for the advancement of the field and the ml community is coming up with innovative ways to encourage reproducibility.

A

But we all should commit to reproducible research early on in our workflow so that we don't have to like worry about our work being not reproducible like the more effort you give up front the better it is for downstream reproducibility.

A

So I also will release a blog post. Like it's already online in my website, uh cs mega dot, ca slash till day, so there's a tilde uh case in her four. So this is essentially my mega user name and practices for reproducibility, so I will share it in the slack and also in the question answering session. So this is essentially a complimentary blog post to this entire talk, so in the blog post, you will find all the tools that are necessary or you may want to use for your for the best practices now.

A

Finally, to conclude, I want to mention, uh like this nice uh phrase used by my supervisor, joel pino at natives 2018 is that science is not a competitive sport like we tend to like look into it. In that way like we tend to try to beat other models by posting, better models and try to beat certain bass lines, certain leader boards and so on. But science is not about only beating certain baselines or certain models, it's about deeply analyzing what is happening and how we are advancing the field.

A

So we should all take care about our work and we should give enough time and enough care to our research. I understand that, due to uh different incentives like due to the incentive of publishing faster or the incentive to like get your results out like before you get scooped, these things tends to get like not looked upon.

A

But if you uh care about your work and then you should like devote more time to it, the more time you devote to these issues, the better your work will be long-standing and the better like down the line you people will find your work to be very suitable.

A

So thanks so much for listening to this talk uh thanks to joelle, pino shagun, sudhani, jessica, ford, matthew, mclee and michelle paganini from facebook, ai research to help me out in making this talk and giving different suggestions, and thanks to the hosts of this seminar for giving me the opportunity to talk and thanks to uh all these people who have been involved in reproducibility challenge in resigns in building the checklist and especially open review, guys who have been very helpful in uh setting up the platform.

A

So thank you for listening and thank for questions.

B

Thank you gustav. This was uh really a great talk. I mean it's a tour de force of all the things that are related to reproducibility in practice. I I certainly learned a lot um we have. I I think we have many questions. um Several of them have been answered. I think in the second part of your talk, but um let me let me see, let me let's go through a few of them. There are different levels, so the first one is is reproducible? Is there a predisability challenge, still open to work on.

A

Yes, so that's a great question, so we will again relaunch the reproducibility challenge uh in new rips 2020.. So, right now, uh like tomorrow, new rips is going to announce the reviews for new york's 2020, but the final paper acceptance will be released somewhere in end of september or early october. So that is exactly when we will launch our uh like our reproducibility challenge and it will continue towards end of december to early january so that you have enough time to work on it.

A

And if you are in university, I would uh encourage you to like contact your supervisors or professors beforehand, so that they can like make it uh like a general thing to participate in your course, and we also like list the courses that are participating in our website, so that everyone can have like a good visibility of which courses and which institutes are more.

C

A

B

Problem, uh another question is which of these practices are, in your opinion, opinion the hardest to adopt.

A

Right so I think the like the hardest practice for all of these would be managing proper dependencies and managing the like setting the correct seed. So managing proper dependencies is hard because, if you are like software depletion is a real issue.

A

So let's say: if you train on like pi torch uh 0.9, and then uh the people like upgrade to python 1.5, then certain things might not work, so you have to like mention explicitly like which of the libraries that you are using, and uh it would be really ideal to like share these singularity or docker containers, but that's a kind of a lot of work for people to do so. That's why these uh repo to docker, like these services, come up or binder or collab.

A

These services come up now in case of like maintaining seats, so that is kind of like a that like a different problem. The problem is that if you are trying to show that your contribution is significant, if you like, set certain seeds early on and you see that your model is not performing well, you will have the urge to change those seeds, and that is where the difficulty comes up. So you should have you should restrain yourself, and you should say that.

A

Okay, I'm not gonna change the seeds because, uh like that, would be kind of like a fair assessment of like uh reproducibility. If you just keep your seeds aside and then you report whatever you are working, so you should focus on model improvements on those seeds that you have like set aside on. So I would probably like term these two practices as being uh mostly difficult, but I'm sure like these are not that difficult. If you are like careful about it,.

B

um There is a, I think, a question is a general like more of an opinion question um uh asking for your opinion. So this isn't, it says, like publishing, trained models means uh posting binary. Data github like repos, are not well suited for such purpose. Is there a youtube like repo for massive binary data archives? Who will pay for such massive publicly available storage.

A

Right, so that's a great question uh to answer that, like so pytorch team has released this pytorch hub, where you can upload your trained models to work on, and otherwise, if, like this storage is an issue, you can like store it to aws s3, so aws s3, you can store in a long term archival format that will lead to very cheap, uh like resource usage. So it's still a bit costly.

A

But if you are like coming from a like an industry or a lab, you you should like get your lab to fund it out, and aws s3 in long-term storage is cheap enough. Otherwise, what I do personally, I just like upload the model uh checkpoints in my google drive, but essentially I pay for the drive. So that's one issue that students have to face where they want to like publish these uh these model binaries.

A

So yeah. You should like kind of like work it out with your supervisor for a nice uh like option, a lot of labs do have like a common uh like an enterprise google drive which you can use to share these model. Binaries.

B

A question on the seeds, I think also there's another opinion question, so training them all across different seats is a nice idea, but it can be computationally intensive. As a student. What steps can I take to compensate for the lack of massive compute.

A

Yes, so as a student, what you can do is you can uh take, uh take help of like hpc's. So if you are a student in canada, you have access to compute canada, so that is like a huge resource for all students studying in canada, where you can take access to uh like a huge array of gpus.

A

If you are in us, uh they don't have a compute canada-like setup, but there exists a lot of different, uh like training hpcs from different institutions, where you can apply with your project proposal and take use of it. But yes for sure. I agree that if you are like a student from uh like a lab which do not have all these gpu resources in hand, reproducibility gets a lot difficult.

A

So in that case, probably the major focus is not to work on like problems that require a lot of scalability instead of looking into problems that are more uh like more analysis driven or that require a lot deeper thinking of the different uh architectures.

A

So so, to give you an example like uh it's very costly to train uh bird or transformer type language models, but it's easy to infer them and that, like led to like a large subfield of using probing tasks where probing tasks are simple, linear models that you train, on top of your language model, to uh to see whether your language model is learning these syntactical or semantic uh like cues.

A

So these types of research fields like come up on basically- uh and these are some of there's a lot of exciting research areas that you can work on which do not need to like where you do not need to like scale up this. Much, for example, in reinforcement learning most of the like fundamental work can be shown in smaller environments, which could be like easily run in your own lab setup, so uh yeah. So that's kind of like my suggestion.

B

Thank you. um Another question is: does regularization have a positive effect on reproducibility.

A

Yes, yes, so regularization has a like a really good effect on reproducibility and a lot of papers. They tend not to focus on it so uh like, for example, if you can uh propose a model which is like, uh which is a very fancy model, but then you can use the like. You can use regularization on your baseline to show that the baseline is in fact better than your proposed model, so we should be also like looking into that.

A

So recently, there is a like a nice paper which came up where, uh like people showed that you can train an mlp to replicate the cnn architecture by using l1 regularization, and these uh these are like really important findings, and so you have to like keep this in mind while proposing a new work and that's why it is very in important to first evaluate the baseline. So if you evaluate the baselines thoroughly by using all these regularization techniques or ablations, then you will know exactly where your contribution fits in.

B

Maybe another question on uh on like how to get more of this. uh The question is: do you think there will be an open source courseware that teaches researchers pedagogically, how to structure projects to encapsulate all these techniques and the predisability that this is useful for people outside of ml too?.

A

Yes, so when I actually looked into this problem- and I found one coursera course that is available for reproducibility, but that course I felt was a bit limited in content, so it would be great if uh the community like comes up with a like a reproducibility course.

A

As far as I know, like uh my supervisor plans to do the course sometime uh in mcgill, so I will ask her again so so once she has like all the necessary uh like setup, then she can probably like take this course and uh to to mention another course like there has been uh in in that there was a course which looked into the effective machine learning training pipelines like that was a more undergraduate uh level course. I think that was in mit.

A

I I have to like look for the course name, but there are like several courses which are coming up, and that would be really great to at least like teach students early on how to adhere to these different practices.

B

Thank you again, kustav. I think there are several other questions, but maybe they can be taken on slack um if that, and these wants to follow up with their questions, on slack that'll be great sure sure, um yeah.

A

C

B

For for this great talk and uh um yeah the many resources I'm looking forward to reading your blog post, I think some of the attendees already said that they are they're there. They're reading it um yeah and um thanks for everyone for joining the the lecture today next week will be a break and then, after that we will have a lecture on uncertainty, quantification in deep learning, and so I hope that you join us then, and until then, please be safe.

B

A

For the opportunity and glad to give the talk.

B

Thank you gustav. Thank you so much bye thanks.

C

Costa and thanks everybody.

B

C

Thank you, bye, bye,.