DevoWorm ML, 25 Sep 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: DevoWorm ML: Week 4 (Input Data lecture)

Description

Fourth DevoWormML meeting, September 25. Attendees: Richard Gordon, Bradly Alicea, Jesse Parent, Aidan Rocke, Vinay Varma, and Abraham Kohrman

A

Sure I think we can at least get started now so welcome again to diva warm mo and thanks again to Vinay for last week's presentation. It was very enjoyable. You know, I learned a lot about what he was doing so after his presentation, I thought you know, I want to make sure that people were sort I don't want to give like a super technical.

A

You know, approach to this is Abraham, so I wanted, you know, highlights and hi Abraham. Are you.

A

So I wanted to sort of highlight some things in today. I'm gonna give a talk on something called input data, so you know I wanted to kind of get add a little bit more basic level for pizzas. So some people are, you know, working in machine learning and some aren't so I wanted to the experience.

A

First of all it I'd like to ask: does anyone have any questions about you know where we're going or any news or anything they'd like to share.

A

Okay, we have a new member here: Aiden Rock, he I've been he's, worked in the open worm foundation in the past and I've been talking with him for a while now on some various topics.

A

So you know he's interested in machine learning, he's a mathematician by training, and so he probably have some things to add to this group. We were talking about a couple of machine learning topics in recent past, so why don't you introduce yourself even a little bit all.

B

Right so I studied mathematics.

B

Where I met the last lady, who did some behavioral neuroscience workplace.

A

Phd and also wasn't direct.

B

Of the open one project, so that's how I know open one and I did some consulting for machine.

A

Learning companies.

B

Last few years and recently, a very entry develop.

A

Symbology and it's practical applications.

B

A

Okay, yeah well, I hope you I hope you find it useful. What we have here and anyone else have any comments they want to make anything they saw in the news. Maybe.

A

Milestones I'd like to share.

A

So all right, so why don't we get started? I have a lecture of an input data that I'm going to present here. Let me share my screen.

A

Can everyone see my screen.

A

So this topic is input data and basically the idea here is that if you put data into a model or algorithm or whatever it's going to reflect, what comes out of it and so I gave it a little Bayesian notation, it's cute I mean, but it kind of means. You know there is a kind of conditional relationship there that if you put bad data in or something that isn't very good, you're, not gonna get a very good result.

A

It does hold true, however, that if you put very good data in, you can still get a bad result, but you know, if you want to talk about that some more in future weeks. We can do so, but our focus here today is this processing of the data, but to have a good, clean set of input data that we can then make inferences on, and so this is, this figure is actually something from a cybernetics article, I read and I can't remember which article it was but I think it's a nice graphic.

A

So I wanted to introduce people with you haven't visited. This already is the Devo zoo and so I mentioned I. Think in the first or second meeting that Devo worm has data that we've collected from different places. This is developmental data. It's largely embryogenesis that, although there are other examples in this repository- and so this.

C

A

Link to the repository- and we basically we've assembled a bunch of datasets that are, you know, from different publications or from different sources and they're labeled here and you click through and you find it in in a location and you download it and you can use it for different things. You might use it for training a model. You might use it for analysis whatever, and we've had a little bit of interest in it.

A

I've used it for google Summer of Code I've, eased it for people who are interested in contributing to the open worm foundation I've, given them some datasets just kind of right here. So this when we talk about input data, we're talking about various data sets like you know that come from a place like this, and you know, take like image data, or sometimes it's tabular data, and then you know, put them into a model and then find you know you can categorize it.

A

You can build regression analyses you can do other sorts of relationships, but there is a lot of open biological data.

A

In fact, if you have been following the literature on this, you know that there's massive amounts of open data of different types and it's much more than people who actually eat, because we have a lot of high throughput techniques and we have a lot of people collecting data on different topics and people being you know, scientists they'd like to share with other scientists, so they make it available for people, and so, if you want to get input data, one good way is to go to some of these repositories, and so we have a couple of examples here.

A

Embl, for example, is a good place to find molecular data. If you're interested in gene expression or sequence, data EMBL usually has saman it's associated with a publication, so you can download it in tabular form and it's you know in you have to do a little bit of research on what the variables mean sometimes, but you usually can get a nice data set. That's maybe suits your needs image. J has public data sets so image.

A

J is a image processing platform that is open source and they are, you know, it's run by the nih, so they actually have public datasets of these to calibrate their program for different algorithms of these for image, processing, um I also Kegel, which is a machine learning company Kaggle comm. They have a number of data sets that are open, so I gave this example here of avian vocalizations. This is some person who collected these data from you're just sitting it with a mic or a mic.

A

You know and they've recorded birdsong in different places, and so we, you know that data exists and you can download that and analyze it and vocalizations.

A

You know it's it's a continuous type of data, so it's usually some tabular in some tabular form, but you are able to reconstruct waveforms from that and then the SS BD database, which is run actually by someone who is collaborated with the diva one group in the past, ken Howe who's at RIKEN in Japan, and he runs s SPD database with some of his colleagues there, and this is a place where you can find a lot of developmental biology.

A

Public datasets, so there are a lot of datasets for C elegans or for zebrafish or for Drosophila, and they have data. That's in tabular form, taken from cell tracking data taken from other types of experimental setups, and so there are different types of open data and I wanted to bring this up. In case people were wondering: how do you get open data and use it for these models, but they all share this one problem, and that is that there's a criterion that we have to evaluate it on or chip criteria.

A

Actually, one is a is the data quantifiable. So if we, for example, download an image G data set and it's a bunch of images, can we quantify those images so that we can use numbers instead of just throwing the image into the model?

A

Also, if we throw the images into the model and the model extract, you know something that's quantitative or something that the computer can count.

A

So that's something you have to think about, and that's that takes a bit of research on the researchers part to figure out how they want to represent their data or, if it's something that they can easily deal with. And then is it enough data. So in some of these instances we'll talk about today, is it enough data to do what you want to make sure that the algorithm is trained properly and so forth?

A

So I wanted to also bring this up from this from data science, and they have this term and data science called the four V's and I sort of modified this for biology a bit.

A

So there are four terms that I'll start with the letter: V that are important attributes of data, and so the first of these is volume in minutes how much data is available. So you know when you download one of these datasets, how much data is there for you to use for you to train a model or whatever? This may be you're interested in a lot of samples, maybe you're just interested in quantity in terms of bytes.

A

So you know there's a lot of very high throughput data, so there are a lot of bytes, but in terms of useful samples that might be a little tough to to get. You know useful samples out of that, and so those are trade-offs that you have to think about. um Another V word is velocity. She says something that you know. You say what does velocity have to do with data and it has to do with how much change over time does the data set capture?

A

So, if you're interested in like movement of like a cheetah and you have images of a cheetah moving, but maybe you only have six images that are sort of sequential you know of a cheetah running across the plane. How can that really capture the Cheetahs movement, or do you need much more data for that?

A

Similarly, if you're looking at gene expression- and you want to know about fluctuations in gene expression over time- and you have three time points- is that enough data to really capture what you want, and so you have to keep that in mind when you get public data sets and work with them. Getting are you sampling it out the right number of samples per unit time.

A

Secondary thing about velocities, then sometimes you think about velocity in terms of that question, but there's also this issue of velocity in terms of how fast is the field move, and so you might get a data set from 15 years ago, but it's maybe not relevant to some of the modern techniques.

A

So maybe you're, you know, there's a better data set UV to use and then two more V words variety, which is how much natural variation is captured by your data and we'll talk a little bit about how you might be able to improve upon this later on and then veracity, which is how reliable is the data before and after transformation veracity, of course minds truth. So can you ground truth or biology? Truth.

A

The data, if you have some data and it's just something that you've collected some repository, does it actually match the measurements or other such data sets that are out there, and so those are those are it's a lot of a lot of V words to remember, but it's a I think it's a good framework for sort of approaching trying to take data and then apply it to some machine learning model. So you know how do we know if our data set is usable and this there?

A

This paper that I have here in this slide lays out sort of a taxonomy. So you know there's usability how easy it is. How easy is it to figure out? What's in the data set the context of the data, the availability, the reliability? So is the data reliable and the presentation quality? Those are all factors, and you know this is again something you really want to evaluate. Your data set your input.

A

Data set very well before you use it, but you know we're also interested in putting it into a machine learning model, and so in that sense we want to make sure that the data is very usable and we understand what's in the data set, we understand, maybe how to label it or how, to you know, maybe pool samples or whatever, but ultimately we're looking at using this for training data.

A

So we want to have something called training data to put into the machine learning model, which is a mathematical model or some algorithm, and then eventually we want to make a prediction. So we want to take if we saw this in the digital battle area stuff, where we had some model here and we took training data which were microscopy images and we applied it to this model, and then we made predictions about the shape of cells and their locations.

A

So the training data is, of course, very important here, because if I want to train the model to make good predictions, the training data has to have you know easily identifiable features or in the case of the digital barrier, data cells. Individual cells, that you can that the machine can purse, and so this training data, then all the stuff that I've mentioned just previous to this is going to be very important for training and, for you know, a proper model that can make proper predictions.

A

You know that people think of machine learning sometimes is like magic, but it's really not managing it's just you're taking data you're taking a good model, and you know again what model you choose is you know something? That's again, you have to do a lot of research on, but you just have to match these up, and then you can get a good prediction.

C

A

So this is the link here to this article that I got this image from so in case. You want to look that up and I'll make the slides available after the talk.

A

So this remember this this sort of framework, because this is the way when I start getting into pre train models and things like that it'll make a little bit more sense.

A

So, when I talk about training data, we could use open data or were sometimes for different tasks. They have especially you know, benchmarked and you know assembled training data sets. So this is the MN ist database and this is a famous benchmark data set for generic machine learning applications. So you can see that's a series of digits from one night or from zero to nine, and it's just people writing these numbers by hand but notice.

A

All the very look at the rotors, for example, have many different ways: people close this fork here they keep it open. They read it in sort of a slanted manner. Here it looks like the Greek letter.

A

You know this is almost looks like a trident, but it's still four, you know and so forth, and so there's a lot of appointments showing this is that the input data should have a lot of variation in it. You know each of these columns is a sample which is you know something. That's like an individual case. You know like okay, this is one case, and then this is another case in a biological context.

A

You might think of it as like individual organisms or something like that and so you're, just showing the variation in this training set and you're lining it up and you're, giving the model all the stata you know just over and over, but different variations on it, so that the algorithm can pick out say that this is a four. But this is also a four, and this is also a four and it would be hard if you just gave the Machine this column for the machine, then to say oh yeah.

A

This is a four, even though it's training on this floor and so the same thing with sevens. Sometimes people put across through their sevens and sometimes they don't and so again the machine wouldn't know necessarily that sevens exist in both ways. Unless you show the machine, then through this training set, and so this is a benchmark set, meaning that it's something that people can use in publications and say we trained it on em this database people know what that means, and it's always the same, and it always should give a very similar result.

A

And so the thing I think the thing to remember about this is that biology in particular, developmental biology does not really have these types of datasets I mean we have some that are famous, but there isn't, like a you, know, a benchmark data set that people know. You know how it's gonna perform when they put it into a machine learning model.

A

But there are also other types of benchmark data sets for training. So if you're not interested in handwritten letters and numbers, we have celebrity faces so there's a database where they just give it a bunch of faces that are celebrities and they train it on this, and you can you know this in this case you have a label which is the name of a celebrity in a face, and then the algorithm, knowing all of that can distinguish, faces, work and you know identify faces.

A

There's the Stanford cars data set, which is again different models of cars and again it's all about like the specific shape and features of the car that if you present enough variation to the model, the model can distinguish between different model of cars. It's pretty simple and then the same holds true for the iris dataset and the iris data set is interesting because it's actually a biological data set that was created by RA Fisher like a hundred years ago, and it's basically taking pictures of irises or the flower iris.

A

You know in many different versions of the flower, and so you know you have a you. Assemble this huge data set of you know irises different shapes different variants and you can identify different individual or you know. Even if you have something that looks odd, you know that doesn't really even look like an iris so much, but it has distinguishing features that allow the algorithm to pick on. You know pick up these features and then you know I directly, identify it as an iris.

C

Pushing in the case of the iris, how important is it to have a complementary set of flower pictures which are not irises well.

A

That's something and that I'm not gonna talk about in this lecture, but yeah you bring it up. I will mention that, so the iris data set is good for identifying irises, but when you start- and you have like a lot of variation there, but then, when you start getting away from like the cannot or the typical iris the average iris, then the performance starts to degrade.

A

So what people have done in in the more recent past is they've, come up with something called adversarial training, and so this is the idea that you train it on irises, but you also train the Machine on things that aren't irises that might look similar or they might be, like you know, related things that aren't viruses, because you know that you, like you know. You pointed out that you know you can have variation in the data set. But if you present it was something that looks. Maybe something like it.

A

But you know it's like a borderline case where you can't the Machine.

A

It's really ambiguous to other machine whether or not it's an then you have this adversarial training set that you can use say, like you know tulips that basically or you know different varieties of flower, where you have you know different shapes that maybe you're similar to the IRS, but aren't so you can put a label on those things and then eventually you know the machine might walk on to something that maybe looks like an IRS but isn't know, but is so it's like you? They can deal with those boundary cases a lot better.

A

So yeah people have used adversarial training for that so yeah. It is important to kind of give it other instances and again what you're gonna end up looking at is how the sort of the success rate in the error rate of the model, so the success rate would be. How often does it correctly identify an iris and then the error rate is. How often did it identify something? That's not an iris as an iris, and so that's a method. Yeah, that's a topic that is very intricate, but I'm just giving you a high level view.

A

I mean the medical field. Those would.

D

Be called false positives and false negatives, yep.

A

Yeah and I think they call them that in machine learning, I just want to know give a like a more colloquial example for people yeah, so that's I mean that, but that's the idea. We want to be able to identify things that are cars or irises or celebrity faces. We want to reject things that are not, and so I mean the same holds for like developmental biology. If we wanna identify like say a cell boundary, we want to find the cell boundary. We don't necessarily want to lock on to things that look like a boundary.

A

We want to be able to wait for finer or, like you know, our search for features that are likely to be a cell boundary, and so so now we get into something called pre train models, and so we have training data, which is input data, but we also have pre train models that we can use, and I mentioned this during the talk on digital vassal area. That's what we were using as a pre train model, and so the definition of this is in architecture that is validated by a benchmark data set.

A

So the thing I just showed you to solve a similar problem, and so pre tree models are common in deep learning and non-linear programming, which is a method for usually it's used for linguistics research or text identification. But a you know. We that's not something we're going to talk about today, but so pre train models are reliant on that benchmark data set and the architecture is like a neural network architecture that is like pre weighted. So you you find the weights and then you apply it to something. That's similar.

A

The question, of course, is what is a similar problem and so a similar problem might be. You know if you want to if you train it on identifying cars, it might also work for you know irises, because you're looking at things with features where it might be like you could create a pre train model just for some domain, like I, can't handwriting, and then it may be very good in that domain. But then the question is what a similar mean?

A

Can you transfer it to other problems and that's that's sort of the down side of pre train models? Is you don't know that, and so in some cases the model it may be generalizable to host a problem. So there some pre train models such as deep net or masks, are CNN which are just to pre training models that exist in the deep learning literature.

A

Those models are pretty well generalizable people have used them on a wide range of problems and gotten pretty good results. But it's worth noting that if you have a pre train model, it may not be good for your problem domain, so you know especially like in biology where we have a lot of variation and we have a lot of things that are moving or you know unclear boundaries or whatever that's gonna, be an issue to think about very hard before you think.

A

Pre-Trained model is going to solve all your problems, and you know in deep learning again they have this wide range of pre-trained models, image that ResNet conception, dgg I, mean just there's a huge list, and the acronyms aren't important. But what's important is to recognize that a lot of these models exist, but not all of them are made for specific areas of interest or fields of research. And so again, here I got two examples of publications that mention pre-trained models. One is an archive free print, an analysis of deep neural network models for practical applications.

A

So they talked a lot about pre trade models and that and then opportunities and obstacles for deep learning in biology and medicine, and then that that actually is an example of pre tree models for biology and medicine. But they don't really propose any specific pre train models. They just kind of go over the idea so, and if that what I'm saying here is that there's a lot of opportunity to perhaps come up with pre train models for specific types of data.

A

Pre, train models also allow for clearly defined features and classes to be built from data of a specific type. So you know you could have scenes faces curves or even tomatoes, and you know we have a lot of variation, but each of those types of things are very different in the way the variation is distributed, and so here's an article it's a blog post on highlighting ten pre train model types in the datasets that come with them. So you know this is mostly non biological stuff and then overall pre-trained models allow for faster training.

A

So we use a pre trained model so that we don't have to spend a lot of time, training the model on input data. We can use some input data and then a pre trained model, and it helps us get to where we want to be.

A

But that's not always the case, so you might want to approach these models with caution and here's a medium article discussing why that is and again why that is is because you don't know, you know, whereas a pre trained model might be very good for identifying cars. That might not be good for identifying cells, and so that's you know, but pre training models are an option so and then the next thing I want to talk about beyond pre train models. Is that augmentation? So we have our input data and we have.

A

Of course we can't capture our instance of variation, and we know that, like the machine doesn't know like how to automatically make mental transformations, it's not like the human brain. In that way, it just knows what it sees, and so one of the things that people do is they augment their data set so that they can give the the model wider range of examples of what these things may look like in the real world.

A

So if I were training my model on dogs, for example, pictures of dogs, the algorithm might pick a dog like this sitting down.

A

You know sitting at the right angle facing forward that's a dog, but then if they show the model of a picture of a dog kind of, like you know, sitting on a couch or you know in a at a weird angle, because I took the picture at a weird angle, then it might not be able to identify the dog so much because it's relying in some of these cues some of these vertical cues here that are lined up in a certain way.

A

So what you can do to avoid that is to create these augmented images and so again, in this case with the dogs you're, just transforming the image in different orientations, and so now the model knows that dogs exist.

A

Sort of at this angle, like straight up, but also if they're pictures where the dog is at an angle, for whatever reason or there's some skew in the in the way the dog sits, then it can identify the dog properly and so I mean dogs are kind of a toy example. But you can imagine, like other things, like yuvan cars, if you took a picture of a car, you know the picture was at an angle or you know the car was going up a hill.

A

You might not be able to identify something as a car, but if you give it examples where it's rotated in different ways, it can pick up the variation of how those those attributes it's using my clean up. Another example is this picture of giraffe, where we have just pictures of a giraffe that are, you know, taken in nature, they're photographs, but the thing is is not just important to have the information about the giraffe.

A

It's also important to have information about the shape, and maybe the shape of the background, so we can do is create masks of each base image and we can train them the algorithm on the masks as well. So this way now the algorithm has shape information both of the object in the background, so it can identify the background elements and the foreground elements and then make a correct prediction.

A

So this is just the citations of these images, but using data augmentation. In this way it allows you to kind of account you know kind of make up for some of the variation you don't have in the input data set. So if there's you know not very much variation in your input data set, you might consider that augmentation to remedy that and I think vinay actually used some data augmentation strategies in his project the summer, and so it's it's important for biological work.

A

E

A

E

Chainless, if.

A

You understand that it.

E

A

Concerned by this too much because if.

E

I, if I put one image, one image and the corresponding elation and if I increase the best level of that input, image and the ceiling.

A

There is to understand that that this is not a big.

E

Issue, so the net effects will be disappear from that to me, that is in contrast, yeah.

A

Yeah I think that's a good point. I think. The point, though, of augmentation to is to really kind of give your your machine learning model a better sense of what the world like you would have. You know what the world looks like, but also to kind of help it along.

A

If like in some cases of microscopy data, you might not have very good information and that one level you just know that that's what it looks like, but it might be hard for the algorithm to lock on to features and just kind of pull some of those features out different brightnesses and things like that.

A

So and then there's. This is a pretty complicated slide, but basically this is the idea of up sampling and down sampling. So now, when your input data, you have samples of different classes or different types of data.

A

So you know in a biological data set, you might have animals of different, you know striping patterns or you might have cells of different shapes that are, you know, sort of in the same class of type of cell and say these are sort of classes that you can label before you train the model, but also you know kind of that. The model will sort of lock onto these as natural categories. So what? But your dad aren't always evenly distributed in terms of these categories.

A

So you know you might have like if you might be looking at fibroblasts, which is our type of cell, and you have different shapes that they have and then, of course, some shapes are gonna, be over-represented in your input. Data set and some are going to be underrepresented, and so what that means is that to give the Machine a little bit of help in identifying each class, you can artificially enhance the number of samples in a given class. So up sampling means that you have a number of classes and some of the classes.

A

You have very few examples of in some classes. You have a lot of examples of, and so you would up sample the ones that you have very few examples of just by giving them. You know doing things like data augmentation or maybe just going and finding more samples from that class to balance out the data set input data set so that you have the similar number of samples from each class and down sampling is. This is the reverse of that?

A

So maybe you have a lot of samples of a certain class, because it's a very common cell type, but you don't really need that many samples to really train the model on what it looks like. So you might just get rid of a number of those samples and just even out you're in data set so that it represents a number of samples evenly. So this is a schematic here where the original data set has this class. Maybe you have like six instances here and in this class you've 60?

A

Well, you want to do something so that you can make this equal in terms of its representation in the in the sample. So you know you might comment this. These data in different ways so that you can get to this even number of samples per class and there's a practical reason for this, of course, and that is the machine, isn't maybe will identify some of these smaller classes.

A

If you just leave it unbalanced like that, so the algorithm is going to pick up on what it sees and if it sees things in the in the class, where you have a lot of examples, it may just identify the ones in this red category is the same thing because it doesn't know any better. It hasn't seen enough instances of this red class, and so you want to even out the number of samples in each class that you give the machine to train on and so again this is another figure from Google.

A

They explain it in terms of weighting each class by the number of samples, and so you know you can approach this sort of systematically. You know you, you might calculate how many classes you have how many samples of each class, and then you know, figure out how much you need to up weight or down weight each one so that you have a uniform data set, and so again these are the links to these for these images that I took, but also they it contains some more information about these kind of strategies.

A

So, if you're interested in the this topic, you can read these articles and they should give you more information.

A

Finally, I'd like to conclude with talking about synthetic and pseudo data, so I think we've talked about augmentation and balancing your data set now I'd like to talk about synthetic and Oh data, and so this is something people use to again. You know provide training data for their model that they may not otherwise be able to get so. The working definition of this is a model of data that generates something not found in the original measurement or something that is not directly measurable.

A

So we have examples such as interpolation between samples, dynamical, modeling of a hypothetical regulatory process. So, like suppose that we had two different types of images and then we train the motto on them and of course it can identify the two different types of images pretty well, but then we give it an image, an input image that is in between the two, like, maybe like a hybrid organism. You know, then we'll be able to identify that sample is its own thing, or is it going to throw it in one category or another?

A

So you would need to create some sort of synthetic data to give it information about that intermediate case again with even with like regulatory processes, you know you're not going to necessarily get all the data. You need to look at something. So if you want them to look at some molecular pathway where you know you know kind of the components, but you don't know sort of all the data for each of those components, you might create a data set that sort of simulates that missing piece, and so there are at least three approaches to this.

A

Some of this, you may have encountered in statistics class resampling of the data using methods such as jackknifing and bootstrapping small sample and inference, and then data dependent priors, which is a Bayesian method. Specifically those are things you know you can look up on your own, but those are like basically the three approaches that are in the literature, so how you might make a pseudo data set. You create our protocol labels, so you would label these fake variables or fake instances.

A

You would might then use a distribution like a Gaussian distribution or something to create a series of plausible values. So you make an estimate- and you say my process is Gaussian or it's you know normally distributed it's random, so we'll jet, you know, generate these values, and this is what they might look like in the real world and then that's how I'm gonna model it, and then you might sample these fake data in ways that test hypotheses or allow for variation.

A

So you might say well we know it generates these values, but there's some values that are more likely than others. So we can make some other assumptions and from that you have a nice synthetic data set that you can then use to train the model on and then, of course, you can also use labeling in the same way. So they have something called pseudo labeling, which is used a lot in semi-supervised learning. So again, this is where you create labels from unlabeled data. You guess what the labels should be.

A

You make them up, but you there it's an educated guess, and then you use the more formal labels that you know to take out false positives, that maybe you didn't account for and your fake data that you've created, and so you can use synthetic data to sort of get a leg up on your model.

A

But then you can also use the real information to filter out things that are obviously wrong and so consider the following: spherical systems: here's a image or an animation of a C elegans embryo in early embryogenesis- and this is a drosophila embryo and, of course, they're very different in terms of how they're you know the processes going on here. But if we just threw this into a machine learning algorithm, we may or may not get like be able to really describe this or make predictions. So this is a nice example because things are moving around.

A

They don't have very clear boundaries a lot of times and there are no labels, at least not in these images. And so you know, how do you deal with those using the strategies? I just show it's just something to think about mostly come up with some questions for application to developmental biology, and they think we've covered a lot of these, but just keep in mind that we don't have good training sets for a developmental biology. So what would a good training set?

A

Look like what property should it have and just leave you with that thought, because I think that's probably the most important thing to remember about this is that you know. On the one hand, we have very good methods for machine learning that are, you know, emerging, but they're not really applied to developmental biology. So much so that's something that that's why I kind of created this group. So we can like think through these issues,.

A

C

Terms of the LC distributions.

A

C

Tails or what people call outliers yeah.

A

C

You're suggesting that one should wait the outliers so that they aren't detected. Oh.

A

Well, it depends on you what you want to identify, I guess if you want to, if your data set, you know it has a lot of outliers in it. Then you you might want to weight those more.

F

A

I think I think in like if it's like about like in the data augmentation example. What people are using these for are, of course, distinct classes. So people are thinking more of, like you know, they think of their data is having discrete states and that's an assumption that you know me and Rinat be true, but you have these distinct things that you want to identify, and so you know the idea. Is you don't want to miss things?

A

There may be rare, but, like you know, because the idea is that the Machine doesn't know whether it's rare or not. It just knows if it sees it and it needs to know how to classify so if it's rare and you it comes up in the data. The machine, of course, is going to just misidentify because it doesn't have doesn't really know what the what the correct label should be. So you might want to you know in in a case like that you might want to have more instances of outliers.

A

You might want to give it a lot of instances of outliers, so I can identify those, but the idea is that these things exist in discrete States, and that's that's kind of the problem, too, is that you have a lot of things, maybe in biology that are intermediate states. You know like if things are changing and that's a problem.

C

Basically, what you're saying is.

C

We're saying yeah, yeah yeah.

A

I mean in this of course, applies to like mathematical modeling as well right. So if you wait come up with a model and you say well, this is the way we think it works. You know. Sometimes it's just not a very good model. So it's it's. You know it's I, guess it's hard to really I mean like basically with regulatory stuff, people have been doing like work with high-throughput data, which is just like gobs of data and they thrown it at like a model and they've gotten an outcomes.

A

But sometimes you look at these papers and you're like what are they like today? They like do they think about like the biology or is it just like throwing data at a model? One see: okay, well, I, guess trying.

D

To say is a little concrete and development.

C

Of biology, many models of how development occurs, and some of these are regard, as rather mythological.

C

So the question is, if you take, one of these models might be completely wrong and use it to massage the Vayner.

C

Data are there criteria by which you can tell the model screwed things up well,.

A

Yeah I mean I, don't know yeah, there are no established methods, so no one's really published a paper saying this is how you would verify this model. I guess one way it would just be to try different models like maybe try something where you know. Maybe you know that something is you know doing some sort of regulatory, okay, yeah so like.

C

A paper on the French flag approach to as a model of developmental biology, they misquoted some of the work and the way we're resolving this is that we're ready joint paper with them at the invitation of the editor of the journal on discussing the French flag model versus differentiation waves. Okay, now this is of course well. This is this is the paper is not going to be based on any kind of detailed machine language learning, but it's an example of trying to discuss two entirely different models for this yeah.

A

C

A

C

Maybe maybe I'll put you down on the referees.

A

Yeah yeah be interesting to see. Let's see, let's go to the comments and see what people make comments on during the okay computer tomography, bad data image: oh yeah, that incomplete image so yeah again input data means everything eaten in DIC. We're talking about the sync folders.

A

Please hide sharing button yeah, okay, so that was so. Anyone else have any other questions.

C

Thing is this: it compared to Mario you're, taking projection data, reconstructing an image and what we found out in our early experience.

C

You still get an image out and your brain tries to interpret the image rather than realize that it's nonsense.

C

Just does not apply.

A

Well, I guess: I was assuming that the bad image is garbage. So but that's true, you could get something.

F

C

It's very much like the business of shock tests or where people see things that aren't there yeah, because our.

A

Brain is always.

C

Trying to interpret an image it makes it, and if we run data, it's from an algorithm at reduce an image. We try to make sense of it rather than.

A

Comment if we're talking about predicting temporal events, we might need these deep, sequential models. So that's another technique that and then talked about in this talk, but so we need machine learning models that can also do causal inference exactly so. We need yeah. So there are a lot of you four options for doing things like, especially with, like you know, when you're talking about you know, we talked about just putting data into the model, so now we're talking about what you can do with models themselves like because there's a back end to it.

A

So it's like you, train the model, and then it can do things down line it just doesn't classify things and put them into classes. You can do more.

C

Sophistication need not be temporal could, for instance, be serial sections in the same area. You can use having, let's say, segmented, one layer in an embryo. You can use that, possibly in some cases as a starting guess for the next layer.

C

Okay, so so the maybe see that so the z-coordinate could be the temporal coordinate instead of the temporal coordinate could be important here. So we actually a four dimensional problem. Okay,.

A

Jesse or Jesse an Abraham I haven't heard from you, did you like to add anything or any comments.

A

A

Okay, thanks yeah and Jesse, says, thanks to her hour, is almost up top of the air Eden I talked to Dayton, maybe about giving a talk in the next couple weeks, a topic that we were discussing camera or what the name of it was.

A

Well, we can schedule that in thanks, Abraham thanks for the feedback.

A

Anyways, while we can, we can discuss that later and set that up so again. If anyone wants to give a talk about something or even share something that they've seen in the research community, you know please bring it to the meeting, we would welcome. You know, let me know in advance and we'll put it on the agenda, and you know we can talk about it. So, thanks for showing up and I glad that, if we're enjoyed the discussion and if you have any questions over.