DevoWorm Lab Meetings, 5 Oct 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: DevoWorm (2020, Meeting 35): biological problems in software, Hacktoberfest, Multivariate Analysis

Description

Discussion on working with and encoding biological problems in software, Hacktoberfest updates, and a presentation on PCA, tSNE, and UMAP.

A

A

B

A

A

B

What's that, how are you, oh okay, um I haven't done a lot directly for diva warm stuff, but um are we uh I haven't? You submitted the extract yet for diva stuff.

A

Yeah I did for two uh neuro match correct.

B

A

Yeah, I did, I just submitted it. I think this weekend.

B

Okay, because I can I'll do we can talk elsewhere, the other one um for the other group, but um yeah. I haven't done a lot of people alone uh yet, uh but I will be doing more with them. So.

A

Well, that's: okay! Yeah! I think that's uh it's fine! I think that neuromatch stuff is pretty keeping you pretty busy and then the other the rip stuff we're doing with the other group. So it's like kind of taking all the time right now, um but yeah. I think it that should be. uh You know something good will come out of that, hopefully um yeah that work is a little bit. Maybe I don't know if it's far afield, but it's definitely something they probably won't.

A

Have another similar talk of at neuromatch for the stuff that for diva worm that I'm proposing so we're putting in um yeah.

C

D

Hi, dick hello, how are you all right? Okay, I'll do some quick rearranging? Okay, there you go, one fell off and I have a uh wanted to meet right in the middle of your meeting.

D

Okay, okay, all right! Oh susan, can't make it today, okay, but I hope you will schedule with her uh a repeat of her talk: okay,.

A

Oh, she just gave a talk last week. Oh last week, do you have it, I think yeah it is recorded. I didn't I don't know I sent you a link. I can send you a link again. Oh.

D

Please do so, I might I get hundreds of emails per day.

D

Some get buried, yeah yeah.

A

Yeah, that's pretty good talk. It was on her optical uh imaging words: oh okay, okay, good yeah um and yeah. I got I got a fairly good feedback. I mean you know we don't have any other experts in the group. So it's oh yeah. Last week I.

D

I had a bad back at the last.

C

I seem to be recovered, though. Oh.

A

Good yeah yeah he gave her some practice in giving the talk. It was it's pretty interesting that I think other people, um okay, good, so good. Well, welcome to the meeting! uh Sorry I have to keep shifting the time, but I had a prior engagement. That was it um one time and then it shifted to another time. So um I have to make that meeting. So I have to move this one, so uh I hope I hope people are you're not watching or if you're, not at the meeting you're watching this later.

A

uh So that being said, we're going to talk about hacktoberfest, which is the event going on on github, where, if you make a certain number of commits or the repository, that's participating, you get like a free t-shirt from google and that's uh you know enough incentive for some people to make commits to things. So it's uh you know, there's a whole marketing campaign around it and we'll talk about that in a bit.

A

uh Also, I have a presentation on multivariate um analysis that I'll present, uh I don't know if anyone else is gonna show up, but we'll do that at the end, so that we have maybe if people show up they'll be they won't miss anything.

A

um And then now you emailed me this week about um that. You had a problem and that problem was. You have a lot of ideas, but you don't know how to implement them in like software or something like that. I know how I'm just extremely small.

A

So uh yeah I mean I guess: oh there's krishna. Okay,.

D

The problem is uh learning a language without anybody around to help is uh very difficult because all languages, all computer languages, have uh folk folk knowledge which is not transmitted with the language itself.

A

Yeah yeah see, I mean yeah yeah, that's a problem. I mean in a lot of like I mean they're they're. You know people who go online and chat about different problems. Usually it's like. If you're in a class, um you know you might have a homework and you solve the problem.

D

I remember I taught pascal and then I attended a class for maybe 20 years later and I skipped after the first day. I didn't understand a word. This person said.

D

So you know this uh there's a lot of assumed knowledge right. Okay, I mean I used to speak fortran four, but that's ridiculous. You know it's now quite dead language and the uh the subsequent fortran is pretty opaque too the one that they have now.

A

Oh, the newer version of it.

D

Yeah, there's still there's a new version of fortran, but uh for a fortran four programmer. It's a lot to learn.

D

So uh anyway, I've tried mathematica and so far it's you know. My code looks like fortran 4.

D

And the problem is: when there's a when it doesn't run, it's sometimes impossible to figure out what went wrong all right. You know.

D

You're running a program, the output looks fine, I make one minor change and there's no output.

C

What did I do wrong.

C

So anyway, that that's my level right now: okay, it could be fossil free, but.

A

Well, I think it's yeah it's very hard to get like, because languages move so fast yeah. You know just kind of try to stay up with the state of the art. It's always a challenge.

D

Oh, I know I once had a book that had 800 computer languages outlined in it and pascal was not yet included.

D

I'm sure, if you made a similar book now be twice as big or more.

A

Yeah yeah. Definitely so I mean, like you, know, I've I've. uh I've done a lot of stuff like in matlab, and I've done a lot of stuff in python, which is a newer language. um So I mean that's that's what I'm familiar with and, of course now python seems to be like the open source standard. So when we go to google summer or code, for example, is standard is usually python and it's you know it's not that hard of a language to learn, but it's it's a bit hard to.

A

You know really kind of get into it like. How do you do x? You know, so you have to put together the pieces uh from like you figure out. Okay, this this requires a loop or this requires some module. You have to load it in and you have to get it running and then so it's it's a it's a challenging thing, but fortunately there are a lot of tools out there that are available for people to make their work easier.

A

So I think jesse was part of neuromatch this summer and neuromatched it's a summer school and they're running it virtually and they're. Trying to get people to run simulations in, uh I guess and so uh jesse do you want to relay your experience on that.

B

I kind of missed the setup for that. I heard neuromatch, but what was what.

A

Was well just how you were able to handle like a lot of the notebooks and and how did you handle programming in in neuromatch.

B

Oh um I mean they they had.

B

Let me turn this off. They.

B

I I like what they did and the spirit was very good. The execution of it was a little bit weird. Sometimes.

A

B

We basically everything in book collabs um and I really like they had very good videos. I liked the tutorials, but it was a little bit. um There was a little bit of a little bit of ping-pong back and forth when you're trying to learn or write code within the collabs.

B

That was probably the strangest part for me, but overall, like what they were trying to do, um and I I'm even I'm still uh it's a bit delayed with everything I was having right now, but I'm still working on my own slowpod going over the material again, but I really liked that they had a very good. It's.

A

B

Blend of topic overview specific tutorial. Here's the code get to work with it in a google collab and even work with other people, and when it's done right, it's a very good way to to engage the code and and get some direct connection.

B

It was hard with the fact that there was just so much content and, and it was very difficult to get to the level of depth and the amount of time you know like we had one day on, reinforcement, learning or like it was very. It was very extremely over over saturated in terms of material, but but it's a reference to go back to and look at again, and I think it's definitely setting the bar for that kind of a learning environment in the future.

B

um I'm not sure.

A

B

What you wanted me to comment on, but it came to mind first.

A

Yeah, I think that's a good, oh like so, let's back up- and you mentioned google co-lab, so they actually have notebooks. Now that you can um execute code in so they're these notebooks, uh I think we've talked about the jupiter notebooks and the group previously, where you have these little kernels, which are these little windows, and you put a piece of code in and you can run it in real time.

A

So you install this uh editor on your machine, it's for collab or for uh jupiter, and it gives you this, like composition, notebook and each part of the notebook. There's a kernel or a window that you can type in some code enter some code and run it in real time. So you can test it to see if it works, and so we've used the notebooks in google summary code for a lot of things, they're very convenient and, like jesse said they use them in lesson, plans for like scientific simulation.

A

So in this case they were doing like looking at, like you know, neuroscience simulations or other things, and you know it allows you to point. It allows you to do everything you would do in like something like python, but it's very interactive, and so um we also uh yeah so collab is is one example. Jupiter is another example, and those actually run not just with python but with other languages as well so and then in our other group, we've for google summary code.

A

Actually, we've done programming in other languages, so python is a standard, but it's not proliferated to. I think we had one project in something called kotlin, which is like a offshoot of python and then another one in julia, which is actually a scientific simulation language uh which is up and coming so you can see that there's this proliferation of languages, even now with with just you know, within python and python, related stuff.

A

So it's always challenging to get things like moving and get things you know keep on top of things, but I think the notebooks are a good way to approach the the problem I think also having like you know, one or two standard languages in the research group is another good way to do. It um is in terms of dick's problem.

A

I mean we still don't really address like how do you make this easier uh like for a single researcher to go in and say I want to use python um for this problem. How do I put it together, although I think jesse mentioned in narrow match they have these tutorials, so they work very hard to create tutorials on different exercises in python. So you know you might have an exercise on.

A

How do you make a a simulation of a cell, or you know just a cell body um that could be done as like a tutorial people? Could you know, get.

B

A

A notebook and then in the notebook it would show them the steps annotated, how to do it and they could run through it executing the windows as they go through and it would show- and it would be- you know, an easy way to really kind of understand how the code works.

A

um I didn't bring any examples with me, but uh I didn't know how to approach this conversation so, but I think that's I'll try to get some resources on that and send them to the group.

A

um So hey are you krishna.

A

It was very oh, yes, the annual, so this last week we had the annual conference on open worm. It was the year so every year they have a get-together of everyone on the board and all the senior contributors, and they uh present on different things, and I presented on some of the stuff we've been doing in this group.

A

um We had a. It was very interesting because you don't get to see the board members like once a year, and there are a lot of connections there with like venture capital, and uh you know other areas of like business, so they they're trying to. I think stephen larson is the person who's in charge of coordinating a lot of the open worm stuff he's uh trying to get a board together. That really is going to push the foundation forward, but you know they. So we talked about the different projects.

A

We talked about some other business, so it was a pretty good meeting.

A

They were really interested in what's going on in diva worm, so you know, I presented a number of the things that were going on here um and they they were pretty impressed. But um yeah I mean overall, it was a pretty good meeting um and you get to see people. You know you only need them in slack uh every once in a while, and then you see them on video conference because open room used to have general meetings where everyone would show up, but that doesn't happen anymore.

A

So, um oh I wanted to get back to the conversation about the uh co learning code and all that so krishna you've been contributing to the divo, learn uh data science. Tutorials.

A

Can you tell us a little bit about that.

E

Yeah I we generally, I contribute in the field of data.

A

E

I put up tutorials and all, and I guess two or three days back, I made an another, you can say folder for important links. I've added few of the pandas framework exercises and I think I uh you know, pushed an issue for.

E

Did you make it bradley uh which one the the issues yeah? I I created an issue regarding creating a label for hector buffett.

A

Oh yeah yeah I've been looking over those issues. I haven't really addressed all of them yet, but I'll get into.

E

That we can get actual face contributions. Also, I I shared our you know, organization in my college, and I also put a whatsapp status so that we get more contributors.

E

And I have to send you the paper also. I forgot to send it so I I I was writing about reinforcement, learning, sarsa and expected salsa. It's uh taking. You know more time than I expected it to take. Maybe the all that work from home thing has. You know affected the efficiency of the work.

A

Yeah yeah, it's fine, oh yeah, lazy, dick says oktoberfest is yeah. So that's oktoberfest is a german holiday like a month long where they have beer and everything and hacktoberfest is a variant to that. We'll talk about that in a minute so yeah, I guess uh krishna. You had uh some uh contributions of some data science, demos, so yeah in the diva learn group. Then we've been or in the diva learning organization we've been uh putting together some demos for data science. So there we have some notebooks as well. That show my demos.

E

It's not all about python. I have pushed also sql and bash code command online, linux and sql, also because these are the part that people often you know, ignore because sql is the basic of you can say there can't be data science without databases, yeah yeah, yeah, also uh and now I'll be uh pushing our code, and I was working on some uh genomic data analysis in r. uh Once I uh once I draft all the.

A

E

I'll be uh pushing a six seven, uh you can say lesson like structure where it uh for beginners it could be for beginners to how to analysis. You can say gene analysis in our language because, most of the time, people only work in python and like things like sql and bash and r are often ignored.

E

So I wanted uh uh to be you know a cocktail of all the things.

A

Not just python focus, oh yeah, yeah! That's what we're just talking about uh we're talking about. Like you know uh dick was uh you know, kind of learned, pascal at one time and did a lot of stuff in it and then eventually the language kind of evolved away from where he was so it's and that you know that just happens. I think with python.

A

It's happening to some extent where it's changing a lot, so you always have to keep like on top of it and but um yeah I mean, I think, that's good, that we have tutorials for that, and so that's another place to go. um I mean that's it's hard to do. Yeah, it's really hard. What was that.

B

I have a question after you're done with that.

A

Oh yeah, you can go.

B

Ahead and yeah.

A

B

uh So I was looking at it again. I answered it.

A

B

A

I see the data science, demos and there's also the education repo.

B

A

B

Wondering is there a specific uh divide like.

A

What's? What's something that should be in education? Wasn't that, like data science,.

B

Demos um specifically for data science.

E

Related things.

C

E

I was going to post.

B

I saw something: there's a resources section and.

E

I was going to put a resource to it just now, because I know.

B

I I I worked on something during in the past, uh the the learner imaging thing and I'm not sure, that'll be more uh learning or.

C

Imaging which I did with one of the.

B

E

B

E

The whole website and the github repo, which is a really great set of resources.

B

But would that.

E

B

E

Would that be a data science demo resource thing.

B

E

B

E

B

General education thing a resource to put an education section.

A

Might be an education thing, I I guess we started the repos. uh Let me share my screen and show that just so that we can see what I've been mentioning this a couple times now and so here's diva learn- and this is data science, demos, and this is education, so I think education, we have nothing yet. Oh, we have the the journal of open source science paper, which is still still going we're trying to get it uh done.

A

uh I've had some things come up in the interim, but this is where we would put this. So this is like scientific education content. So this journal of open source science paper is uh like on the devo learn platform, but kind of like a paper that explains it and you know, gives you more reference. You know future uh further references uh for like what's going on in that, so I think probably what you're talking about would fit into this repo, um and so it would be.

B

A

uh Yeah, it would be because the data science demos, I think, are largely going to be about um just like quick tutorials on different topics like you know, if I want to know how to do like manipulate some data or maybe run some python code, this is a place. Okay, dick um a place for you to post those.

A

I mean. That's that's what I was envisioning you know in these. You know these repositories always have some sort of like drift where they like get to uh they get too far afield and they have to reorganize it, but I think that's a good way to do it like if it's sort of data science, if it's like how do I manipulate data or analyze data, something specific to that, it would be that's where it would go.

A

If it's maybe more of a tutorial on like neuroimaging, which is not really in the scope of what we are doing with data science, then it would be an education because it's it's useful, but it's like a little bit outside of the main uh area.

B

Yeah uh yeah, I don't know, there's specific things within within the other resource for data stuff. But um okay, I don't know maybe.

D

I'll make a resource.

B

E

I created the resource section mostly for you know, for example, for important github, repos of other people or if, for example, like code uh free code camp is having a great video. So you can share that video in the resource section things that uh we don't own is uh basically for that resource, section things that we don't own and uh yeah yeah.

E

A

Right so yeah again, this is the. If we go to the data science, demos we'll see that yeah, so we have networks, resources and tutorials, so the resources are here.

A

This is like for things that might be background reading networks, and we want to be careful with attribution there. We don't want to, and then this is uh networks. This is stuff from. uh I think that uh mayo committed from some of his work on networks from this summer there's more to put up there, but I haven't put it up there yet tutorials.

B

A

These are the ones the the straightforward uh code tutorials. So this one here uh command line, basics ipynb. This ipymb is a notebook and they don't render very well in github. But if you download it and have a editor you'll see that it's it's basically like with it's a editor with some windows and you can put code in and run it.

A

This is a different format for a tutorial, so this is for um this is krishna's tutorial on sql, which is a structured query language, and so this is actually a nice way to teach code, because you have this description of what it is, and it's very I think it's pretty accessible. Then you say: what can it do? You have a bunch of things that it can do.

A

Then you say, create a table. That's an example of something you can do. uh Then you have some uh okay. So then you have just an example of doing this, so you have this.

A

You know it's a little bit vague in terms of like where you bring the data from, but it's a good good way to show what it can do, just as a specific set of examples and then inserting new values and updating values, the leading record and putting memes in or kind of a way to soften the blow of learning a pretty dry subject. So I think that's and then you have references down here. So I think that's that's a good start, for you know demos. I think there's a lot more to do with demos.

A

um Dick says my problem was that 40 years ago I left all programming to my students. That's kind of what we've been doing with uh the machine learning stuff here in evo, like I've been getting students who are really good at machine learning and recruiting them to work on some of this stuff for diva learn. I don't I don't know if I really understand, like all the machine learning, so it's like a probably run with the same problem there with that but there's.

A

So it's moving so fast, though, that it's just really hard to keep on top of it. But I guess you know: that's that's our trying to do a lot of this stuff with tutorials and everything so that we can make it at least somewhat accessible.

A

You know to people and even if you know code, but you don't know the language that you need, you know having a tutorial. There is nice, so I think that's good. I think we maybe will follow up on this discussion.

A

uh Maybe we'll talk about maybe some specific problems that maybe dick wants to solve, um and maybe we can work out some. uh Maybe we can like post them in a way. That's uh maybe recruit people to work on them or something like that. I don't know if that's something you'd be comfortable with dick.

D

Yeah sure uh the first two I said, four problems. I guess the first two. uh Some people are responding to them.

A

All right, yeah, you sent me some problems if I email yeah, I think in the context of this.

D

uh Model promotion of diatops might be the most appropriate all right.

D

I can state the problem very simply if you watch your diatom, it moves in a jerky fashion and if you, if you photograph it with a high-speed camera, it still goes in a jerky fish, yeah, okay, suggesting.

E

That the motion itself is fractal.

D

And now what we know about fractals and biology is at some point you get down to the molecular or atomic level. We can't be fractal anymore, right, okay, uh so the problem here is to take a bottom-up approach and try to simulate the motion of molecules that we think might have something to do with the motion of the diatom and see if that motion is jerky yeah and then what parameters would allow us to match the model to observations?

D

Yeah, okay, so if uh I mean one hypothesis is that the diatom actually moves by like a rocket ship, except that it's it moves like a a rocket ship that was proposed. Oh god,.

D

About 70 years ago, and that is a rocket ship that drops atomic bombs behind it and uses the explosion, the atomic bombs to accelerate.

C

D

D

A

The idea is what the diatom may be doing.

D

Is setting up a whole.

A

D

Of tiny molecular explosions- and maybe this if we modeled that we could see if that works, yeah.

A

Yeah, that would be good, I mean just to yeah, get a problem that we can easily like. You know, how would you attach code to this problem and that seems like.

B

A

Standard homework problem you might encounter in a class: oh okay, okay, yeah.

D

I generally default to what are called mass gas models: okay, okay, which may which basically involve discretization of space, all right, yeah, okay, end of events so uh yeah there may be other approaches, continuum, approaches, etc. But in any case that's the basic problem. Can we make a model that imitates the actual observations of the jerky motion of diatoms.

A

Yeah yeah, so it's just it's.

B

A

Approximation problem, but you have to suppose it in a certain way like a model and.

D

A

You have to put code to make it like yeah and then, let's see if the.

D

Parameters that would match the observations, or at least.

D

And you know- and there are things you can do like try to make predictions like- how would the diatom move against the force things like that? So you could also try to do simulations of experiments that are plausible in the lab and there's a classical experiment. For example, going back to, I think, maybe the 1960s done by margaret harper.

D

What she did is ran a diatom into a a very fine fiberglass and it backed the glass. So she could measure the force that the giant diatom could insert okay yeah now you know more refined now would be. Can we predict the force versus time if the motion is jerky?

D

Okay, things like that or are ways of, in other words, trying to bring this model to a test against reality, yeah or our alternative.

D

The uh one alternative is that the diatom trail, the gooey stuff, that they leave behind, gets stretched and snaps and uh as it snaps it's like holding on to a rubber band, letting it go things move suddenly very fast.

C

D

Okay right, so this is another model or a model might combine both and see, which is more important.

A

A

D

A

Great and then like we could, uh I mean then there's also the thing about like how do you evaluate across models? That would be, I think, something that would be like sort of beyond the immediate problem, but well.

D

Unless you combine the models and and then just try to evaluate different parameters,.

A

D

Okay, now this you know, we don't know how universal this is. I mean you know if you look at bacillary when the cells move, they look like they're moving smoothly, but nobody's looked close enough to see if the moment is actually jerky right.

D

Okay, but maybe we can get thomas to do that.

A

Yeah yeah, he generated a lot of good time series in the for the page.

D

Yeah but, but is he prepared to go to? uh uh How should I say most cameras will run. Maybe 10 frames a second where we were running. We ran the camera at 800, 890 frames per second and, as I said, even at 890 frames per second. The motion is still jerking.

D

Right and 890 is modest. You can easily get cameras, it'll do 10, 000 frames a second and some will go up to a million frames. A second yeah.

D

Okay, so you know we'll, I don't know what thomas has available.

D

Maybe we can inspire him to get a hold of a high-speed camera.

A

D

And he's doing some very small colonies, you know just a few cells and uh the whether or not that motion is smooth or jerky could be determined.

D

Yeah yeah, so so it's kind of fun, yeah.

A

Well, yeah, well, yeah! Let me put that together we'll talk about that more like next week.

E

Sure yeah, uh I'm not from.

A

Biology so my question.

C

Can be quite name.

E

But do this collide?

E

Yes, of course, so uh is there any change in their structure or something.

D

Like that, no because the structure is that the cell is inside a silica shell, okay and silica at that scale is pretty rigid. It's like glass, okay, okay,.

E

D

Yeah, so you don't expect any significant change of shape when they collide. uh There are experiments where they collide with the beam of.

E

A

D

The observations have of jerky emotion have only been made on single doctors which are isolated from other diagrams.

D

No questions, okay. I have no question mark. I should say.

A

C

A

Yeah that sounds good yeah.

A

So thanks for the yeah we'll talk more about that in the coming weeks. Today, it's a good conversation to keep going and see what we can do um now. I wanted to talk a little bit about hectoberfest and I know peop the main people involved in that aren't here today, but um yeah.

A

So we had a so there's this thing called the oktoberfest. uh Let's see it's right here, I have the image. Oh, let me share my screen. I didn't do that yet so there's this event called oktoberfest, like I said it's through the month of october and github hosts it. It's a play on oktoberfest, of course, which is a german holiday where they celebrate the entire month, and um so this is the hecktoberfest.

A

uh This is for the evil learn uh or you know, for all of divaworm, but we're focusing on divalern for this event. So the idea is that, throughout the month of october, people can go to the github repository and make commits to the repo. It could be anything that they want to do. Add in some text add in a tutorial, make changes to some code and then, if they get like five or more commits, they get uh something free from uh from github.

A

So it's a t-shirt usually, and it has the hacktoberfest name and it has like github on it and it motivates people. I guess to do some commits, so we have in um let's see so this is divaler, and this is where the action is occurring for the um for the hacktoberfest and started the first and we've already had. I think about 12 people uh uh contribute to. It is mainly individual here in this repo, so they were.

A

I think this was driven by major and o'jual, where they, I think you know, talk to people, they knew at their school and they started to make commits to the repo. So we have a bunch of commits that are I don't know if you can see it under actions? No uh pull requests. We've had like probably about 12 poll work or maybe 10 pull requests in the last couple days. So it started on the first and it's going to continue through the month and we've already had a pretty good amount of interest in it.

A

So one of the things you can do if you want to commit- or if you want to participate in hectoberfest, you can go to the issues in any of the diva learn repositories. In this case, we have divalern, has a good number of issues to to look at, and you look at these issues that have been generated. I think mayor is mostly generating these individual in the diva learn uh part of divalern, uh but there are other, I'm not sure if bourgeois done the same for c elegans diva learn there's one issue open here.

A

This is add contact info. So this is something that I don't know. If it's it's a question, but basically you know people will go through these issues. They'll pick an issue and then they'll address it though, and what what you do is you uh you make a fork of this repository and then, which is basically just your own version. You would go to fork and it would create it.

A

The over your own version of that thing on your github account and then you would make the changes and then you would submit them back to the main repository and so that's called a pull request.

A

So you're you're, taking the a copy of what what is on there right now, you're pulling it down to your own account you're, making the changes that you want and then you're sending it back for approval to the main account.

A

And then, if it's good and a lot of these are pretty good these contributions, then we merge them, and so here here's where the poll requests come in people will make them they'll appear here and then we can review them.

A

uh I've been involved and myoca been involved in this, where we're all reviewing these changes and basically just to make sure that they're not malicious or that they're not junk which can happen, but you know and then a lot of times there will be like errors that they make or things they need to add that you can suggest they make before their pull request is accepted, and then that happened I think twice in this batch of pull requests.

A

So we have a lot of things. You know that it's it's a nice process for sort of uh managing version control. So you know we have files that we are constantly making version changes to and github is designed so that you can make those changes without overwriting things and being able to track all the changes, and it's very effective.

A

uh That's why we do a lot of things on github. um So that's that's how basically, how pull request works? If you want to contribute to this you're welcome to do so um they're.

A

You know this is just something that google puts on as or uh github puts on as a way to encourage uh participation in open source. So uh hopefully we get a lot more of these there's also, I think, an issue we had to solve about wake up, verifying the the repo for participation in hacktoberfest, but I think that's been solved so now.

A

I think if people contribute like five pull requests in them during the month, they get a t-shirt, but I'm going to try to recognize people- and this is our list here- I'm making a list of contributors who have made pull requests.

A

So we have a bunch of user ids here, and these are all people issued pull requests, and these are the accepted. The status of the poll request, so they've all been accepted.

A

A couple people have made two pull requests already and they're just like chain little changes, I think, but um and then I'm making a list of further interactions so krishna's on here. uh You know he made a pro request.

A

I think it was before our oktoberfest, but I put him on the list and I put melvin m on the list because she contributed something to the tutorials, and so those are I'm just making a list of people who are contributing to uh evil, learn and some of the other repositories in diva, learn and um and following up to see what they're doing so.

A

Mal vm has joined our slack channel and krsna's, of course, during the lab meetings, and so I just want to see where like how this is pulling people into the group and where they're going if they're, like sticking with it or if they're, you know just kind of casually making a pull request or whatever.

A

And so that's that's a good census tool, and I think you know it'll drive interest in the group as a whole. So and hopefully some of these people will join slack or I don't know. If maybe we can have some discussion on github about things. If that's, where they're living, you know, if that's where they're checking in maybe we can get more people involved, and you know some of these problems that we're posing in the meetings.

A

um But but you know it's a good thing to know, and I think it's a good thing to like get people involved in this. So that's uh all I have to say about hacktoberfest.

A

This is the intro uh page in in our divorm github. So if you want to see the basic description of what it is here, it is again it's just simply this. You know you check the issues you go and you try to solve the issues and you get connected to the community.

A

If you don't really have faculty with github and some people don't um especially if they're coming from the biological side, because it is more of a computer science centric tool, um then you know you can contact people, you know or join the slack channel and that's another way into the community.

A

I don't want to exclude people who aren't really they don't have much to say in terms of uh github commits, because that is again, like you know, maybe very code-centric.

A

So if we have people who want to commit in other ways, uh you know wouldn't really be part of the hacktoberfest system, but we'd still be interested in hearing what you have to say, and so that that's a good overview. I think of that.

A

Are there any questions about hectoberfest or any of the contributions.

B

um I had a general question, like I saw that there's I just like fully signed up for hacktoberfest right.

A

B

um And they do offer these things. That are like events, and I wasn't, I wasn't sure, if that's something that was in the works for diva, learn or other like orthogonal.

C

B

uh I'm not sure what an event entails or if it's just like a presentation exactly but.

A

And, oh, I'm not sure either. I think some of the organizations might have events surrounding oktoberfest, uh like maybe like their get-togethers or some things. It's just a way to like encourage people to participate, so they might have things where they're having meetings of people like you know, maybe open uh events where people can come and interact at a little bit higher level than github commits. I'm not really sure you'd have to show me an example, though,.

B

Yeah, it looks, it looks basically like it looks similar to a bad bright page.

B

You can kind of make a under the.

B

I don't know if that was something that uh other people and deal that we're trying to do or not.

B

It's not something.

A

To talk about now,.

A

Well, maybe like later in the month, we could certainly try something if we wanted to.

B

A

Yeah I mean it's like I don't like other, like some organizations, do it every year and they're like prepared for it. So I don't but we'll see if there's some interest in it we might send out like I might send out a message to people. Who've contributed and say: would you like to come in?

A

You know we might have a meeting where we discuss their contributions and then maybe what things that they might follow up on and it might be things that are more like you know, more involved projects or things that you know aren't just like isolated commits, so that maybe they can get a little bit more exposure to diva, learn or diva worm. You know so that's I mean that that would be good, but we'll talk about that later.

A

So last thing I wanted to talk about today is this presentation that I wanted to give and if you have to leave at 10, that's or at the top of the hour, that's fine, but I just wanted to go through this. So this is something I'm trying to wrap my head around for a while now, and this is how to understand multi-dimensional analysis in developmental biology. So what I have here are three different methods.

A

uh One is, I think, everyone's heard of his pca or principal component analysis, and this is where you're taking a bunch of data and you're trying to find the axes of variance.

A

But then there are two other methods that have been more recently applied to mainly molecular data, and these are t-sne and umap, and these are probably things you haven't heard of so much. But if you read any sort of like molecular biology paper uh in development or in other areas of biology will encounter these methods, and so I wanted to demystify some of them so again I'll make these slides available afterwards and it'll be recorded. So if you can't make the whole talk, you can go back and check it out.

A

So principal component analysis is an exploratory data analysis technique and people have used it for many years. Many decades it's a tool. That's been developed around like just regular data. It could be any type of data, it could be like you know, cell measurements.

A

It could be any type of thing, it could be social science data and what principle component analysis basically does is it takes all of your data points and it tries to take the matrix and find a series of directional vectors and these vectors are you know the vectors that come out of this are vectors that best describe the data well being orthogonal to all other vectors.

A

So what does that mean? That's kind of a weird way to put it, but basically what it does is it takes a bunch of data, maybe in a bivariate relationship.

A

Then center is the points it computes a covariance matrix and it calculates eigenvectors and eigenvalues from those. So it does this transformation into eigenspace.

A

Then there's this uh where you're picking the eigenvectors at the highest eigenvalues, then you're, projecting these data or projecting the data points to these.

B

Eigenvectors so you're.

A

Taking the data you're extracting information about their sort of core variance and then you're, taking the uh that sort of that model of the variance and you're mapping the data points back to them and then you're getting a map of those data and how they're distributed. So this is from a hacker noon tutorial, but there are books on this. I just wanted to give a couple of blog posts on this that maybe are more accessible.

A

So this is an example of what people usually do. They'll take their data, they'll analyze it using pca and then they'll get maybe the top two dimensions or top two principal components and they'll plot them in a bivariate graph, and then it'll show these sort of the scatter of the data, but also these groups and the clusters are, you know. Basically, they are organized along some varying uh axis of variants: they're, not really like clustering, like a clustering algorithm.

A

They don't exactly match up in that way, but they give you some indication of what the variance looks like.

A

um So this is an example here of like taking uh a picture of a flower or morphological data from a flower and creating a pca biplot, which are these two top two axes of variants and you can define different features in the data by looking at the different uh vectors. So this is sepal width, this vector this is sepal length. This is petal width and this is petal length and you can define these clusters and then you can define it by species and the pca analysis gives you an idea.

A

It's sort of you know, looking at the shared variance and the variance between groups, so you have. These species basically fall in separate clusters, so the setosa species is very different from the versacolor species and the virginica is overlapping with the versa color species, and if you were to just kind of idea in nature, you may or may not see that distinction, but this puts some numbers on it.

A

So again, this is how pca works. This is a nature methods paper on the method, so this actually goes through a bunch of data sets that they've used for to look at this. Actually, this one is just kind of a tutorial of pca and how it works.

A

uh So this is pca geometrically projected data onto a lower dimensional space. So you take all the dimensionality of your data set and you map it to a series of dimensions that are defined by the by the vectors that I mentioned before, and they try to make them as orthogonal as possible so that you're getting into you know you're getting some sort of independent set of uh variables, basically that you can compare to one another.

A

So that's the first step. This is, of course, where you can help to identify clusters in the data. If you compare it with a hierarchical linkage, analysis or clustering analysis. You see that this clustering. This is a hierarchical linkage method, so it forms these basically a tree with nested sets in it, and so you can see that this matches up with this cluster analysis. But it's not an absolute map one-to-one map. The pca analysis will reveal features that the cluster analysis doesn't and vice versa.

A

So it's not exactly like a clustering analysis, type of clustering, and then you have this there's. So there's some limitations on pca. uh One of them is that it may miss non-linear patterns of data. Another is that it misses non-orthogonal patterns of data. So if the data is well organized and highly structured, it will pull out these clusters. But if it's not, then, if there's a lot of there's interactions in the data or there is unexplained variance, it's going to be harder to find a good.

A

You know interpretable set of images here and that's interesting. We talk about this interpretable set of images because we don't think of it.

A

That way, we think of it as its hardcore quantitative method, and that brings us, of course, to t snee, which is t distributed, stochastic, neighbor and bedding, and that's a lot of words for something that is uh really just like a dimensionality reduction technique uh like I said, you're interpreting these sets of points and you'll see with t sne that that's actually more true, even though this is a more advanced method than pca. So this started off.

A

Jeff hinton and company actually published a paper on this in 2008 visualizing data using t-sne, and this is a machine learning sort of approach, uh and so in that sense it's more advanced than pca.

B

But it does resemble.

A

Like more simple dimensionality reduction techniques, there's a technique called multi-dimensional scaling, which it's also closely related to, and that's actually a very simple method, but this is uh you know this is supposed to. This was the state of the art maybe 10 years ago, so for this t-sne algorithm, there are two steps.

A

The first is to construct a probability. Distribution for high dimensional objects, so you take what pca does and you look at about the high dimensionality of your data, but you construct a probability distribution for it and, in this probability, distribution. Your objects that are similar in some way have a higher probability of occurring and dissimilar objects have a lower probability.

A

So then, you construct this probability distribution. You do a separate probability. Distribution for objects on a low dimensional map, so you have the real data and you're trying to construct the probability distribution on that.

A

But then you also have this low dimensional map that you want to transform these data or map these data to that you're, also constructing a probability distribution. On this probability, distribution for for the little dimensional map is based on minimizing like what they call. The kale divergence for each point, and so it's basically kl divergence, is a technique that is used. It's it's uh the callback leveler divergence.

A

This is the long hand and it basically computes distances or divergence between two separate methods or two separate trajectories and so you're, just basically trying to find the high dimensional like you're, trying to approximate the high dimensional data map it to a lower dimensional map, but that lower dimensional map has its own distribution and you're just trying to match the two distributions and you're trying to minimize the divergence between, say, a point in the high dimensional probability: distribution and the low dimensional probability distribution, so that the map is as close as it possibly can be.

A

So this produces clusters like it does with pca, and it's it's a similar method in terms of mapping, things back down to a lower set of distributions, so.

B

It does produce.

A

Clusters, but the cluster val validity is a little tricky. It requires a lot of visualization and interpretation, so kl minimization is done using something called gradient descent, which is an optimization method that is very common in machine learning, but that still doesn't solve a lot of your problems of validating the clusters.

A

So this is uh like the mnist data set using t-sne and so mnist is this data set that they use in machine learning that has handwritten numbers from zero to nine? And it's like different handwriting. You know different people doing different sort of swirls and things with their handwriting, so the fours look very different across the different samples and the idea is to be able to identify all these as a four in this row of fours. So all the variation in the fours should be.

A

You know the machine should be invariant to that variation and it should always identify for it correctly.

A

Now, if you plug these data into t-sne, you actually find that you can take like data that you predict as being a three and.

B

Put it into a cluster.

A

That's defined as a cluster for three, so it actually does a pretty good job on this mnist data set it. It classifies those uh instances correctly. So all these light blue instances are threes that are then put in the category of three in this map and the threes cluster together. Now you can see there are a couple of reds here which are one I think.

A

Oh no, eight! So you see the eight is being misclassified as fives and it's threes a little bit it's so it's a little messy, but the idea is that the proper classification should be in one of these clusters, and you can see it looks very much like some sort of postmodern art thing where they just kind of like a jackson.

A

Pollock almost- um and this is the problem with interpreting this- it creates a very pretty image, but it's hard to interpret, and so the basic rule of t-sne is that points within clusters are meant to be similar in some way, even if they're misclassified, but it's hard to say that if these, if these clusters are less similar than these clusters, so these two clusters- the one at the top here and the one over here, it's hard to say whether these are less similar than this cluster.

A

So basically, if you're looking within a cluster, it's easy to say that they're similar and that's an artifact of the of the approach that they use for the cal divergence approach, but between clusters, it's harder to make that distinction and it's an artifact of the search algorithm in the way they implement that so there's another another study that uses t-sne where they took a bunch of data sets of brain tissue or embryo and brain tissue.

A

A lot of things from development here and these different molecular sequencing techniques, next-gen sequencing techniques, that's not really as important as the differences between the data sets.

A

So this is a meta analysis of all these data sets and what they're going to do is they're going to map it to a t-sne model and see what if they can identify different cell types, so they have these first of all, they did all these different met. They tried all these different methods on the data and they got some performance uh statistics on this. One data set tacit at all, so I think, tacit get. All is what we're going to focus on here.

A

This is adult mouse cortex using smart c2, so you're looking for a bunch of classes here- and you have this thing- perplexity and random initialization, so perplexity or perplexity. I don't remember what they're actually calling it is the main parameter that you're looking for in t-sne and it's a method of I I think it might be similar to convolution- I'm not really sure what they're what it is, but that's the parameter. That's that's the key and it's a way of like you, know, algorithmically, sorting the data out and trying to get.

A

You know find a good mapping between high and low dimensional space. As you can see, t-sne produces a lot of clusters here. Whether the clusters mean anything is a different issue, and so one of the things we can do is we can look at out-of-sample mapping. So in this example, we have line two data sets.

A

We have a reference, t-sne atlas and a data set of interneurons, and so we're able to map those interneurons onto this reference atlas and we're able to do it fairly uh successfully, um and so the they're two actually t-sne has a number of parameters. So perplexity is the measure of global structure. So I get remember. I told you that these individual clusters are sort of. You know you. Can you have a very good measure of um sort of similarity within clusters, but then between clusters? We really don't know.

A

What's going on well perplexity, if we set this at a different level, we can refine this global structure. So it's like a top-down way of saying you know we want to look at like how many clusters we have or how similar the clusters will be, but it's not really that it's not really a fine level of control. It's it's sort of this measure of global. It's a general measure of global structure, so it turns out.

A

T-Sne is not very sensitive in this respect and there's another method you map, which is much more sensitive and we'll talk about that. Next perplexity is the main free parameter in tc. So this is the one parameter you can play with to improve your results, but there are others such as learning rate and number of iterations, which you can also play with to improve your result.

A

So this was like 10 years ago now, in the last couple years, we've come up with umap, uniform manifold approximation and projection for dimensionality reduction. That's a long set of words, but basically it's an improvement on t-sne. It works in a similar way, but it's better apparently and so they're, using this more and more now and you'll, see it in papers and I'll show you at the end that there's an atlas that uses it uh that involves c elegans developmental data, um this paper, mcinnis, healey and melville, is uh sort of lays out the method.

A

So this is the sort of the seed paper for the entire enterprise, and then this is a blog post on what and why is it exactly better than t-sne, which explains kind of the pros and cons of the two methods?

A

So this pretty picture here is what you get with umap and if you compare umap and t-sne, we know that t-sne does not scale well, especially for stuff like single-cell uh sequencing analysis, so sn, rna-seq or other methods really doesn't do a very good job of scaling well from small to large data sets and your average uh your average next gen data set is very large, so we want to have a method, that's robust to that size of data uh also t-sne.

A

We also know that does not preserve global structure, so it doesn't preserve global structure only within the clusters. So you know if we look in within clusters, t-sne does very well between clusters. It doesn't really tell you a lot.

A

T-Sne performs non-parametric mappings between higher and lower dimensions and does not rely on features, and so pca has something called loadings, which are features that you can use, but tc doesn't have that and so to avoid sparsity, which are fragmented uh clusters. They call them manifolds, but the manifolds just have like unconnected clusters, uh that's what they call sparsity. So if you have a bunch of clusters that are just very far apart, that's also bad because it's sparse it doesn't give you a lot of information about like you know.

A

It tells you that they're clusters, but it doesn't tell you kind of how they're related so to avoid sparsity for high dimensional data that are first handled by an auto, encoder or pca. So umap is actually bootstrapped by some sort of pre-processing technique.

A

So, as t-sne only preserves local structure, it resembles the cost function for a multi-dimensional scaling, which is a much simpler method.

A

There's this lack of global distance preservation in t-sne and so to solve. This umap actually uses a stochastic gradient descent model, so they use this model of gradient descent, but they use it in a way. That's very different from what t-sne is using, which is they're using it sort of as a way to optimize like kale divergence and other things.

B

A

um But stochastic gradient descent- I guess, is probably better and they kind of go into in the article as to why it's better um umap also uses an exponential probability density function for high dimensional data.

A

So you know if we use like a normal distribution, we may miss a lot of the outliers or a lot of the things that are out of distribution uh and if you use an exponential probability distribution in the high dimensional case, you pick up a lot of that variation that you're going to miss just by looking at assuming that you're just trying to fit them into normally distributed categories.

A

This is, as we know, high dimensional data.

A

There there's a lot of potential for interactions, um so the distance metric, that's used in new map varies across the manifold or space and the nearest neighbor graphs result from fuzzy simplical sets, which is a form of topological data analysis which we've talked about in this group before, but is an interesting sort of wrinkle to this, because then we're bringing in a new method and it's a different and exciting method that might solve some of the problems that we've had with t-sne e-map also uses a number of nearest neighbors instead of a perplexity measure.

A

So this is similar to information theory, where perplexity is similar to an information theoretic measure but umap uses.

B

A

Number of nearest neighbors as a proxy or as a high level parameter, and so the parameter d min demonstrates a set of uniformly connected points. um This is sort of this nearest neighbors approach using a nearest neighbor's approach instead of something like perplexity, leads to tightly packed clusters.

A

So you get this more dense local information and then they use binary cross entropy as a cost function instead of kale divergence, and I think there are some problems with kale divergence in terms of its sort of how it's uh sort of its rigor and and completeness for what you're looking at here so binary cross entropy is a better choice.

A

Finally, there's this tool that is interact. It's an interactive atlas of single cell sequencing data for uh c elegans, neurons in development, and so where I don't think yeah. I think it is just for development, so this is a lineage resolved, molecular atlas of c elegans embryogenesis at single cell resolution, and this is a science paper that came out last year. This is the vishal of this cello tool.

A

You can see it on github and there's also a shiny app that allows you to actually just generate these plots directly. So if you go to the shiny app you, you input your parameters and it produces these clusters which look like sort of the t-sne clusters, but they're much denser and they're. You know there's more global structure here. So what you have is you have a bunch of cell types and you have uh they're color-coded, so you can see that they cluster within their. You know within their groups, and basically this tells you something about.

A

You know how different they are from one another. So the ciliated amphi neuron, for example, fits into a couple of clusters, and it tells you that you know maybe something in relation to some other cell types, and this is based on you know all of the single cell data that they have for the cell, so they have a bunch of sequencing data for the cell. It defines this cell state and this is sort of a summary of that and hide in a low dimensional space. And so we have this two-dimensional space.

A

Where we have these dots for the ciliated infant neurons and comparing them to other types of neuron, soluted, non-amphibian neurons, for example- and these grape dots are just kind of like undefined identities, because we don't have all the cell types in this plot, so it actually does allow you to compare with many other types of cells, and so, for example, developmental cells born at 250 minutes seem to contribute to all future types of uh cell types and tissues.

A

So we have this, uh we think about in terms of the lineage tree. The developmental cells that are born 250 minutes are responsible for all this variation.

A

um You get all sorts of neurons and glia that come out of this.

A

You get. So that's that's just sort of to set this up. You get these axes like you do in pca. For this analysis, v1 and v2. These two axes represent the dimensions of a two-dimensional projection of the source data. So this is just a two-dimensional map of all this variation and distance on the manifolds are defined by single cell transcriptional, profiling of all cells. So these distances between clusters, for example, is a product of this mass profiling of a single cell and basically a summary of that distance.

A

And so the question then, is with any of these methods. Can you accurately capture that and some people say with the umap, it's a pretty good chance that you can capture it, but there are probably new methods that can be developed. That would make this even easier.

A

So you have I apologize for going. Oh, I see dick had to leave, but that's okay, um so I mean I've been trying to figure out how to present that for a while- and I just wanted to give you a primer and, like I said, if you have you know, this is something that you'll probably encounter. If you read developmental biology papers with like molecular data in them, don't be scared off by the the figures and the techniques. This, I think, is a nice primer for that.

A

But if we have other, you know other questions about it, we can follow up on them.

A

So any questions.

B

No questions, but I appreciate that uh overview thing together,.

E

B

E

Thanks for sharing.

A

Yeah yeah no problem I'll share the slides later um and then yeah. So thanks for attending um happy hacktoberfest and if we'll talk on you know offline on slack and uh next week, we'll have another meeting, uh maybe we'll follow up on the uh programming issues like you know how to solve a problem, and hopefully we'll have more hacktoberfest activity, yeah, all right! So uh christian. I look forward to seeing your paper.

E

Yeah I'll I'll just send you. uh I have.

C

A

Send you one more thing, also I'll send you.

E

A

Okay sounds good: okay have a good week. Everyone take care.

E