DevoWorm Lab Meetings, 24 Jul 2023

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: DevoWorm (2023, Meeting #28): DevoLearn/DevoGraph, Genomics of Differentiation II, Recent Papers

Description

GSoC updates (DevoLearn and DevoGraph). MedSAM, Persistent Homology, and TOGL. Genomics of Differentiation: protein alignments for C. elegans, Global vs. Mosaic Methylation and Control of Cell Differentiation. Differentiation codes and associated analysis. Attendees: Sushmanth Reddy Mereddy, Himanshu Chougule, Bradly Alicea, Susan Crawford-Young, Richard Gordon, Jyothi Swaroop, and Lukas.

A

B

Homanshu, hello, everyone. Well, we had a discussion. What a discussion about this paper on the genomics of differentiation I think it's great I know: Lucas was working on it quite a bit this weekend and we're going to go over a lot of that stuff. I did a deep dive into the literature on C elegans. Differentiation is quite interesting. Actually uh so yeah we'll be keep working on that uh I.

B

Don't know if Lucas is going to be here today, but he sent me a version of the paper and then of course yeah so and then, of course we have our vsoc students um and I also got the document from socialmont. The microsam and I have that uh in my tabs, or he can share a screen as well, but just to go over that and then hamanchi's here.

B

So let's get started uh wants to go first with an update.

C

I hope, you're fine, um I started writing segmented, microscopy segment emitting model, so microsam I just came in that paper I. Just given a gentle introduction about the using segment everything model for cell image segment.

B

Could you go uh well uh setting up with your update.

D

D

Okay, so this this weekend, I was basically working on uh demograph and trying to set up like the persistent homology, like the topological data analysis, part and.

D

Or get the same results as previously before, uh because there was some issue in or no.

A

Just one second.

D

A

The repository- uh let's share it over here.

D

Thank you, uh yeah, okay, so in the requirements.txt. So basically uh the this is like out uh different from what is actually working right now. So uh since the project was like a year ago, I go and send any version.

D

Versions that could be used uh so like I, had to figure out what the exact version for it was and uh I'll send an update to xiaang and tell him about this. Like the kit which they were going to use and.

D

Updated as well like it has gone to a version, 1.0 and python is also gone to version 2.0, so uh the interchangeable things of it is.

A

D

Right now so, uh but then I got it to work and the stage two right now uh working on my PC as well so uh and I got the same results as uh before this one.

D

So my second stage is, as of now I is to get some papers done like or this uh topological graph neural networks, paper and I've been looking at the GitHub repository for a while and trying to figure out what how exactly they uh used it and how we can uh like try to get a similar results for our tasks, that is, for cell tracking, and they have done it on classification tasks like node and Edge Edge classification.

D

uh So, basically, uh what they have used is something called as torch, persistent homology repository, which is a different from what most of the people have been using. That is, the tricer tda1, uh so right now, I'm just figuring out uh how to get into GN and TDS and once I get a small example code ready for uh our data set. I'll move on to the cell tracking part I'm trying to get but try to resolve results like I'll try to solve the problems that Jiang has all had.

D

Also mentioned in the previous means, like uh sensor, data is like there is some kind of rotational Interstellar or a different ability, and also the other.

D

It was when like, since this paper only considers like the static graphs and our graphs are dynamic, so we'll have to figure out that that is one of the challenges uh as well. So right now, I'm ordering creature to review on that uh one sec, an example: I'll move on to the next step. Okay,.

A

D

My work this week would be like to uh get a working example ready for uh tocls, but on static, static.

B

D

B

That look like.

D

The previous paper, like basically we'll, have the input data which will be processed and in the temporal graph section, we'll also have like a static graph or of which we'll extract topological features of it and then, along with that, the GNN. uh With the help of that, we can get the output.

A

D

Which is, uh which is what these guys have done as well, like the.

B

Okay- and this paper is what is this paper.

D

This is okay, so yeah, and so I meant to send this paper as well, and I also found a different paper.

D

Is this year and uh I'll trade on that and if it's useful I'll send that as well.

B

All right that sounds great.

D

D

C

Are never integrated stage. One was running on different python version and stage two was a uh and whatever CSV. When we are expecting stage, one that.

A

C

I mean we have never integrated, you can run separately stage one and you can run separately streets too yeah, then, if we want but combinedly, they have never integrated, and that is the hard part, because a stage 2 needs uh cell lineage analysis. That is the main reason uh with development. We are not able to create the CSV file with cells, Androids and cell lineage analysis.

C

So that's what the issues.

D

And versions, and all of that like also you, uh we also started to update the develop new repository as well like to change the functions to by using the Skype and Library functions for uh like image, processing and everything, and also like the dependency was updated. So I had to use a different development for uh to get my setup ready for stage toward stage so right now.

D

uh So as of now I'm using uh water rules for Prof traveler, and with that my stage, one is working.

C

Yeah there are so many active changes and duplicated Library. So right now we need to update to accessories.

A

C

Will go through that all.

B

Right, so what's the plan for this coming week about you also, this.

D

Week I'll be doing was example for the graph on togls.

A

D

And then I'll move on to the next one for dynamic graphs.

B

All right that sounds great yeah all right, yeah thanks for the update uh glad to see your computer is back up and running uh yep I'm in trouble with that yeah, but.

A

B

This is, this is good, uh so social moth. Are you ready.

A

C

C

I think there is a problem with my laptop and it's not sharing.

B

All right so yeah, this micro Sam document um and it's basically an overview of his some of the proposed work on uh microscopy image segmentation, this segment anything model and uh is we're trying to build a pipeline of papers here. So this is the first one I guess, uh where he's proposing that we have this version of Sam called microsam that can do work on microscopy image images, specifically it's just using the Sam methods of um from meta and applying it to this specific problem. So.

B

Okay, foreign still working on getting up and running here with this uh any questions.

B

There he is okay,.

C

B

C

Yeah right now, till these many days, I have the problem with my gpus for trying the but.

C

Compute units on Google, collab, I started training the model and apparently I was writing. The whole paper for microscopy images of sense. uh I have implemented I, actually I shared the dog about it. I wrote an abstract what it is doing. An extra after that I am giving a general introduction with that corresponding images. I am comparing our model with unit structure segmentation yeah, that's what idea is there in my mind? Exactly I will be writing paper about cell segmentation using a zero shot classification, zero, shot segmentation, our FM and apparently I'm writing I'm.

C

Trying another model called devonet.

C

Can you see my screen.

B

Okay, yeah, but I just see the Gypsy window now.

B

You share the tab.

C

C

A

C

Now I bought the collab upgraded version, I'm training, the model there are some little small bugs I need to work with them out. Actually, the bug is just a shape error for the loss function. This is smaller Maybe by tomorrow, I'll sort it out and model will work. Fine, but most of the main problem would be writing this paper and.

A

C

Explaining every parameters needed for writing this paper and almost code is ready. We just need to train the model and give it with fine tune with correct hyper parameters.

C

This complete week, I couldn't work at most because I was having a fewer and back will manage my work, but this week I will try try to complete, try in the model and I will try the complete paper about what is what is happening with our model and how it is implementing water park. Accuracy are compared to other models. Etc, apparently, this stock will I will make changes completely.

C

Please give a look at the end of the week and let me know like what all changes do I need to make after this paper, I'll start working on the table net, which I have proposed in my case of proposal. I will try in that and I will write a separate paper on that. Also, ah that would be one and devolin actually Jyoti attended today's meeting for developing only he started reading the docs, which you are shared and he will start writing papers from tomorrow about develop Etc, maybe from next week onwards.

C

He will also use the update above paper Etc.

A

C

Yeah I hope this I think the vocabulary all go, I think is good right in the paper. Try to just write a basically.

B

Yeah this is it's always hard to know, because there's so much jargon in that field like how, if you're using it correctly or if that's this is meant for like a a technical audience or a general audience or what's the.

C

Yeah I want to publish as a conference paper or something like that. I, don't really don't know about how these papers work, but I won't do what I am doing. My work in my paper or some kind of Channel paper, because we are building tools which someone you can use for cell segmentation. Okay,.

A

C

Yeah, that could be a vaporizing show or like Med Sam, which I've showed the last three that I have published in the archive I. Think like that only we are working on the microscopic cell images. We can show it some that we are work on this thing.

A

C

I will try to make draft of all papers and I'll share it to you. After reading them, you decide whether it is up to mark for a general paper or for a conference or something kind of just publishing a market. Okay that would be, and my uh Bradley I think there is something with the deadline thing. It was not updated on my Visa, but actually oh.

B

Yeah, the yeah I you take care of it at incf. I'll probably have to tell them closer to the time, because I don't think it's like I think they do. This later in the.

C

Oh yeah, actually some people.

C

B

This, the Ryan CF or another work I.

C

Am here for me.

C

Extended that line, but mine was showing in August 28.

A

B

I'll I'll I'll, re-email them and see what they say or yeah. Let me make sure they do that.

C

And this week, I couldn't work because of the scent issues. Maybe I will work this week.

A

C

That's the reason I was mostly making the documentation work for the whatever I did.

A

C

Couldn't train the model and another issue was with gpus which to train, so I took yesterday complete collab, gpus, first and yeah, maybe by next couple of weeks a model will get trained and it will be up and running and the papers will also be drafted perfectly.

A

C

I mean sending all this to mayuk. Also, he has aware of the project what I am doing right now? What is.

B

C

B

The papers will be I, think more iterative, it's it's fine. You know for right now, because they're going to be a lot of changes made like we're. Gonna, probably we could even have you know multiple versions of the same thing where you know the stuff you're working on for Med Sam, my yeah. We can start with one and then we might also make other documents. It's just you know. That's the way.

B

This kind of works like you have you have the work and then you want to make sure that it gets out there, but you want to do it in a way. That's like interesting or irrelevant to people who you want to Target. So if we want to say talk to a machine learning audience we're gonna have to use a lot of the the jargon of the field. You know they have. They want technical details now, a general audience doesn't necessarily care about that.

B

They want to just use it for whatever reason you know and they're, not like super experienced in machine learning. They will. You know you want to have that kind of a paper for them as well, but.

A

B

You know that's just something will work out like we have to work out, I think the technical details in getting those in the paper like as a way to and then, of course, the technical detail. Sometimes you know they get buried in papers too much I think uh like.

B

If you read a a lot of these scientific papers where they have like all these, you know supplemental uh graphs and data and figures, and it's kind of like who I mean people read it, but, like some of these papers have like 240 supplemental documents now- and it's like this reads all that sometimes like their technical experts are interested in all of it, but no one's real interested in every piece. So it's like you know we have to we. It's like one of those things where we'll be creating things for specific purposes. Yeah.

C

Okay, I will just make a basic draft which can be modified in future. uh I. Think our graph is not a publisher, I think so, so we can upload our base model there and we can make changes over there. If that could be right right, yeah.

B

Yeah one thing we should have, of course, is a persistent URL. When you make a when you put up like code or something you want to have. uh Sometimes people will, you know, have well GitHub. I guess is a persistent URL, but a lot of times people will make like a release, like probably at near the end of g-soc, will make a release for code the code, and we already have releases for Diva learn but we'll make another one and then we'll also like. Maybe even we can even archive it on foreign.

B

But I'm talking about like the data data sets and things, um so those can be archived separately and they usually have like data archives, where you can make a stop and it gives you a DOI and then it gives you like. You can update the files in it, so people can download it.

C

Okay, I will keep an MIT license over my GitHub repo and I'll push it to about 100 and the free print I will upload in dark side as open, One, Foundation.

B

C

That could be nice yeah. Completing these traps, I will share it over answers, but right now I will keep it in my report.

C

B

C

B

Should have you know there should be a copy of the code with a license in that same repo, so people can have that information, but then we'll also have an archive of it, and the archives are usually just the releases. So like a group will make a release then make it to a certain point. There are different strategies for doing that, but generally they'll say we're going to make a release. Now we're this we're going to call this 1.7.

B

Maybe it's a stable version and then you put it up in an archive or there's like a little tag in GitHub repos. That say the latest release and you'll. Just actually just click on that and get it you'll be able to get the stable version. The repo itself will have maybe any changes that are like current. So the last push to that repo will be in the repo files and that may not work stably for whatever reason, but the the last really should be stable.

B

So people will do releases just to make sure everything's stable and then release, and then people can download that and then the people can still keep working on the open source. They can open source the files they can keep working on the code and then that's that's not interrupted. You know, okay, so, okay,.

C

Thanks uh next week, yeah.

C

C

C

Like uh you actually talked in the other meetings right, like a destination of code and all this stuff, what it is happening under the hood, actually, we need to add a meeting. I explained it him. Everything like whatever happened. We will make a draft of whatever is happening which process We are following Etc.

B

Okay, that sounds good. Thank you for the update. um Let's see uh so hamanchi posted something in the chat here and it was uh didn't. Oh well, okay, uh hamanchi posted the uh Joss, the OJ, so the Journal of Open Source software yeah. That's a that's an uh a good venue uh for papers. You know the it depends. They have a certain standard for like uh applicability and use and things I think I submitted there once and they didn't want the paper at that point.

B

But uh you know that it's basically, where it's the kind of model of a journal where you, um you know, prepare the code and you push it to their repository and you put it in the template and then they publish it directly from like a GitHub push it's kind of an interesting model for public publication, but it's usually these papers on open source software. Where they're you know describing the work, but also you know they have certain standards for like what they find interesting.

B

So it's kind of like a regular Journal, but um then yeah the archive is is a pre-print server. So that's that's, obviously a place to go with the uh different with the paper versions, so you can put papers there, but not necessarily code, although they they have improved their support for code links and things like that.

B

um There is like papers with code and things like that that are like archive papers that also have the code and they don't know how that's set up. But um I know that um the person in the other group Ankit Grover, is familiar with because he published he got a paper on the archive and papers with code and all that so will I might talk to him about that.

B

um So that's yeah. It's all planning in the planning stages and then geother says: can you suggest any Journal Publications for cell biology?

B

Oh there, there are a lot of Publications, but one of the things in our group is that we have we're at the intersection of Cell Biology and computational uh Science computational Biology, so that the the journals there, you know, that's going to be very different from what they publish in the cell biology journals. Just in terms of the yeah biosystems, is a interesting uh Journal. That kind of straddles, all of that it's kind of even more theoretical, but they're different.

B

You know if you have uh and we can come up with a list of journals for a certain Paper, but I mean you know. The the point I'm trying to make here is that a lot of your cell biology journals are not going to publish like a machine learning paper. It's just not their audience and Their audience is more like I. Did this experiment or I? Did this observation in the lab? It's it's.

B

It's kind of an odd thing, but like there are other journals too, like uh you know where they they do have this mix, but but this is a thing we'll have to talk about uh well with some finesse, okay, uh yeah. So thanks for all that all those updates, um yeah and so we'll be coming up on I. Guess we're going to extend the projects to October 30th I'll. Try to get in touch with there are people at incf for that um yeah.

B

So, thanks again and uh now, I want to turn attention to the uh genomics of cell differentiation paper. So it's like, there was a lot of activity on that. This week uh we had a bunch of I think we're at version 10 now right, um yeah, and so we have Lucas, did a last search and uh we have so I'd like to hear from Lucas. Actually, if you don't mind uh talk a little bit about that.

E

Good day uh so, yeah actually I actually had a couple of meetings with Mr Gordon, and what we came up with was an idea which uh I'm.

A

E

If I could share the figure 8.13 from Abu Genesis explained, yeah I could probably share that I'm. Just gonna share the uh screen for a second.

A

A

E

E

Okay, so this is the figure uh like 8.13 from embryogenesis explained by Natalie K Gordon.

A

E

uh Richard Gordon- and this uh like these, are a bunch of like fcr wnt map Pro table Associated protein. These are like uh a bunch of proteins and uh apparently based on what Mr Gordon said they are somehow uh uh like. They seem to be involved in the differentiation uh in CL I guess. So uh let me stop sharing.

E

So Bobby came up with um yes, okay take shared there, so what we came up with was um that we might be able to run a blast search for uh and after we won the last search for, uh like, let's say one of these proteins. So what I did, for example, was uh running so, let's say uh for uh I, don't know, uh let's say uh for like pkc protein protein can I see.

E

uh I went on ncbi National, essential biotechnology uh and uh I searched in a database like uh protein q, a c uh C elegans, and um it like it shows up a bunch of results from, and most of them are the sequences for the protein of that. Finally, so like it gives you the protein sequence and then what I did was uh I run blast against. So when you run blast so like you could search it up, it's uh for those of you who don't know uh is uh so.

E

This plus is taking through like a soft word, which helps with DNA alignment or, like sequence, alignment in general. So what happened was uh I run. I ran last and I specific I specifically narrowed that down the results to the taxonomy of Clans, so that we could find results there and then I started counting uh like so based on uh Richard said um based on what he said. uh When you keep count for each Protein, that's going to be a specific enhancement to the paper and how it works is I. It.

A

E

Showed about 100 hits per each protein and then I started counting the copies. uh So these copies are the end copies. Technically they are the that have been drifted apart so and they they uh correspond to the end edges of the differentiation tree. Although uh there is a limitation to this procedure, which is um when you run when you do this, uh it's not gonna technically show all of the hits and when you find the significant similarity, the problem is that uh some of these, like so I, could share.

E

uh If I could share my screen again, yeah uh just a second, uh so.

F

Focus, what stringency did you use.

E

I used the I think it was I'm, not sure, but um I have to look it up again, but uh it's the normal settings, I I, don't think it's specifies Regency. Oh I should default something yeah default settings. I have to look it up that what the percentage is, but so let me show you what's going on so this is for, for example, uh like I ran the last search and it these are all of the results that it found, but the limitation here is so that some of them are like.

E

You know, for example, if you look at here, it's technically the same thing. That's showing us, and the problem here is that I think the ncbi uh database is not like that, like how could I put it like it's, not that developed. So, for example, let me go to version 10 like for diam d-a-am one and a d like the digital. This protein I could not find any results and for APC uh I. Think, since the amino acid, like sequence, was too short, it didn't find any copy.

E

So for some of the proteins I could not even find the sequence. So that's uh I would say that that's like a limitation to the to this procedure that we are you're doing, but yeah, that's technically what I did and uh Mr Gordon also uh sent some interesting ideas and uh I could read it like uh one. uh He said. Look for regulatory elements, not genes and I started looking out for those like uh the blast.

E

Search is the same thing, uh but I have to look for those specific elements and two he said, reduce the stringency. I have to look that up and three he said see if there is data, an expression but I'm, gonna, I'm gonna put this into chat, and let me just okay.

F

Now, uh Lucas these ideas are come from discussions with Natalie. This.

E

Morning, yeah yeah I'm, not sure if we um I'm not again I'm, not sure if I could find the data for three like for the different stages of development. However, it's uh it's. We could look at this protein. I left a note that I think it was.

F

Can you answer that is for C elegans? Has anyone done uh the gene expression for individual cell types.

B

um I, don't think there's like uh not for individual cell types, I think people have done like studies on specific cells, but I'm actually going to talk about this in a bit. uh What the state of the I think the state of the art is so.

F

Yeah I don't but that's addressing here.

B

Yeah yeah so yeah this is yeah, so this is all these are all good, uh especially the first one I think looking for regulatory elements, uh so I mean this is where a lot of the action's going on like the regulatory elements, apparently yeah.

F

It could be that whatever, if there is a differentiation code, whatever codes it right, it might be regulatory rather than these Express streams. On the other hand, uh Lucas got hits up with the default string, so you got hits up to 40.

B

F

uh Which is impressive because we don't have to I'm not sure we have to have drift just because it's been uh a uh duplication, yeah.

B

F

Okay, uh four is uh something maybe you could answer. Are there other nematodes that have been uh for which the limits of the cell lineage has been ripped out? Besides the elements.

B

uh Yeah there are a couple like I, think, C, brigsier and uh some other ones uh they're. You know they're different, like they're different cell numbers and and I. Don't know what the available data.

F

Is okay, so here's the question if you have one that has a smaller number of cell types of its lineage tree, is a subset of the clients.

B

Oh yeah, yeah uh uh I, don't know about smaller I know. There are some that are larger, but uh well.

F

Then we can reverse the question. Yeah.

B

Well, yeah, we yeah.

A

B

Could compare yeah across uh Canada's yeah.

F

So I think that, okay, if we can get anything.

F

Other is One Tree, similar to the other one in the in that the part of it is any part of it. Similar yeah.

B

The problem there is that the I think the nomenclature is different in the different species, so it'd be hard to know if that's yeah, I suppose, but really it's true yeah.

F

Somebody's just lonely, a tree Yeah.

F

B

Yeah I mean the way that the they handle the nomenclature, because remember these cells or these trees are originally drawn by hand. So people.

F

Were like just putting.

A

F

On them, yeah you're, saying getting corresponding cells between two species might be difficult.

B

Yeah yeah but I mean you know that I don't know, I, don't know which uh it depends on I. Guess the name that you pick I. Guess: okay,.

F

Now are: are there any? Is there any software for saying how similar two trees are to each other? Oh.

B

I know I think so, I don't know we could look for it. um Yeah yeah one one word about like the drift part, so I can blast the results that you get or like matches like across the sequence. So when it gives you like a percent match, it's really it's a how much it's predicted from the input. So it's like I, don't know if that's like going to be different, I mean there's, obviously they're going to be differences in the samples.

B

If you go from like C elegans to another species, okay, uh they're, obviously going to be differences between, say, C elegans in another species, but there's also this issue of prediction, uh strength of prediction, so I don't remember what parameter they use but, like usually your match is you know pretty close with protein sequences. It's better than DNA, and the reason for that is. The DNA is often when they make when they put together like a genome or they have DNA sequences.

B

You often get like uh Missing bases like they have to infer the bases from using an algorithm or something, and then that or there are a lot of repeats and that can be a problem with last so using the protein sequence is generally better and it's generally more stable, so I mean you know, I, don't know what those changes.

B

Those differences will reflect if they're, actually The Drifter, if they'll be just like prediction, differences, but that's I mean that's really a good work, I'm glad that you were able to get into that and do all that work.

F

B

F

I think we're on to something here.

B

F

We may not have I think the idea of comparing things. This way is good yeah.

B

So yeah there's a lot of reasons why you know having a like a lot of the comparative work is really hard to do like you have to figure out a method for doing it, and you know because it's like, if you're going across, like from C, elegans or drosophila such a big different system in terms of not just the nomenclature, but the way that the development you know proceeds, although it's the same type of development, so I mean you have these parallels.

B

But you know one of the things about like uh blast and like in general. A lot of genomic stuff is that organisms tend to share a lot of like basic. You know, DNA that involves housekeeping genes and things like that. uh You know like things that make things like cells, and so those things tend not to to be different across species, so you could compare, for example, nematodes and humans, and there are a lot of Pathways that are similar or the same.

B

So that's why we can do that with uh with genomes and with proteins the of course. The differences are in some of the other functional things that are specialized for that species. So, like that's it like, if you look at uh sequence, homology between say, like bananas and humans, it's like 50, and it's like you know you would think. Well. Why is that? And the answer is as well. You have a lot of metabolic genes.

B

You have a lot of genes that involve, like you know, uh cell signaling and things that are basically don't need to be reinvented so uh by Evolution. So.

D

B

That's all great, that's all great! Thank you, Lucas for the uh great presentation of work, and uh so next steps I'll take a look at the draft. I haven't really I mean I've, taken a look at the earlier drafts, but I didn't see the latest draft I, don't know if I actually have the right. Current draft I didn't see anything in my inbox this morning, Lucas. So if.

E

You could uh send it to you uh after the meeting: okay, yeah, that's good yeah and the sixth like the six proteins that I send there uh for five of them. I actually couldn't find the protein sequence, so even finding a protein sequence would be a uh I, the protein sequence for C elegans of like for the protein of sealions like if we could find that that's that's like a I could try blessing that as well, but I couldn't even find the floating sequence for those six or five of them.

E

I couldn't for one of them, I couldn't find like it was too short. Let's see, Constitution I couldn't find a match.

B

Yeah, that's a common problem, uh often because it's really what it's doing is it's taking a sequence and it's looking at all the other sequences and it's trying to infer like a batch and again it has some like you know, it'll make some uh account for some of the sampling error so like when they put together a sequence they're really, some of it is like actual sequence, and some of it is inference of of what should be there. And so, when you get a sequence in in the database, it's usually pretty clean.

B

But you do have this issue of like getting like an alignment, and so this is where you know: uh you'll you'll have a certain degree of accuracy, but you also sometimes get matches that don't make any sense, and it's just because they're very similar, but you know they may be. You know false positive.

B

So there are a lot of things that you know um that, but it's a very useful tool and so yeah in in no matches this means that no one's put it in the database yet sometimes because you know we only have so many we've only sampled, so much uh biology and you'd think LCL studied well, not always.

E

uh And I have a question for uh Mr Gordon. So uh how why should we cross compare with yeast and less complex nematodes? If you could explain the reason and how you came up with the idea, I.

F

Think that's where Natalie lost me a little bit. That's that's Natalie. The question.

B

F

I mean one of the reasons.

A

F

Yeah I I followed her to step four but uh I. Think it's too sleepy.

B

People will often use them well, sometimes people will use cross species comparisons to in phylogenetics to root a tree. What that.

F

Means I think one of the reasons she brought up yeast is that it has only a few phenotypes.

B

Oh okay, so you know what the phenotype is in the or you know what it corresponds to yeah. There's there's there is the issue of uh annotation like knowing what it does so like uh what I think in Blast it'll give you an annotation, but the annotations are generally very um simple. It's not like very detailed information.

B

You can look at yeast if it's conserved between yeast and c elegans and see like what the uh function is in East and it might have like I said you know we share a lot of DNA across species, so you'd have like a maybe a corresponding function and see elegans.

B

um So if it's not like, if, if it doesn't evolve away from the sequence, if it's conserved, if it just keeps the same sequence, then it probably does have the same function, although not always.

A

F

Interested in where these molecules are in the cell, like this is.

F

Speculation is in that figure: okay, yeah.

F

It was a figure 8.13 in the in our book.

F

Oh, it's figure. 8.13.

B

In the 2016 book.

F

Lucas, perhaps you could make a extract that figure and send it to Bradley and oh.

E

Yeah I could do that yeah.

A

B

Yeah, so that's and then yeah so yeah, the the compare comparative stuff is usually to root the tree and or to like find function or there's. There are a lot of ways: yeah.

F

uh Fine papers on the uh lineage tree of other species of Syria, Cedar Rapids.

B

Oh yeah I could do that. I have some yeah, I kind of know what they are, but I have to go put together. Some papers.

F

On it, okay, yeah and then we need, we need some precise values for the number of different cells in the number of identical cells.

B

All right, yeah I'll see what I'll see what they have in the literature, uh yeah and then Lucas. You know the parameter values. Sometimes, if you play around with them, uh you can get different results. It I, don't know what how you played around with the numbers: yeah I, don't.

E

Think are you referring to the E value.

B

Well, the E value is like a significance value, so the E value is like. Basically, usually you don't worry about the value, because it's always like pretty small and like what it's doing is it's just getting us a statistical significance, and so it's not exactly that. But it's basically the same, but I mean like any parameters that you put in for like the uh percent like a Criterion for a percent match, if whether you put it in or you're just using it.

E

I use default setting, but I have to check that off. But um in terms of, are you also referring to percentage identity like when it matches.

B

uh I think that's also something that it generates so.

A

B

So that's that's all those are good. I mean those are good things to have in the when you have when we publish an analysis. For example, you want to have those numbers there like in the results, so.

A

B

A

B

There are any input parameters, I, don't know what the input parameters are on the window there, but uh what you're using but like if you do make different. If you do searches of different parameters that input parameters, then you should make note of that and.

E

No I did not like okay. In that specific case, I did not change the parameters or like the input parameters at.

A

All okay, yeah.

B

I mean that, just just as a note when you have when you're doing these analyzes, if you do like, if you do them under a certain set of conditions, make sure that everyone knows what they are, uh because when you go to publish them, it's you know it's like with the machine learning stuff. We have to have like technical detail, but but you know it makes a difference sometimes, and then you know sometimes I don't know if it'll make much of a difference of changing input parameters.

B

For you know, I mean you might have a specific question where you ask. You know um a very specific question with respect to input parameters, but it it's generally just you know make sure that you know what you started out with and with the uh results are, and you know we'll probably make like a table or something to show um what we're getting.

C

B

Yeah, that's great uh yeah and I'll look into the literature on the cells, sell, IDs and the different types of different or the different types of lineage trees across species. I know that they're in canor habitus, which is the genus of uh that of our interest. There are a number of um uh nematodes that have been studied: they're, not model organisms, but I think they understand the lineage tree. uh But again the nomenclature is sometimes different.

B

So it's you know it may be something that is uh easily comparable or not. I have to find out.

F

Yeah, that's that's, gonna, be fun. Yeah.

B

F

Take any two real trees and can you match them and say: oh, these are sweet at least the same species, or this is an older species than the other one.

F

Okay, yeah yeah, there's an implication here that if we could match the differentiation trees, then we might actually be able to date when different species occurred.

B

Yeah, okay, yeah.

F

So I think they'll all over again. Yeah.

B

So I wanted to go over this deep dive that I did on differentiation last week, I talked about um the stuff with um methylation, and this is a different type of data than what Lucas was looking at. So Lucas is looking at the output proteins, so it happens. Of course is we have transcription.

B

We have a promoter region, we have a gene, a promoter triggers things on the gene or expresses certain parts of the Gene, and then you get like uh you know a protein made from that. So what but what controls transcription and what controls transcription? Are these epigenetic things that uh you know control the openness or the closeness of the promoter? So this is where methylation comes into play.

B

Now, when I talked about last week, was the standard model for awake in stem cell research, which is mostly a mammalian cells and what's interesting, is that in mammalian cells? There's this Assumption of global regulation of uh methylation state or Global regulation of state. So what that means is that all across the genome, in a certain organism, if you have a cell, it's a stem cell and you have a change in methylation state.

B

So this these methyl marks will change their state to sort of Drive the thing towards a certain differentiated State, um and we talked about by stability and all that that's kind of an aside to the main idea, which is that in general, in a Cell, every Gene will be regulated in the same way. In other words, they're going in the same direction. Every every promoter will be regulated or primed towards this differentiated State.

B

And so that's that's what we have in mammals and it's it's very interesting that that's actually maybe not the case in C elegans, although maybe it is now. We don't know this for sure, because apparently the literature is a little bit scattered, but uh let me go through. Basically what we have in terms of uh genomes for C elegans, so C elegans was, uh you know. The genome was sequenced before the human genome a couple years before there was a draft sequence or a draft genome.

B

It was put out in 1998 and published so there's this uh genome sequence for the nematode C elegans. This is the C elegant, sequencing Consortium and they put out a 97 megabase, genomic sequence, which, in the original uh version, revealed over 19 000 genes.

B

So the number of human genes predicted by the human genome sequencing project was something like 20 to 30 000 and they keep revising the numbers, the the refine, the sequence. So what happens? Is they put a draft sequence out and then they refine it? This is not that far off from what we have in humans, people thought a long time ago. There were a lot more genes in humans than in say, like other organisms, especially the what they call the lower organisms, but that's actually not true. It seems like C, elegans and humans.

B

Have you know within an order of magnitude similar genes? Now the size of the genome is different, and certainly, if you look at the C, elegans genome and I, don't know if I have a copy of it here. But uh it's you know it's it's a couple of chromosomes and I think a sex chromosome. So there are one a couple of autosomes in a sex chromosome and the C elegans genome does not have the genes themselves, don't have centromeres. So that's that has implications for This Global regulation.

B

Oh, this is this is where we get a lot of our information about uh gene expression data. So this is the genome sequence.

B

And then this is you know where we get have like a a bunch of uh you know, genes that we can sequence in DNA and then we can make we can infer proteins from this or we can actually get the protein sequences um that that's where we're getting that. But this is actually from encode. So there was a part of the encode project which is called modern code, and that was part of the project where they did a lot of.

B

They collected a lot of data on C, elegans and drosophila, and so they make these comparisons between like CL, Williams and drosophila, and um humans and mice. You know so there's this broad comparability uh aspect, and so you know, if you can, you know, do experiments and see elegans like say for aging or for other types of things.

B

We have like the the genetic Pathways, we kind of know what they look like they're, very similar in humans, because, like I said, we have a very high degree of similarity for such different organisms and then we can make you know inferences. uh They have these things called. You know homologs and paralogs, which are how you know you get like genes that are similar uh in different species and they have different names but they're basically doing the same thing.

B

uh This is an example of what we have with the modern code data set, which are these. These data that have been generated on these are gene expression data. So this is Chip seek data, which is where they.

A

B

This next-gen sequencing, where they put a sample on a chip, they sequence it for each uh oligonucleotide, they get a sequence and they get a sort of a an amount of that sequence, that's expressed, and then they uh compare it against the genome and they try to find these little stretches of DNA and how intensely they've been expressed or how intensely they're in the sample. So we can say a lot of things about gene expression, using chip, seek data and other types of data. This has a lot of trip seek data in it.

B

So this is all this was something that encode did and the reason they did. This is that they wanted to infer function from The genome of these different species. So the methodology was that if there is a transcription factor in association with a promoter that promote a region that that's expression or that that's function and they had different, there was a controversy about how they Define function, but basically this data exist.

B

um This is blast. Of course, this is the Wikipedia stub for blast I, don't know how much I need to go through this for people, but basically this is the the origins of blast, we're trying to find a way to compare DNA sequences and make this comparison between similar DNA sequences. So this has been around a long time um and you can do this through the uh GUI that, like how uh Lucas did it, you can also set up lasts on like a cluster or even on your I.

B

Don't know if you could really do a good job on your laptop or desktop environment, but you can set it up so that you have like the database in a fasta file. You plug it into the program. It runs, usually a command line thing, and then it will. uh You know, give you yours your matches, but of course you know that's going to take a lot of memory so using the guise probably good enough for a lot of this sort of thing.

B

But if you doing like a you, know a sort of a genome-wide essay or a survey, this would be you know installing it on your machine is uh good, and this just explains like how this process works. So it's really, you know comparing two sequences and finding a match. uh It's inferring matches from this, these pairwise comparisons and it's calculating a score which is then the degree of match, and then it's generating that the value where we talked about um where it's it's evaluating, the I guess the significance of this.

B

You know the match where the result- and it gives you a score, a similarity score, which is how similar are the two sequences for reasons you know, like I, said, for reasons of sampling, for reasons of other reasons, these aren't always going to be a hundred percent, so we wanna take note of when they're, not 100 and- and you know that can be it's not usually a problem, though uh this is of course, in worm base. So this is worm-based specific. This is an ncbi.

B

If you have trouble Lucas and finding some of the protein sequences on ncbi, you might try worm base, and this is wormbase.org, and this is tools blast black. So this is a blast plant search for specific weave for C elegans. uh This is like based on that uh C elegans genome, so you can actually choose the version of The genome that you want.

B

The latest is ws288 and you can do the the sequence search, uh they're, actually different bio projects, so the VC 2010 genome was done in like 2019, and it was just the revision of that 20 or that 1998 uh genome, and so this is a tool specifically for so that you have the e-value threshold. This is the e-value that we talked about. You can just threshold. It at I think there's a default value, but this is the number the significance and then this is the database.

B

You know it's usually blast P, but you can also compare nucleotides versus proteins uh yeah and then this is. Finally, this is the C elegans genome assembly. This is you know on and CBI, so we can get the whole genome if, if needed, but I, don't think we need the whole genome just to let you know what the state of that is a bunch of comments in the discussion here. uh So we had uh okay uh yeah we're talking about the figure. Then uh dick has two citations of the Stull State splitter.

F

B

Okay, yeah yeah.

F

B

I, remember those this is a lukai uh and then uh let's see yeah, okay, so that's and then so then I had this other thing that I did on the Deep dive where I talked about where I was thinking about this problem of these different methylation patterns across the genome. So um so what happened? Basically, is that you have this problem where, in mammals you have this Global control. So you have these methylation marks on the on the promoters and they all sort of go in the same direction. The C elegans.

B

However, that's not the case necessarily- and this is the same in drosophila, so C elegans has what we call mosaic form of development, which means that instead of having like this, these cells that respond to cues and- and you know, differentiation cues in the environment- the cells are deterministic in terms of what they're going to be. So you can take a lineage tree and anyone's cell from a developmental state, which is usually a stem cell state, will differentiate into a certain type of cell.

B

So if you take a cell out of that lineage tree, you can remove a Bunches of adult cells, there's nothing that will like make up for it, as you see like in the mailing development.

B

So this is something they call Mosaic development, but apparently there's also a mosaic form of methylation, and originally they didn't think that c elegans had an ethylation. They thought that it was restricted to mammalian cells, but what they found is that in mammalian cells uh you have this Global regulation and then in C elegans. You have this Mosaic regulation, so um the this is uh kind of talk. I have a couple papers here: I'm not going to go over too much. This is Mosaic methylation and clonal tissue.

B

This talks about some of this, where, if you have a tissue type, you can have this Mosaic regulation of methylation, so you can have cells within the tissue that are sort of maybe can jump to different states. This methylation isn't like stable, always uh over uh tissue. So you know you have this Global regulation, the genome in mammals, but even in mammals, you have this sort of variation across cell cells in a tissue, um and so that brings us to C elegans, where they have.

B

Apparently they have these clusters of place of locations, these clusters of methylation marks and they tend to be in the promoters of genes, and this will allow, for uh you know this sort of differentiation in in different cells. But it's mosaically regulated. So you know there are certain places in the genome where they're at the methylation marks are in one state in another part of the genome, where they're in another state and if you think about uh this Mosaic development mode.

B

That makes sense because this not all cells, are going to end up in the same state. At the same time, sometimes cells will differentiate early into neurons, for example, and sometimes they'll differentiate later into muscle or into something else, and so this this paper is on induced neurons from germ cells and c elegans, and it talks about actually inducing this process and some of the things that they do with transcription factors and they're.

B

Actually, using directory programming here, they're talking actually I, think this is just a review where they talk about our current knowledge about this so they're, using this kind of approach with c elegans and they're, showing that there are these differences between mammalian cells and C elegans.

B

There are also these what they call Hot regions which are regions in the C elegans genome that are cpg, rich and the cpg again is the cytosine de guanine transition and that's what they they look for with these methylation marks. So you have these sequences that are CG, CG, CG, sometimes they're. You know in this kind of what they call a micro, satellite or a satellite, and sometimes they're just in the genome.

B

Now in the promoter regions, you tend to get these satellites where you get these long repeats of CG, and that's where you get these this sort of uh methylation activity that affects differentiation, because it changes how the gene is access, and you know it changes what's regulated. So this talks about these hot regions. They talk about them in C, elegans and humans, and this kind of this work kind of sets up.

B

This difference between C elegans and humans in that respect, so this uh they they find that there are these regions where you get clusters of these CG repeats. You get this higher potential for uh regulation, that's based on maybe like cell differentiation, and that you get differences between humans and C elegans in terms of the stability across the genome. So there's a lot of work in this um and this paper Okay. So that's uh the Mosaic methylation work.

B

uh This is the genome and then there's some other papers. I got on. um You know this. One cpg, ions and regulation are transcription basically driving home. This message that there is this, uh that there are these areas of the genome, that or these air, these methylation marks, which are epigenetic that regulate the promoters. That then regulate the gene gene expression, but we can actually identify the state of this or the potential state of of methylation and and this change from the sequence, because the DNA sequence should have a lot of these CG repeats.

B

So the idea would be that you have a lot of CG repeats somewhere. There is a potential for differentiation and regulating uh cell State, and so that's that's what I found in my deep dive. I was really interested in that because I thought well, you know this is in the all over the literature and I I knew about million cells and I wasn't familiar with C elegans.

B

There are some other papers I didn't put in there on like people doing work on specific cells in like the vulva, where they actually looked at the Genome of different cells in the vulva, and to answer your earlier question about the cells that we have like for or the data we have for specific cells, people will do stuff like that, where they'll like sample a couple of cells in an organ and they'll, look at the genome or they'll.

B

Look at you know, maybe just uh do some work on not not an entire genome sequence but like specific genes and then they'll actually look at the function between the cells, but I don't know of any study. I don't know if we have like the entire genome for each cell. We just have like these data sets that are kind of like, for you know whatever people, people ask a question and they generate a data set and that's what exists.

B

So that's that's it, and then you know back to this paper with the uh with the Volvo cells. They were able to show that these different methylation states actually govern the sort of differences between Volvo cells. So there are actually two cells or in one state, two cells were in another state and it you know they were both. They were all in the vulva, but they had different functions.

B

So this is again, you know something that uh uh you know maybe a different analysis from the proteomics, but that's that's something we can put together in in the paper.

B

I think the missing part of this, of course, is the uh differentiation code, and we did some work actually in the differentiation code um in 2016 our paper in 2016 on uh it has a title: that's not really uh what I'm looking for in this paper, but I we did do some differentiation trees for C elegans and for siona intestinalis, which is a c-square, and we generated those in this paper and we evaluated the lineage trees with respect to differentiation, and there were some other things in here.

B

But the thing that I wanted to point out here was that we did work out uh some uh uh something about the differentiation code in these type of organisms. So this was a mosaic organism where we had uh reorganized the lineage tree, and then we did this cast analysis, which is kind of like a blast.

B

It's just analyzing, like I, think oh yeah, like basically the differentiation code, is this binary code, where you have these binary divisions in the lineage tree and you attach uh binary numbers to them, and the binary numbers get larger. As you go down the tree and then you can take like a certain level of that tree and you can take another tree and you can compare the sequence of numbers.

B

So if you have like uh you know, binary numbers uh they're, they kind of act like in computationally in the same way as a DNA sequence or a protein sequence, and you can actually align those and you can get a score, and so that's what we were doing here, we're generating a code for the different nodes of of the differentiation tree, which is a resorting of the lineage tree. We were comparing level by level different trees or different formulations of the tree, and then we were getting a score for the matches between those two trees.

B

So we could actually get like a sequence and it's alignment. So that was the way we approached in that paper. um I, don't know if that's, maybe that's I'll go ahead.

E

So that might help us with the uh what life was looking differentiation, trees yeah. Would it work, but I don't know I could not find the software like? Did you find like a specific software? We.

B

Didn't yeah, we didn't write software for it, we did I mean that's not going to be the same as like a blast, but we didn't write soft performable package for it. We just kind of did it with uh some code and some sorting. You know as uh we could write up some code or we could write up some software for it for what we're doing here but I'm just saying we don't have the software, it doesn't really exist.

B

It's just kind of like uh you know software operations, it's not really something you can release to people, but um but yeah I think that's I mean that might be a good method uh going forward, but I don't know if that's the best method, so we might okay, yeah yeah.

A

B

Yeah no problem: okay, now I'd like to say a few words about uh methylation and cell differentiation.

B

So the first thing I want to talk about as I mentioned uh these cpg ions, so cpg ions are these very small motifs of C to G. So it's like C and G.

B

It's a two base Motif- and this is a very short motif, and so you can have these kind of paired together in different parts of the genome right and then those are kind of all over the place, but they're clustered in the promoter regions. As I mentioned, these OT regions, That You observe where they're clustered in uh functionally relevant places like a promoter region and they open and close the promoter so that you can, the transcriptional Machinery can get access to the DNA in the gene.

B

You also have these longer sections, and this is where we're getting into our.

B

Or patterns here or our richness of these cpgs- and that is where you get something like this, which is what we called a satellite or a micro satellite, and this is a term from genetics where they talk about satellites and the reason they talk about it. This way is because the way they've discovered it was by running an electrophoretic radiant, where you help bulk DNA.

B

And then you have these bands or satellites of cpg content, and so this is, you know this is an electrophoretic gel, so it's running in this direction and things segregate out along that gradient according to their molecular weight, so you can actually pull out sometimes pull out different types of proteins, different types of DNA sequence- and this is the way they used to do this before a lot of the modern sequencing. Technologies came about.

B

Well, that's an aside from the point of this, which is to say that we have a promoter region and we have a coding region and that promoter region may have these cpg Islands.

B

Where you get a methylation state, you have methylation marks and usually each something right, that's the nomenclature they use. So in the literature, you'll see a methylation Mark, it's H3, something or H6 something and.

D

B

The mark That's on this site, so you get enough of these in the promoter and you have a lot of places where this is the case, and so these sites are at the site of regulation. You can have them open or closed where they can be in a bi-stable state which we talked about last week.

B

So if it's open, that means that you have this uh transcription of a gene or you have a transcription of some allele or something from a coding region of a gene, and it makes a product it makes an RNA Mr mRNA. If you have, if it's closed of course, then nothing it's closed down. If it's by stable, you can have something making other like different types of products, and so this is something where, if we have a gene, for example, that's involved.

D

B

Making muscle like myod, you can have this uh switch on that Gene and the promoter and I can turn it on and off, or can turn it on and on in different ways, so that it's making different products and it's it's making more of a certain mRNA than it would otherwise.

B

So this by stability allows it to be poised to turn on and off during the process of development, and so this is what we mean by when we say, buy stable, um that it can be in in either State, and the thing I mentioned in the meeting is that in mammals you have Global regulation, so in mammals.

B

You have Global regulation of this.

B

All right and then so that means that everywhere across the genome, in a cell, these marks have the same state. So there's a switch from being a stem cell to being a precursor cell. To being maybe like a neural cell.

A

B

And this makes sense, because you want to have these- um you want to have this coordinated across the genome, because you have these different intermediate States and you have this complex signaling that happens in cells and they sort of they. They have a great plasticity as to what they are.

B

So if you put a stem cell on a bunch of muscle cells, though it'll become a muscle cell or if you put them in a culture of neural cells that can become a neural cell just through signaling, you can also reprogram the cells artificially and get a similar result, but in C elegans and in persophila you have this mosaic.

B

You have Mosaic regulation, which means that there's it's local I, guess you get local regulation. It's a local regulation is where it's on a gene by Gene basis, and so this is. This is important for this type of development, because in this type of development we have a lot of cases where we have these deterministic lineage trees.

B

So we have these lineage trees that might have like you know two daughter cells and those you might go from this level of development and all these cells coming down like this will maybe contribute to the uh that the epidermis you know different parts of the different tissues, so fate is restricted by sort of the level of Developmental cell. If I were to take out this developmental cell, for example, I take up this entire part of the lineage tree and I would basically deprive the organism of maybe one half of its body.

B

So you'd end up with an adult looking like this, instead of a worm that we're used to this isn't C elegans. So this is uh so. This is definitely in cells can't fill in the Gap. So you can't produce more cells here. You can't proliferate more cells to make the back end, whereas in a human or a mouse there's a mammalian system, you could do that. So this is the difference, and so the methylation marks are just ways to regulate that process.

B

To keep these cells deterministic instead of you know, um I, don't know what you would call it. Maybe regulative or you can regulate cells to a new fade as you need to so. This is all kind of the background for this and um help you learn something all right. Finally, I'd like to talk about the differentiation code, as we talked about in the meeting. So briefly in our 2016 paper, we defined the differentiation code as the outcome of a reorganized lineage tree.

B

So what we did was we took a lineage tree, so one age tree has um you know we have mother, so we have the daughter cells. We have these binary divisions, I'm just going to do a four-cell tree to get give you the idea. Now these are divided. Usually this is a anterior posterior, basic anterior, posterior orientation. So in C elegans you have the anterior cell and the posterior cell, and then in this four cell example, you have true interior cells to posterior cells and they're, organized.

D

By like nomenclature.

B

And by these, like one of these posterior cells is going to go on to form the germline and another. Posterior cell is going to go on to form specialized cells in the intestines and the muscle and some other things, and then the the anterior cells are going down for most of the epidermis, while some muscle a lot of like cuticle and things like that and neural cells, of course. So this is how it's structured. What we do in a this is a lineage tree.

B

What we do in the differentiation tree is we actually organize this by size, so these cells, instead of being anterior, posterior, it's going to be organized by size, so this is larger and smaller. This is largest. This is the larger of the two here. This is the larger or the two here and.

A

The reason we do.

B

That is because, in the sort of the way that they've originally built this model- and this was unregulative embryos- the tissues were of different, like they're, these expansion waves and contraction waves, so the larger are the expansion waves and the smaller of the contraction waves, and so when you're dealing with tissues in a regulative embryo, you know you're going down the tree like this, and this, like the the group of cells, is either Contracting in size or expanding in size, and so this is say a contraction from here.

B

This is an expansion from here you're either, like you know, uh the the shape and the in the form of the thing is either an expansion or a contraction. So the expansion is usually on this side. The contraction is usually on the side in this, in the regular or in the Mosaic embryo. We had to take some Liberty with that to say that the larger cells are on one side, the smaller cells are available.

B

Well, the consequence of this is you end up with a different topology. It shifted from the lineage tree so that you don't really care about the anterior, posterior orientation, you care about this size orientation and so we're just using single cell size as a method for that, at least before we get tissues, so we don't have tissues at this stage or just cells and so tracking, the sort of the developmental Cells versus the terminally differentiated cells. So this is the way we did this um in 2016. now.

B

That means that you have an interesting problem here, which is that you have. You can create a binary code from this reorganization, but you can also create a binary code from the um I hope. I'm doing this right, but I think it matters to this case.

B

You also have one for the lineage tree, so you have one for the differentiation tree, one for the lineage tree. That means that, for this four cell example, here we have two codes.

B

We have one that's the original code, and this is just like a reference alignment, and then we might have one where we have something like this.

B

So this is the differentiation tree. This is the lineage tree, okay, all right! So then what we can do is then we can take those for trees and see how distant is each level. So we can actually look at this level here too, so we can say zero one and zero one. Let's say that there was no change, so the anterior cell was actually larger than the posterior cell.

B

So if that's the case, then we have the same concurrence between the two cell lineage tree and the two cell differentiation tree and so there's a distance of zero. There. We use what they call Hamming distance from computer science, characterizes.

A

So the Hamming.

B

Distance is zero here, which is great because we did you know it's interesting, because we there is no difference between them. In this case, however, there's a big difference and that big difference is uh where you have it's basically I think everything has changed here. So there's a distance of four. So that means it's maximally distant here. I. Don't think this was the empirical result, but I can't remember I'm.

A

Just giving you an example.

B

In any case, this gives us like a basis for comparing trees.

A

B

It doesn't have to be the lineage tree and the differentiation tree two different differentiation trees. It could even be two lineage trees, although the lineage trees don't vary in this way, so it's really useful for either impairing it with a winning industry or comparing a different, like maybe samples, different individuals, different species, and so then we actually have this code that we can compare and align, and this this code increases. So it's a binary state. So, as you get to the eight cell, the 16 cells 32, so the number of binary digits increases.

B

So it's like you might have a three bit or a four bit or a five bit number and you have a longer and longer set of sequences to compare, but you can always do it at the same level, it should work.

B

But now we have this problem where these are just the cells in their states. So what we're looking at in the paper in 2016 was characterizing the cell size and it's it's reorganization of the lineage tree. So it's order from left to right all right, which is fine, except that now we don't. We only have the information for cell size, that's our soap Criterion and we also have the information for lineage, but that sort of implied in the structure.

B

But what we need in this case and we're looking at genomes and we're looking at protein sequences and so forth, as we need a way to map those changes onto this tree structure and when we realign them, you know having those States like also realigned, so the realignment isn't the problem. It's characterizing, the state differences where the things that define each cell- and so this is where we're kind of uh at a sort of I think it impasse right now is that we don't know how to make that mapping.

B

So each cell has like this I, don't know this content, it's like an end Tuple of what so traditionally, we've used spatial location, XYZ T. This is our five Tuple for usually what we, how we model these trees. So we, we might have like uh three dimensions, a spatial position, one dimension with temporal information, and then this uh variable that measures- maybe some other Factor. It could be some summary of molecular data. It could be something else, but that's maybe not enough.

B

Maybe we need to have multiple entries in here where we have like a huge list of attributes that are at the molecular level and to be able to summarize those into this parameter. But you know, maybe we need just to pick one parameter and build a tree each tree having one parameter and then get a distance, so you're getting a Hamming distance that would um be suitable, I guess I'm, trying to figure out how to model this here in my head.

B

um Something like this versus something like this.

B

So this would have you know a distance and.

A

B

This would be uh representative of some molecular attribute. We could even reorder it instead of left to right or largest small.

B

It would be like um presence or absence of a certain protein, so it would be like you know protein that we don't know what the name of it is. Is it there or not or what's the state or whatever? Now this complicates things, because we don't have single cell data, as we mentioned in the meeting, so this might be a problem, but I hope it's not, but we can we can organize. We can arrange this in different ways and get a result that maybe is informative to people. So do we have anything else?

B

I want to talk about today or.

F

Whatever you're playing clock, the number of heads versus astringent stringency and it picks at a certain stringency, does that mean anything.

B

A

B

uh Well, it just means I, guess that there would be like uh fewer hits with a higher stringency. I would imagine, but.

F

B

B

If something is more, maybe more common across like if it can identify things that are more common across different samples in the database, uh which you know is not like everything in biology, but it's what's in their database yeah. This will be a curve that rises no plateaus yeah it should. It should vary based on like how common that is in uh in this yeah.

F

Okay, so if we find, if it's a plateauing function, then if we find the stringency at which it plateaus you can see if that's similar difference between different molecules- yeah yeah, okay, so there might be a classification, probably.

B

The handle yeah and then, of course not all matches, are going to be relevant like sometimes you'll get matches that are like something's, totally different has a different function and you don't think it's like relevant to what you were getting, because you know you you think about like sequences or like combinations, so protein sequences. This doesn't happen as much but they're kind of like combinations of in DNA sequences. There are combinations of four characters, so you get like. If you get a small sequence, you can get a lot of noise.

B

You can get a lot of things that aren't relevant yeah.

F

F

Well, I guess from the proteins at least we know they're Lent.

F

But approximately what once they're trying to get a match, because.

F

Okay, well really good Lucas can handle this yeah I.

E

Don't know if I can well but yeah, let's see what's going on yeah, if if there was a software for uh person, that would have been interesting, I mean um uh yeah, but this the thing the problem with the matching in general, is even when you uh run like protein to Protein Plus. Sometimes it shows you. Duplications of results like they're the same thing right but like I put it as a non-redolin. I want non-reductive results, but it still shows me the same thing with the same percentage identity.

E

But when I look up I see that uh yeah. Sometimes it's from this different look locus of the same same thing. So that's why it sometimes it shows you the duplication, so I'm not sure about the curve thing that you mentioned. So that might be a problem then, but yeah I'll, look I'll, link that up later.

F

Okay, Lucas one suggestion: uh we're writing a paper here, and your list of things that you put in the paper needs to be turned into proper English.

E

Oh yeah yeah, that's a like a version. 10 is like a like. A I didn't wanna like I, wanted to put it in a different document, but I just attach it as like details there. It's not the part of the computer, okay, yeah yeah,.

F

Yeah, so what you're doing your conclusions are.

B

Yeah there's a specific way of to write up results or rate of methods, so yeah, but that's something you'll learn here so.

B

All right, uh that's that's great all right! Well! Thank you for attending see you next week, okay, great good session yeah thanks, bye, all.

E

B

Now I'd like to go over a few papers that have come out uh have to do with some of the things we talked about today. So actually, this paper has to do with some of the things we've talked about in the past few weeks on human embryoids and uh sort.

A

B

Human embryos outside of the normal process of human development, so they're been a host of papers in this area and it's really kind of in a breakthrough um recent times so this, but this paper actually focuses on some of the things going on in that stage of where the Inner Cell mass is forming from a blastocyst. So you see the blastocyst here, it's kind of moving out. You have the Inner Cell Mass, the trifecta, Derm and they're arguing here that they're able to find using live Imaging nuclear DNA shedding during last assist expansion and biopsy.

B

So this is the blastocyst up at the top. It's it's starting to form this Trifecta derminar cell Mass and then that's where that's the the sort of the stage of development that we're in, and so this is the what I'm pointing to here is the graphical abstract. So they're actually live Imaging. This they're putting it in you know, they're, live doing. This live Imaging technique, they're, getting images of this they're able to see different things that are going on here.

B

So one thing is mitosis and segregation errors, so cell division is occurring, you're, getting errors in mitosis and errors in chromatin segregation and segregation of the different parts of the cell, as it's dividing.

B

So the contents of a cell split apart and move towards the poles, as you get cell division, because you're going to eventually get two cells and they're going to pull apart so that you have to have you know basically a copy of the DNA and the contents of the cytoplasm or the inside of the cell, and so that segregation process is happening here. So you're observing errors of that. You also get this nuclear DNA shedding during expansion. So there's a cell structure expands as you see here.

B

You get this DNA shedding that comes nuclear DNA, shedding that comes in the in the I guess in the nucleus of the cell. So the nuclear DNA kind of comes off and sheds, and then they do this biopsy and you have more shedding hair. So the highlights of the paper they're using a fluorescent dye assay, which means they introduce this dye to stain the things that they're interested in. So you can see them under a microscope. Clearly, fluorescent dyes enable live Imaging, a human embryos without genetic manipulation, so they're able to actually use a Dye.

B

They have things if you're familiar with voltage dies in neurons, so they'll sometimes use voltage dyes to reveal electrical activity instead of using transgenes or instead of using like other types of assays like recordings or electrodes. So these dies are actually quite flexible. These are actually introduced into the sample and they're able to pick up some of these things. The alternative would be using a gfp or yfp transgene, and so that that has its own challenges in these life samples uh live Imaging reveals differences between human and mouse embryomorphogenesis.

B

So there are differences between human and mouse morphogenesis that they're able to do I guess they also sample Mouse cells that have the similar mode of development, but you're able to observe the differences there, and we talked about that in terms of genomics today, but they're. Also in cell biology. You have these systems that you can compare. They have different processes going on, but in like human and mouse, the processes are similar enough. So you can get a sense of the underlying sort of process.

B

The underlying sort of the I guess, the underlying conditions.

B

um Less assist expansion causes Trifecta, Durham, so nuclear budding and DNA shedding. So you get this nuclear butting here in this image and then you get the shedding that comes from this budding so um and then mechanical stress from blast, assist expansion or biopsy triggers nuclear DNA loss. So basically, what they're arguing is that in this process, you're getting nuclear DNA loss you're, hitting these mitosis and segregation errors, and this is something that you.

D

B

Implications for uh you know, uh genetic anomalies and development. Perhaps so that's what they're interested in this. So this paper, you know, there's a lot of technical detail here: I'm not going to go into just to show that this paper exists. uh This is from cell, and this is a recent paper uh 2023..

B

The other paper I want to talk about. Is this uh it's from the bio archive? That's probably I, don't know what conference it's going to be at it's probably going to be at a conference. uh This is called synergizing, geometric, deep learning and data Centric methods for improved protein structural alignment.

B

So we were talking about protein structure and protein sequence alignment. uh We've talked in the past about some of the tools that these for protein folding and, of course, Alpha fold, which is uh machine learning technique for looking at protein folding. This is protein structural alignment and, in this case, they're using geometric, deep learning for this.

B

uh So the abstract reads: structures are replacing the role of sequences, so, as we saw in the meeting, we have these sequences that have a certain amount of information, they're good for conveying what was transcribed and translated from the DNA. So the DNA structure tells us what's in the genome, but then that gets transcribed certain parts of that get transcribed.

B

uh Generally. The sequences that of interest that get put in the uh in like something like ncbi or some other uh centralized database are things that are biologically interesting. So.

D

B

Usually things that get transcribed, um and so we get this. We have this different um different alleles or different uh isoforms of of a gene and what's being expressed by the gene.

B

But then we also have translation, which means that it's being turned into a protein sequence or an amino acid sequence, and so you know we're interested in the uh work that Lucas was doing on the amino acid sequence, but there's also the structure, and then we have the instruction on RNA as well, where their folds and their turns- and there are other types of topological features that are functional. They have functional significance, so a small RNA molecule might be a straight line and it's a sequence, but in larger RNA molecules you have secondary sequence.

B

There is secondary structure that actually the sequence in in alignment with its structure is the information of that molecule. So this is what they mean by this sentence. Traditional bioinformatics research focuses on sequences because they were reasonably obtained, and so this is again the sequence. Is you can get them from studies? There are ways you can do Mass sequencing, so it's it's cheap to do sequences.

B

It's not so cheap to do structural analysis, and so this is why you can get sequences more readily than structured advances in techniques like cry on: electric micro, electron, microscopy, molecular modeling, docking, algorithms and structure prediction. Software have shifted the focus to structures. So there's there's microscopy that that happens, that you can get uh the part you can actually do x-ray crystallography as well.

B

You can get information about the structure of the protein, but cryoelectron microscopy is a little bit more modern than X-ray, and so this is just a way to get the data so that you can actually model the structure.

B

But then, of course, once you have the structure you need to model it, you need to understand how it works functionally, and so that's where a lot of this stuff comes in molecular modeling, which is where you have a three-dimensional model of the structure docking algorithms, which are you know when you have a a cleft in the protein. You know they're different things like electrons, a dock in there and those are biologically important and so that's important to know how those work and then structured prediction software again.

B

This is protein folding, what's the conformational state of that shape, and is it biologically viable? So all those things are necessary to know, and you can get these structures actually on ncbi or some other resource, and you can model them in uh in software, so they're different software packages for protein modeling there's actually also something called Nano, which is a VR based uh protein modeling platform, where you can actually pull protein structures up in front of you and play with them, and you know do all these things that you can do in traditional programs.

B

It's really interesting stuff. So this is something that we're moving towards now, given the importance of deep learning in many of these breakthroughs, it makes sense to also explore how it can modernize classic bioinformatics tools. So this is again you know we want to know if we can apply deep learning to some of these new or graph neural networks, even or geometric, deep learning, I guess in this paper to some of these older techniques and they're older only in the sense of relative old, because we have these. These are like maybe 10 years old.

B

A lot of these deep learning methods, these other methods like molecular modeling, are maybe 40 years old at most and then x-ray crystallography or some of these other techniques you use to get the or the protein structure. Maybe it's 70 to 80 years old or maybe a little bit older than that. But the point being is that it's it's uh not a really old science, but it's moving forward.

B

So, however, empirical findings have shown that machine learning based methods have many pitfalls, resulting in over optimistic conclusions, including data leakage between test and training data. So again, in in our typical deep learning model, we have test and training data. We test our data on what we've trained on and we can only our model is only as good as the training data, and so this is a problem that you know we they're trying to kind of get around. This is a caveat, especially with biological data.

B

Thus, there is a need for new Innovations to make neural networks more intelligible.

B

So, in this paper we have developed Van Gogh, a geometric, deep learning, based structural alignment approach that performs on part of the state of the art without ever having been trained on a pair of natural, we found homologues, so we talked about homologs where these are analogous genes or analogous proteins in different species or their analogous in terms of being duplicates. So these are your homologs are where they have a similar function, they're just Divergent.

B

In some way, we adopted a data Centric approach to address deep learning and data limitations by augmenting protein templates since into synthetic homologs for training. Our method allows us to supplement homolog data by knowledge, driven augmentation, self-learning role and structural features by supervised examples and protein alignment that is competitive with state of the earth methods.

B

So, let's break this down a bit, um so they want. They have homolog data which is kind of the standard in proteomics. Where you have, you know, comparisons between, say species, and it gives you well sort of what we saw in the blast: searches where you have two different sequences and you're, trying to infer like the relatedness of the two sequences or the similarity, um and so it doesn't tell you a lot about function. It doesn't tell you a lot about like this sort of evolutionary homology.

B

Necessarily it's just that similarity, and so we need to have more information here. So knowledge, driven augmentation, is something where we know something about the proteins and their function and there's in their context. So we can apply that as data augmentation. We also know a lot more about its structure. uh We we can supplement with sequence, information or functional information, and we can, even you know, backfill this with molecular simulation with another uh sources, so we can actually augment our data set in that way. There's also self-learning of relevant structural features by supervised examples.

B

So this is typical of machine or deep learning, machine learning, but not necessarily of typical or or traditional protein modeling. So they, you know they use these supervised examples, a supervised learning to provide this information to the algorithm where it can learn new structural features.

A

B

Is a very basic, deep learning stuff, so this is not like anything new, except that in this field it would be an advance, and then this is some in protein alignment that is competitive with state-of-the-art methods, so along the way, you're being you're able to align proteins with use. You know in comparison to other methods using these sources of information, and it will give you a result. Now they don't talk about the improvements that they've made necessarily in the abstract two traditional deep learning methods.

B

So there are some caveats that they're getting around I, guess they're using these they're bootstrapping this with training and with uh data augmentation to get a a good result. So this is the uh referral Network framework that they mentioned. This is github.com deep rank, deep, deep brain core tree main deep brain core and that's the place where you can find the code for this. So that's uh those are two new papers that just came out thanks.