National Energy Research Scientific Computing Center (NERSC) GPUs for Science 2020, 5 Aug 2020

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: Application of GPUs in Proteomics and Connectomics

Description

Fahad Saeed of FIU presents a talk on Application of GPUs in Proteomics and Connectomics. Pre-recorded session for GPUs for Science 2020 https://www.nersc.gov/users/training/gpus-for-science/gpus-for-science-2020/

A

Hello: everyone, my name, is fahad saeed from school of computing at florida, international university, and I will be presenting some of the high performance computing techniques that we have been developing for big data, mass spectrometry based proteomics and fmri based connect comics.

A

Let me start by thanking the organizing committee for gpus for science for inviting me here today, and it is a pleasure to be here so our long-term goal is to come up with machine learning based techniques that are scalable for proceeding medicine.

A

So, in order to do this, we use multiple modalities from different ohmic data sets that come from different kind of system, biology experiments, and then we develop these machine learning. Algorithms that allows us to make sense of these data sets and then we go on and develop high performance computing, algorithms. That makes this process more scalable right.

A

So we have been working for mainly proteomic data sets and fmri based connectomic data sets towards this end. So today I will be presenting mostly about high performance algorithms, that we have been developing for uh proteomics, that is, mass spectrometry, based proteomics and uh and some of the details about the fmri based connectomix and I will be happy to connect uh offline if needed.

A

So this is a very high level uh overview of mass spectrometry um and I will go in a very small detail of how the data from mass spectrometry is produced, so assume that these are the proteins that are that we will try to study. The first step is that this goes into the ionization stage, where all of these specific amino acids in the proteins are charged at a very high uh are charged in a in the ionization chamber.

A

And then, uh after this, the ionization chamber, the specific uh peptides go into the isolation chamber depending on their mass right and once they are in the isolation chamber. They go into the fragmentation part where these specific small molecules peptides are then bombarded using different fragmentation methodologies right. What this does is that this breaks up the peptide into very, very, very small pieces such that those small pieces can then be analyzed using their mass using their master charge ratio.

A

So the first ionization part here gives us ms-1 spectra and the mass analysis or the final data that we get is known as ms2 spectra, okay, so the result from data science perspective is that we get these millions of millions of these ms2 spectra. That looks like this.

A

Where here the x-axis is the max to charge ratio and y axis is the intensity, and each of the peaks here is representing the uh abundance of that specific uh fragment of the peptide in the mass spectrometer.

A

So the question that, from the computational perspective, the question becomes that how do we take this specific ms2 spectra and formulate the peptide or uh rebuild the peptide that originally produced this mass spectrometer uh mass spectrometry spectra?

A

So the reason that we consider mass spectrometry based proteomics big data problem is primarily because of two reasons. So first reason is that the mass spectrometers themselves, the instruments- have been getting more and more efficient right and they have been getting efficient in a way that they are, they have been. They uh have been able to have a throughput that is more than more than the moore's law.

A

There's. Another reason why we considered the mass spectrometry based proteomics a big data problem, and it has to do with the kind of computational techniques that are needed to process the mass spectrometry data right. So here I will show a very high level schematic of why this is the case. So here we have uh a typical mass spectrum, mass spectrometers right that have produced this spectra, and these spectra can be anywhere from uh gigabyte to terabyte level.

A

Okay and as we discussed that database search algorithms, uh allow us to process this data using databases right, so the databases themselves, the protein databases that are used to that are used as a reference to deduce. The peptides are rather small right, so there can be hundreds of megabytes right and because of this there are many people who might say that. Well, this is not really a big data problem right, um which would be true, but the problem computational problem that we are trying to solve is not matching this spectra to this specific database.

A

If you closely look at the computational problem, uh computational techniques, especially the database search algorithms, you will see that it is not this database that is being used to do the matching. Rather, this database is then expanded into a much larger database that is known as the theoretical species specific database, where the database is expanded, depending on the parameters that are being given to the search algorithm right.

A

So in this specific example, you will see that here we have this data set, and this has this sequence number one right, and for this specific example, we assume that the phosphorylation is uh being requested uh in one of the parameters of the search engine right. So when that happens, uh what search engine does is that it takes this specific sequence and then goes on and do a combinatorial of all the combinatorial possibilities that might be associated with that specific uh sequence.

A

So in this case, you have this uh sequence and all of these sequences are done expanded, where each of the sequence is different in one of the amino acid. So so, in the first case, uh the phosphorylation is assumed with s. In the second case, it is zoomed with t in the third. It is assumed with this second t and in the fourth it is limited with the y and so on and so forth.

A

Right- and this is just one example- and this is just one modification that is being requested from the search parameter when you increase the number of modifications in the parameters for the search engines. The common, the expansion of this theoretical database is rather exponential um and usually people do their uh search. Engine runs using two or three post translation modifications, just because the results uh do not get scalable with increasing size of increasing number of parameters right. But this is not the whole story.

A

Once this specific uh database is produced, the theoretical database produced it. It is then expanded into a theoretical spectra right, and this theoretical spectra is the.

A

The list against which this experimental spectra is matched right. So really, if you think about it, it is not the database itself, but it is the theoretical spectra uh that is produced from the database uh depending on the parameters that are being searched, and that is used for this.

A

That is used for this capital deduction. Now the you will see that the theoretical database that is produced for matching uh is up to a terabyte level or more right, and we already have these billions of spectra that are from gigabyte to terabyte level.

A

So then, the um the way that the these search engines operate is that they uh assume a lot of filtering mechanisms uh that allows the methods to be scalable right, but it is widely known that those filtering mechanisms uh lead to uh dark data in proteomics, which means that it leads to a lot of data that might not be seen by these search engines right. So, in order to solve this problem, um uh we do not need the uh the.

A

uh The matching is not done for these small protein databases, but rather with the large uh theoretical spectra that is produced and and and and it has to have a all to all matching between the spectra and these theoretical database, which is the theoretical spectra in order to have a complete understanding and deduction of the peptides in an optimal manner.

A

So this is what the schematic of the uh our gpu based algorithm looks like. uh We do not have uh time to go into detail for all of that, um but here you will see that this spectra is transferred to the gpu side, where only the intensity arrays are transferred, while maintaining the mass to charge ratio spacing of the of the of the spectra.

A

And then we had a specific sorting technique that allowed us to sort very large number of arrays uh very quickly, and then we do a lot of processing and I want to get to the detail of the 3d spectra into a 1d spectra, 1d array that we were able to do, and the idea was that we take this spectra and we have two specific arrays intensity, arrays and quality qis arrays.

A

That allows us to go on and process the data, while the data is still in 1d array right. So this 1d array is then processed right and instead of transferring all of the data back, uh we just transferred the data that has been modified because of this reductive algorithm, and that saves a lot of bandwidth in from going from gpu to cpu right. So uh once it goes back once uh this data goes back the difference of um difference between the spectra, and that is process and um spectra that has changed.

A

um It goes back so since we already know the spectra it started with, this can be used uh to reconstruct the spectra uh with the id changes. So this is uh the very high level schematic of how the um gpu can be used for processing very large number of spectra uh in in an efficient manner.

A

So then, the performance that we got for the execution time that we got was also very uh encouraging for these data sets and we were able to get speed ups that were somewhere between 100 and 400 times as compared to naive approach. That only gave us uh speedups that were two to four times. For this large mass spectrometry data sets so most of the work here on uh gpu based mass spectrometry processing.

A

What does done by mozzovan, who is a who is one of the organizers of this gpu for science um uh series um and and all of the code is available uh online for people to use freely, and we will be happy to. We will be happy to answer any questions or provide any uh help that we can.

A

So our ongoing efforts is towards developing uh a few of the fundamental computational motives that we can use for mass spectrometry based proteomics, and there are specific problems that we are working on uh related to load, balancing, localized data processing and making sure that the resources on these large supercomputers are used in an efficient manner, and we run all of our code on exceed supercomputers. And I'm going to show you some of the results in the next processing that we have been doing.

A

The basic idea is that we want to be able to process these mri data sets in a very fast but very uh in a very scalable, but in a very accurate manner, and we are able to do the and and the and the data that you get from these mri machines are rather large right. So our our effort has been towards developing these machine learning models and also processing these machine learning models using gpus.

A

um And we have been able to uh publish that in a frontiers of neuros and neuro informatics rather recently, uh and the the reason that I wanted to show this slide is that gpu, processing or high performance computing for scientific um data sets can be very significant. So here you can see that using our gpu daemon method, which is a generalized method that can be used by different people who might not be very familiar with um cpg, gpu architecture or algorithms.

A

Just using that method, we were able to take our machine learning model and process that in less than 40 minutes as compared to other methods that might take seven hours. So a a carefully designed high performance, algorithm 4dbu can result in very scalable performances for many of these scientific data sets, uh which is which is very um useful, especially for personalized and procedure medicine.

A

There are a lot of people who have worked on these projects over the years.

A

Of course, without my students, this would not. Any of this would not have been possible for the work that I presented today. Mao zawan, who is now uh uh who is now a nurse at uc berkeley, uh was uh the major uh contributor towards these gpu based algorithms.

A

uh We had tabana swami and she graduated uh this semester and she has been working on these big data problems in uh for connect, comics and mohammed hasi is my current phd student, who is working on many of these algorithms distributed memory, algorithms for large-scale proteomics, and these are some of the funding acknowledgements from nsf and national institutes of health, and we are very thankful for to these funding funding agencies and that's what and that's it for uh now.

A

Thank you so much for listening to me all of my web information and my twitter information is on the slide. Please get in touch with me and I will be very, very happy to hear from you.