National Energy Research Scientific Computing Center (NERSC) GPUs for Science Day, October 25, 2022, 9 Nov 2022

Previous Meeting

⏯

youtube image

►

From YouTube: PASTIS @ ExaBiome

Description

Oguz Selvitopi
PASTIS @ ExaBiome

A

um Hi everyone- my name, is oh salutopi and I'm with the performance and algorithms research group uh at the lbnl, and today I will be talking about um our work on protein simulator search and how to do this on uh distribute number system using gpus Etc, and this is a joint effort of the ECP projects at the biome and hexagraph, and our tool that does this search is called task.

A

This so uh comparative genomic studies, the biological relationship between different organisms by exploiting the similarities of the genome sequences, um a major component of many biological workflows is to find out the existing genes by aligning them by comparing them against a reference database. For example, a common task is to find out the functional, the functional or the taxonomic contents of the samples collected from an environment Often by running those Collective sequences, comparing them against and established and known reference database.

A

So uh it becomes more and more important to enable and build fast computational infrastructure for this kind of comparative genomics as more and more genomes are sequence every day and with the sequencing costs dropping and the technology becoming more available.

A

um The bottlenecks in this meta genomics research and are shifting towards computation and storage so for those unfamiliar metogenomics, means that you can have genomes in a sample uh virtually from any type of organism, not just one like bacteria tongue bacteria it can from a plant. It comes from an animal, it can form a human, that's called metagenomic datasets and they are extremely large data sets because you can have any genes and they can relate it or not. You don't know anything about them.

A

um Our team developed uh distributed tools for both the search and clustering of protein sequences. uh In this talk, I will focus on the our focus on the search component.

A

um There are existing Tools in this area, although these tools are fast and well optimized for shared memory environments, they are not really well tuned for uh distributed memory environments, they usually Implement a tool for shared memory environment, and then they try to extend it to a distribute memory setting which do not often scale very well.

A

um These are our two tools. uh The first is pass. This uh it's called is for alignment vs plasmat disease, which basically platforms to search on these sequences on metogenic data cells and for clustering. We have hip and CR, uh which must a distribute memory. High performance, implementation of the macro, clustering algorithm.

A

um Here this figure shows our pipeline to form the protein synthetic graph or to perform a protein Synergy search here at the left, we had the protein sequences from a given set of sequences. We select the pairs which we want to run. The alignment on this alignment will produce a similarity measure which we can then use to establish how similar the pairs are.

A

You usually at the end, we filter out the pills that do not mean meet certain criteria in similarity or the coverage, and then this is going to finally lead to final smart information or the symmetic. However, however, you look at it and then this graph or this information can be used to Cluster to get the product and families to discover new protein sequences Etc.

A

um There are basically three components: uh the first and the third here, as you can see, uh the candle detection and filtering out of the dissimilar pairs are memory bound operations, whereas the alignment where we perform a batch of uh pairwise protein sequences. uh This is complete bound.

A

um Imagine novelty in passive is the treatment of grafts as fast matrices. The basic information, storage and manipulation medium impasses are smashes. They are used to represent different kinds of information required throughout the search, for instance, here on the left, you can see there's a matrix, uh radials represent the sequences and the columns represented the k-mas. The camera means that a substring in that sequence, so if that camera exists in the sequence, there's a noise right in that location.

A

As you can see, uh these matrices are quite different from the matrices in the microlinear algebra. They are actual graphs and we have a lot of custom information. uh We do not just use a numerical value in those in other locations, we start custom data types and operations on these matrices usually operate on those you know extended data types, foreign.

A

Implementation of these compounds here on the right. You see the software stack of a software stack of passes.

A

um It relies on comb last perform distributed, spasmatics operations, uh this Library supports NPI opponent, type of qubit parallelism and for alignment on CPUs and gpus. We use two different libraries secant and adapt.

A

uh These are all not shared memory, libraries and they are quite fast and optimized for shared memory environments, so uh they are not aware that we use them in a distributed manner. So we orchestrate them to. Apart from the search uh and delegate alignment to those libraries.

A

So what are the basic challenges of this search? uh There are two basic types of computations performed in our search. uh One of them is this past computations, and the second is the edit distance computations. uh The farmers describes about its computations are memory bound. They have low, computational intensity, High memory footprint and they have a lot of irregular memory accesses while the edit this computations have high qualification intensity and they have a uniform pattern in Computing those edit distance matrices.

A

We charge uh small and dance, so they perform a lot of computation per byte, uh while in sparse computations, that ratio is quite low, so there that's why they have high memory footprint.

A

Another challenge is that this kind of search usually requires huge amount of memory, and the existing libraries in this area have various techniques to deal with this issue, ranging from uh writing intermediate files to this performing to search in stages uh problem. Is that the uh the challenge is that the number of the candidate pairs that need to be stored and aligned? It grows by the order of and Scar. So since we are doing and many against main search.

A

If we increase the number of sequences uh by n, your uh search is going to go over uh ends cache. So it's going to require uh extensive amount of memory.

A

uh We have several different techniques to uh efficient. Apart from the search, these are some of the techniques or algorithms we have implemented. These are optimization techniques, they include optimization techniques, low level, optimization techniques as well as algorithmic Innovations. The first Technique we use is the blocked or the incremental formation of the symmetic graph. We here have a null algorithm called block two-dimension spasmas, so this is used to find the candidate pairs which are going to be aligned. We have different uh algorithms, heuristics for load, balancing and then uh specific GP optimizations for the gpus.

A

When we use the GPS, uh we have a technique that can utilize all the resources or Not by Distributing the work we do uh while doing the search.

A

This figure here shows uh block two-dimensional spasma. uh There are a lot of map. This is here. As you can see, the matrices on the left are multiplied to find the candidate pairs in the middle, which are here, the non-zeros in this overlap. Matrix each non-zero is a candidate to be aligned and then, after this alignment on the right, we have the similar to Matrix, which contains the information we need.

A

We can limit the memory required by the entire execution of this uh search uh or the you know, memory account of the symmetic graph by using a block formation or the incremental formation of this uh symmetromatics or overlap Matrix.

A

um So uh this parameter is tunable, so we have a control over the amount of marginal memory we use. So in this way, we about maximum volatilization, and then we enable in-memory search for VG data sets, and it also opens up, opens up the path for several storage optimizations. uh The always disadvantage. Is it increases the time comparing to doing this thing all at once, but it is better than doing not being able to do this and it also increases the communication.

A

While doing this multiplication on the left, you see here, but it is negligible because we often all laptops with you know other computations and so on.

A

We have different techniques for load balancing, as you might have noticed, the overlap Matrix in in previous slide is metric because in sequence, I and J are similar sequence. J I j and I are also the same, so this map exists Magic uh in this way by using by exploiting this metricity, we can avoid alignments and even periods classmates computations.

A

In this block formation incremental formation of the Matrix we tested out two approaches based on triangularity and indices. The basic difference is that the triangularity phase approach it avoids a lot of spasmatics computations, but it can still lead to load imbalance, mainly due to diagonal blocks in The Matrix.

A

On the other hand, the index-based approach, it achieves very good load balance and it preserves the structure of the overlap metrics, but it needs to compute all of all the blocks of the automatics, so it cannot really avoid a lot of computations compared to the triangular based methods, foreign.

A

Here you see the comparison of these two games here, a line in the left plot and the middle plot. You see three points at each line indicating the maximum minimum and the average obtained by the processors that are running in parallel.

A

As you can see, the index based method is able to attain a bad, uh better load balance than the Triangular base method for all tested to accounts, uh but at the right you can see the triangular based method. It is able to save a lot of computations and when the number of blocks increases, the effect of this load in Balance reduces or disappears in the triangular based method, because the number of diagonal blocks is proportional to order of n, while number of entire blocks is proportional order of n Square.

A

So the takeaway is that when the number of clubs is relatively High, you prefer the Triangular to base method when it is low. We prefer the index based methods.

A

um We perform the alignments on the GPU, so we delegate the alignments on gpus and those High memory footprint and those irregular computations related to surprise computations. We do those on this. We perform those on the CPUs in the kamato formation of the graph. The CPU has to prepare the candidate pairs which are aligned, and then it sends down to the GPU which performs those alignments and then copies back to the host site and then post processes them to incrementally form the symmetic graph.

A

uh In this way, if done, you know natively, these CPUs can stay idle, uh but this block formation allows us, allows the CPUs to go ahead and prepare the next patches, while the gpus are filled with performing the current patch. In this way, we can hide the overhead of this past computations uh and these are distributed.

A

Spasmatics computations, with all the collective Communications you know, and The Irregular memory accesses so avoiding those types of operations uh are really useful because fast computations do not really scale well as opposed to those alignments that are usually that usually schedule really well, uh as you can see here on the table, if we done this nearly in the plane way uh and with the pre-blocking, we achieved around 30 percent of reduction in the overall execution time here on the left, you see the strong scaling performance and Android UCD weak scaling performance uh jungler to base approach, achieves a strong efficiency.

A

More than 75 percent on 400 nodes going from 1500 close to 400 nodes. As for the weak scaling efficiency, there are different components, as you can see here with those lines, but in the overhaul uh at at 800 nodes, it maintains an efficiency more than 80 percent.

A

uh The io is here, as you guys do not scale well, but you can disregard area because uh it usually constitutes no more than three percent of the oral execution time. So it's not really a bottleneck in our two.

A

um So what were the things uh we did to? uh We did uh our code uh or you know, encountered while porting the parameters. um We did not face any major issues. uh Why reporting capacities on promoter as our tool was already able to run on nvd gpus. uh But the thing is that um uh past this needs both the CPU and GPU resources. So uh the summit, CPU notes CPUs, were relatively slow compared to those gpus, and this past computations uh started to start to dominate our execution time.

A

The alignments were really fast when we put them onto gpus, so faster CPU benef and on the right here you see the parameter comparison against Summit and Korea Haswell, there's a small scale, experiment on 64 nodes and, as you can see, uh both the CPU and the GPU components uh run faster than just keep your NGP components on Summit, not even to mention Curry Haswell. uh It's quite lost in this figure.

A

And we also conducted large-scale experiments on Summit and parameters. The data sizes we tried, the number of sequences the highest on Summit were more than 400 million sequences and on on parameter we tried 200 million sequences and the test scale. uh On Summit we tried largest used around um 3 300 nodes and on parameter we tested on July on um around 1000, not 1K notes, um I'll Summit, uh our smaller scale run on 2K nodes, with 300 million sequences uh took around uh four uh uh sorry uh roughly for four hours.

A

uh The larger uh similar to search, bigger, smart search on for under five million sequences took around two and a half hours, um and the important metrics here uh is the alignments per second and cell updates per second. uh Those were those are quite good and we improve improve those a lot in our bigger run and this bigger run uh as we submitted to the God for the Gordon about price, and it was selected as a finalist. So it's going to appear at SC2.

A

If you are, you know interested into a interested or going to actually uh be welcome to attend our talk.

A

We had similar performance compared to Summit, but the external version on a smaller ckl tested on 1K nodes and 200 million sequences on the right here you see alignments per second per node when we exclude I o is quite competitive, the summit and the parameter, and if, if parameter, was bigger, it was as big as a summit. uh We would also submit those uh submit. Those uh is asked to do you know uh for the competition, but I think the more important point is the table at the below here.

A

As you can see, these are the times uh spanned in this past component and the alignment component of practice, uh the closer they are better. It means that, because the overall execution time is going to be determined by the maximum of these two, so as you can see at parameter these two and these numbers in red is the ratio of alignment to sparse a component and on parameter. uh They are quite close.

A

As for the state of the art, uh the biggest round, uh you know, the biggest run reported to date uh was reported uh two years ago, a search uh that uh a search between uh two data sets uh containing 280 million sequencies and uh 40 million sequences uh in this search. uh The alignment rate is computed to be 1.2 million alignments per second. uh In our two on Summit, with our biggest run, we achieved almost uh 700 million alignments per second.

A

uh This is an increase in uh foreign magnitude in uh in this metric and as for the search space, the size of the third, the actual size of the search we increase. If we increase that by uh an order of magnitude, uh 15x.

A

um Biological workflows uh is, uh can we both complete intensive and they usually have a huge memory? Footprints I think that that is what makes this problem more challenging compared to simulations, or you know, chemistry simulations in our approach. We use accelerators for the alignments and we relied on algorithmic techniques for staying in memory and we use no intermediate IO.

A

We only utilize at the beginning at the end, so uh this makes our approach faster and then we also propose a lot of optimizations done a lot of optimizations to take advantage of the heterogeneous node architectures to able to use all resources on demands at the end of today we cut some solution from days or weeks to hours.

A

Thank you for your attention. If you have any questions, uh I will be glad to answer them.