National Energy Research Scientific Computing Center (NERSC) GPUs for Science 2020, 5 Aug 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Reduced/Mixed Precision Optimization Techniques

Description

CJ Newburn (NVIDIA), Xiaoye Sherry Li (LBNL) & Cindy Rubio González (UC Davis) present a panel discussion on Reduced/Mixed Precision Optimization Techniques. Recorded live via Zoom at GPUs for Science 2020. https://www.nersc.gov/users/training/gpus-for-science/gpus-for-science-2020/ Panel Chair: Hugo Brunie

A

Right now we will have a panel discussion on mixed precision, optimization and I will be uh the moderator, I'm a nurse postdoc here at lbl.

A

So our three panelists for today's sessions are kyrie cherie, lee from lbnl cj newbern from nvidia and cg rubio gonzalez from uc davis.

A

Cherie lee is a senior scientist in the computational research division from lawrence berkeley national laboratory. Dr lee has worked on diverse problems in high performance, scientific computations, including parallel computing, sparse, matrix, computations, high precision, arithmetic and combinatorial scientific computing she's. The lead developer of super lu, a widely used past direct solver and has contributed to the development of several other mathematical libraries, including oprah, clappack, pdslin, streampike and xbox. She has collaborated with many domain scientists to deploy the advanced mathematical software in their application codes.

A

So now um sherry will talk about exploiting mixed precision in sparse direct solver cherries, yes, okay, and share your slice. Okay,.

B

B

Having trouble.

B

Too I didn't know I'm the first sorry, okay, so let's see whether you can.

B

B

You see the slides yeah, we can very good. Thank.

A

B

A

B

A

B

Okay, so I can turn on my video. Oh what happened? Okay!

B

uh Thank you, hugo for the uh kind introduction and, as uh hugo mentioned, I was actually working quite a bit in the past on the high precision arithmetic. Now we're coming needs to do low precision as low as possible. So that's interesting.

B

So we by the way we do have a quite a few uh high precision libraries. So actually, every year we got to still get uh more than a thousand downloads of those libraries. So it's depending on your application. The precision needs are certainly different.

B

Okay, so I will uh talk about the uh our recent effort of trying to use lower precision, lower uh precision in sparse direct solvers, and uh the goal here is to determine you know where how low precision you can use and obviously we want to be safe. You know you want to in the end, get correct result and another thing is: we always want to analyze the accuracy of the numerical algorithms with the mixed or lower precision.

B

So so you want to tell the users that what guarantee you have with this algorithm. So this these two goals go hand-in-hand, and uh you probably already know this about iterative refinement in the dense matrix format context.

B

So so the lu based the direct solvers can safely use a mixed precision and if you do properly, you can get a desired accuracy and the the good helper uh library the function is the uter refinement.

B

So in the um example code here I'm showing you the methodology here is you do the expensive one using lower precision but for the cheaper operations. So you do lower precision. You use a higher precision to recover the accuracy lost from lower position.

B

So, for example, in the dense lu case, the factorization is expensive. It's the order n cube, so you want to do maybe single precision, but then this ir iteration is you compute the residual after your first solution and solve this correction term and add back this correction term to your final solution that you can probably iterate a few times so in this ir loop, you can see that this matrix, vector multiplication is cheap. You can do double precision and uh triangular solve is not so cheap, it's but still cheaper than the factorization.

B

You can do a single precision and the double precision here. The addition is very cheap relative to the others. So look at this column you can see in the dense case you have most expensive and it's n cube all the rest. The next expensive ones are n cube uh and square so that you have a lot of room to do the cheap operation many times before catching up to n cube.

B

But in the sparse case the situation is slightly different. So in this case for the standard 3d problem, for, if you do the sparse lu, we get a n square complexity and then the uh residual calculation is cheap, but the triangular solve is actually not so cheap. It's n to the fourth third, so that you can see the the gap between expensive between the most expensive and the cheap one is relatively small compared to the dense.

B

Case so here I'll just to show you, the ratio in the dense case, expensive versus cheap, is order n. So you have a lot of room to play and in the sparse case the expensive versus cheap is order n to the two thirds. So you have less room to play. That's room to you know to do this user refinement. If you need the many iterations, it will catch up very quickly and numerically, this algorithm is well understood.

B

We have already implemented this in dense, airway pack and also in the uh apply this for the over determine the least square problems, and these were published a while ago and recently some researchers they use even better technique in terms of for inter refinement. Here, instead of doing the pure, this simple eatery refinement, you can do like a gm rest for this loop. It's still much faster in the dense case, so all these were illustrated in the dense case.

B

So now moving to the sparse case. The first uh issue I mentioned is the gap. You know the between the expensive and cheap is relatively small, so you don't have too much room to play for that, and another issue is in the sparse case, so you need to do gather scatter operation which get in the way compared to the uh dense case. So this is the example of for, for example, in the superior direct solver.

B

You want to factorize this matrix, a into l and, u and all the shaded block here they are non-zero, but all the blank ones. These are zeros. So you don't you don't install those zeros, you don't do the operations with those zeros etc, and then here this picture shows it's mapped to six processes. Six npi zero, one, two, three, four: five: zero one, two, three four five. So this is the blocker cyclic kind of for mapping and in the gpu implementation.

B

For this code, our strategy is to use the gpu this side as the offload mode, so the panel factorization is still on cpu and depending on the gpu memory size. If it's not big enough, we will keep a part of the short complement, also on cpu. So it's a splitting between cpu gpu and this splitting location is a parameter depending on how large memory you have, and nowadays the gpu memory is pretty big, so we can do a lot of stuff on gpu.

B

So that's really good news, but in the early days, like five years ago, this part of the gpu memory only like a one gig, two gig, so it's pretty limited.

B

So the main thing uh I want to mention here. If you look at the computational kernel every step in the sparse case, you have to gather the sparse blocks into a dense work array and then use this dense work array to perform gmn after that, you scatter this dense work array into the remaining sparse blocks. So this step one and three- you don't do in the dense case, but in the sparse case you have to do this and to use a tensor cores. For example, it's actually relatively easy, it's not so difficult.

B

So, for example, first to do the single precision gmm, all we need to do in the code is to set a kubra's set math mode to be this, and then you do. The cool blaster as gmm, then you set math, set the mode back to normal mode after this operation.

B

So we see a little bit benefit of using this, but the main problem for the sparse case is the gmm operation. Taking the fraction of the time is relatively small compared to the whole calculation, and you have something getting the way.

A

um Sorry, sherry, it's uh more than seven minutes. Can you wind up in 30 seconds.

B

Oh okay, so I I'll stop pretty much stop here. So I'll just show you some results, for example, on the summit machine currently we're using six npi six gpu each npi drives one gpu, so I have two sparse matrices, so they are on the order of a million dimension, and then you can see that the factorization time double versus a single precision. You get something like 48 to 50, faster just by moving less data.

B

You know doing the single precision doing lesser communication in terms of bandwidth and as I mentioned, getter gather scatter actually will take 42 in this case in this case 35. So you only have for like uh 50 60 of the time that you can. You know do that fraction. You can speed up by single precision. You know tensor core those kind of thing, but this gathers scatter will, you know, cannot benefit and the solve is relatively fast. So you can.

B

You have room to do quite a few iterations into a refinement to catch up this double position, so there's a benefit there. So I'll stop here. Sorry about this.

A

Thank you, sherry for trying this uh nice results and um iterative refinements.

A

um I'm just uh talking to the audience to say that uh you can ask your questions during the talk and the panelists will answer when all of them have presented so right now um we have chris j newbern cj is a principal architect and drives hpc strategies and the software product roadmap in nvidia compute software, with a special focus on systems and programming models for scale.

A

Dr newburn is a community builder with the passion for extending the core capabilities of hardware and software's platforms from hpc into ai data science and visualization he's delighted to have worked on volume products that his mum used and that helped researchers do science that previously wasn't possible. So now, cj for talking about the mixed precision tuning working group.

A

Cj, we can't hear you.

C

Better when I'm not thanks, I'm really delighted to have worked with hugo and many others uh in a working group on the topic of reduced precision. um The key thing here is that uh it's it's one thing to have cool hardware um and ideas about what you could do at hardware, but what really matters is doing.

C

New science is making end-to-end connections and working with people who have the use cases who are actually in the trenches doing the work, like all of you, trying to figure out how to apply it and getting a dialogue going back and forth between end user developers and our developers as to how we can do that. So we created the working group on this topic and kind of first want to talk about sherry thanks for already talking about sort of the algorithmic complexity. Part.

C

Another factor of this is really understanding where you can apply reduced precision, so uh we decided that we would offer uh many different options uh in the our gpu hardware for supporting fb16, fp32 um and fp64 um and I'll talk about uh some of the other variants. In this we have but really understanding uh sort of the first order. Where do you need precision and where do you need range?

C

Looking at accumulators tend to want higher precision more so than things like exp that others have experimented with tools on. If you find that your range is too broad, you can do some preconditioning, for example, for linear systems. You may be able to rescale to fit into the range where possible, and you can test the tolerance with uh introduced noise uh in that space as well, and so one of the things that people like john stone have tried out is using fixed point for reproducibility.

C

Here, it's it's easier. Actually, if you uh don't have to worry about all the cases that have to be excluded, things are reversible, it could be easier for algorithm development and there are a number of opportunities. There.

C

We've also looked at some different algorithmic optimizations, like minimizing the number of transpositions and even reusing previous preconditioners, just you're, reducing the amount of work.

C

People have tossed out some ideas like tuning the encodings for graphics. Apis um people have thought about uh sort of. What do you need to uh take advantage of hardware that can do a matrix, multiply or write in hardware rather than just doing individual vector operations? We call them tensor chords. You may need the right shape.

C

We have opportunities where you can eliminate sort of half of the entries that are coming in if they're, zero and mux those for as much as a 2x performance on the latest gpu the a100 hardware that we have um and we've looked at things with um in one and eight for signal processing and so on. So lots of opportunities here and you can see at the right uh for teraflops um we've gotten the best performance where we use an fpc, fp16 tensor core that accumulates in uh 32 bits.

C

If you didn't accumulate in 32 bits, um then you may never even converge but larger problem sizes. Things tend to fall off. So uh what is it that we can do in this space?

C

So one of the key things that um you know I kind of started with this, and actually I know many of you in the berkeley community, because I helped start the ixbug effort uh back when I was at intel- um and this is kind of uh of a similar vein of gathering together people that are passionate experts and share what works uh show, how much it helps uh be able to share, uh reproducing results, um try it and then give get feedback through a community discussion and be inspired by a range of different applications.

C

You saw a number of different applications in their speed-ups listed on the previous slide and maybe document some of the rules of thumb that works. So I'd like to invite you, we've had a monthly session, for.

C

um uh I think it's well over a year now um and we'd like to invite anybody who's interested to join in that um kind of take your turn and taking a half hour or longer, to present some of the work that you've been doing uh and and work through that our next uh session, for that is next tuesday and we'll be talking about a particular interface. We have called causalis uh to make better use of the tensor cores. um Kate clark also just started the slack channel and mixed precision so welcome to join that.

C

There are a number of libraries and frameworks that you could try. um You could sort of make it easier to try reduce precision with the same higher precision interfaces some in the dl frameworks. People have been working this with something we have called amp or you can just throw a switch, and lo and behold, you get a whole lot more performance.

C

um We have some an iterative refinement uh in the ku solver, as you were, referring to sherry. um We've done this with the cuda, that's used for chroma and other apps, with coupe loss and cool tensor, where you can sort of drop in for the mixed precision, um and I expect we can come back to this, perhaps um in a broader discussion, but I kind of wanted to offer some different highlights of some opportunities with iterative, solvers, multi-level summation figuring out, you know hey.

C

Where am I actually only concerned about a particular physical patch, and I don't really have to look outside that. That often can give you an aid in both the the range and the precision you need, maybe for something like drug discovery or free energy.

C

I care more a lot about sort of getting close and processing lots of samples to figure out where I should care, rather than getting super accurate results.

C

We've had some discussions about where physicists know, hey you really need to care about water and or this virus has a particular molecule, but it's kind of a nerd. You have to include it in your model, but it really don't need to worry that much about. What's going on that, so treat that with lower precision um than something else, that's really operative. So maybe we need some sort of a new forum. I don't know for being able to have more communication there um being able to analyze know what is the science?

C

What are the physical limits? uh How can you work through these subsets of species that you should care about versus? Not what matters in the algorithm?

C

uh How sensitive am I to variation across the data sets if I'm trying to use profiling, for example, um as uh hugo has done, with the tool that we mentioned there um to be able to analyze those inner sets? um What kind of interfaces do we need, um and how do you automate this? How do I uh you know, measure the stability in terms of the number of iterations, or does it converge?

C

Can I introduce some noise tolerance or measure the nose tolerance by introducing some noise um and looking that across your validation, data sets, and can we work up a set of like hey here? Are the different interfaces for these different operations, the different precisions for these different skews or generational models, so that developers kind of know what's at their fingertips to be able to work with so.

A

These are some.

C

Ideas they've been working with.

A

Cj, just could you wind up in less than one minute? Thank you.

C

How about negative five seconds, so I'm done.

A

Okay, perfect. Thank you. Cj.

C

Yeah, thank you.

A

So uh our next speaker is cindy rubio gonzalez cindy is an associate professor of computer science at the university of california.

A

Dr rubio rocks spans the areas of programming languages and software engineering, with a focus on program analysis for automated bud finding and program optimization. She is particularly interested in the reliability and performance of system software and scientific applications. Now for dynamic analysis for floating point precision, tuning cindy, rubio, gonzalez.

D

Hi, uh can you hear me? Yes, thank you. Thank you for the introduction. um So let me yeah so, and thank you to the introductions before. uh Also it's not hard to, uh you know convince everyone that foreign arithmetic is used everywhere and but unfortunately, listening about flowing point programs is often difficult, given the large variety of numerical problems that can exist in these programs and also the fact that most programmers are not experts in flowing point because of this, a common practice is to use the highest available precision which often leads to poor performance.

D

So the goal of our work is to develop automated techniques to assist programmers in tuning the precision of the floating point programs. So the idea is to systematically search over the types of floating point variables to recommend a type configuration that specifies what type to use for each variable declaration.

D

The goal here is that the resulting program should still produce an asset acceptable answer while being faster than the original program. So um let me just illustrate our program transformations using an example. So here is an excerpt of a c program that computes the at length of a function g.

D

So I will not go into the details, but I want to just point out that this program uses long double precision, and here we have been told that by an expert that there is a mixed precision program that produces an answer as accurate as the original program while being faster, and here is a program, so um the orbital structure is unchanged aside from some variable declarations and function calls.

D

So here, for example, variable s1 is an accumulator, so it remains in long double to avoid accuracy loss. However, the precision of the remaining variables has been lowered to either double or float. Furthermore, the call to this square root function has been replaced with according to the corresponding single precision. Implementation is square root, f, so, unfortunately, even for small programs like these, it is infeasible to find these type configurations by hand.

D

So our goal is to automatically find them so that programmers can use them as a start point to explore opportunities for optimization in one of such efforts.

D

Our first effort was the tool presimonios, which is a dynamic analysis for precision tuning that takes as input the source code that you would like to tune, along with an error threshold and also a set of inputs, and what persimmons does is that it lists the floating point variables in the program and it performs a binary-like search to find a type configuration so that the pro the resulting program still computes an accurate result with respect to the error threshold while being faster.

D

Now, some um some of the advantages of pressing onions is that it considers both accuracy and performance and because it's black box, it works on medium-sized non-trivial programs. It is easily easily configurable because you can specify what areas of the program to focus on. If you know them, and also in our initial evaluations, it gave speed ups of up to 40.

D

Now the downside of this technique is that it it explores multiple configurations during the search and each of them has to be evaluated, and that is because simply lowering precision does not mean that the program will be faster. Now. Also, presimonious only explores a subset of the search space, so the ordering of the variables will will affect what parts of the search space are looked at now. In order to add some of these challenges, we also develop another couple of tools.

D

One of them is blame analysis which performs child execution, and it only requires one single run of the program, and this is to identify variables that can be allocated in single precision now, because it only runs a program once it only focuses on accuracy, not in performance and also, unfortunately, child execution has its own overhead.

D

Now, the best setting that we found was combining blame analysis to prune the search space and then running testimonials on top of it to find a configuration that splits up the program, and this resulted in a considerably faster analysis.

D

Now, these two techniques are still black box, which means that we cannot leverage what the program is actually doing.

D

And then uh we work on hp tuner, which is a white box, hierarchical approach that groups variables based on their usage, so that we only consider type configurations that can actually lead to a speed up, because they will focus on configurations that do not have many cast operations, and this still requires program profiling to find dependencies among variables. But this resulted in a considerably faster search and also finding configurations that led to higher speed ups now.

D

To conclude, I would like to briefly list a couple of questions that we can discuss during the panel, so in terms of how we can apply these techniques to hpc applications.

D

So, despite all the um disadvantages, I I briefly discussed the tools I I presented. um They are the state of the art in dynamic and dynamic, dynamic position to me, and um and but there are still challenges to overcome so, for example, these type configurations that the tools propose rely on the inputs that we use during the tuning process. So is this a problem for hpc applications, or is there any other correctness metrics that the applications use that we could leverage in order to to make these configurations to make?

D

You know more general guarantees about these configurations? Also, these current approaches do not scale for long-running applications, applications that need to run for hours days or weeks. So we need to further find ways to reduce the switch space or find incremental ways to apply these techniques.

D

um Third, is that, even though these configurations are leading to a speed up, we do not know how far we are from the ideal configuration that we could find. So are there any other program transformations that we could explore in addition to changing variable declarations and function, calls or is there any domain knowledge in the hpc application that we want to tune that we can use to guide the search that would be very helpful to figure out and finally, as a tool developer.

D

One of the blogs that I have always found is applications to to test these precision tuners.

D

So I think this is a great opportunity to connect to application developers in order to create a collaboration to you know, to put together a collection of applications that we could use in order to have um further inspiration and find the bottlenecks for these tools.

D

So I would just like to acknowledge the collaborators for the tools I mentioned from berkeley, lbnl, oracle and davis, and I would like to conclude by by listing here the links to all the tools that I mentioned. So all of them are open source.

D

As I said, we're actively look working on these tools, so it would be great to connect to developers who have access to hpc applications that could benefit from tuning and also, I would like to make a quick announcement about a workshop that I am organizing with nasa laguna from livermore at supercomputing. So if you have, if you are working on, you know, topics related to mixed precision and correctness. um So we welcome your submissions and yeah. So.

A

Thank you. Thank you. Thank you cindy um yeah. That was a nice view of the tool for searching the space of type configurations, and I think this is complementary of what we've seen before that, where more in-depth uh mixed precision tuning of a specific library which are broadly used like sparsely so, do we have any question from the audience.

A

You can raise your hand or ask a question in the q a.

A

Okay, I have a question for for sherry. What were your biggest challenges that you face when reducing precision of a sparse, linear, solver.

B

uh The biggest challenge.

A

Yeah in terms of uh getting the solver to converge and getting performance.

B

uh Yes, so so the uh the code is a very complicated, so it's you know, we started with the double precision code, double precision double complex and we have a automatic macro to uh mechanize the uh the code base to generate a single precision code. But uh you know the compared to dance code. There is just a lot more uh different pieces going on now, not related a lot of these. Like a scatter gather operation. They are not related to the floating point operation, so you need to take care of all these indirect dressing.

B

Mold correctly. So that's a you know, it's quite an engineering. A lot of these engineering part it's. uh It doesn't show up in the dance code.

A

Okay, so yeah. These are problems specific to the sparse pastel algebra.

B

And numerically, there's no difference in terms of for error analysis compared to the dense case. So so in the dance case we already know very well. We have proved everything.

B

So that's why, numerically, I don't worry, but performance wise. How much we can gain is very limited.

B

We can gain some, but uh I think it's limited.

A

Okay- and we have seen like a presentation of, um for example, cj you- you talk during the the work group that uh there has been a lot of research on optimizing specific libraries. um Do you think all of you all the three panelists? Do you think this is the the path to to make mixed precision into uh general applications is to tune specific libraries, um or can we leverage some like uh cindy sets some metrics from specific application to apply it to general case general purpose?

A

In other terms, how can we bridge the gap between specific library, optimization with mixed precision tuning and having a recipe that could be applied to any application.

A

C

Yeah, our experience in general has been uh you know, kind of following the the bang for the buck right. So if you're going to make an investment finding something that is common to lots of users that you're solving lots and lots of people's in the end problem, um then you know going after that in library form. First uh you're, it's easier to justify a larger person power investment in really making that one thing shine um and while you're doing that you're likely to learn a bunch of things.

C

And so then you can apply that either to the next library or to document and share some of those learnings so that it can be applied in other cases by sort of end developers. But you know of leveraging the scale principle in the community uh seems like a good place to start yep.

A

um That was yeah.

D

Well, I was just going to add that I completely agree that uh you know tuning the the lower lower end blocks. It's a you know a first step. I also think that there will be optimizat, so there will be optimizations that are specific to applications um that we can always leverage, but also that makes it even more complicated because um often they are not general, so they are very specific application specific.

D

So I think that maybe going the way of developing tools that can easily take advantage of that domain, knowledge will be the way to go so that we we are able to take advantage of what we know about an application to get the best performance, but also being able to perhaps apply it to in other scenarios that we don't know about yet.

A

Yeah and maybe building this domain knowledge.

C

A

Yeah just to really.

C

Just to really plus one that's indeed um being able to dsls can really be your friend right. So where again, there's a broad applicability within a given domain, and if you can use a dsl as many people that berkeley and other places have done and focus there. First, where you're dealing with a set of constraints that you can operate with, um can be very fruitful.

A

We have a question here from jack. A lot of the current hardware emphasis advantage. Low precision appears to be driven by machine learning applications. Do you see these trends continuing? Should we continue to expect possible super linear speed up for lowering precision.

C

I'm happy to respond to that. The link can others go first they're one.

A

um I'll go ahead.

C

All right, so I think it's interesting. If you look at what we did in this case, we went for mp16, first and uh then sort of uh backed out to higher precision so that we could, where the lower precisions was good enough for a lot of dl, and then we backed off to higher precisions, which are really needed for a lot of hpc and we kind of compromised with the the tf-32.

C

um I I actually skipped a slide, uh as I was uh talking about that of where um you can sort of use that 8-bit range um uh for actually is it? Okay? If I share that, I don't know if it is, you can just put that up um using um kind of a the best of both worlds of uh being able to uh get that can't talk at this. At the same time, very well.

C

Here there we go.

A

D

C

C

So in this one uh sort of it different things help uh in different ways. So we we found that going back up to being able to do things at higher precision was helpful. uh Finding a compromise where, with the tf32 tensor cords, you can get an 8-bit range like fp32 and b-float, but 10-bit precision like the fp16 was very fruitful.

C

We also see opportunities, uh as I've mentioned, also uh with the um single bit all right. There was some gordon bell work in 2018 that did really well with sort of one bit stuff and sort of an adjacency to hpc, with the signal processing it's for radio astronomy. So uh we do think that there are opportunities up and down that, and I think we need to explore that whole range so uh going starting in the middle and going up and down from there.

A

Thank you, cj uh cindy. You wanted to add something.

D

Yeah, I just wanted to add, like the aspect of correctness in here, because, um yes, I agree well, we have seen that you know there is this, this trend from ml to to push for for mixed precision, but also, I believe that ml applications have very different requirements in terms of correctness and that's something that usually comes up in all these automated tools that we need to know what what correctness means and then these tools are driven by a subset of inputs. So how? How much error are we willing to tolerate?

D

So that's something that we need to to have on the table, because, even though you know we can put ml applications and scientific applications together in terms of performance, and I think in terms of correctness, they may be very different.

A

Yeah, that is, that is true. The correctness uh krita is sometimes very difficult to to get from the application developers. It's a it's a full problem, so we we have a question here. I will ask it to uh to sherry. Do we need mixed precision because of convergence of hpc with ai workloads or because either hpc or ai workloads are more thirsty for performance that then we currently can deliver, and I would reformulate this question also saying that um yeah?

A

What's what did you push? What pushed you to uh to use mixed precision in sparse edu.

B

Yeah, so for us the motivation is really to uh reduce the communication. So if we can use a lower precision, we have a memory, access will be smaller and the communication will be reduced by half. So that's the real motivation and if you are talking about a hpc ai together and I think uh the uh the tf4 32, this format is really attractive because it doesn't, you know, reduce the range which is really needed for most of the hpc applications.

B

If you have only five bit exponent, it's really limited, we cannot re very often you can do some trick, like balance the matrix equilibration, but in most cases those techniques is not helpful. So five bit is too limited and then you reduce the uh just reduce the mantissa to do things faster, but keeping the same range. I think that's a very good compromise in both world.

B

And also the for this tf-32, the application code that you float, you don't need to change the it's, not a different data type right.

A

B

So that's also convenient for for the uh for the application people.

A

Yeah we're we're seeing uh more and more hardware floating points um way to return the floating point like uh the fixed or tf32 to be maybe driven by applications needs. uh Do you think there will be uh more needs for for for this? For this specific type of floating point, and do you think the hardware will answer to them like it did with the ii ai.

B

uh So maybe cj can answer this, but my impression is the with this. For example, this is a tf-32 and there's already tensor cores for this formatter, which is this that helps speed up things uh dramatically right.

B

I think that that'll be very useful. Yeah.

C

And maybe just to take that in a slightly different direction and coming back to florina's question uh one of the things that we're seeing and I'll just cite hpc at the edge as an example of where uh I visited some people at some government labs, where it's really important uh to make best use of the very very expensive equipment for being able to. You know, take measurements, whether it's microscopy or whatever it might be, and it's very easy to get a completely bad data, because you didn't set it up right.

C

It's also possible to sort of take only a few samples of something that was really interesting and surprising, but you didn't discover that until much later it was sent back to the scientist, and uh they said, oh man, that that was really cool. Please please do this experiment all over again. There are a lot of folks that are telling us. We really want to be able to get a very quick turnaround and get almost instantaneous feedback from this.

C

So what that means is that we're, if you will downloading a model that we use for inference into the instrumentation pipeline, we're also doing other kinds of hpc processing. We might be being a video and doing feature extraction and so on. So one of the connections here is that having the same hardware, that's both really good at hpc and really good at ai and being able to have that at any scale where you can do it out in uh sort of smaller instruments that are out near the edge.

C

Maybe it's in at a base tower so that you're looking for a lost person or whatever, and you you push out something. That's really good at recognizing that person wearing a red jacket or something or you can move it back into the data center and being able to make that trade-off in the same kind of programming model. That's able to run on the same kinds of hardware that may be just available at different scale. That's pretty cool um and being able to have that sort of model.

C

uh That's there wherever it's needed throughout your whole system kind of opens up a lot of opportunities. So that that's maybe a different perspective.

A

Thank you cj for this yeah um yeah the question I I just asked to uh to sherry. I think uh both of you can also uh give your your point of view um about the the convergence hpc ai and the the why we are using mixed precision uh cindy. Do you want to to give your point of view on this.

D

um Yes, so um it's sorry sorry is this about also florina's question right that what is driving the the the need for mixed precision? Correct? Okay, um so from my point of view developing these tools, um it seems like the main driven force has been performance um and then by law by using mixed precision, then we have to be aware about any additional numerical problems. That is my break.

D

So it's not so much about um uh you know doing the mixed position because of the need of the convergence, but it's mostly about achieving as much performance speed ups as we can and then being aware of what other problems are being introduced. Because of that.

A

A

So I think we have a complete uh view on on this, and this problem and yeah thanks cj for this uh dynamically mixed precision model that it looks like a yeah and um so do we have another question from the public, okay, so yeah, we will finish soon. You can. If you have a last question or debate you you want to uh to bring bring up between you. Otherwise uh I have a one more question. I would like to ask.

A

So yeah this is a bit specific to uh to sherry application. You say that uh this is bound with bomb, because it's a sparse, linear, algebra and you say, mixing precision will help you to reduce the data movements, so we're talking about about a lot about making computation go faster with 10 circles on nvidia, but actually for such application. Tensor cores are maybe not the solution because of the bandwidth aspects. Have you considered using mpa communication with reduced precision data.

B

Yeah, yes, so for uh for in the single precision code, I was uh showing the results. So that's the using the mpi uh float to do.

A

Communication: okay, but.

B

I'm not sure whether you know you're, asking even with even more reduced communication uh reduce yeah.

A

Maybe tricks with uh compressing the data.

B

Yes, so so that's that's a very good question. So for the compression you know there are some tools like, for example, livermore has this tool called, I think, a zfp compressed or something and uh the we don't know yet we haven't experimented the uh the impact on the final result.

B

I mean that that's a very good direction to go. I think you know you compress your data before you do the communication.

A

Yeah so yeah, that seems like a good access of research.

B

A

Another acts of research getting back to what cindy said, what about the the correctness of like how to to define the correctness criteria from some application I mean for some of them is quite obvious uh when you have just the significant digits of the of the results, and you have your solder, which must converge with such uh significant digits, for some other is a bit less obvious, and maybe uh people can join the nurse user slack channel and on the mix, precision channel and- and we can talk about it further so for this I will uh thank you all the the the panelists.

A

Thank you very much for for these discussions.

A

um I want to tell to the audience that uh we will have a last talk at the end of the day, uh wrapping up the the gpu for science two days and- um and we will uh share with you uh a form to ask you what you think about this uh these two days again. uh Thank you very much, uh uh cj, cindy and sherry for for this instructive panel.

A

Thank you. Thank you. Thanks.

C

So much for the opportunity.