National Energy Research Scientific Computing Center (NERSC) General NERSC Info, 28 Mar 2013

Previous Meeting

⏯

youtube image

►

From YouTube: The Future of High Performance Scientific Computing

Description

The Future of High Performance Scientific Computing presented by Berkeley Lab Associate Director of Computing Science Kathy Yelick at NUG 2013, the annual meeting of the NERSC Users Group.

A

We welcome everybody to the lab. This was a interesting request to talk to the nug, not as the nurse defender so sudeep's, going to tell you all about the future of nurse and I'm, going to tell you about some other things and putting some stuff I've been kind of related to work that I've been doing in research, but to try to tell you where but I think the important problems are in computing as you look forward towards exascale platforms, and things like that, I did want to just welcome all of you to the lab whoops.

A

This is not gonna work, I, just that's! Okay! This is mine, and so here we are at the lab and I.

A

Think if you were like me, you might have gotten stuck on the way in the entrance this morning behind a cement, mixer and that's of course, because we're building a new building and I'll say a little bit more about that in a minute, and you know I, so I won't, since most of you I think almost all of you, not everybody, but almost all of you are Berkeley Lab people I won't go through some of the slides on the Berkeley Lab, except to say that if you look at all the initiative, major initiatives across the lab, they really all do involve computation, as well as other science disciplines and so there's images from some of the work that goes on across the lab or related computational problems to work that goes on in photon sciences, energy and environment, Biosciences, cosmology and physics and and of course, computing itself.

A

So what am I doing now that I'm no longer nurse director I'm at the associate lab director and what that means? Is that I'm actually in charge of the nurse es net and the computational research division? So I think most of you know that as well, but I did want to put in one kind of plug for or discussion about, es net, because I don't think the nurse people are necessarily often layer with es net and for those of you who are at remote institutions or are involved in science projects at other institutions.

A

It's important to understand that the network is also an instrument that you can use in science and at one. It is one that you might want to work with people at nurse who can also hook you up with people at es net in order to figure out how to move large datasets around. So yes, net has just recently gone through an upgrade yes net. Five is in production, which is the first 100 gigabit per se, Transcontinental or continental scale network. It has upgrade capacity in the form of dark fiber. But what really this is?

A

Is it's a it's not like you know, going out and buying Internet services from AT&T or Verizon or one of those companies, because it's really about the ability to send large data sets around, and so, although it is the network that connects all of you to all the other labs and then out to the rest of the internet, because there's all these agreements of peering partners and so on that they that connects IES net to all the other networks in the world.

A

This there are facilities that you can use to really send huge data sets around, and you know the I guess. One of the numbers that I don't have on here is what happens if you tried to send these large data sets like terabyte data sets around on another network. They were I, think Eli dart was recently doing a row. Brian Tierney we're doing a project where they were transferring data between Oregon and Berkeley Lab, and they found that because of some of the loss and the internals of the network.

A

So normal networks will drop packets once in a while and things will get retried and it works fine. But you try to do this with a large science flow in the middle they saw the the the science bandwidth the bandwidth slowed down by a factor of 80. So why is this important to you? If you have a large data set at nurse that you want to put someplace else or another data set someplace else that you want to bring to nurse?

A

You really want to make sure that that network path between the two facilities is well optimized. There's a bandwidth reservation service in the yes Nate called Oscars, and if you're sending large data sets around- and you don't know about that, you should there's also information about how to set up a you know: data transfer nodes or things we have at nurse this science dmz idea to really build a high-speed connection on the other end. So there's information about how to do that. So a little bit about Big Data within do II.

A

So there's a picture: the artist rendition of the new building and some of the specs over there about how how big it is and how expensive it is, and things like that, it's a very it's a very efficient building and it's going to be a very efficient data center scientific computing center. It is one of the probably the most energy efficient ones within do e be most within the office of science, and that is because the temperature that you experience outside is the temperature.

A

Almost a year round here at Berkeley, it's very rare to have very hot temperatures, which means you can use the ambient air temperature to actually cool the computers in the building. And you know we did a little analysis- a back-of-the-envelope calculation in 2010 that was when Hopper was it was when we just had Franklin I had haven't updated this yet for just the hopper configuration, but we had about 200 publications, 250 publications per megawatt year. Ok, so I challenge any other computing Center to produce that many papers in a megawatt year.

A

So this is just abiding the number of publications, but a number of megawatts. We use that year and you know there's a lot of ways. We can look at efficiency and that's what's going to come up when I talk about the future and what we're worrying about in the future machines. So enough about the lab, by the way, I'm also very happy to answer any questions that people have, because it's a small audience so might as well.

A

Ask you take advantage of that all right, so I wanted to talk a little bit about the future of computing and kind of start with by looking back at the past, and I was just actually talking to a computer scientist in human-computer interfaces. We can really cool problems. You know how to get rid of laptops entirely by just projecting things. On your on your hand and and using you know, motions of your fingers.

A

So you don't even have to carry a laptop around anymore and he said oh, but my computer's fast enough, I really don't need anything faster than that which is a common perception. I think that most computer scientists have- and so I like to do this little thought exercise where I say well. What are two of the things that we really care about?

A

Everybody in the world cares about even if you're, not a computational scientist, where you might understand more about the importance of high-performance computers and that is and I've some kind of a smartphone- it's let's say an iPhone and a and searching Google. So if we so in 2013, you know these are commonplace. We use them all the time we use them multiple times a day and if you roll back into 1993, you know what do these devices look like? Well, you know certainly there's a lot of really creative user interfaces.

A

These things there's creative algorithms in both Google. You know, there's asymmetric eigenvalue problem inside of the the page ranking algorithm. There's a you know, there's a bunch of speech, recognition and so on in the iPhone. But if you roll back 20 years, you end up with you know the nervous supercomputer on your in your hand right.

A

So so it you, you needed all those creative people, all those creative algorithms, all that new software, but you also needed faster, smaller, cheaper, denser computers, and so you know, as we move forward people say: oh, we don't need any faster computers, but you do in order to get innovations like this. You really do need computers that are gonna, be much faster, much smaller, not much cheaper and Google. So you know how much you know: what would you need to have Google? Well? First of all, you need a few gigawatts of power.

A

So where do you get a few gig say 30 gigawatts of power, which is about what you would be using in power in 1993. If you tried to build a Google, Data, Center kind of estimated for what you think is inside a single Google, Data Center well, Google, of course, is a green company. They like to advertise. You only use green power. So where could anybody tell me where can you find, let's say, 20 to 30 gigawatts of hydro power, so cheap Green, hydro not necessary, but Green, hydro, power, Canada.

A

Three Gorges Dam in China, so at least that's that's one of the biggest hydro plants. I could find available, and so that's a lot of. What's that not in 1993.

A

But if we halted the progress on computers in 1993 and just had the well progress as such, as is in other things, then that's where we could get that much hydropower. So so now I thought exercise rolling, 20 years forward, and this is always really dangerous because it's really hard to make these kind of predictions. But the first prediction is: there's no personal computers, there's no departmental computers.

A

There are only client devices which are embedded, perhaps and as I said, but you know people are trying to get rid of keyboards and they're trying to get rid of screens, and so we may not even really see computers and there's and then there's the cloud right so and the cloud is we're, including you know, we don't like to call norske cloud, but it's there. It's a place that you can do scientific kind of computing and we don't travel very much because we do a lot more telepresence.

A

Wouldn't that be nice lecturers teach to millions of students. You know we're running one of these MOOC courses on campus at UC Berkeley. You know these are courses that have tens of thousands of hundreds of thousands of students in them. It's a really interesting teaching experience from what I've been told they haven't tried one yet, but don't ever one of the rules is never hand out a homework assignment. That has a mistake in it. That's the first rule of teaching a MOOC, so you know theorems, you know theorems might be proven online.

A

If you aren't familiar with this thing, you look at a webpage called polymath, so you know there's sort of more automatic thing, maybe that one's a little bit more of a stretch- users never login to the nurse system. So this I think is actually going to happen sooner and it's already happening today right. There are a lot of people that actually use a nurse that don't directly log into the systems. Probably most of you who are nursing users in this room actually do login and you submit jobs and things like that.

A

But there are other people that are coming in through science gateways that are using that are actually just accessing the data that comes from nurse simulation, so the climate community, for example, is you know long ago, given up in the idea that everybody who is doing climate science is running jobs at nurse or certainly writing the code, a nurse or other places there they're looking at the data that comes from these, so we have to think a nurse has been really trying to to figure this out how to support more users kind of indirectly who we don't even get to count as user.

A

So we've had a big debate about how we count all these people that use nurse indirectly, computers Intuit what jobs should be run so, okay, this one might sound kind of you know crazy too, but the you know this also is sort of that idea behind some of the gateways. So if you look at something like the materials Genome Project, where you've got tens of thousands of simulations being run, it is not unreasonable for some algorithm to say you know: here's a space of the here's part of the design space.

A

If there's enough structure on it that we think should be filled in by simulations or the user asks queries coming in from a web interface about some particular material and a bunch of jobs, get run and run based on what that is. So you know the idea that you're not directly logging in and submitting batch jobs and so on is I. Think not not such a crazy idea. No users actually visit the other user facilities either. So it's already the case. Why do we have so few people in the room?

A

Because people don't really come to nurse to use nurse right? They, mostly all of you just log in remotely, and what surprised me I was I, was at a meeting where somebody was talking about big data and data and science and data and medicine, and things like that, and this person was actually doing medical experiments and I figured while medical experience. You have to be there right, you have to be there with the subjects and they said well.

A

Actually you can outsource experiments and mouses to China, so you can go and basically rent a mouse or buy a mouse in China or a whole bunch of them.

A

So you buy a hundred mice and you say, run this experiment for me and they say well, you know what, if you don't trust the labs, because you don't know you know who are the people that are running these things and you say: well, you buy three sets of 100 mice in China at different labs and you run the experiments and you get the results back and that people are doing medical experiments that way as well, and we see this at the advanced light source that there is an increasing trend towards moot.

A

You know sending the material or whatever it is to the light source and having somebody there run it and it changes the model of what the user facilities are. So we need to think about what this means for the kind of science that dearie does for big team science and so on and I'll leave that for you to think about okay.

A

So my next kind of high level discussion is about the world of high-performance computing and the politics of it, and this kind of big data versus exascale discussion that has been going on for a while. Now and unfortunately, it's there, it's been cast as a versus in the in the discussions, but I think it's important to go back and think about with India. We, where did where did high performance computing kind of grow up in terms of the the growth of the HPC program with India, and it started with I.

A

Think the comprehensive nuclear-test-ban treaty on the NNSA side, which really said that you have to use modeling and simulation because we're very restricted and kind of experiments that we can do and so that sort of said, the balance between doing data analysis on the one hand and simulation, on the other hand, kind of shifted more towards simulation within the NSA and I. Think that the dothis of science took advantage of that and the Oscar program and said yes.

A

Well, there's a lot of important size problems that can be done with simulation in science as well, and the the focus has been on simulation rather than on data analysis. Now at the moment, whoops, because you know, there's there's these huge growth in data rates coming from CCD technology coming from sequencing technology and so on. We're seeing a shift towards data analysis that there are big data problems that are coming from things like the next-generation light source plans here from the Bell to experiment from from the sequencers at JDI and so on.

A

So it doesn't say that simulation is no longer relevant or that data analysis is the only thing we should be worrying about, but that there is a shift and that we we do need to focus on or that data analysis was never done before, but but just that I think there is a an important shift, at least within the Office of Science, where there are huge experimental facilities.

A

Producing huge data sets, ok and by the way it doesn't really matter what the balance is up here, because both of these things rely on having faster computers and that kind of goes back to your the little iPhone and Google exercise. You really need to have faster computers up or in cheaper, more plentiful computation in order to solve some of these problems. Ok, so, let's see I'll say a little bit about the kind of science, transit and I think these are some of different examples. Then so deep will use.

A

But these are these are examples from nurse but I like to think of the science that we all do at computation as being divided up between large-scale science that is petascale up to exascale simulations. What I call volume of science and volume which is about running massive numbers of simulations? Some people call it capacity computing, but it I think it's actually something a little bit more well defined than that which are people that want to run ensembles.

A

That are very that are closely related to each other, and we need to have support for those kinds of ensemble simulations, whether you're doing uncertainty, quantification or some kind of screening through biology data or materials data or what and then the data analysis side of things where you've got huge data sets. So you know things get oversimplified and I think the one thing I want to make sure everybody in this room understands is that exascale is not only about the top thing.

A

It's about technology needed to solve any of these problems that require more computing performance. So science XK limit models, of course, are very large-scale computations, although they also do run a large number of them. This is just you know a slide about some of the the history of climate modeling at nurse nurse has been involved in client. The IPCC runs a climate since the, but certainly since a are for- and you know going forward, why do we care about faster computing and climate modeling?

A

Well, one of the examples is because you want to do cloud resolution you. If you want to resolve clouds, you need to have a computer that is significantly faster, I, think Gil, isn't here, but I Gayle, compo who's, doing more of the data analysis side of climate change. When we were talking at the BER requirements workshop, he also mentioned that he needed you know about a hundred times more computing power in order to analyze to reconstruct datasets.

A

So he's doing this 20th century re analysis, which is reconstructing data from from very sparse datasets that exists, and he said in order to get some of the effects of get things like cyclones and stuff in back into the reconstructed data he needs to have faster computation, because what happens right now is you're kind of averaging over this very sparse and very noisy data set, and you you'll average out some of these kinds of interesting local events. So here's the materials genome, one and I won't say a lot more about it other than you know.

A

The the goal here is to is to increase the to decrease. That's the amount of time that it takes to get from the design of a new material into manufacturing to cut that in half it's about eight eighteen months, I think you should say right there. The the the delay that doesn't design time is eighteen years David. It is 18 years, yeah, sorry and the you know.

A

The idea is to search through a whole space of related materials and cut down the the interesting part of the space, so that you're not you're going when you go back into the lab and synthesize things that you're not actually or searching for things that exist, that you're not searching through the entire space, and so this gets into the case where you really want a sophisticated interface to being able to drive the simulations. You don't want to submit each one of these jobs one at a time and then in the genomics area.

A

You know the the move of jgi computing into nurse has been an important thing that has affected all the different parts of nurse.

A

That group that came from jgi is now fully integrated into nurse and has affected the kind of systems there's now you know these cluster systems that have been that have been built up for specifically for the jgi workload, but we're also seeing that we're using that for other parts of the space, whether it's physics or other other areas where people have the need for just sort of generic clusters, because they're doing either smaller scale.

A

Things or more data and analytics problems and the knowledge base is this I guess project that's been going now for about a year, I guess within do-e and BER, which is to try to build a to take all of these different tools for analyzing biological data and to put it into a single plate space.

A

So this is the growth historic growth chart of machines that are actually on the Gordon Bell prize list, not on the top500 list, and you know the kind of major transitions that have happened in here were that was the attack of the killer micros, which caused us to go from vector, supercomputers to mpps and then into about 2004 the rest of the world, getting K parallel computing, which is when you stopped Casias the clock, speed scaling of individual processors instead went to multi-core and then the what were John Shelf likes to call the attack of the killer cell phones, which is where you have to really look at more energy, efficient devices and more energy, efficient processors, and by the way, if you look at the top 500 numbers on the same graph, it looks roughly the same on a large scale, so the weather, whatever you're looking at here.

A

You do see this kind of growth in the computing, the computing performance, both from the Linpack benchmark, but also from the gore bell prizes, which are by the way, also, of course, very highly optimized codes. Now I want to make one kind of side comment about the cost of running nurse and the cost of cloud computing, because there is because the cloud providers like Amazon Yahoo Google has done an incredibly good job of making. You think that cloud computing is free, it's only 10 cents per core hour.

A

It sounds really cheap till you give them your credit card and you start running some jobs on it.

A

And then you realize how quickly you can rack up a pretty hefty bill and as part of the Magellan project, which I think we've we've talked about before at least one of the nug meetings, but just to remind you, the estimated total annual cost of running nurse if we were just to buy the computing in the cloud would be about two hundred million dollars a year and by the way that doesn't account for the fact that many of your codes would run a substantially slower on the current commercial clouds and they would in the in the other than they do on the nurse systems, and these are kind of list prices.

A

So they do overestimate the cloud costs, but they also underestimate the cloud costs in many different ways. So, as I said, it doesn't measure the slowdown and it doesn't take into account that you don't get any consulting on the in the cloud for that or scientific computing experts. There's no real account management, there's no software support and all of those things are about a third of nernst budget and and further. You know. Why is this true?

A

So so you know: why is it that Google can't provide computing more efficiently than nurse can because they have a larger scale of computing? Then the nurse and the idea is economies of scale, and the answer is they probably can they can actually buy computing infrastructure at slightly less than nurse? Although nurse cos pretty far up the efficiency curve right, we're also already buying a very large scale systems buying very large quantities of power.

A

The power here at the hill is very, is actually pretty green and also very inexpensive relative to what you would pay and say a traditional commercial setting. So nurse has many of the benefits of cloud computing at scale, but we run at much higher utilization, so over 90 percent utilization, whereas most the cloud facilities are struggling to get over about sixty percent utilization. Many of them run much lower than that and the curse the cost per core hour.

A

You know what from when I started at the end of 2007, actually technically as January of 2008, but in 2007, so that was when Franklin was installed. It was in October of 2007 until we installed hopper and latias last year, when we've had both hopper and Franklin running for a while that the number of core hours went up by a factor of 10 in that in that four year period. In that same period of time, the cost of buying a core hour at Google or at Amazon sorry and their ec2 cloud dropped by 15%.

A

So you know a factor of 10 versus 15%. We all understand that difference. Okay, so I think you, many of you have seen this slide. This is the canonical exascale slide.

A

It says that the main problem we have in getting to exascale is about performance Ansari's about power and how to make machines how to make it possible to actually build a machine that you can afford to turn on, because if you just look at Moore's Law scaling you'd have about 200 million dollars in in power cost just to run just to pay for the power bill at risk. So I'm now going to sort of switch and talk about.

A

You know what I think all of you who are writing codes and worrying about the next generation of architectures and what these systems will look like should be thinking about in terms of the future of these these codes and what are the problems? And the first problem is that communication is very expensive, it's expensive both in time and that's the little table up there in the upper right. It's those are the the annual improvements in 14 point operations per second, which is 59 percent in bandwidth and in latency.

A

Now you say: butBut flops stop getting faster right in 2004. You just told me that, but this is the throughput rate of a single chip, has continued to go up roughly by 59 percent a year. It slowed down a little bit, but- and we are going to have a problem in the next 10 years or so when we're gonna start we're going to start running out of transistors scaling as well.

A

But the the bottom graph then looks at the amount of energy that's used to do different operations within the computer, so this is in Pico joules and if you're doing, arithmetic you're there today at around 100, Pico joules and projecting forward at more like 20, Pico joules accessing something and a register is significantly less energy. But as soon as you go off chip even to local dram memory, you're up to one to two orders of magnitude more in terms of the energy consumption.

A

So given that the problem of exascale is really about saving energy, we need to minimize the amount of data movement. So we also have to be careful to separate bandwidth problems, which is the number of words being moved and from latency problems, which is the number of separate messages that are being moved and, and these things are hard to change. The latency problems are really about physics right. You can't get any better than the speed of light across the machine.

A

Room and bandwidth is about money which Sudeep now understands very well and I'm sure he did before as well. But you know this is really about I mean when you go and talk to the vendors in a negotiation you say: well, we want twice as much bisection bandwidth and they say: okay. Well, you know, that'll cost you substantially more money in order to get that much bandwidth and if there's only a small part of the workload that can benefit from it, it may not may not make sense.

A

Besides that, there's a point of diminishing returns once you've spent 90% of your budget on memory, bandwidth and network bandwidth, there isn't much left in order to take away from computing in order to put computing into the bandwidth machine. So the strategies are slightly different for these different different cost components of communication. When you, when it comes to latency, you can try to overlap it right.

A

You can hide it by doing other things on the computer, it doesn't make the latency go away, but it does make it less painful and less expensive in your at the algorithmic level, whereas in bandwidth the only way you the only thing you can do, it's more fundamental, usually in the algorithms. The only thing you can do is come up with new algorithms that that don't send so much data.

A

Ok, so another thing I stole from John shell, not the picture, but the idea is you know there was this idea more than 10 years ago out the memory wall that we hit the memory wall, the memory wall is going to stop us and the multi-core made the memory wall worse in multi-core didn't cause this problem, but you know bandwidth.

A

The gap between bandwidth and computational capability and a single chip has continued to grow, and but the way to think about this is not as a wall but as a swamp, and you walked in walking to that swamp for years and we're going to continue walking into it, because the amount of the number of 14-point units, the amount of arithmetic performance that you can put on a single processor chip, is going to continue to grow. Much faster than bandwidth is going to grow and there are technologies that we are looking at.

A

That Delia is looking at in terms of optical. You know on chip, silicon photonics in the longer term and the short term trying to figure out memory technologies such as stacking that will hopefully make this bandwidth gap a little bit better, but fundamentally the the this. This is still going to be a problem, so you know that this is slight, as maybe start starting to get old.

A

Now the election is long over, but you know Obama actually understands this problem, and the president's FY 12 budget said that you know one of the things that de needed to do was to minimize the communication between processors and the memory hierarchy by reformulating the communication patterns specified within the algorithm. So now you have to be a little bit careful about taking lessons that you learned in in your scientific work and applying them at home or employing them in another.

A

Setting like in the debate in Denver I, think that Obama might have taken communication avoidance a little bit too seriously. So, let's say the so. They have a few lessons now for all of you who are writing scientific software, designing algorithms or supervising people who are the first one has to really understand the communication limits and for this I'd like to use Sam Williams roofline model, how many people are familiar with the roofline model: okay, yeah hold the co-op's cursive sales papers and the locus of local people. No, this is so.

A

This is a nice way to think about the the fundamental limit of bandwidth in your systems, and you can apply it I'll talk here about what it looks like in in between the memory, the DRAM of a single processor and the processing chip. Although you can apply this to other parts of the memory hierarchy and it's a it's a very simple model, and it's actually what I think people were using intuitively when they're trying to optimize codes to minimize bandwidth, but it's it just kind of captures it in a nice picture.

A

So what is this picture like? First of all, it's important to realize it's a log-log scale. So what is on the x-axis here is the is a property of the algorithm, which is the computational intensity, that is the number of floating-point operations per byte, moved from the memory into the processor chip versus the number of Mount of computation. You do you can do inside the processor chip and the y axis is the attainable, gigaflop rate that you can get for that for that code. Now, what? Why is it called the roofline?

A

Well, the top the flat part of the roof is the peak floating-point performance of the hardware. So the other lines on here all about hardware characteristics so they're fixed for the hardware. The basic plot is fixed for a particular processor and you start with the top line, which is double precision, 40-point peak performance. Those are the things that the benders always tell you this, how fast my processor goes and but then, if you don't actually use fuse, multiply, add instructions on a lot of processors.

A

You drop down by a factor of two remember it's a log scale and if you don't use Sindhi operations, you might drop by another factor of two and if you don't use in structural level, parallelism that is careful scheduling of your instructions. You'll drop down by another factor of 2, that's the without ILP line. So this gives you a sense of how fast you should be going in terms of the floating-point performance of the processor.

A

Now what the the diagonal line is, maybe a little bit harder to kind of get an intuition about, but it is just. It is the the bandwidth between the memory and the processor, and you start with a peak bandwidth. So that's they kind of guaranteed not to exceed number. But if you're, not using software prefetch on a lot of memory systems, you actually won't get that peak performance you might drop by another factor of 2 or so, if you're, not using the Numa architecture. As a structure of the architecture like on on hopper.

A

Many of you who have done kind of careful optimization of the the node code on hopper know that the Numa structure there is very important than you might drop by another factor of 2. So if you kind of cancel out- but you know bytes per second- the bandwidth numbers there and put this on the graph. You end up with these diagonal lines, and so that also limits your performance.

A

So your goal, in optimizing code of course, is to try to move your code over into higher computational intensity, which all of you knew before I told you about the roofline model. But this gives you a little bit more concrete kind of picture of what the limits are. That system, and so some work so I think there's some of Stefan's results here. The GTC things with other people at the lab whole bunch of people have worked on each one of these points.

A

This is optimizing a number of different computational kernels for two different architectures until Nehalem and the nvidia fermi system, and what you can see is that the the performance of these is all over the place, but that you are, you know roughly they do how to match the roofline. That is, if you work really hard at optimizing these codes, you can get them to be bandwidth.

A

Limited I, don't have that it didn't bring this particular graph, but it does also the case that if you take an average nurse spent an average or nurse cap location- and you run it on the Sun on the system- it often is not pegging the memory bandwidth. So it's important to realize that yes, these problems- indeed, as you would expect, is a sparse matrix vector multiply, is indeed memory.

A

Bandwidth, limited stencil operations typically are bandwidth limited, but many times the actual code that people are running is not so there's often a fair amount of headroom in there and understanding this roofline model can help. You number two is understand that, in order to get better better bandwidth utilization, you need to do higher level optimizations. So you know that previous slide was all about optimizing.

A

Little tiny kernels in the code GTC by the way, is off the roof line because it is doing essentially atomic updates on in an irregular pattern, but the real problem is that it's gonna be limited by synchronization, rather than being limited by the memory bandwidth or the floating-point speed of the other processor and DGM dense matrix multiplies up there, but these are little tiny pieces of these large complex applications trying to get them to go as fast as possible on an individual processor, but the the name of the game.

A

Now, if you, if you're, if you're sitting here at the roofline, what can you do? And this is the position we were in a few years ago and in this bebop project? We were looking at these first matrix vector, multiply things we said. Well, you know what do we do? We can't we can't beat the bandwidth on the machine, so the answer is well. You look at a higher level kernel and see if you can avoid bandwidth by optimizing at a higher level. So the example of that is something like sparse matrix vector multiply.

A

Let me just let's see, go to my other picture here for a minute that one which says in a sparse, matrix, vector multiply. What you want to do is do a you, need to read the matrix and then do a matrix vector multiply. You know so multiply each one of those entries in the matrix and what was really discouraging is when we looked at the performance of this I mean this is another way of saying that it was sitting on the roof line is basically the amount of time to do.

A

A matrix, vector, multiply, sparse matrix multiply is limited by the time to read the matrix, so you can't do anything. You've got to read the matrix right, so the question is: can you read the matrix once and take multiple iterative steps, because you are in an iterative solver reading that matrix over and over again? So the idea is we'll pick up a little piece of the matrix and, of course a sparse matrix is really just an unstructured. Graphs will pick up a little piece of our unstructured graph, we're doing nearest neighbor computation.

A

That's a SP M V operation on it. So in order to do the update on that vector which are the nodes in that graph, we need to get a slightly larger region. So we need to get the neighboring points there. So we can compute the next value of the interior of the graph and we need to. If we want to do two steps with one read of the matrix, then we need to get a slightly bigger piece of it. So that means all the edges as well and three steps and so on.

A

So so we actually did this and you can actually you can actually make sparse matrix vector, multiply, go much faster if you do them. K steps at a time that is, you, do a to the K times X, rather than doing a times X but you're. Now you now have a higher-level computation, so this says a to the K times.

A

X is actually faster than doing K versions of a times X, because you're not reading a over and over again now gets a little bit more complicated and I won't go into the details, but you know we have to understand the numerix, because we're going to take this new thing and stick it in tomato in the middle of our iterative, solver particular will put it into GM res so GM R, as is you know, an iterative solver that has a sparse matrix vector, multiply in it in the new algorithm.

A

It's going to have that eight of a K thing in there. You can see that the W equals that vector there. So we stick in our a to the K kernel, there's some other stuff going on over the reductions that I won't talk about right now that also have to do with communication, but so to a compiler person. This looks like kind of a loop interchange idea.

A

We have the the K loop on the outermost that there and we're going to just take that K loop and kind of stick, some of it on the inside there you know, and once we read the matrix, then we'll do K steps, except that it's completely illegal to compiler transformation. We've completely changed the dependencies in the program you no longer get the right answer and- and maybe it's sort of still smells like GM res, but unfortunately doesn't behave like Chan res. So this is how GM res behaves in terms of its residual error.

A

So it's an iterative solver, so you want it to get. You want the error to go away as you go through the iteration counts of the solver, so this is not performance. This is an error that we're measuring here. This is what happens when you take the new communication, avoiding algorithm that uses the a to the K kernel. That is, does one read of the matrix for every K steps, so it runs faster, but it no longer converges. So not.

A

A very useful algorithm, however, turns out that if you use a different basis called a Newton Newton basis, you can get convergence back again and you still are using something that kind of is like a this. This a decay to the K kernel in there, so lots of hand waving underneath this. The high-level point is that you shouldn't be just optimizing. The inner most loops of your code, you need to think about well, could I rearrange something at a much higher level?

A

That would allow me to do less communication, less data movement and by the way you can put this all back together again and it actually does run faster to use the communication avoiding part. Those are the the orange and red bars they're all set to one there. So it's normalized to this, but the the faster version and that's the the slowdown of the original version.

A

So the next trick that we are using to try to to optimize things is to to not get hung up on the owner, computes rule, so I'll tell you a little story about matrix multiply, but this is really not about matrix multiply. It's really about all these.

A

What we're actually not trying to generalize it to arbitrary loop nests, but the the basic idea in which I think a lot of people go into a scientific computation is you've, got a physical domain, we're going to chop up the physical domain, we'll give each processor a piece of that physical domain and they will be responsible for the updates on it. That makes by the way the concurrency control problems really easy. You don't really have to worry about two processors updating the same value, so it's a nice way to organize your code.

A

So this is some performance analysis done with a new algorithm that doesn't just do domain get into composition. It actually makes multiple copies of things in matrix multiply, so this is edgar Solomonic who's, a grad student on campus working with Jim Donnell, his advisor and doing matrix multiply, and this is running on Blue Gene P, and this is a speed. This is running time, so this is the old algorithm, and this is the new algorithm.

A

The new algorithm for reason, I'll, explain in a minute, is called a 2.5, D algorithm, and so it's not just using this idea of chopping up the result matrix the C matrix, if you own C, equals a times B in two separate pieces, but actually doing something more complicated. There's a different problem size with a speed-up shown. So alright, so I wasn't involved in this. But I was watching this work and I said well. What was I surprised about?

A

First of all, the prize that any committee could make matrix multiply, go any faster from an algorithmic standpoint. I mean I know there are people working on you know, making the exponent a little bit lower in terms of Dawsons algorithms and things like that. But this is basic and order. N cube matrix multiply, wasn't really changing the computation in any significant way.

A

It was just changing the data movement, so you can make it go faster and the basic idea was to make copies of the C matrix, have different subsets of processors, updating those copies independently and then combine the results together at the end. So the lesson that is so and there's you know there's a nice theory behind us. This is provably optimal you! uh The lesson. Was you never waste fast memories?

A

So, if you've got now, you may be concerned that Edison or nurse aid of the future systems are not going to have enough memory per core, which is which is always going to be a concern, but they. But if at some point in the middle of the computation, you have a computation that is not using all the fast memory on those systems.

A

You want you to consider doing something that decomposes into finer grained parallelism and then makes use of all of that memory in order to get speed-up and you're doing it not to get more parallelism, you're doing it to reduce communication. So now the question is: can we take this isjust matrix multiply? Can we do something for everybody else who are running other computations in the world, and so this is looking at what matrix multiple actually looks like in an iteration spaces, which is the way compiler writers think about it.

A

So there's three loops right: I, J and K and there's our iteration space, and you can actually then think about where the matrices a B and C fit, because there are projections of that iteration space onto the surfaces, so the C matrix is the top and the bottom. The a matrix is the front in the back and the C. The B matrix are the two sides: okay. So at every point, in the middle of that iteration, space you're going to do, multiply and add which is updating a value from each of the three spaces.

A

The three three spaces that I've shown here that are colored. And so you need to pick up those elements. And so the question is how do I divide up that iteration space in order to minimize the amount of surface area that gets touched by projecting out that interior region? So you can imagine, say the way the proof goes to say will pick up an arbitrary glob of stuff in the middle of this cube and figure out what its projection is and what is the smallest projection and not surprisingly, the smallest projection is actually a cube.

A

So what this says is that you actually want to chop the iteration space, not just by chopping up a and B, which is the obvious way to write.

A

Matrix, multiply and well, you chop up C, and then you kind of you know, send all the blocks of a and B around to get all the updates done, but you actually want to make copies of C and then chop up those copies and say that that divides up the the iteration space and it's called a 2.5 D algorithm, because the original kind of the the kind of obvious implementation based on D main domain decomposition is a two dimensional partitioning of the C matrix right.

A

It's just chops it in two dimensions and this one is actually chopping in the third dimension. So it could be called a three D algorithm, but it's kind of there for technical reasons, there's a an extreme case where the third dimension is really big as big as possible. That's called the three D algorithm, so this is called the 2.5 D algorithms, because it's because it's somewhere in between okay, so you may not care about matrix multiply. You may care about other things, so the question is: can we apply this to other things?

A

Actually, some of my students that I actually do have students again we're new students who are working on some. Some of these ideas have figured out that you can apply it. We've figured out what that you can apply this to end body codes, so just to give you a hint what the idea is. If we've got what we'll just do a really stupid end body code here for purpose of illustration and because it's a lot easier to analyze, so you've got order and particles and you've got P processors.

A

So what is the usual way that you would it would paralyze this code, you take each processor and give it and over P particles, and then you cycle the particles around somehow so that eventually every processor sees every other particle and they do all the updates for the pairwise forces between them and this computational cost is N squared over P and the communication costs an order, P messages and you have to send all the data around everywhere.

A

So it's order n words, so it turns out you can use the same replication idea, so we're kind of replicating all the particles a few times and then, within this smaller group of processors, sending all the particles around so that everybody can do a subset of the updates. So, for example, the first row is responsible for all the pink updates. The second row for the green updates, the third row for the yellow updates and so on, and you can actually prove that you get better performance out of it yeah.

A

But you know I like this quote in theory, there is no difference between theory and practice, but in practice there is, and so what that you know it is this just a theoretical result. Well, the answer is no. You can actually get speed-up numbers from this as well, just as you could with matrix multiply. So it's important to think about. You know how you might paralyze your codes in ways that will reduce the amount of traffic by looking at higher level kernels and in this case, by thinking about other ways than just decomposing.

A

The data structures into independent pieces and the other way to think about what's going on in both a matrix multiply case and the n-body case is kind of a replicate and then reduce you've got you've, got make replicas of your data structures, you're, independently working on partial results, and then you reduce at the end to get the the full answer. Okay, so have we seen this before?

A

Yes, in fact, when I've talked about these algorithms and some audience, they say, but we use that much, that algorithm for matrix multiply and the connection machine, the CM 2. So for those of you who they're not remedy in this room, who are old enough to remember the CM 2 and the mass par machine, those were machines with little teeny, tiny processors and people did indeed use these kinds of algorithmic ideas because they they needed so much parallelism that it wasn't.

A

They didn't have enough parallelism that they just divided up the problem into these different subdomains. They actually have to divide up the computation, not just divide up the data structure.

A

You do have to have an operator like plus, usually that's associative, so that you could do these things independently and you are reordering the way those things are being updated, but not in a very it's, not a very significant reordering of the calculation and actually in the GTC results with Stephan and others that I talked about before, though there was another paper that was looking at synchronization avoidance in particle and soul codes in general and the same ideas being used there.

A

The basic idea is: if you need to have a bunch of processors, updating something simultaneously rather than worrying, about locking over it make a copy of the thing that you're updating everybody updates independently. And then you combine the results together at the end and it gets used in sem d, extensions and GPUs and so on.

A

Ok, so you all know about making messages large or any good MPI programmer knows that you want to send a small number of messages, because each message is very expensive, but the opposite of that is really that you want to also overlap and pipeline your communication. So this sometimes runs contrary because in order to overlap, communication and pipeline, so pipeline means overlapping communication with communication, whereas but also overlap, just means overlapping it with computation.

A

You want to start the communication as soon as possible, which often means you've done you're, not ready to do all of the communication at once. So you start what you can I mean you end up sending more messages in the end, so you know this is what the P gasp ideas are all about and is about really making it easy to do overlap, and it's really about DMA operations that is doing fine-grained very lightweight communication across a global address space. So so this is.

A

This is what these, the P gas languages like UBC and Co, a Fortran chapel and so on. Look like, which is every processor, has a chunk of the memory which is physically what you have in the system, but they can access data anywhere in the system. Simply by doing a read and write, they don't have to ask the other processor to help them do the communication. So I also think of this as we're having to say receive, and so in this model you it is it.

A

It turns out that this is closer to what the hardware actually does, because down inside of a MPI send and receive there's typically a DMA operation going on there, and why do these? These kinds of programming models come up. These global address space models especially come up when you have a very irregular sort of data set. So imagine that you want to compute a histogram on a huge data set that does not fit in the memory of a single processor or even on the biggest shared memory multiprocessor.

A

You can find so you what you do is you take your machine like hopper and you spread the histogram over all the processors and now what you're going to have as you're computing? This histogram is you've, got these keys coming in and you need to put them in a bucket in the histogram and those are going to be kind of random accesses into the middle of the machines memory.

A

And so that's what these global address space programming models are about is making these kinds of things easier to express and actually faster to execute in in general, whereas the MPI stuff, if you're, really working on a physical simulation problem, it is often easier to divide up your domain physically. Even if you're using the replica, the replication idea, you've got some much more structure to work with, and so you you use that.

A

So that's why the MPI codes in practice have been that has actually worked out, it's very painful, to write to program a histogram in MPI, because you don't know when to say receive right if you're the processor that owns the bucket that somebody else, some other processors inserting a key into. It's, not a very natural thing, to figure out how to say receive in that kind of a model. So I think I'll skip some of this, except to say that you know there.

A

This is some work, I'm done on the milk application in QCD, and this is a hung, Jiang Shan and a bunch of others who looked at the comparison between a UPC implementation and an MPI implementation I'm going out to 32,000 cores now. This is looking at a slightly different version of the algorithm. So whenever you compare across programming models, you have this problem that you you write something in a different way, sometimes in a different language, and- and in this case that's indeed what happens.

A

Is you get a different version of the algorithm, but you do get as you can see much better scaling of the performance of QCD, okay, so I think I will uh kind of wrap up and just say there are a lot of challenges that we're facing in the next generation of scientific computing. Scaling is the most obvious, but excess scale is really not about scaling.

A

Excess scale is about figuring out how to use more energy, how to design and use and program more energy, efficient processors, and it is also about synchronization the dynamic system behavior that we're going to see I'll. You see this many of you who run who run very large-scale simulations on hopper, do see. The fact effective me was telling you about this.

A

The other day you know different runs, may have different performance depending on exactly where that job was laid out in the network, more irregular algorithms, as you get to the trying to optimize your algorithms in terms of the amount of computing you're going to do, you tend to you know, move from dense too sparse algorithms from structured meshes to unstructured meshes or to adaptive block structured meshes, so things that give you more irregularity and there's and, of course, resilience which I haven't talked about at all, but is obviously something that we worry about today, which checkpoint, restart and we'll worry about even more in terms of whether you can do a checkpoint restart.

A

But you know, but what's really important is still location, location, location and all of the things that do communication near your code, whether it's communicating up and down between the processor and the memory or communicating between the processors, continues to be a really important characteristic of how you optimize code. So in conclusion, communication hurts and so be careful and try to minimize the amount of communication you do thanks.