National Energy Research Scientific Computing Center (NERSC) Getting Up to Speed on OpenMP 4.0 with Ruud van der Pas, 31 Aug 2015

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Getting Up to Speed on OpenMP 4.0 (Part 4)

Description

(4/5) Ruud van der Pas, Distinguished Engineer in the Architecture and Performance Group, SPARC Microelectronics, Oracle, also a co-author of the book "Using OpenMP" (published by MIT press), presented this tutorial on OpenMP 4.0 for NERSC users.

A

Ready in 2 I'm ready to go again. Thank you for coming back, I feel a million pages so that cooks, a good sign um being either us questions, but if it echoing it is to have a commercial break, but at the start of this this show. But if you have to be technical, commercial break and we'll be very brief, the reason is that I'll be showing performance numbers on our spark, processors and I. Guess very few of you are familiar with what that is about so, but again I'll.

A

Keep this very brief for pushing in the break or anytime. I can always talk about more, so we currently have what we call the t5 processor. The team 5 processor has 16 cores. Each core is running at Lincoln. Six echo hoops the core go as a twist, so in total one ship isn't 28 feds and the chip is designed to have an egg way directly connected system. So you have an eight socket thousand 2930 system or point-to-point connection you can do bigger, and then you have a second layer of metal and visit them.

A

This is a simple architecture, a simple architecture diagram. It's not a simple architecture. It's a very interesting in my system, which are because each HT five has national memory controller, but it's a fairly flat teaching in my system. The ratio is about think about forty percent when you go off your socket, but it still pays off to welcome.

A

I true why waste that kind of performance, but the largest system that we have is called m62 the juice cup 32, because if it has 32 sockets- and this is this- is the architecture in total that will give you three than 84 course at least 16 gigahertz and will over 3,000 quits, and it also supports 32, terabyte ship.

A

Now, what's on the way to you that you haven't, we talked about it, so this is disease of the inflammation, but we we have the noun system. Yet this is spike m7, that's the new processor and we have a new court redesigned core compared to the team 5 32. Now on the chip, the speed is faster. I can't give them the final Steve, yet again, eight sets per court. That part is pretty much the same. In total, give you 250 six cents per socket and again this system can go a gray glueless.

A

You can go bigger as we did with the m6, but then you add an extra layer of network.

A

This is a I will skip this knitted them the new or it's a penny more than four actually out of order to a superscalar, respectable speed and eight said what I want.

A

What I want to mention after after this I'll show a little bit about the architecture but again 64 divided l3 cache in control, so the strength of this chip is actually in the bandwidth together. Awful lot of bandwidth. The measured bandwidth is about one hundred and fifty gigabytes. A second family member and we can go too much bigger systems. The question is: do you want to do that? How many people actually are willing to pay for it? It's a matter of business side. Technically it can go much further.

A

What I find interesting is this is a first ship where we started having accelerate but not lot like your common GPU. These are integrated and and have specific purposes that what we call Dax the data and a Linux accelerators, are really primarily targeting the database. To celebrate. Local databases doesn't come as a surprise and I won't say anything more about that. But it's an interesting trend in what you do when you can put more things on the chip, is a little bit more detail on how it how it works.

A

So you have only call these all clusters and the connected digitally connected between the cheek. What the last thing I want to show is what I think is really interesting: has nothing to do with performance with correctness, and it is the underhood name has been changed a bit in generally, we settled for application, data, integrity or ABI, which is kind of a little bit of a funny name. But basically what you do is you memory? You have your data and you have your addresses and in addition to that, data now has a specific color.

A

That's the way to see it, and you can just summarize this in just a few words, but it is I think pretty nice. If you try to access data that does not have your collar at the track will be generated and that prevents at hardware speed things like memory, pepper, overcoat, overrun problems. If you try to Malik outside of try to access outside of what you Malik's, all the nasty memory problems that have been available in tools before and now detected a hardware speed. So we have.

A

We have our own tool to do this, checking into called discover and- and that takes advantage of that. So the huge performance penalty that you usually have with these kind of checks is pretty much gone with it and I think that's interesting, who had a correctness feature instead of always you're going for so very simple idea. If you try to load from something, that's not your color and a signal is dead. Ok, that was the commercial I hope it was not too painful.

A

I try to keep it true, so I'm not going to talk about open a painful home opening people.

A

All right I'll talk about the mint and I guess you can guess what it is. Then I'll talk about some of the darker corners of the openmp building, then also some cases and there's a better record. The myth so what's amiss is something widely believed, vaults and I. Guess you can can imagine what that is. Let's see it make sure works and commit this opening. It does not scale and you'll find that in too many places people saying it and once you start asking it turns out to be different.

A

So what I'm going to talk about? What I'm going to show you now is got to be a little like wait after a plan is I'll. Show you an imaginary discussion with an imaginary person who comes to me and says open and Peter's not together, and I will show you my side of the discussion only, but I think it will be clear what the answers are from the others. So what are you saying? Opening theme doesn't care what what does that really really mean think about it?

A

The programming model and not not scare many things cannot scalp to start with. The implementation could be very poor at X 2 day 82 to create under threats. Then it doesn't get that could be the implementation that doesn't scale. It could be that you're running on kind of the wrong type of hardware, for your application requirements like your application or certain needs, and you pick an architecture- that's not really suitable for that that happens to or it could be kind of you.

A

You wrote something in not knowing what's going on, and that turns out to be a button and that's what I'll be talking, how how you can prevent a big balls in the some questions. I could ask so my first question will be so you acapella program, you use OpenMP and it doesn't perform I. Think that's that's what you're really saying with what we have and ok I see.

A

So did you make sure the program is very well optimizing in sequential mode, because if it doesn't run well on one core or do you think will happen on one hundred two hundred thousand, it won't get any better. It actually got worse very, very quickly, so you didn't. Why do you expect the program to scaled it?

A

We just think it should you use all the course. Thank you. Maybe this'll speed up estimate using Amdahl's law. No, that's not the new EU financial bailout program. That's something else! I know you can't know everything, but you'll need use a tool to find out where you're spending most of your time, profiler you didn't you just paralyze all the loops in the program. Okay! Well, having done that that your boy trying to avoid that your voice try to paralyze the inner loop in the most loop in a Leutnant, it didn't.

A

Did you minimize the number of fellow Regents? Then you didn't, he just was fine. The way it was. Did you look at a no wait close to minimize the use of the barrier we never heard of a barium? Maybe maybe should read a little bit, so all threads roughly perform the same amount of work. You don't know you think it's okay, okay, well, I, hope you're right.

A

Did you maximize the usual private data you just shared all of it. I am sharing it easier. Ok, judging and looks like he's using a cc Newman system that you take that in to become neighborhood of that either making someone unfortunate because could perhaps before sharing the affecting your performance, never heard of that either. Maybe Mary should learn a little more about those things.

A

So what did you do next to address? Clearly you have a performance problem. So what is your doing? The next to address that you switch to MPI? Okay? Does that perform any better than you don't know, you're still debugging the code? Well, while you're waiting for that debug Roger, let's look at an opening p.m. performance in there we go.

A

Definitely the ease of use of opening p is a mixed convention. I! Think for those of you new to open up, you have some experience. You'll find that it's not so hard to get going. The problem is, it may be terrible for performance work, while in other models you go through it, possibly deep learning curve, but then you get the reward with open appeals, more subtle. You have different ways to parallel license and it turns out that only one of them is deficient.

A

Well, then, how can you tell so that's what I'm going to talk about the things to do in the things not to do so. The ease of use is an extra touching I, still like it, but there's something that you need to be aware of and I think not much written down about that. We don't start searching through its not recur. Mvi is very well in the student documented, like nobody in their right mind will send one salient one byte messages to not know you. Could you just don't do that same things?

A

Equivalent thing is arturo can appeal, though, the rules are different and people do that because they don't know. So, let's look at it and two of the nasty things that are kind of silently happening are ceasing numa and for sharing the funny thing is and they're real. I mean that happens and I'll show you examples, but it has nothing to do with open Abby.

A

It's the way shared memory systems are designed and, in particular things like cash go hands which is really nice, but it can hurt you if you don't use it in the right way. So again, nothing to do will open in v1 one time. I I hope that some time to show you exactly the same nasty things in a pudding, / application, as in opening p. But if you happen to use open it be and that's sweet, that's very natural thing.

A

Ok, that was opening pay, but very often it's something, and one of them is full shame, although I as many people dialing in but who, who in this room, is familiar with false sharing.

A

Not so many that's what I was afraid of, because it's a pretty evil evil thing to happen and I won't go into much of the detail of the underlying things that are happening but uses one slide that tries to explain it. They have this cache line and whenever you need data, that data will come unless it's extremely large but will come as part of a line. A cache line is the size of the cache line is designed by us. We're architects could be 32, x, 64, bytes, maybe hundreds and 28.

A

There are very good reasons to have very low on cash lines and are equally different, good reasons to a showcase. So there's no one-size-fits-all. That's why different systems have given design choices made, but it's it's a chunk of data more than you typically need. So, if you need like one double precision element, you gotta cash like when it's possible these elements and somebody else who may need a different element that happens to be the same time will get a car, but it's very natural to have multiple copies floating around of the equation.

A

So far, that's good! As long as you read it's good, but what, if you modify Natalie what, if this court decides to modify the yellow element, now, I have an inconsistency. So this is this. Is this is now stale data and that's getting access to the right data? Is handled by cache coherence, that's the underlying system. So what will happen is great vibe, which are the state states of that line, there's a like clean or dirty, or whatever. They have different states in the cash transfer.

A

They'll change, so that anybody who attempts to get the cache line, but that wait a minute I have an old coffee. I need to get a new one, someone- and that includes this one. Although we know that the blue element wasn't modified, it can tell it will see a dirty cash sighing said: okay, I gotta get a new woman. Now that happens all the time. That's fine! Unless it's in the heart of your laundry.

A

If you hit this all the time in the middle of your innermost loop than and the decorative patient is feedback is really bad. I used to have some slides show how bad was the damage to depression. I took on it is really bad, so it is something to take into account and it's got full sharing and the reason really to that to exist is that these statements they keep track of the status on the on the cache line basis not on a bike. Wait, that's cost.

A

The occasions are very large if you have to keep track of every single bite, that would add a lot of infrastructure and the design and that's expensive. So long time ago somebody decided we'll do that on cache line. Yes like mother, so that's in a nutshell, is what false sharing care needs. So one of the red flags.

A

We have free conditions for Krista Boehm, it happens when you modify data technically it happens on the store instruction. So as long as you read it doesn't matter, it read as much data as you want no for sharing. As soon as you modify an element that line gets invalidated. So the bad things happen when you have multiple threads, they hit the same cache line over and over again. That line will travel throughout the system because of each same effect, I need it or wait. A minute.

A

I have an old copy, I need to get a new one and that's that's fairly expensive. So when that happens very often, and at the same time, then all sharing is going to happen at us that we scalability and the recommendation use local data where you can, because you meet Annie, violate rule number one. It's not shared anymore and they're all, and they all find the way you do that in practice. Is that often what you can do is you can do something local updates and only way you're done even has to be shared.

A

You copy it back to some shared data structure so often have some scalar and, let's say you're accumulating something you humor laid in that local scalar and eventually write it back into the array.

A

Give you for sharing, but it's like an order of magnitude less so that's the general idea and how to avoid false, sharing, there's other ways, but something to keep a friend again. We don't use. So that's one of them dark, very dark dungeons of the building government. Of this thanks. It's nice to get all the scalable bandwidth, but as I talked about in the morning session, which is a new comes in sort of responsibility to make sure you get rough and close to your data and the burden is on you, that's one of them.

A

One of the painful thing I've shown this, like we thought of rank this morning, but I put it back in again because I'm not sure everybody dialing for that session. So how do you bear with you this ain't it again already you got two CPUs could be multi-core whatever Woonsocket and each socket has its own when we control the talks to its own memory, but thanks to a cache, coherent internet, everybody else or no one's going on, you know exactly what's going on where you need a variable, you'll get it.

A

You don't have to do anything to that. The only thing is the time to get it could be vastly different. This is only two socket system, religion, local memory, access, that's the best you can do it wouldn't be so good. If this, which we move next and as I said this morning, this is hard to avoid hundred percent, but you don't want to have this kind of remote access happening. Ninety percent, so it's always a bit of a trade-off to decide where your data go and again.

A

I'll I'll show is over here night as part of the performance Easter.

A

So this is definitely important on zee cinema systems as since, even to socket systems or CC new man. These days, it's pretty much affects everyone, and so the cost is not only longer memory access time. If they all hit on to the same memory controller, you can consider a tab memory controller and get a texture performance possibility. Luckily, open mp4 todo provide support position, one that was, but this morning extensively talked about. The only places don't be profane you.

A

What I'll I'll show you later is in general, how you can handle optimized for season, numeral and I think it would pretty much all the OS has used its first touch a principle to replace the data. So what is research and see this before you got to so Casey's with our many and in other loop and I'm, just initializing a vector 20, and if I don't do anything one cor? One thread will execute that and for the rest of the lifetime of that data.

A

This cord will own the data, that's the ownership and, as I mentioned this morning as well, this allocation of Dana is done at the page level, so typical page size default is four kilobyte. Maybe may provide very large chunk of data relative to us who data elements that you access.

A

The solution is pretty straightforward. You paralyzed this loop and for demonstration purposes. I, are you got on two threads here and will happen both will initialize half of that vector and each will get half of that vector into their memory and hopefully that's the way you that's going to be close to the thread, meaning which isn't always the case. I mean this is I'm talking about the ideal world view and I'll show a little bit more of them. The gory real world that can sell them happen like this is to go.

A

So that's what you want to do and in that way you to exploit the first touch and as I and as OpenAPI has infinity support. You give map the surface on to where you think the data is going to be.

A

Those are two things for sharing teaching newman and I'm not going to go through some case. The problem, the case studies are always kind of specific, so I try to pick something. That's general enough to have some more general message than only applicable to a very small subset of users, but they are, by definition specific. So that's the title of the first a study and it shows my fave is elaborated because it's so simple, it actually has so many interesting things in it. It's multiplying a matrix with a vector straight from the text.

A

Hundred what you're doing here in see you take dot product of the rows of the matrix, multiply that with a vector- and that gives you result- that's pretty much embarrassingly parallel, because all these dog products are you pendant and actually any self-respecting automatically paralyzing compiler will do this for you, but if you would do it yourself with opening be you know you want to do we want to paralyze this outer loop I. Look! That's to look over the rows of this openness matrix so using OpenAPI.

A

That's extremely straightforward noted I use the shortcut to parallel for and it works.

A

This will get me rappelle vision, so I ran it and at that time I only had access to an older, the Halon system, but as you'll see shortly, that really actually turns out to not matter what I'm showing you here is the performance in metal box or gigaflops as a function of my the size of the matrix, and I do that for multiple multiple threats and I use the notation the number of course times the number of pets- or this was one core one thread.

A

To course, one set per core and took all the way up to the 16 sets in this indeed was a couple of things to notice. First of all, if the matrix is very small up to, in this case up to around 64 kilobyte, the parallel version is slow.

A

No surprise that the babe small matrix there's not enough work to amortize the cost of what I'm doing- and this is where this, if clause comes in handy, you could say if the matrix is whatever, then don't go palace, so your performance, this line here would be all these roughly the same, you can get any improvement, but you wouldn't get a slow down either. That's how you can use their clothes send in a certain range of the memory yoki. We actually get very good performance here. Even got super linear scaling.

A

This one right here is more than twice as fast and that's because that's quite common on shared memory systems, you get additional cash space available. As you add threads. That course cores have cash paid. So all of a sudden, the data they did fit in the cache is now complete because you have more guessing quite natural nice. That happens, but not always like that, and this shows want to go to large amazingness. It came over no matter how many things I throw at it, but the best peter is only two thirds in about 1.6 remember.

A

This was an embarrassingly parallel algorithm, that's not very impressive, so, what's going on, sorry again will be technical and again you've seen this this picture over and over again and there we are again with this. Is it in this case the specifics of that chip that Kip had eight course each coil has to Hardware threads, but the way I draw. It is very similar to what I showed earlier. This is the CC Numa books. I, hope that came as small as it is. It is a cinema box.

A

They interconnect a quick bad name to connect any cash go hand in connect, but each circuit has a portion of the memory, so we're talking about this issue. So what do we do? Well, what you need to do is you need to figure out where the data is the algorithm itself paralyzed. But where was the data hidden well, I was very sloppy with the state initialization that was sequential, so on soccer didn't have any the other one at all.

A

So what do you do think about the first touch you got to go back to the drawing board. You got to go back to how you initialize the data, and this is my data. No data, initialization I'm, really sorry about this, but I can't take any requests. Frank Christian I get in a little while.

A

All of these accepted.

A

So straight forward again from no borders on the textbook, what I need to do and again this is why I said you go back to the drawing board. What is it out rhythm to? It, takes the dot product of the roles in matrix times the vector so and if I run this on two sets saying one thread will access the top half of the array, the other one, the bottom app if I use the static schedule doing so? What I need to make sure is.

A

This part, is in the memory of the bed that executes this part of your day, the other one under them. There's a little catch here is vector, is shared by all. All of them have to be so there's no optimal placement for that either I could copy it to all the memories, but that's pretty expensive. You really. You really should try to avoid copying data. That's the cost of that gets out of hand very quickly. I rather rely on some caching.

A

When you read in that vector that, hopefully some of my data, all of it, is still in a higher level cache. That's usually better think big, be very careful with copying, but that could be a solution here. I went the eating way now, so what I do I? First of all, I take that vector that was see and I do paralyze the initialization telling these half a minute in the right place on to third, we can order is on the right place on Portland. It's about the moments about the best I can do it.

A

The key part is in the matrix, so I am going to initialize your matrix and I'm going to initialize it in parallel over the Rose. Exactly as my algorithm access I was lucky that that was the case. I mean like I said. This is the best case you can get with the way you initialize. The data in the way of you is pretty much the same. The one thing in red here is that it got a little bit Katie way, but the same observations are to putting output vector before you can write your result.

A

The cache line has to be your cash before we can write into the same locality, rules hold or their destination, not only being so. Okay, probably kind of measure the difference, but I couldn't resist to do this one's about this with demonstration and then I got better scalability. It's about to exercise still not very high, but hate. The 2x improvement that I get x by x. We would King my deck initialization. It wasn't. My algorithm can with the database to show you this is generic I took a spark system.

A

That's an older p for system, it's kind of a smaller scale version of the g5. You got a bunch. Of course. You've got your threads and I really try to draw it to make it look as similar as the other one same kind of memory. Here, every idea have your local memory and you have some indie connect that loose it all together. So I took the same code and initially again and I didn't get any speed up beyond two threads and I.

A

I use my initialization trick and again I get a roughly get a factor of two performance improvement, and when you put that in one charge, you see that, although in absolute sense, the performance is different that doesn't matter, the improvement is roughly about 2x, and this is a small-scale thing. These things start with, where more as you go to larger ship memory systems, with that with more and more clothes right, selects a.

A

Fortran example: oh please original array update I'm obtaining an array X and, as it turns out, first I had rule number. One always make sure the program runs reasonably well in single phase, no point in trying to paralyze a program that runs like a dog and a reason for that. I didn't say that yeah, it's running like a dog, usually means you've used.

A

The memory system to go to the memory is the way too often and when you do that in parallel, you're overloading the integral and no matter how fast is in the connector, none of them can saturate alone, like that all threads goal means connecting that data from somewhere else. Oh, so, first make sure that the program performs really well single bit play with some compiler options or whatever you like, a philippic addition to that and then viola. So this one is okay. This is written in the right way for 420 access is three dimensional array.

A

The loops are qualities, but that's fine, but unfortunately they're too dependent people ikk depends on I da k, minus 1 so that the defendants in the third dimension and here's the defendants in the second image, so it can't just simply paralyzed I- may be able to have some way from a sober, but as written I cannot have a parallel do on the cake for your change. So if you do it like this, what is the problem?

A

Well, the problem is: is that there's this implied barrier here that will cost you the performance, as you add threads, certainly going to affect you. So that's because of this dependence, I'm stuck with only one dimension, so I meant it and he needed performs like a dormer and eight fences who came over the phone stops by very quickly and I was not surprised. Let me show you how this was I know.

A

This is not different way to do it, but I way to the sanity check any of these hidden performer and then use our profile from the performance analyzer and call it to compare the single thread run in the tooth bedroom. As I said good morning, whenever I look at performance problems, I start with comparing one and two-thirds, because only if I understand that I can try to go bigger. So here I'm comparing side by side.

A

If the one thread the two thread- and I look at the user CP 0 time, the work time in opening p and the wait time is open. That's that's one of the things our profiler spits out. She tells you how much time you spend doing some some sort of hopefully useful work and overhead like waiting and it's a little small. But what is seeing here is when you look at this functional block 3d from 2.7 seconds. It goes to two, not very impressive, but it is a little bit faster.

A

The thing that literally sticks out is the wait time that almost nothing goes to over two seconds already into exam. That's like very, very strange, because why would there be so much weight I when I had a regular vector operation that I cut in pieces? I didn't understand the wait time and I look at the source level, and it confirms way it is anything it shows that that loop, having a way time from 80 milliseconds to 2.3 seconds, that's a huge jump, I really couldn't figure it out and no idea what was going on.

A

I switched our tool has and I need to explain that, because I'll be showing mortgage or you call the timeline, the timeline is shown for each thread and you can load multiple experiments as I'm doing here. This was an experimental two threads, and this always I think, of hate and when it shows it time from left to right and each color represents a state in the u.s.

A

here and when you look so anything but green is bad news, like blue green systems, I'm distant, I'm here means that was initializing the pages for them for the day. So that was all that interesting at the application level. Each function has a different color and we do that for every snapshot that we make and what they're doing I highlight the bad ones enemy at the legend. The bad ones here are red. That's the barrier and the blue, which is the idle time of the feds.

A

So what you're seeing here this was on the master fled. This is on the second fret. What I see is I think the barrier cost that's what I expected, but they also see that idle time here the bloom and that's not what I expected. So it confronts the food you what I already saw and when I zoom in I, see except the master, that this is a single thread, that's kind of boring, but the master sent here is to the master that is very active.

A

Occasionally you have the spike, but if his idol time I want to go to more search, it gets worse and worse, and this a little bit more of the same when I go to 16, says I, see that really getting out of hand so for too long I couldn't figure. When is the idol segment? How can I be on a vector operation? Another realize this is because when I I have one thread, there's nothing to share, but what I have to say is or what?

A

If this is the access part of the first, this is of the second place and Hispanic ashland. I have one cache line: descent only one, but then I go to for I got three enter food and especially when you have short pieces of data, that's very very likely to happen because that the impact of that is very large relative to the other. Things are doing. How do you find out? Actually, as far as I know, detecting for sharing is still like.

A

I've is a big crystal ball or wishful thinking, but luckily we have counters now that can help you point that and counters are process, a specific each processor that you use. You need to figure out what the name is, which can be very cryptic unclear, but it's not an easy thing in general, but the information is there so on on ship. I looked at the counter that shows me how many cache line invalidations.

A

I have remembered a picture where you valid invalid end line of this is exponential growth, as I, as I increase in a group until that's really does smoking smoking gun. You know over two hundred times higher or just 32 yeah, so this is definitely for sharing at work before she. Okay, there's several ways you can tackle or sharing this case, I found something that's more, this generic, because you have the defendants in the two dimensions here, but I don't have luckily don't have a dependence in the I direction.

A

That means I can do claims in parallel the obtained this pending yeah, but if I can do a plane in parallel, like a new three-dimensional support.

A

That's shown here just don't get the code year and in the next. Basically, what you do is you have each side work on its own three-dimensional, silver and then you're running to what I call the plus or minus one problem. You need to figure out what is hardened n manual, which certain I usually get it right. Almost the first timer but a pleasant one is one missing: it's not hard.

A

Is it bookkeeping, and this is what I come up with for the code and now I actually I killed two birds with one stone, because I have one big parallel region, there's no very very more than one at the ends. Each thread will add. First, I scoid threadid and then I think about they start and end value at the size that the position in the block that you work on I mean it's not hard. It's got to be careful and make sure you get anything right, but nothing difficult about it.

A

So all I need to do is adapt this thing to accept, start and end value, and they all call this in parallel and any job. I can wait. Wait a moment: I got the new dick, so that's help loading balance so that you know that that's because it doesn't always usually divide the number of thread, so you get to load imbalance overall, I get about 4x performance improvement with a with a very simple exchange and always do the sanity check now.

A

This timeline looks very well behaved like here in this was I, don't know or offense I go only one barrier. No, not all that idle time in between and 16 there's more of the same with this confirms that this was one little thing that I never looked into. It's still a little bit of bogging down when you zoom in, but it's so small I didn't really care anymore and when I we measure those sketch lining validations. That's the blue one now versus the red one. Is it pretty much which call it again?

A

It confirms that there's still a little bit of invalidations, but that's hard to avoid and that's why so. That was one thing I did, but that's another thing you can do yeah is that um this is a funny use of opening and once you've seen it a few times he started get used to it, but I want to carefully explain it. The problem was that I had this. This whole parallel region embedded here remember. The parallel region is expensive, so that cost is log n square times, that's really high.

A

So what I want to do is I want to have one parallel region and inside my do work or a parallel to open.

A

What does that mean? That means that all threads will execute this whole to so all execute is due and then they'll split the work. So this is some toothpaste how this would execute. They all start with a close to Chris, refrigeration and Jacobs to then they hit that inner loop- and this was a work sharing, do so they'll split that work, the link of NJ and in that way it works.

A

So now by now, I have actually have four different versions. I had the bad one. Of course I try to compile our dresses. This is relatively new compiler to analyze. I have my box version with a single pamela region and the last one, which is just a different implementation, and all of them were significantly better than the original version.

A

What was kind of planning surprise is that the paralyzing compiler is very well. It turns out and generate the same code as I did my hand, but I was quite impressed. I mean for more than I expected. So the only do that kind of funny version wins first, but eventually loses out and since I'm only interested in five seconds, I never looked into what was going on here. So that's. That was what I thought was the end of this story, but some things never end so I, don't know why.

A

But I started playing with software prefetch, which we have on our compiler. The hardware does automatic, prefetch and there's an addition. You can add the compiler to insert software prefetch instructions, and um what I saw was that without the software prefetch, initially, it's slower because many cases of the page, which is a good idea and then come to me again, no idea what we're going on here of that. Oh this is when these Hardware counters can be incredibly useful. I got some suspicion because, when I did that first work, I didn't really I difficulties in.

A

Oh, my god, Oh data locality- and this is another counter. If niches it measures, how often I had a local dedication is but a when the data was somewhere remote. We also have a counter to tell you I missed it, but it was somewhere else in the memory a locally, that's kind of not that's a nice one to happen. This is the bad one.

A

This is the one you don't want to seem too high and what you see for up to eight threats, velo, because this processor has eight cores another go beyond that and that's when I start seeing my remote by remote, mrs. and what I see and not surprising, is why so nicely? Optimized version is his works in terms of CC new, so I gained on one side and then I lost on another side. I mean that's real life. Is this?

A

Wine is okay, 30, don't know what's going on here and now we need to get really technical.

A

Many of my scenes are with one memory related. So let's look at how that array X. That three-dimensional array is stored in them. It's a three dimensional array and all four times people know or should know photo and arrays are stored by the columns first, so you got the first column in memory and the next one, the next one. So this is like all imaginary sizes, for this is the first column.

A

This is the second column Angelo.

A

The problem is how about how it links that to the page size, what if I, have some very unfortunate combination of things that I have a page size that spends more than one column. That can happen, so is one column plus a little bit or doesn't matter so it's not nicely a mind and cut off at the right sizes.

A

So that's how things are laid out in memory and they're a am with my dress and what you see this is a page is just one unit like this. One is heavily contended by all threads 0 and thread one, but the page can only be in one place at any point in training, so I have excess conflicts in terms of performance on the page them and.

A

Some the blocking thing actually, when you look deeper at that which I want to hear, but when your daddy present looking you realize that doesn't cut it, that doesn't that doesn't help you I came up with the heck, nothing else than a hack. This is a proof of concept to see. Well, what can I do you? What I did is I may be arranged for dimensional and I'm accessing it over them over the thread.

A

I did I want to stress that I don't need more memories, I just organized it to you whether you can do that in the full meal. Application is course the question, but if you can do that, this is what you might be able to do again. I don't need more memory. I'll show you the code in in in a minute.

A

I need a little bit why she didn't say no very, very fatal edition additional memory requirements and now what I do in my fellow region I get a thread idea and it will again figure out what I need to do and then access according to this is better. So that's more complicated I mean and here with first touch I first, make sure that each thread go get the data locally. So this is the initialization. That's knew something I didn't do before, and then the real algorithm is almost the same.

A

I need to get that thread idea which I needed anyhow, and it is another index into the array and.

A

You look at the counters. You see that this and I'll show you the performance on. The next slide is seeing that this has to work, because originally, when I counted the remote access is that a threat had you see these are these are modest because it opens the data and these are high, because these said don't only with a blue one I distributed the data and are pretty much all equal.

A

Oh, this is circa that this is TC newmont works and a total insignificant again that hack to go to 40 I'm, not deciding proud of that, but it's with the show. This is the way you could do it and it definitely works whether it's on the real called I'm sure, but so so again, many would bail out in the blue line in appearance help. Ok,.

A

The next and the next case is.