National Energy Research Scientific Computing Center (NERSC) Building Codes, 14 Mar 2013

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: Using the Intel Compiler on Edison

Description

Tips for using the Intel Compiler on NERSC's Cray XC30 Edison system.

A

Okay, thanks um next up.

B

We have mike.

A

Stewart who is going to introduce us to the intel compiler on edison and maybe mention anything that might be different between the the hopper and the edison environment with intel. Even I don't know if you had.

B

Anything in there- well, yes, that's actually, probably the prime. Despite these various titles here which we uh uh we got, we were coming up with this before we got edison, so we had uh we're just thinking. What were you going to call this? This is basically the edison programming environment and differences with hopper and as a matter of fact that you end up talking mostly about intel, because there really aren't many differences with the other uh two compiling environments.

B

Before I give you the introduction I'll give you the caveats, this is not uh this. The edison programming environment is a moving target. uh It's we have a lot of requests to praise. uh You've, probably heard to uh to make it easier to use, for we have a lot of requests for changes and such all.

B

This does is describe how you compile, link and run codes now and how it's different from on hopper- and let's say there may be major changes to this environment by the time we accept this system and we will keep everything up to date in the uh edison webpages about our current recommendations.

B

Okay, well, here's going to be the basic structure of the talk I'm going to talk about.

B

Well, mostly, the difference is who cares about the similarities between edison and the compiling environment as it uh as it impacts a programmer and a code runner? Since uh that's what my experience has been, then I'll talk a little bit into more detail about the edison intel programming environment, which is quite different from uh the way it is on uh hopper and from the other two talk a little bit about porting from pgi, on hopper, to intel on edison and since pgi is going to be gone and I'll talk a bit about some uh performance.

B

I got on the various benchmarks that I run on on the system. These are the same benchmarks. I run on all of our systems.

B

Okay, you log on to the system, and you start your compiling and running and stuff, and here's the differences differences. You will notice.

B

Edison supports three compilers three programming environments, intel which is default differently from uh from hopper and franklin, cray and gnu. Our pgi and path scale compilers are, uh will not be installed on the system.

B

I didn't think anybody was using pastel, but I got a request right today from somebody who wanted us to port the pascal 5.0 beta compiler to uh hopper, so there are people out there who still use it gnu and cray significantly. I think in the long run this may be one of the most significant differences use live site by default for math library routines as we as they said in the previous uh talk, you can use mkl with either of them.

B

I haven't done that yet and uh ultimately, we'll probably we'll set up, probably if craig doesn't do it for us, we'll probably set up a module to uh to allow you to link with uh with mkl, as we do on uh on the uh carver.

B

Our our cluster uh intel uses mkl by default and lifecy isn't available for intel, at least at this time on on this system, and we recommend people use minus nkle equals cluster at least right now, as a load flag, though, um did you say that it should be a compiler flag too, or did we just need it to load time.

A

B

Just load yeah.

C

I think we said it's a load flag that the oh.

B

C

B

Yeah, that's right it's! So it's just a load flag. So I don't need to correct that.

B

um Another note thing you you will notice, though uh it's actually not edison specific, is that cray has changed the name of several of the uh libraries that they provide like netcdf and hdf.

B

They have now they're prefixed with the cray dash cray dash net cdf. This actually exists on hopper as well, and it's part of the long term policy of craig to change to replace those the original names with the crane names, but uh they uh it's.

B

It's supported uh a legacy still on hopper, eventually, they'll go away, you can only use the cray-dash ones, but it makes sense not to start with the old-fashioned names on edison and, as we have spoke quite a bit in the previous uh talk, the intel openmp and the hybrid mpi openmp runtime. Iran environments do not work, or rather they do not work efficiently by default.

B

At this time, and this we're sort of putting this here and maybe uh cray will fix this for us and uh we'll uh give the workarounds provided by junjie and helen later in this talk.

B

B

How come you're supposed to move? Oh there, okay, edison math, libraries, uh gnu and uh cray math library is the same as on hopper. It's the old tried and true cray life side been around forever. No and again on hopper as it was on franklin. Special flags are needed. Everything links automatically intel uses the uh the mkl math library and again, as I just mentioned from the side, my mkl equals cluster as flag at length time to load it livesci is currently not available for the intel compiler.

B

Apparently, maybe it never will be, but at this point you can't use lifestyle for it as it is the default library on hopper.

B

Okay: here's the details on the cray uh module name changes and oh the two significant things down: are there at uh bullets: five and six or sub bullets? Five and six. There are two cray modules that are not yet available uh for the intel compilers, the petse and the tri linux.

B

But that, presumably will be uh soon. I hope- and you can use these names in your hopper- make fun, make files for the modules, because they all the names already exist. The modules already exist under those names on hopper.

B

So converting from pgi to intel um the uh these are the flags and the equivalents things people use uh pretty frequently um the thing to really to talk about, and I'm going to talk about. The next slide on. That is what is the recommended flag to produce well-optimized code in general in pgi? It's uh minus fast intel, it's minus fast with this other.

B

This is something I've just discovered myself, just in the past couple days that this produces the best well optimized code at runtime, as well as minimizing the compile time minus fast, but with minus no dash ipo with it. If you just do minus fast, there are problems with it that I will talk about.

A

B

Going to ask you and.

A

Nathan, are there codes well-known codes? You know that benefit significantly from from ipo optimizations.

B

Good c, plus plus benchmarks.

C

And I I've certainly seen it happen even on fortran, but it's like something where they have a function call or they have a subroutine. That call that's in the middle of a loop and it was like, oh well. We could rewrite that loop or we could do.

B

This or or you could just target it. You know because you'll know where that is rather than have it spread over uh yeah creation.

C

Yeah and and exactly I, I even wrote a a benchmark once that intentionally had a call in the middle of a loop. I did it for programming reasons, but then I I just did a inline there, so it happens, but uh is there a compiler directive.

B

That you could.

C

There's a credit, compiler directive to to say in line this and to say never enlighten this.

B

Okay, fast option and uh they're very different between the two compilers fast is uh to quote the man page, a generally optimal set of options chosen for targets that support sse capability. It's it's fast, but it's not a really. It doesn't do a lot of very big analysis and it's well. It's basically used to be called fast sse that one time they were thinking about keeping those two things separate, but they at least for years they they're they're the same thing intel.

B

It's includes a lot of uh optimizations, but most significantly it includes as we're just talking about inter-procedural optimization, which can increase compile time by an order of magnitude or cause it to fail in order of magnitude uh gtc.

B

If you compile it with minus fast and it will literally take a order of magnitude longer than it would, if you just ordered, compile it with the default optimization or if you compile it with minus fast, minus, no ipo so and and it does, you do see it fail, because intel seems to have a hard time keeping where all of the uh routines are, particularly if you're in complicated make files using different directories- and I have seen it just say: oh, we can't find this dot ipo or whatever the uh stuff that you stick in so um yeah it's it's I I had.

B

I've always run my benchmarks against fast and before I just sort of ignored it, because on hopper doesn't do significantly better than the default. In fact, often it's not does worse, but I found to my shock when I was running benchmarks on uh edison the fast for many benchmarks, particularly the larger benchmarks. It was producing a faster running code than the uh hopper recommendation, which is the default so I didn't want. I was, I had very cold feet for the reasons of about uh ordering about just recommending minus fast, but I I thought well.

B

Let me do fast, so I'll get everything except ipo, and so, when I did that, I found that not only did it compile reasonably quickly comparable to what what it was at the compile time when he was a default, but in in pretty much every case, the code it produced would run just as fast as the one with the ipo fast.

B

So that's my recommendation for users. If you want a fast high level of optimization, I would recommend minus faster, minus no ipo over the hopper recommendation, the default on edison and as I was just mentioning, there's no significant improvement to uh minus fast order. Minus fast non-ipo over the default on hopper. I ran my benchmarks again on hopper just to double check. This does an executable, carry evidence of.

A

The flags that were used to compile it, I.

B

Don't think so.

B

I've never been able to uh you.

A

Know you know what strength of space called something about this discussion, but now I don't remember the details.

B

And I don't remember which compiler maybe it's some compilers you could, I would suspect the ibm. I think there was.

C

Yeah, I think that's what I'm thinking.

B

Of ibm is very elegantly laid out and uh a very modular set of compilers, so it would not surprise me in fact you would use it. It can call a separate program if you uh uh uh use some optimizations that uh uh you it would just go, and uh uh rather than have been one big chunk compiler it was, it was extremely modular. You could.

A

A

C

Left as an exercise region.

B

Okay, um this is, uh uh we spent a lot of time talking about this about the uh intel, hybrid, openmp, runtime environment. I'm not going to be talking about hyper threading in this talk. It was covered very well in the previous talk and I haven't really experimented with yet yet so I don't have anything intelligent to say about it, as we know from previous talk, create thread, affinity, settings which make a lot of sense and a very good idea, performance, wise and I intel's runtime openmp environment conflict because of that awful extra thread there.

B

So you have uh two threads in essence, settled on the schedule on the same core and means the job takes twice as long as it should so. Here's the current workaround- and I know this works because I've run a bunch of openmp benchmarks on it. It might be. There might be other things that work too. This was uh janji and helen came up with this and uh I I will get. I do get the uh appropriate speed ups for uh for the benchmarks, when I use these things, when you have.

B

You have two uh two uh conditions here: you have omp num threads less than or equal eight. In that case you set the kmp infinity affinity to compact and you run with a pneuma node cc newman node flag.

B

If you have greater than 8 and less than or equal 16 k infinity goes to scatter and uh you use the uh cc, none uh um affinity flag, you break all affinity rather and again these do work. There may be other ways of doing it, but this this will work.

B

Compiler performance on edison um yeah: this is the last slide, but I'll probably talk a little bit more than I have written up here.

B

I have a bunch of nurse six and uh in nas parallel benchmarks, 3.1.1, that I used to just sort of look at compilers and performance and such all perform all compilers perform produce significantly faster code on edison compared to hopper.

B

When I say significantly on the average, you get a two and a half times speed up on on edison over hopper and if that holds up for uh regular uh codes, that will be probably the biggest jump in per processor per core performance at nurse since uh the acquisition of the c90 way back in the early 90s.

B

um The crane intel have at least on my benchmarks, which are don't use just a couple of them even use the math libraries have quite comparable performance, runtime performance, gnu uh codes run on on the average about ten percent slower. But again you can find the benchmark where the new will beat the other two.

B

And to close up, this is going to be very short talk um to compensate for the previous ones. This is what I uh you're losing using a lot of different, uh uh optimization arguments on these benchmarks.

B

I find that the only one difference from hopper is intel which uh the mine is fast minus, no ipo, whereas uh on hopper I recommend not using any optimization arguments just using the default cray same as on hopper default, no explicit arguments, the new minus o three minus fast math, again the same as on hopper.

B

um Though, oh no, I'm not going to mention that that's not important! So that's the end.