National Energy Research Scientific Computing Center (NERSC) New User Training 2018, 20 Apr 2018

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: New User Training: 08 File Systems

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

I think so my name is Lisa Gerhard I'm, also in the data and analytics group working with users, helping them run their data intensive stuff on HPC. So today, I'm going to talk to you about the file systems that nurse and go into a little bit about how you would want to manage your data on these file systems and where things would want to be placed.

A

So first, we're gonna talk a little bit about some best practices for file systems and then some couple tips for dealing with your data, especially shared data across the groups and then really briefly, we're gonna go into future plans for the file systems here at nurse, so I stole this diagram from Steve earlier. Basically, this is you know. This is a diagram of all the systems we have at nurse and I think.

A

The message that I want you to take away from this is that there's a lot of hardware and expertise that goes into making sure that we have a lot of high performance file systems with good connectivity to our HPC systems and there's a number of different, interconnected file systems.

A

But I really prefer this much more simplified way of thinking about the file systems here at nursing, and so what you have is sort of a hierarchy of storage here at the top is the the most performant stuff, the stuff that responds the most quickly when you're doing I/o and as you move down this list, the performance and response becomes slower and slower, but also because performance cost money. The capacity gets larger and larger.

A

So up at the top you have the fastest response is just to write to memory right, that's the thing you can't beat that then after that was our burst buffer system, which we'll be hearing a lot more about later, which is very fast. It has a cumulative I/o of about 1.8 terabytes per second, but it's also very small. We have only about space for 1.8 petabytes and after that you have the local scratch system. So we have a local scratch, that's local to both Cori and Edison.

A

On those systems, you'll always find better performance if you write to the local scratch. So if you're on Edison write the local Edison scratch, Cori write the local local quarry scratch, but quarry scratches mounted on all systems a quarry scratch has a cumulative I/o of about 700 gigabytes per second, so fast, but roughly a factor of 2 below the burst buffer. And then, after that, you have our shared project file system.

A

This is GPFS. It's now called spectrum scale, and this is uh this is intended for sharing it's for long term. Storage has a much larger capacity of data, that's kept long term, but it comes with a slower, slower, IO and here at the the highest capacity layer that we have is our HP SS tape system, the tape archive about how to about 130 petabytes of space storage. There. It's not a performant file system, it's just for long-term storage.

A

So you can ingest fairly quickly because there's disk cache in front of it, but once something goes to tape and you're, trying to read from tape you're looking at maybe 100 megabytes couple hundred megabytes a second. So it's definitely not something you want to do streaming, I/o off of when you're in a compute, because you'll be spending MPP hours. Waiting for this tape to me treat you right.

A

So, if you're accessing data here, you want to go through via, like you want to stage it ahead of time, you could use our x4q ever on something on a login node before you do any computing and then down here soared over and our augment out is we have this, the global common filesystem and what this is intended for we'll go into this a little further, but this is intended for high performance software installations. This is where you put your software. If you're going to run at scale on the system.

A

So the the first thing that I get asked a lot by users is sort of where do I put my data right, I'm going to do some computing. What file system but I put it on? Maybe it's on project. Where should I move it and generally, if you're doing any kind of heavy I/o any kind of large reading IO bound application.

A

The burst buffer should be your first choice and you can the burst buffer is basically a layer of extremely fast and rear. Am says: flash storage, it's very quick to access, but it's it's truly transient. So what you do is at the beginning of your job. You have a directive that says stage this data to the burst buffer. Then you do all your work on the burst buffer and at the end you say, stages. Take this output data and put it back on scratch.

A

So it's sort of like if any of you worked on a Linux farm with us local disk. You can sort of think of it like that: you're writing to local disk and pulling out instead you're writing to this superfast burst puffer.

A

A

There's a there's, a special specialized command that you can put in there, so you can't CP and what we're gonna go into this in more detail a little later. But it's a it's a batch directive and you give it the name of the stuff that you want to do, and so then the batch system before it starts running your job will stage the data in and then once that's done. It'll start the computing, so you're not sitting there with ten thousand notes waiting for the students, this data to stage.

A

So that's, if you have like one-off data that you're going to do. If you have let's say you're, going to do some kind of campaign. When you're continuously reading the same set of data, you can actually get a persistent reservation of up to 20 terabytes. You can put the data in there and that reservation will persist across multiple jobs and you can access it that way and we you can do up to 20 terabytes on your own.

A

If you need to larger allocation, just open a ticket well work with you.

A

The other cool thing about the burst buffer is that this, the bandwidth scales with the size of the request and you have different- you- have a unique metadata server. So it's much more forgiving to sort of what people typically say is bad I/o like lots of opens and closes because it's only your metadata server. That's responding to you.

A

So we've seen a lot of different science codes show improvement just by switching and unfortunately this is cut off. But basically this is a something from the Atlas group and what they're seeing here is their bandwidth reading and the scratches down here, and then they switch to the burst buffer and with just a few minor changes, there's roughly a factor for better and that's just from the improved hardware. And then you can further go on to really optimize your code to make this much more performant.

A

But a lot of guys get a benefit just by switching.

A

So next, after the burst buffer is scratch. Basically, this is for data. You don't want to stage into the first buffer, maybe you're not reading it. In that much, it's not worth the time to interact to figure out how to interact with the first buffer. It's you know. Maybe it's it's too large. You feel like you, don't want to stage it in for a single one-off, but scratch is a luster file system.

A

Basically, the way it works is there's a couple of metadata servers and then there's a whole bunch of different data storage drives and you can stripe your file across multiple drives to improve performance, and we have some guidelines for here for how you want to do this. So, by default, your your data striped across only one data storage device- it's called an OST and that's great for if you're doing like file / process I/o.

A

But if you're doing like you're reading from one big single shared file, you should stripe across the osts according to its size, and we actually, you can go to our website and just search for lustre striping, and this will bring up a whole page about this, but just to boil it down. This is sort of the commands that the rough striping that you should use.

A

So, if you're in the neighborhood of like 1 to 10 gigabytes, you just drive across a handful of OS T's, but if you're much bigger than you want to stripe across roughly about 70 osts to get optimal performance, and you can do this. The way you do this is command. L, FS, lustre, filesystem, get stripe and the file name that'll tell you the striping, and if you want to change it, you can set stripe.

A

But if you have an existing file- and you want to say let's say it's- a 200 gigabyte file in scratch- you want to change striping. You have to set stripe to an empty file, then copy that file into place or into a directory. It doesn't automatically tree stripe in place.

A

So the other thing to think about for scratch is that we have a limited capacity if everyone used their quota to their full ability, it would be oversubscribed. So what we do is we go through any purge and if you haven't accessed your files in 12 weeks, they're automatically deleted. So this is something to keep in mind. This is scratches for data, that's being actively computed on once you're done with your computing, you should move it someplace else, either to project to HP SS to your home site.

A

So the next file system, in the sort of performance hierarchy, is the project file system. So this is a shared group file system, so scratch and burst buffer. The allocations are on a per user basis of a directory. That's your name that belongs to you project. The directories on ur are on a repo basis, so if you're in that repo, you can write to that project directory and it's intended for large data that you're going to need for the next few years, and it's also intended for sharing this.

A

These large data sets across your your repos. So if you had a big set of input data that all of you guys are crunching on for different analysis, you would put this in projects. Everybody can access it. We never delete data from project.

A

We manage this by quotas, so each repo has, by default a terabyte of space to start with. If you need more space, you can write to us. There's a quota increase form you can fill out on the website and we'll work with you to see. If we can accommodate you there's also. We have this nice feature in spectrum scale, where we keep snapshots of the file systems for the last seven days.

A

So if you come in today, occidental a delete this file, you can go to this special dot snapshots file in there and pull this file back out yourself. You don't need to write to us. You can just grab it right then, and then, if you create a dub-dub-dub directory inside your project directory, this will automatically be picked up by our portals and shared. So it's a way for you to share data out to the to the wind and in a really quick, quick, very fast way.

A

So after project we have our HP SS archive and really what HP is for is for is for data that you want to keep for a really very long time. This is something like you have a data set from your paper.

A

This is maybe raw data from your experiment that you can't reproduce or some kind of really hard to generate simulation data, so you should think about when you're, when you're putting stuff into HP SS how you're going to retrieve it and what you would want to do with it, because at its heart, HP SS is tape. When you transfer something in at first, it hits the disk cache. It goes in really quickly, but then over time it migrates out to our tapes. So when you come back to retrieve it, it can take.

A

Let's say you transfer in 10,000 files they all go in together. Those 10,000 files could end up spread across many hundreds of different tapes. You come back and decide. You want to get all 10,000 files out. All of a sudden, you have this traffic jam from all these tapes. Trying to load so what's better to do. Is think hey if I'm gonna get these 10,000 files back? I should bundle them together into one bundle, because I'm gonna need them all at once and then just put the bundle in there.

A

So we have to see if HSI, which is this command, you can use just to put individual files, and then we have H star which basically works like regular tar, except that the resulting tar ball goes directly to HP SS, which is really nice when you're trying to bundle up a really large archive, because you don't have to have space for the tar that you're making before you move it into HP SS.

A

And you should, when you're archiving, you should just spend a few minutes to think about how you are going to retrieve the data and are have it in the way that you want to get it back.

A

So, finally, here's our our extra guy global common. Basically, what this is is it's four 4 software stacks. So if you have a in your side, your group, a shared set of software that you're all going to be using, we recommend that you install it on your global common directory. This is a each repo has five minutes. Each repo has its own shared group directory and global common by default. It comes with a 10 gigabyte quota, but we're pretty flexible about that, and the reason why you want to do this is performance.

A

So this is the startup time of one of our Python benchmarks over time, and you can see this is the time I forget what concurrency this is. It's like 1,500 nodes or something 1,500 processes, and here down here, the fastest you're gonna get the fastest performance with shifter, which we'll hear more about later and then right after that. Very close is global comment right down around about here, and then you have project and scratch, and so these global Commons mounted we'd.

A

Only so you can leverage caching on the nodes, so you get a much faster start up times and load times. If you install your software there.

A

So that's where the hierarchy of the file system that nurse there are a couple gaps in this system right now, so in project and global common, any kind of shared file system quotas are painful when you hit them, you'll get permission drift where people will put things in that are only readable by themselves and then migrating between the tiers is painful and it's getting harder and harder to find your data as data sizes grow, and then our project file system is under sized.

A

So let me just so: we've deployed a few tools to start dealing with this. The first one is what we call our data dashboard. This is really aimed at helping people deal with when your shared directory hits your quota. I, don't know about you guys, but I used to manage a project directory. There was 40 terabytes when we hit our quota. It was like bad news. Everyone had to stop and go through and start our typing things and figure out how much data they had and it was really just kind of painful. Anyway.

A

If you go to my nurse gov, there's a link right here. It's the data dashboard and what this will show. You is every project directory that you have permission to. Access should show up on here. I'm in consulting I, see I see a lot of project directories. So you shouldn't see this many, but we can scroll down. We look at the staff directory which, for historic reasons, is called NPC cc, so you can see we're doing. Okay in quota we've got about 10, terabytes we're doing.

A

Okay and I notes we're about two-thirds full, but if we were full I wanted to clean up, you can go and look at the percentage, that's being used by user. I, pull this down and every time I did this. It's always a risky click, because I'm often the number one user and once again I, am the number one user in this space. So you can mouse over. You can see I'm using roughly three terabytes.

A

So if we hit our quota, everyone would be looking at me to clean up and then, if I was to come on here and say like oh gosh, I need to clean up. Where do I start I can get a list of my ten biggest files on here and I might directory now by inode I, just kind of get a good idea of where to start. So this is sort of useful tool for managing these shared spaces that you have here.

A

So future plans we're looking at we architected our our storage system to try and deal with some of the problems of moving between the tiers and what we're looking at is you can sort of think of it right now we have four tiers of storage. We're going to go down to two, so we're getting in a you. Can look at others, we're going to integrate our sort of high performance area, it's going to become one tightly integrated package and then our longer-term storage is going to become much more tightly integrated community storage.

A

So these these changes are in the works. You should start to see some of these things, maybe hopefully within the year, especially for off platform storage and for our next nurse system. So that's all I've got there's a couple links if you want some further reading and otherwise I'll take some any questions. If you have them.