National Energy Research Scientific Computing Center (NERSC) New User Training 2019, 14 Aug 2019

Previous Meeting

⏯

youtube image

►

From YouTube: 9. File Systems

Description

Learn about the NERSC filesystems and the best way to use the different filesystems.

Slides for all sessions can be downloaded from here: https://www.nersc.gov/users/training/events/new-user-training-june-21-2019/

A

Hello, everyone, my name, is Johnny. Probably you have seen my name in the in your email in the during the past few days, so I guarantee you'll see a few more emails. Why would that be a short story, so I hope, I, hope and I appreciate? If you can reply to that, that will give us some feedback I'm going to talk about nurse file systems, so they are different file system and nurse and they are designed with different purpose and different motivations.

A

So, for example, burst buffer is the current most the fastest fastest went nurse and the backend does hardware is SSD, and this software software there is a Chris data work. So if you want to accelerate your I/o application performance, you can try to. You should try to use burst buffer as your first choice, and then we have a scratch which has Laster file system and the maximum aggregated performance is around 700 gigabytes per second, so I, don't think you can get this high number anywhere else.

A

So the hardware and the software and nurse gets highly like our colleague Ashley has commissioned the highly motile e optimized for each PC users and for your application, and then we have a project. So this the big difference is that this tends to be permanent and tends to be hoped how they are data. You know, for a longer time, period turns up previous to first offer and the scratch, especially the relatively longer than first buffer, like with 12 it. So at 12 weeks, purging period burst buffer is just a template.

A

You can just grab a temporary instance for your job so busy. That means when you launch your job, you can get some burst buffer space, but in your job exit completely finished you can't access the first buffer space. There is the one option too.

A

She used that as similar as the scratch, which is to create a persistent resolution, so I will cover that later and then we have the HP SSA. It says: that's, basically the archive storage, so we use tape libraries to keep your data as long as possible.

A

Those are the main file systems and we do have another two global common global ho, so those are not designed for your applications IO. So you shouldn't run your application while directly talking your direct talk into this file system, so you should, but those photos might design to, for example, keep your source code compiling your code on and that so it's SSD based and has limited Kota so, for example, hope global home.

A

That's where you you see when you first log on to curry, you are / home directory, so you got 10 40 gigabytes Kota on that so first, so this is a very simplified diagram for different file systems and in the following, slides and other works through different fastest file system in more details. So the first one I want to talk about is the scratch file system so scratch.

A

So it's based a nurse scratch is just don't get confused. Scratch is just the name, a name that we use to describe a configuration based on laughter and HDD. So a bunch of things. So we refer to scratch as this Laster file system. So Laster is one with most successful. Hpc file system- and it has 16 years research and development. Many was the optimization. The research idea has been have been put into production and have been implemented as the real product features in the file system.

A

And if you look at the top 500 HPC, thus fast fastest, a supercomputer in the world, you have found that most of the supercomputer are using Laster as the file system and the current version nurse is 2.7. The latest version is 2.0 so which has more features and I think we will get upgraded in the next machine.

A

So in order to understand the scratch or the last year, we use the two terms here. There are few important concepts. So first is the metadata server and this is holding your files like file, name and directory name some of the metadata, and then you have this OSS, which is object. Storage, object, storage server, so this is managing a bunch of OST. So OST is object, storage target, which you can think of that as a bunch of discs.

A

We think that those are the hard hard HDD hard disk drive, and when we talk about the aggregate IO performance, we actually talk about. We we actually talk about a maximum observed performance in before we really in the initial phase of the system. So, for example, if you run the MPI file write all so you get in order to get maximum performance is better to leverage the lustre striping on the file system, and this is another diagram to give your idea how the scratch file system is hooked in the quarry computer.

A

So on the top, is the quarry computer knows and we have 130 on that router and the router connected to the computer knows with the it's a lot with the scratch file system in terms number OST, we have 248. So that means that you can strive your data at most 248 of the OST of the X tourist server or objects our target so and there's there's this metadata server, and currently we have five major servers. One is one is called primary metadata server and we also have four additional metadata server.

A

I think there are two important interesting thing to remember so: first, when you have too many files and we're deep hierarchy and when you see some poor performance on the scratch processor, maybe you want to consider moving your data, the map that are from method, the primary server to the four additional servers.

A

So you could talk to us and send us email if you observe this kind of slow performance in terms of metadata, and another thing is: if you have a very large file or big file like 100, gigabytes or one terabyte, so you may consider striping to get the optimal IO performance, so striping I. Can we will talk about that later and how to do? That is using a very simple command.

A

So here's a very quick demo with this striping command, so this is using LFS and then this only works on scratch file system and it doesn't work on any non master file system. For example, if you log on to quarry and you this is where you see when you log on to quarry and then you, if you type this command the FS get stripe, we try to know like I mean is OST is being used by your data by our fine. So you will see this error.

A

So basically, you cannot run this command by non non Laster file system and we know that this home directory underneath is based on GPFS, not a master. So that's the reason you saw that you will see the error and then, if you sitting to the scratch and make sure you see this global square, C square one SD and your username, then you are guaranteed that you are on the last row file system. And then, if you run, this clamp, LFS gets drive and given any existing file name.

A

So you will know like how many OST is being used for by your file and by defaults any file. Any new created file on scratch on this master file system will just use 100 ft.

A

So, in order to change the striping, because you can imagine having more OST potentially will improve your concurrent improve your I/o performance, because you have concurrent server to serve your request, io request. So in order to change your striping for their data, you have to create a new directory and manually move your existing file into this newly created directory. So you cannot change the stripe configuration in terms of number wise T and the number of the size of this drive directly on an existing file.

A

There are some striking recommendations, so you can check out this table like the depends on your file size, if is, for example, if it's less than 1 gigabyte, probably just use default striping, if it's very large, like a hundred gigabytes or even one. Her wise will recommend you to use disk one stripe large, and you could also manually change the stripe size like a damn Street before, like because the command will just use 72 OST for the stripes count, but you can definitely increase the stripe count to 200 or 100,000 and next, first buffer.

A

So why burst my first? Birth father is designed to accelerate your I/o and also to absorb the bursty I'll request. So in these two pictures, as on the left, is without this first buffer or the teeth work as the Iowa accelerator you'll see that this is very typically situation on HPC file system. So we will, if you observe the IO activity, you will see this kind of a spikes, this kind of a birthday pattern, but with first buffer we will be able to absorb those bursty pattern and to learn dramatically improve the I/o performance.

A

So basically, the first buffer is designed for high. I ups and high bandwidth applications and it's very easy to use. Currently you just you just need to add a few lines of scripts in your existing script batch script. So there are a few important seem to notice when specify we want to use burst buffer. You need to tell your job like how much resources you want to allocate. So this is the capacity parameter so, for example, I.

A

If my job produced like 900 gigabytes, probably I will request like 1 terabytes right, but slightly more than the job we are produced. We are produced so I request to capacity as one 1,000 gigabytes I will get a bunch of burst, buffering those during the wrong time and how many knows I will get it. You can simply calculate that by dividing this number with 20, so the 20 gigabytes is the granularity on burst buffer, and then there are few more commands which are useful. One is the staging, so your data, assuming is unscratched filesystem.

A

You use this command before your before. You start your job. You stage in the file from scratch on to burst buffer, and then we run the job. Your computer knows can talk to those data from directory with burst buffer, and you can also use stage out. That assumes that you have some new data produced right or you may modify the data. You want to keep that new result.

A

So you want to stitch out from burst paper down to some relatively permanent space like a scratch so, which is like, will not be purged until 12 weeks, so very safe and permanent, and also you can see that the burst buffer space we are disappears. If your job exists right, if you won't have a longer period of a burst buffer space, you want to create persist in the reservation.

A

So in order to creat this reservation, which is only all the only owned by you or a group of your users, so you can you need to submit a few jobs, so one job is to creat this reservation. So this is job 0 and you just submit this job as purge this script and by specifying capacity and typos and the mode access mode and the type of space, and also give it a name and then later you use this name as your burst buffer tag. So this is then.

A

This is how you use the are producing the reservation, your job and don't forget to delete this reservation after six weeks as how long we can make sure we can guarantee the data is safe and third is a project file system. So so you can see that for running a perky for running jobs or for doing data analytics and a first buffer and scratch is good for it's good to use in that scenario, but for sharing large data project is recommended like transmissions.

A

There's a nice feature using project which is you can with you, can view the file web browser. So, for example, here I already put some data under this directory www and my user name and I can just go to this address and to check my files.

A

Okay, so this is the project so and then there's HP SS. So if you want to keep the data forever and then, for example, the data film from a paper or some raw data, you want to later reuse or reproduce your science. So you can, you should archive the important data and for the HP SS there's some best practices on the website. For example, you should archive the data in a way that later you may intend to reach it.

A

So, for example, you may reach if, for example, you have time files and but your usage using pattern, it's like reaching one by one with probably you want to archive them separately, but if you, if, if you're using using pattern, is to use them or or retrieve them were probably you want to turn those files into a single file per file and then archive that and the later which we will back and finally, it's a common file system, collabo, common and global whole, so the technics oft, where technology used as similar and for the global common the purpose is to hosting the software stacks.

A

So you have some like you build some software and you want to use you want to use one day. It's used by your group. You may consider requesting a global common space, so here's the performance comparison showing that the cotton, the performance, the library loading time is faster, uncommon, uncommon, space and then finally, it's a global home, so, like I said, is designed for hosting your source file and you may compile code at this directory at this space and you have, but it's not intended for Iowa Highway operations.

A

So and then there is a better monthly, our back half by HP SS. So you don't need to worry about that. It's automatically going down there and a snapshot is also available. So you can, if you deleted some file, I want to get it back. Probably you can check the snapshot.

A

So this is just a summary of what we have covered so like I said there are different file system and there's first buffer scratch and project and SPSS, and then google, chrome and Google home. So they are really designed for different purpose, and you should check you should understand what we are going to do. You are going to archive some data, then go to HP SS or you are going to share some data.

A

Then use project right share with your group for a long term or if you are going to run your job and doing some I'll, probably scratch and Bruce Buffer is the best place.

A

Okay and finally, there's a nice fit nice scene designed by bass group, which is the data dashboard. So if you go to this web this, my top nurse could have got website. You could have a very clear picture about where about your data on nurse file system like including home, see scratch and project, and then for it specifically for the for the data on the project.

A

You could get more insight by clicking this data dashboard and you can have you can view all the project that you belong to and also you could click this button to check the detail. Usage in terms of percentage of space, allocation and I know the allocation and group percentage over space allocation.

A

Okay, last twice is a list of our resources, so feel free to check those pages, and if you have any questions, just send us email or you can find me at the first floor and next, okay, because I move next, we will have Quincy Collier to talk about more about our best practices. Thank you.