National Energy Research Scientific Computing Center (NERSC) NERSC User Training, September 2022, 28 Sep 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 09 - Data Storage & Sharing Best Practices

Description

Part of the NERSC New User Training on September 28, 2022.

Please see https://www.nersc.gov/users/training/events/new-user-training-sept2022/ for the training day agenda and presentation slides.

A

My name is Lisa Garrett um I'm, going to be talking about data storage and sharing practices, I'm just trying to find the right button. Hopefully that looks okay to everybody, um okay, so I'm just going to solder ahead. um So I'm in the data and analytics group and my specialty is file systems and beta storage.

A

Let me get the video going. Okay, sorry about that, all right already, so, first I'm going to talk about data storage, best practices, so um I wanted to talk just a little bit about the data storage policy here at nurse, and we have a very, very long set of very detailed policies here at this link. So if this is something that is interesting to you or I, I encourage everyone to take a look at it, at least once because it's important to know so.

A

Basically, the the policy I've kind of pulled out the high points as they relate to storage policy and basically nurse provides its users with a means to store, manage and and share their research data products um at Via nurse resources and nurse file systems.

A

um These resources are intended for users with active allocations. um It's strongly recommended that, if your allocation at nurse ends, um that is the project that you're a member of. If that ends that you transfer excuse me all the data that's associated with that project to somewhere else where you have access, because we can't guarantee that you'll still have access.

A

If you're, not an active allocation um in terms of how data is managed inside of nurse, the project, pis and Pi proxies can request the modification deletion or transport to another nurse file system of any data associated with their nurse award.

A

So if you are at a running at nurse under this particular project, your Pi can request any data that you generate under that project be moved to another place, be changed to another user, those kind of things, and that's because the pi is ultimately responsible for the research products that are made at nurse and that data is part of the research project um in terms of how files are protected inside of nurse they're protected with basic Unix file. Permissions um based on user and group IDs that are set, and you can view an iris.

A

So it's it's your responsibility to ensure that the file permissions and things like default, um masks that you set are set correctly to mask the the to handle the privacy or exposure of the data that you want to have inside the system. I mean you can always reach out to us.

A

If you have questions about this, we're happy to help, and we also have a long page about file system permissions, because it is a little bit hard to understand um and then finally, the final main policy is that users have ultimate responsibility for managing their data. So if we tell you you know, that's a file system has ephemeral and you should back things up.

A

um You should back up your data. You should always back up your data to the second location. You know you're responsible for moving your data around the center and and to various different file systems.

A

So let me give you a quick overview of the the file systems we do have at nurse um and basically um like a number of other centers, we sort of right. Now we have a tiered file system um where, at the the peak, the highest layer, the most performant layer is generally the smallest um in capacity. That's because it costs a lot of money to get performance, and so um you sort of trade performance for capacity, because you can't buy more of it.

A

So here I've laid out a really simplified kind of version of the file systems that we have available at nurse on both Corey and perlmeter. So at the very tippy top I have memory, which is you can sort of think of as a file system, because you can temporarily write things to memory and then below that we have the scratch file system.

A

In our Pearl meter we have a 32 petabyte um flash scratch system. So it's very fast. You can get aggregate speeds on the order of terabytes per second, but it's small. So it's temporary. So we purge it and I'll talk a little bit more about that later um and then the layer below that is the layer. We called the community file system.

A

um That's that's a spectrum scale file system. It used to be known as gpfs that might be more familiar to people. um It doesn't have as fast of a streaming rate. It's intended for longer term data holding and for sharing with your sharing data within your project.

A

um There's no Purge there, so data is is not removed by us and then finally, at the bottom layer we have our hpss tape archive which is intended for long-term data storage.

A

And then we have a couple of other file systems that are sort of more utility file systems. They're not used really for data storage, but we have Global common Global homes which are better places for things like software stacks.

A

So you can. You can go to this documentation to find more details if you wanted so starting at the sort of top of the the file system pyramid. We have prometer scratch so, like I said before that's luster um and it's one of the most successful and mature HPC file systems.

A

uh This is where you would store your data, that's being actively read or written by jobs on computes, so you're doing a lot of I o. You need fast rates, you put it on Parliament or scratch. Then, when you're done, you move it to a more important uh sorry, not more a more permanent area, because we do Purge and what purging means is that if files aren't accessed in a certain time, they're automatically removed by our system, we have a it's just a machine that runs, but it's not something that we do. It happens automatically.

A

So you need to be aware of this anything that's important on Perl emitter scratch that you want to keep. You need to back it up to another file system, either at nurse or somewhere else.

A

um The directories on promoter scratch are user level directories, so each user has their own directory and by default it's only user, readable. There's no group readability for the directories on scratch. I'll talk a little more about how, if you wanted to set that up, you can do that, but it's fairly rare. Most of them are just for the users for your own user, for your own self to write while you're doing jobs, um there's a quota on Parliament or scratch.

A

So there's a default quota of 20 terabytes we do have so we do allow you to go over that for very short term by about 10 terabytes um and that's intended to allow you like, if you're in the middle of a job- and you accidentally exceed your quota, we don't want everything to fall apart.

A

You know, if you have hundreds of nodes, writing just because you're like a couple bytes over your quota or something so we have this buffer so that you can write everything out and then, after your job's done, you can come and clean up. So after you exceed a quota you're not going to be able to write any more data to the file system, you can remove data, but you can't write anymore and you'll get an.

A

I o error when you try to write that says: you're out of space, um and so um Lester has this really nice. One of the reasons why it's such a successful, mature HPC uh file system is that it has a whole bunch of these servers called the OS OSS and each one of them talks to Dedicated storage pairs called osts, and you have.

A

We have a whole bunch of them on Chrome letter we have 273 and they all stream at a pretty high rate, and so you can get an aggregate high bandwidth by all of these servers streaming your data at once and really keep large compute node jobs fed and push some high. I o rates.

A

um So one of the ways you can optimize the performance on luster scratch is by using striping.

A

A

So, um by default, data files go to one OST only and that's because most of the file sizes that folks write are appropriate to stripe across the one OST they're on the smaller size, um and that's that's fine for small files. It's also great for the kind of file per process.

A

I o that's, that's pretty popular here at nurse, so we try to set a sensible default that would work for most everybody, but if you're doing something that's more sophisticated or you have really large iOS or different kind of I o patterns, you may want to think about how you you might want to increase the number of osts. This is striped against and we have some helper scripts um that will sort of automatically do the striping for you. So there's there's small, medium and large.

A

They should just be available in your path when you log into promoter Quarry- and we have a table here over here to kind of guide you a little bit about um when and how you might want to set the striping. So if you're doing single shared file, I o you are going to want to Leverage The striping, because you're going to want to talk to multiple osts excuse me and as the file size increases you're going to want to get more and more osts involved. So you can really push the bandwidth for this.

A

um If you have, if you're doing file per process, I o. um What will happen? Is the files will automatically get striped across all the osts on their own until you'll, just by default, because it kind of round robin puts them across the osts by default? You'll get some the kind of striping that you want to have to really have optimal file for process streaming. So you don't really need to do anything until your files get really large, um because there are a lot of osts each one of them has a limited capacity.

A

So you have a really large file. um You might want to spread that across more osts, so that you can. You can keep from being really tied up for a long long time talking to a single OST. So you can use these helper scripts um and the way that we usually recommend you do it. Is you set up a directory like you're?

A

You know if you're going to write out files, you call it output and you would say stripe small output and it would put the striping on the directory and then all the files that come in there would inherit that striping and if you wanted to look and just kind of make sure that something was actually working. You could manually query with this special command, which is lfs. It stands for luster file system, get stripe and then the path to the stripe.

A

um We have a lot more guidance on our website about this and we can open up a ticket if you have some questions so moving on to the next layer in our file system is the community file system, and this is intended for large data sets that you need for a longer period, say on the order of one to two years.

A

um It's set up and set up for sharing with group, read permissions, so that means that every project gets a directory on the community file system and it usually has the name of your your project name. So if you're m1234, it would be the m1234 directory and the the path here is global CFC as Cedars the m1234 or you can use this dollar CFS. If you don't want to write all that out, um this is intended for sharing and long-term storage. So it's not for intensive. I o you're going to do a lot of.

A

I o. um You should use scratch. Instead, data on the community file systems never purged, we back it up using snapshots. So there's a seven day record of all the data on CFS that you can access yourself. You can actually go to this. If you delete something you want to get it from yesterday, you can go to this website and see how to get something back yourself out of snapshot.

A

So you don't need to open a ticket, um and then the usage is managed by quotas and you can actually, as a project, can ask for multiple directories on CFS and you can give each directory a separate quota, and so this is pretty useful for large groups where they have say a simulation group and an experimental group or something they want to split these things up, and maybe one of them needs a really large quota and the other one. Doesn't you can take the whole quota you're given and split it between these subdirectories.

A

I'm finally, we have hpss. This is our tape archives system at nurse. It's for data that you really important data that you want to keep for long term. So this is like data from your finished paper. That's been published or raw data. You might need in case of emergency, really hard to generate data. Maybe it took you a really long time to do it. It needs some special thing, any kind of precious data that you want to keep and you think you might use again, but not right now you would put into hpss.

A

So the the thing to remember that that's different about hpss is that it's a tape archive um and it's actually a combination of tape, archive and spinning disks. All the other file systems are Spinning Disk, it reacts, they kind of you can list a thing and see it right away.

A

You can get the the bytes out of the middle of a file right away, um but if you try to do that with a tape, there's actually a huge library of tapes and a big rack and a robot has to go and pick out the tape that you want, bring it over to the reader and then read the tape and it's fast. But it's definitely not like Spinning Disk file system fast.

A

So, if you're trying to do something in hpss, where you need to access the data on a bunch of files, you're going to have a bad time, so what you usually need to do is is get your files out of hpss. The way that you think you're going to want to. So you put them into htss the way you think you're going to want to read them and then pull them back out and do the kind of analysis that you're looking for there.

A

um So there's a there's quotas for hpss, just like for everything else, but those are controlled in Iris. You can actually go to the iris page and view go to the storage tab for your account and you'll see your hpss info.

A

Most nurse users are a member of a single project, but if you're a member of multiple projects, you can go into IRS and say like adjust your percentages of quota, that you want to charge between the projects that are- and you can find more details about hpss here and so now. We also have Global common and that's sort of our file system for software, um and why do we have a specialist office file system for softwares because we want to optimize the library load performance?

A

So what it is is this is a plot of the it's kind of a messy plot. There's a lot of stuff on here, but basically it's time range on the bottom and then seconds to load on the on the y-axis. And so you can see this black dot is global, common and then the other two are the red is CFS and green is scratch. So you can see it takes considerably longer, in most cases to load um a pretty large Benchmark with a lot of libraries from these other two file systems than from Global Commons.

A

So it has a block size. That's optimized for small files like software, like software, usually is, and it's backed by flash storage. So it's very quick, but it's very small. So it's really just intended for this is where I'm going to put my software stack so I have a calendar, install I'm, going to put it there because I'm going to be reading it on 10, 000, nodes and I want it to become load fast, um and it's set up similar to CFS uh there in that there's group writable directories.

A

um So it's a good place to put the software software stack that your whole Project's going to use um and on Corey it's um it's read only on compute nodes to further optimize it. um That's not true in Pearl letter to read, write, um but usually what you want to do is install from a login node and then just read from a compute node.

A

Apparently we have the home directories. These are what you land in when you first log into the system. um These are sort of kind of a good place for bookkeeping. um Put your setup scripts, those kind of things um you may sit in there and compile um and install somewhere else. Everybody gets 20 Gigabytes of quota.

A

um We very rarely give out quota increases because we have all these other file systems that are intended to take up the large capacity stuff. It's not intended for intensive IO, don't read or write, don't read or write large files from your home directory during compute jobs. It will not go fast.

A

um The home directories are backed up every month in the hpss. We also have snapshots. So if you want to delete something, get it right away, you can just go into the snap special snapshot directory and get it back.

A

So we have a couple of add-on tools, so that was the the tour of all the file systems at nurse. Quests have a couple of like helper tools that help you kind of manage your data on the system and one of the things that frequently comes up in the community.

A

um The file system is quota management because those are shared directories and sometimes projects can have hundreds of people. When you reach a quota, it's really hard to say who needs to clean up or what? Where is using, all your space?

A

So we have this thing, called the data dashboard and just go to mind.nurse.gov it'll be right there on the column on the left and it'll. Show you this. For starters, it shows you this little handy thing that shows you where all of your project directories for all your projects are at.

A

um It tells you you know with the little bars you can see right away, which ones are getting close to quota, and then you can I, don't have it here, but you can toggle this usage detail and you can see which users in this project are using the most data. So when you get close to the quota, you can say hey guys, we got to clean up and you can show a little plot of who's using what and then folks will be.

A

um They can. Individual users can also come here and they can see where their biggest files and directories are. So they know where to start if they need to start cleaning up or archiving and here's an example of what a usage report looks like. So it's a particular directory and you can see, there's a bunch of users.

A

If you have a quota, then maybe you would want to talk to this particular user and see if maybe they could clean up a little get your most bang for your buck there, um and then you can adjust the quotas for the community file system in Iris. So this is for a project. This is the Das project which I'm in on the storage Tab, and it shows all the the CFS directories that we have here's their names.

A

So this would be Global, CFS leaders, gas, repo, that's what I would see on the file system and it shows who's who's the owner and it shows how much of the storage um has been given out and how much I'm using and then up here you can see the total um storage for the whole project, so this 200 terabytes is split amongst these.

A

What is this seven, six seven eight directories um and spread by how we how we use it- and you can come in here and if you you, can adjust these percentages if you're the pi. You can also um ask for a new directory just by clicking this new button and then it'll pop up a little box and you'll fill it out and tell the name and who you want it to own and then it'll propagate to the file system within a few hours and they'll be ready for you to use.

A

um We have a somewhat new tool called the pi toolbox. This is used for pis to come in and adjust permissions in their CFS directories.

A

um So if you have a student that leaves- and you want to- maybe make their files readable to another student, you can come to this dashboard and select the files you want and and act on their permissions.

A

You can even change the ownership in here, so it's a fairly handy tool for managing permission, drift in the community file system, so now I'm going to move on to the best practices for data sharing, so first I'm going to start with the idea of sharing inside of nurse. So I talked a lot about the community file system. This is the main way that a project has to share data with themselves.

A

So if you have something you want to share with your whole group, you put it in your CFS directory and and then everyone in your group can look at it. If somehow the permissions get messed up, you can come to the pi toolbox and manage the permissions.

A

uh You can also have a similar kind of construct in hpss. You can have a shared project directory in hpss that can be shared by the whole group.

A

Another interesting way of sharing data at nurse because these things called collaboration accounts. These are accounts that are tied to a project. Basically, instead of an individual user and then the pi can control who in the project can access it and then you can come and you can do a special command and log in as the collaboration account and then you're, just like a regular user, except this as a with this collaboration, account and groups use it for doing things like managing shared data sets or running shared workloads.

A

It turns out to be really useful tool because then you don't have to keep choning data sets when people leave for things that need to continue to to propagate at nurse. You can also share data with scratch. It's not as common to do, because it's a little bit harder to set up. um We, we really recommend you only share, read access out of your scratch directory.

A

If you want to allow rights in and out of just if you want shared rights in a scratch directory, we suggest you set up a collaboration, account it's just too confusing, managing the quotas and things like that. Otherwise, but if you did want to make your project readable in scratch, you could you could change the group. So that's readable by your project and then propagate group readable permissions. There is just an example below how you do that and then. Finally, we have these.

A

These helper things called give take, which lets you give individual files to any other nurse user foreign.

A

So if you want to share with external collaborators, we have a whole bunch of different ways to do that too. So those public HTML access, you can create specific by default. If you create in your CFS directory a directory called www, those things are available via our portal.nurse.gov. Anything you put in there and make World readable will be viewable at this URL. So it's a really handy way to just really quickly share files over the web.

A

If you want something more sophisticated, we have based these things, called science gateways which are sophisticated web portals that let you interface your your data on computation at nurse most of these run out of spin they're, especially if they're doing really sophisticated stuff, and they mostly interact with data on the community file system.

A

um Finally, we have Global sharing, um so I'll talk a little more about Globus, but basically this is a a read-only endpoint, a read-only point where folks can share data via the Globus protocol, and this is the the way that we recommend. If you have large data sets, you need to share with the general public like terabyte, sized or larger you'd, really want to look at using Globus sharing.

A

We also have a network of data transfer nodes. We have four of them that are open to SSH. You can just go to dtn.01. nurse.gov to get to them, and these are set up, so they have high bandwidth network interfaces, they're tuned for efficient data transfers, and we monitor bandwidth between nurse and the other media facilities over esnet like Oak, Ridge and argon, the other National Labs, to make sure that things can move quickly around um and these data transfer nodes have direct access to the community file system, hqss archive and Quarry scratch.

A

So if you want to move large volumes of data in and out of nurse between nurse systems, we recommend you use the nurse dtns. If you want to move data in and out of promoter, you should use the parameter login, nodes or Globus, which I'll talk about next Globus. Okay, so Globus is the recommended tool for moving data in and out of nurse.

A

Basically, it's a it's a service software suite that we pay for that sets up an endpoint that you mostly interact with via web GUI and in it you go to the nurse dtn endpoint and you can see the list of all your files there you can navigate around and then you can go to another endpoint and you can just drag and drop the files and behind the scenes, Globus will orchestrate the transfer for you.

A

It will retry if things fail, it does a check sum on either side to make sure that the data Integrity is kept and then it'll send you an email when everything's all done or if there's a problem. So this is really great for if you need to move large volumes of data, it's kind of drag and drop and come back. You can check on your file transfers and see how they're going, and it just takes care of all the pain of moving the data for you.

A

um So there's a web-based GUI, that's how most folks interact with it, but we also have um there's a Globus CLI command line scripts that you can use and we have a module that you can load. So you can interact with some. We have a couple of command line scripts that you can use to move data and then, if you need to share with someone who doesn't have so most institutions pay to set up a Globus endpoint and you can find them and move things between them.

A

But if you're, just a person who wants to transfer a bunch of stuff from the nurse scan point, you don't have to deal with like keeping your SCP going in the background and restarting it and figuring out all that stuff, you can still use Globus. They have this thing called Globus connect personal. It's a little software thing that runs it can run on Linux. It can run on Mac. It can win on Windows, it's a you install it.

A

It's usually plug and play, and then it sets up your own little personal endpoint on your laptop, and so you can go from say the nurse dtn to your personal, laptop and transfer data. That way- and it won't be as fast as going between the large Computing centers, because you don't have the super fancy network, but it will keep trying and it'll keep going and the data will get there eventually, even if it's a little slow. So it is possible to use Globus even without leveraging an endpoint at a large Institution.

A

So some general tips for transferring data. We've already said. We really think you should use Globus for large transfers. um They don't have to be, they can be internal or external. If you have hundreds of of terabytes, you need to move from CFS to scratch. You can use Globus, for that. um It doesn't need to be external.

A

Scp is fine for smaller one-time transfers. You can also use Globus, so there's no reason not to use that so sort of whatever whatever works. If it's a single small thing, you might want to look at using SCP if it's a bunch of million small files or something, maybe you want to use Globus.

A

um The one thing that we do say is: if you are on the data transfer nodes, don't use them for non-data transfer purposes, they're there to move data and to maybe do some some minor data checking to make sure that you've gotten the right files and that's it.

A

um You should use the system login nodes for more General routine tasks.

A

So when you're moving data, some things to think about in terms of performance, it's often limited by the remote endpoint, we rely on our companions at esnet. They give us out of nurse some really high speed and great performant Network, and that goes fine as long as you're on the general network, but generally usually it's the last mile.

A

That's the hardest to get really performing so you're, going to some place in I, don't know Boulder, Colorado or something somewhere inside of that Network there's a problem: it's not transparent to you, but what you do see a GC really slow transfers.

A

um So that's that's. Usually the problem with most performance file system transfer with most large-scale data transfers, is problems in the wide area network. But there's sometimes that nurse there can be file system contention um and if you are seeing that or if you know, if you look on the motd and see things are degraded, you may want to try a different file system or try a different time if you're seeing really slow rates.

A

um Sometimes you can have problems because you are using the wrong directory. Don't expect to transfer a whole bunch of terabytes into your home directory and have it do well number one you hit your quota number two it'll be really slow, so don't use that for them.

A

um If you're taking all these considerations into into effect and you're, not getting the performance, you expect then definitely please open a ticket and we'll help you debug. What's going on.

A

um Just a few minutes left, but I want to talk about transferring with nurse hpss, so I mentioned before. Hpss is special because it's a tape archive it's great because it gives us a whole bunch of capacity. It's about 200, petabytes that we have in there right now, which is a lot of data, but uh it's makes it a little bit hard to get large amounts of data in and out of it without doing some special things.

A

So we have some mechanisms set up at nurse to help you um best get the data out of hpss, so we for one thing: we have a transfer Key Queue that you can use to transfer data in and out of hpss, so you can run up to 15 jobs at once, pulling data from hpss and you can use our transfer queue to spread those out across all the login nodes and spread the load across.

A

um And then, when you are interacting with the data in hpss, we have a couple of different ways: HSI and htar, for dealing with the different kinds of files.

A

um Htar is a built-in bundling algorithm that'll, take your small files and bundle them up into things that are that are optimal for tapes, um and then we also, if you want to use Globus for this, we have command line tools for external Globus transfers. We have a whole lot of details here in this link, so I encourage you to check this out if you're moving large volumes of data in and out of hpss, so just to conclude nurse has multiple file systems to fulfill different performance and capacity needs.

A

um You know we have a whole bunch of different ways to share and transfer data, and if you need more, if you want to read more details, please check out our web documentation. So thank you very much.