National Energy Research Scientific Computing Center (NERSC) New User Training, June 2020, 16 Jun 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 10 Data Transfer Best Practices

Description

Part of the NERSC New User Training on June 16, 2020.

Please see https://www.nersc.gov/users/training/events/new-user-training-june-16-2020/ for the training day agenda and presentation slides.

A

So I'm Lisa get hurt I'm in the nurse data and analytics services. Hopefully everyone can see this screen.

B

A

Sing out, if you can't see my slides okay, so I'm going to talk to you about getting data into and out of nurse and sharing data with your collaborators.

A

So the first thing I want to, let you know is that we have a dedicated set of nodes, so called data transfer nodes that you can use for moving data into and out of nurse, you can SSH directly into four of those nodes. There Athiya this address here, DT n, zero, one dot nurse gov all the way through four.

A

These are. These servers are optimized for data movement. They have high bandwidth network interfaces, they've been highly tuned for efficient data transfers and members of the data, analytic services and the storage group have worked to make these really really work well, along with working with the S net, to remove a lot of obstacles in the land that are between us and our major science partners, and we have a monitored bandwidth between their skin other major facilities, such as other DOA labs and slack, to make sure that data is moving as optimally as it can.

A

These nodes have direct access to all of the nurse file systems that what he just mentioned. You can log in there and interact with the community file system that HP OSS tape archive Cori scratch, and so you can use them to move data externally to other systems outside of nurse Co. You can move them, use them to move data internally between their systems and there's gauge PSS.

A

So the take home from this is that, if you're moving large volumes of data in and out of nurse go between their systems, you should you should leverage the nurse dgn station by.

B

Lisa this I sorry for interrupting this lies not moving forward and I, don't see the full screen. Okay,.

A

Let's see, thank you. That's.

B

A

Thanks, sorry about that, yes so use the D chance to move large volumes of data into and out of nurse or between their systems. If you are moving a lot of data, we come in that you use globus, which is a software stack. That's optimized for moving data under the covers. It's basically parallel grid FTP, and what that is. This is something that paralyzes data transfer movement for you and it can so it sits, and they excuse me they have a whole very easy-to-use web-based service.

A

Basically, it's a GUI I'll show it to you in a minute, but it does automatic retry. So you set up the transfer and it'll move the data for you and if you move in 10,000 files and one of them fails, it'll it'll be try that not that file again and we try it over several times until you either succeeds or reaches its quota, and then it sends you an email when it's done and you can go to the website and check on these transfers. So it's really great for sort of fire-and-forget data transfers.

A

We have a globus, endpoint and most major educational institutions have a globus endpoint. If you don't have one from your target, you can set up a Globus, personal endpoint and even that's a little better than just trying to move it yourself with secure copy, because it automatically does the parallel movements and it does the retry and notification when it's done so that extra stuff really makes data transfer much easier. We also have so there's a web-based GUI I'll show you in a minute.

A

We also have some globus, some scripts deployed at nurse for command line transfers, & Glow. This also has a rest rest api. If you wanted to write your own sort of services to pull against it, they have a nice Python software development kit. You can also use the write scripts and they also have a canned Globus connect, personal installation that works for Mac, Windows Linux that you can install on your personal laptop. If you wanted to move data that much so.

A

So let me just do a quick demo of Globus, so this is the globus webpage. It's dubbed Globus org and you can go here to this button. You get here the first time you're going to need to create a globalist account. I've already created a Globus account. Let's assume that you've done that. So you can come here and you can log in and you choose once you could get your Globus account. You can link it against your your nurse cacao or any other number of different accounts that are here.

A

They have a whole bunch of different identities, I'll accept and once you link your Globus account to that you can log in with those credentials.

A

So here sorry gotta get my MFA up, so you can log in and once you log in with this this credential it will let you access the nurse endpoints.

A

So here's what it looks like when you log in so basically this is GUI here and you can go here and search. So we have a bunch of endpoints I recommend the nurse BTN endpoint. This is the one that points directly at the data transfer nodes, and then here is a listing of everything. That's in your in your home directory, let's say: I just wanted to put something: I'm gonna use this to put this into another endpoint that we have, which is nurse kori.

A

And so here it's loading the it's also loading my home directory because they both amount the same thing, but you can click on a file. You want to move and then you can go over here and put a place where you want. To put it so, let's say: there's kind of folder I'll put it into this bar folder, and then you just click on this start and it does the transfer. You can click on this to go, see the details of the transfer and it'll. Tell you like.

A

Oh it started, and then, when you, when it's finished can hit me load to get it to go Pfister, so it already succeeded, there's just moving one file and then it's also gonna. Send me an email that says: hey your your transfer finished.

A

Okay, so that's Globus! It's pretty easy to use. I, definitely recommend it. So some general tips for transferring data large transfers. We recommend Globus online. If you can, you can also, if you're, moving large chunks of data inside of nurse. Let's say you have a bunch several terabytes of data. You want to stage to Cori scratch. You can use Globus for that, if you're doing smaller, one-time transfers that are less than 100 megabytes or so of data, you can use SP.

A

The globus is also fine for small transfers too, and then the the data transfer nodes are really just for transferring data. So please don't use them for non transfer purposes. Don't use, don't run compute heavy things there, don't if you can avoid it them on your work flows out of it. Let's move those to the work flow nodes and use the system, login notes for more more general routine tasks like compiling.

A

So one thing to think about when you're trying to get the best performance is that usually the performance is often limited by the the remote endpoint. So you know nurse works really closely with the essent to get the most optimal paths out of our system and a lot of institutions also have optimal paths into their systems. But then it's that last hundred meters or whatever to the computer you're trying to get through that usually is very difficult to get that highly optimized.

A

So often the remote endpoint isn't tuned for optimal data transfer, and so you can see low performances just based on that. Occasionally the nurse side file system contention could be an issue like Cori. Scratch is very heavily loaded, so it's delivering data very slowly and also don't use your home directory for for large data. I mean number one. It wouldn't fit number two. It doesn't perform. Well, so keeping those things in mind.

A

If you're transferring the data around and you're not getting the performance, you expect feel free to open a ticket with us and we'll we'll help you or work with the es net to try and debug the issues.

A

So one special case I wanted to talk about is transferring with with into and out of the nurse HP SS archive, and so like what he'd said. We recommend you use our tape archive for archiving large amounts of data for a long period of time, and we have a pretty extensive documentation on this I recommend you check it out. So when you're done with your data and you you don't think you're going to need it for you know about a year, you know. Maybe it's like data from your paper. You.

B

A

To keep it forever, but you don't think you're going to be frequently accessing it. That's the time you want to put it into HP SS, and so you can do that either on the interactive DT ends by SSH into there or you can use our special transfer queue on Cori to move this data to into and out of HP s FS, and we have two main ways on the command line that you can move data into. Hp SS, there's something else HSI which is sort of like its.

A

It works with a put get method, so you can say, HSI put this file and it'll put that file into HP SS. If you want to get it back out, you use get, and so you can use that for putting individual files into there. It also has some nice flags on there for like conditional access like put this file into HP SS. If it if the one in HP is the one on the spinning disk is newer than the one in HP SS then transfer it over.

A

If not, don't do anything, so you can do some sort of you can do some running backups of your system for things that have changed generally, if you have lots of files like more than about a about ten or so, we recommend that you use H tar to aggregate them up and H star bundles things up in the same way that tar does, except that the output tar file goes directly into HP SS. It doesn't hit your spinning disk.

A

It can be really handy if you have let's say two terabytes of data that are in all in small files. You can use H tar to bundle them all up and put the output into HP SS. You don't need to have space for both of those things before you put it in and like I said like what he'd said before when you're bundling these things up, think about how you might want to come and get them back out if you think you're going to get a whole years worth of data all at once.

A

Maybe you put that into one bundle. If you think you're only going to need I know that the calibration files or something maybe you bundle those separately just try to think about how you might possibly want to get out when you're putting them away. You can also use globus for our HP SS system, but we actually recommend that you use our command line tools for external globus.

A

Transfers and part of this is this is because at its heart, HP SS is a tape system, and so, if you put in let's say a hundred or a thousand files into HP SS over time- and you want to come and get them out, they can be spread all across the tapes in the system in all different order, and you need to be able to sort them so that you're getting all the files that are on one tape.

A

At the same time, instead of you know getting it on a tape and then the machine has to load. Another tape go to the it's a little robot. It has to go down the thing and pick up another tape and put it in, and so, if you don't order them so you're getting everything off the tape in order, it can really slow down the process.

A

So we have these command line tools that have extra risk extra code on top to do the ordering the tape ordering for you, and so so, I'm just going to do a quick demo of these command line tools.

A

So this is this is me and Corey, and you can get there's some documentation on this on how to do this, this to dip to these tools. You say: module, load, Globus tools and then you can say which transfer files. So this is a helper script that will let you transfer start Globus transfers from the command line at nursing, and so you don't have to use this just for HP SS. You can use it for various other things. You could script it as part of your job. Submission I'll talk a little bit more about that.

A

If you wanted to stage data and then submit a job when it's done and so the way you use this, is you say you have a source, that's where this data is coming from, and this is normally an endpoint UUID which you can get from the globus endpoint page? Is this really a long Godley good thing, but it also knows a bunch of shortcuts for the nurse DTN endpoint, you can just say DTN for nurse gage PSI.

A

She can say HP SS, so so I'm going to pull from HP SS and my target is going to be then there's TT n nodes and my output file I'm, going to I'm going to get some files from.

A

From HP SS and put them in to put them into my scratch directory, so I have this list of files. So I've listed the files that I have an HP SS in this text file. My output file here is in my scratch directory and then it's going to transfer it's going to pull from HP SS and move it to the my global scratch directory. So I can hit return.

A

Don't think about this for a little bit and then it'll give me a transfer ID and if I want I could I could go to the website and.

A

And check on this transfer, ID or I can use this command line tool and check on the status, and it's actually I only had just a few few files in there just two files. So it's already finished.

A

We can take a look in this directory and these these files are here I, think so they've been copied over and if you wanted to excuse me, there's also an example: script on this stage data script, which we can give it the same, the same arguments but with the additional the addition of an analysis, script, a job script, and so what this will do is it'll move. This data over and it'll keep checking that until the data is transferred, and then it will submit this analysis script job. So this is submitting to the transfer cluster.

A

And you can, let's see we can look at the output of this file to see when it's done. Whoops.

A

Just watch this directory and what it's doing it's setup, the transfer, and now it's querying every few minutes transfers active, not succeeded. It submitted this analysis, script job, which is going to use this data that's in place, so you can see this this this stage. Data script is actually included in the the globus tools module, and so you can take a look at this, and if you wanted to use something like this for your own job submission, you can just alter it a little.

A

However, you need to to make that work, okay, so so that was how you would use the command line to get into and out of HP SS. If you need to share with external collaborators, we have a number of different ways.

A

You can do that if you, so we have a public HTML access where you can, if you just need a simple, really bearable and she need to share a few very small files publicly over the web with anyone, you can create a specific dub-dub-dub area in your project directory and then those any files that are put in there that are world readable, are automatically available for public access under this URL portal, Thunder Center project slash your project name, and so that's a really simple way for you to share files with external collaborators.

A

This you can build on this stub directory to build really very nice science gateways.

A

A lot of groups have put a lot of work into this and have really nice portals that they use to access their data at risk, and you can you can build on this in the dub-dub-dub directory or you can also build web applications and spin, which is a I, don't know if they talk they're going to talk about that, but it's a way that you can set up containers of really complex and really nice web pages and have them run in our nursing system.

A

And then, if you have large amounts of data that you need to share it like hundreds of terabytes, you can set up a global sharing endpoint and share it out over Globus, and these are these are read-only endpoints. So let's say you have a big data set.

A

You need to share with someone you can set up a global sharing endpoint and then that has a permanent web address that you can use to share this data with either all globish users like anyone who has the address and is on Globus or with a specific subset of just a few Globus users. It's your choice.

A

So this is a really great way if you need to share large volumes of data, and you can incorporate that into your web portals. There are some links here on how that how those share these- and so that's all that I had don't know how I'm doing on time. But if you have any questions, I can answer them.