National Energy Research Scientific Computing Center (NERSC) New User Training 2018, 20 Apr 2018

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: New User Training: 10 Burst BUffer

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Hi so I'm Wahid I'm gonna carry on from where Charlene stopped her to talk about the burst buffer. Okay. So what why would you want such a thing, which is a large SSD layer just to spoil the surprise of what a burst buffer is? So the general motivation is that you know we see a lot of spikes in Io bandwidth, but then this generally tends to be a spiky behavior, so people do get bottleneck by IO, but it's not as if the file system is seeing this level of I/o contention all the time.

A

So the idea is to have a layer of the file system that can deal with these kind of bursts, but also as well as the sort of contiguous I/o the really high bandwidth stuff, there's also what sometimes it's described as challenging IO patterns or bad IO patterns by other people, which is kind of you, know large amounts of random reading across a file or high levels of I/o operations and various different concurrency of jobs accessing files.

A

So, while providing capacity via disks is the cheaper option, if you actually want you know high performance in this sense, then SSD is actually a relatively cheap option. So the motivation is to have some smaller capacity layer that can handle these kind of bandwidth spikes but provide a larger capacity in a normal parallel file system and then a couple of other comments here, as journaling is probably mentioned or someone I mentioned. We have you know the large.

A

The lustre file system on Cori, for example, is a huge 30, petabyte POSIX, parallel file system, where every user can see every other file and so forth, and really that's not a very scalable model for performant file systems. So the other innovation of these burst buffers is to actually build file systems on demand which looked like POSIX file systems, but aren't shared with people. They don't need to be shared with, and the other thing is to you know we have this high performance network.

A

Why not actually put the storage on that network, so so the architecture of the burst buffer at nursed, which is on the Cori system, also don't try and use this on a listen there's one thing is to provide. So this is like a sort of part of the Aires high-speed network, and this is obviously replicated several times. You have compute nodes.

A

You also have conventional IO nodes that talk to the lustre filesystem, that's on a different fabric on a different network, but you have also vote nodes that look very much like the I/o nodes, but instead of connecting to the file system here they actually have SSDs directly in them and then so that's the kind of hardware. But then there's also this data warp software provide by Cray.

A

That goes with this: that's integrated with the workload manager so that you can actually request pools of storage on this system in quite a flexible way, either to use just within your job or to be persistent across different jobs and I'll explain a bit about that later. So you know there's nothing magic about the bus, but I feel you do just see a file system that you can use at the end of it.

A

But this file system can be striked across many nodes or it can just be one node, and so that allows more flexibility than this was just a single local disk in each compute node for example. So if you want a large amount of space that can be seen by many compute nodes, you can have that as well.

A

Okay, so there's a nice picture of what those per four node looks like, and we have still you know pretty large capacity considering this is just used for you know individual jobs and stuff of 1.8 petabytes over 288 nodes and just to stress again, this is only available on quarry.

A

So when you want to use a day to walk, you create an instance, and this can either be per job, which is only created by the job that uses it and when the job ends it's automatically destroyed, including all the data on there. So this is still useful because it's similar to a local disk or something you can still stage in the data from the file system run on it stage out everything that you want to save that you've produced and but you don't have to stage out things like checkpoints and what-have-you.

A

There are only just for your application as it's running, but we also provide persistence instances which basically can be used also by other users, subject to the normal of UNIX file permissions. They said how long you want this to last is set by the creator, and this is useful if you want to have something that you're going to frequently reuse in different jobs or perhaps different. Other people are going to reuse.

A

This same data like a database or what have you or if you have a couple job and you don't want to run it all within one job, so you want to do some visualization on some data produce by simulation, for example, but this isn't meant for long-term storage of data, so this isn't a resilient file system.

A

If something goes wrong, you will lose that data, so you shouldn't expect it to be there for a long time- and you know at the moment we're not constrained on space, but potentially we might automatically like expire some of these over time. So don't expect to use it forever.

A

Okay, so I mentioned briefly: well, no I, don't if I mentioned, but there are different ways of accessing the data or an allocation of birth' buffer nodes, so one is called striped mode. So here you, you can have you know several burst buffer nodes and several compute nodes and all the files that you create on this are striked across all of the burst buffer nodes and visible to all the compute nodes.

A

So this is similar to a parallel file system, but created just for your job or or in this persistent reservation, and so this creates its own file system and it also has its own metadata server, but it has just one metadata server for the UM the whole allocation. On the other hand, you can have a mode like private mode where the files are only visible to the compute node that creates them.

A

So this is much more like a local disc sort of scenario, and it's still useful for any of those applications that you know that only create data locally and just need a local disc type situation, and then an additional advantage here is that each data walk node is its own metadata server. So for things that are very metadata limited, this can be a good model.

A

Okay, so now I come to how to use data walk so the main way that we expect people to interact with nice buffer.

A

You're saying if you, if you don't need the compute nodes to talk to each other, but you still choose to use shared yeah I mean basically probably not only in this, for this one factor that, if you're limited by metadata, then this is more scalable in that respect. But otherwise it's you know you kind of think this is like.

A

Why not always do this because you have the option of seeing across the compute nodes, and you also can obviously not share if you don't want to, but but yeah they only, and we have seen cases where people have like, for example, a SQLite database or something that has a lot of metadata operations that actually private mode is beneficial. So.

A

Ok, so the way that most users interact with this is via JavaScript, so I'll show an example of this in the following slides here you can allocate the space you want. You can stage in you're gonna choose what files you want to stage in and then you you get these environment variables defined, which show you the mount point without you having to guess the cryptic path that it's mounted at. But there are just dimension which I won't talk more about here.

A

If there is also an API, a library that you can link in, if you actually want in your program to be able to do things like stage out and or would define what will be staged out and stuff like that, so there's a C library there and it's described there, including an example of how to use it.

A

Ok, but in terms of the slurm way of interacting. So this is just a sort of regular write script and then it's got these extra things that aren't commented out, but a pound DW lines that are recognized by the by slurm, so the first. So it defines what space you want. So it says you want a thousand gigabytes and it's access modes, striked, it's typing called scratch, so the duration is just for the compute job. In this case it's not persistent, then you, you know you might want to stage in some data.

A

So here's a couple of staging commands and you can either use directories or files- and one thing to know here actually is that you can't put environment variables in the source. Part of this catches, people out sometimes.

A

And then you can also define a stage out which occurs at the end of the job. So this will, even though it's put in the beginning of the batch script, actually is meant for outputs that you stage out at the end of the job here and then I just say on this line. This has been your application running and these India and in file. This assumes that your these are just example, are gyun's that your application might take they're, not they don't work for all applications.

A

It depends on your application and you could also do something like change directory into this directory and then run your job from in there. If that's how your application works.

A

Ok, so then, in terms of using persistent data warp instances, so here you can create something, and then you have to give it a name, and that name you'll then be able to use on subsequent jobs to access the same space.

A

You can delete it and you should do if you don't need it anymore, but again you have to do so via a batch job. You can also use interactive the interactive cue that hopefully, people have talked about if you want to to submit these kind of jobs and then in order to use it in another job. You just have a line like this, which says persistent DW instead of the job DW and then the name that you defined earlier.

A

Okay, then just to mention a couple of other tools that are useful. One is just a slurm command. S control show burst. That shows all kind of information about the available system. These these are the pools that are available, but also about your about allocated buffers that are particularly for you and it will, for example, be able to remind you about your name if you've forgotten the persistent name that you made earlier and then another useful command that you can run.

A

Perhaps in your job is these DW stat commands and there's some a couple of scripts provided here that show you, for example, which burst buffer nodes you've actually been allocated. So if you're, if you're wondering how many birds buffer nodes you have and therefore how big how much your files are striped and you can access yep from this command up so yep, then I talked about striping here. So here you basically don't have any choice. Unlike with the lustre file system, you don't have much choice about the striping that you get it's.

A

Basically, it's pretty much only defined by the space. You request now. To give you a little bit of control. We currently have two pools which have a different granularity as it's called, and so, if you request the same space, it's basically divided up in these granules across the across the nodes, and so, if you, if you requested one yeah and I, think that probably the mass is still right look. So these chains I had to change my maths from previous example of this talk, but anyway yeah.

A

If you request one point two terabytes, it will be striped over 14 burst buffer nodes in one pool, but 60 burst buffer nodes in the other pool. So for smaller spaces you may benefit.

A

If you have large files from striping over more nodes and I, think oh yeah well, this kind of shows the performance that you can get at best, so against benchmarks, the burst buffer does very well, and you know you can get one 1.7 terabytes per second and huge amounts of I ops compared to the lustre file system, and also this is more stable, II.

A

True, so the lustre file system degrades over time so, as Jalen mentioned you're not going to get the seven hundred gigabytes per second anyway, even at the best case, whereas this you should still now be able to get potentially.

A

But although this is for you know, somewhat idealized tests, but then there's various performance tips that we have on there, but the burst buffer web pages. That I'll provide links at the end. But you know one: one simple tip example is: is to stripe your files across multiple beam burst buffer server. So, as I said, that's only controlled by the space you request, so you may sometimes want to request more space than you need because it, you know, stripes, wider, ok, so the summary nurse has a verse buffer for science.

A

You can get SSD performance and these on-demand file systems it's very flexible, and so it's not just like a local disk. You can use it for big share files or you can use it in a local dislike way, but some tuning, the flexible nurse of it means you probably have to play about a little bit to maximize performance. So now we're finding the users generally get good performance and a pretty pretty stable service I mean especially compared to when we initially install this system. But a lot of the syntax and the error messages.

A

You get a pretty esoteric and you know you can't really just Google for all of these things, because you, the only place you're going to find is the nernst website. Probably and there's you know, some aspects of the performance tuning may be different to what you're used to with lustre file system, for example. So you shouldn't just give up is the message here as with everything at nurse? Just let us know if you have any problems and there's a bunch of resources there, ok, I, don't know people have questions. Let me take.