National Energy Research Scientific Computing Center (NERSC) Data Seminars Series, 25 Sep 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2020-09-25 - Quincey Koziol & Suren Byna - ExaHDF5: An Update on the ECP HDF5 Project

Description

NERSC Data Seminars Series: https://github.com/NERSC/data-seminars

Abstract: This talk will present an update on the features and future of HDF5 for exascale HPC. Currently, our work focuses on asynchronous I/O and node-local storage caches, but future work will include GPU direct I/O and data movement across the deeper memory hierarchy anticipated on future systems.

A

Okay, um um so I I mean I can waste one minute on introduction, not that anyone really needs an introduction.

A

uh So uh today's seminar is on hdf5 and if you haven't heard what that is, you probably haven't tried to store scientific data in a file. uh It's a pretty well known uh way of doing so, and uh quincy is certainly one of the originators of ht5 in general.

B

A

Now with us at nurse and seren is also in um crd and they've been working on. I guess the future.

B

A

So it will be very exciting to hear about um where hdf5 can go from here. So hopefully.

B

People trickle in as we talk.

A

Once they find the password um but yeah, I guess take it away and see what.

B

Sorry didn't mean to catch. You up sure thing all right, um thanks everyone for coming, uh it's great to have you all here, and uh I want to acknowledge the uh many many team members that we have on this project. uh It's fairly long running, we've been renewed by ecp at least once so.

B

We've had a sequence of people in some cases, um primarily it's a it's a collaboration between ldnl argonne and the hdf group, but we've had some really great interns, particularly this summer, with cayenne and john and may have contributed quite a bit to the effort.

B

So it's a broad effort across a lot of people, so I'll talk briefly, many people are familiar with hdf5, so I've tried to pack it down into a reasonably short amount of slides and then kind of talk about how that applies to the ecp applications and the features that we have been working on to help uh those app teams now and in the future and then kind of think about. Well. What does this mean in the longer run? Where? Where are we going in the next few years?

B

Once we get um kind of to the end of our current milestones or how can we continue to play those forward? So so? Why use hdf5 right?

B

You know, if you ask yourself, how do I deal with I o in the exascale? Do I need to understand the specifics of the mpi standard for io and by the way? Why is my checkpoint taking so long? um So hdf5 is designed to hide all that io complexity, so that you can concentrate on your science. That's our goal right is.

B

We want to help you out by hiding those all those moving parts in a way that is really nice, easy abstractions for science, application teams to just work with the hdf5 api and trust that we'll do the right thing: they'll get their data back and it'll work. Well, so hdf5 stands for hierarchical data format version five.

B

This kind of has three legs that we talk about. It's an extendable data model, I'm not going to talk about the the components of that in the next few slides it has a core software package, it's open source, but a very, very, very broad ecosystem built around that, and it's designed to do.

B

I o on data, of course, according to the data model, but it's built on top of lots of different kinds of backing stores, posix object, stores the cloud memory hierarchies, whatever you want, and the last thing that I won't emphasize here and actually we're kind of gradually migrating away from is we have a very high volume, complex data, friendly file format? That's very well defined people have written third party readers and writers for our file formats, we're pretty confident that we've described it and even if the software went away, you could get your data back.

B

So the ecosystem is, as I say, quite broad there are. I mean this is just this very small kind of old, in effect, um subsampling of all the different teams and tools that are working with hdf5 data today and over the last 20 years since it started. I think our first release was in 1997., so we've been doing this for more than 20 years.

B

Okay, conceptually speaking right, hdf5 is a lot like xml, it's self-describing and has extensible typing system, and it has a lot of rich metadata that users and the software can apply to the the data that you create. It's also designed to be very high performance and compact scalable. It's like binary flat files.

B

um Sometimes we call hdf5, you know the pdf of science, it's a standard, interchange format. It contains a lot of different kinds of data in one container, it's hierarchical in a lot of ways.

B

You can also make it be a true graph if you really feel like building your links in the way that that works for you if you'd like, but it's a lot like the directory in the file system and we provide lots of um random access, subsetting kinds of capabilities, so it's similar in some ways to databases, although it's much more friendly to science data, those are so it's it's a broad intersection of all these guys.

B

It's not a true superset of any of them, um and it has capabilities that are not included in any of these kind of related concepts and technologies.

B

So, from a data model perspective files are containers right. We don't we're gradually. Moving away from a file is a file on the file system and opening up that concept to well. It could be an object, store system with many many objects inside a container, and you treat those as if it were a single container for the objects within it, but, conceptually speaking it's a file, it has a set of objects that are supposed to belong to that file.

B

The core object inside hdf5 are data, sets right, they're, basically a multi-dimensional array of homogeneous data elements in order to understand how that works in the file. We need to store a description of that, so we have to have some specification about the data elements themselves and what well? How big is this array right?

B

First component for the data elements, we call data types um in this case. I'm I'm just naming this one. You know it's a 32-bit little indian integer. You can make these be arbitrarily complex, nested compound variable, length, sequence, array, fields, the whole shebang quite complex if you'd like to have complex state elements, but it's very efficient to store floats and it's too, and we describe those and allow you to do. I o in them for the arrayness of the array we need to store.

B

You know how many dimensions is this: we call that the rank frequently and what are the sizes of each dimension and in fact, hdf5 allows any dimension to be unlimited in size. So you can extend an array in hdf5 in any dimension. You'd like not just the slowest changing line is kind of the typical append images to a movie kind of notion of things. You can actually extend and all the other dimensions.

B

Finally, to kind of organize these concepts. You know the data sets in a group. We don't just want a you know, set of them laying around in the file. We want to have some structure and hierarchy. That means something semantic to the users so that we provide groups folders here um and links instead of the arrows so that users can kind of build a semantically, meaningful science, meaningful, usually um structure out of these objects in the file. So every object or sorry, every file has a root group.

B

Very much like a file system objects kind of like a file system. You can add hard links to more than one file. You can have links to more than one object in an hdfi file. um You can also kind of have something like soft links that refer to objects in other files and other hdf5 files, but unlike normal file systems, um hdf5 files, you can create graphs and cycles, do whatever you want, and I don't necessarily recommend that. But you know because it could be, can become confusing for users parsing this and like what happened.

B

Why is all this tangled up? But it is possible to create custom graphs if you have a need for that in some way. So all of these things together lay out something that hopefully, is semantically, meaningful and kind of standardized for an application to kind of augment the basic objects. The groups and the data sets. We provide attributes and they provide user metadata to generally decorate those or add information to those baseline objects, we're similar to key value pairs and that each attribute has a unique name for that object. So you can have multiple objects.

B

With the same. You know, author equals somebody and a value for the attribute, and the values are very similar to data sets. They are described they're, basically, a small array described by a data type and then data space, and they are designed. You know for small things, right, they're metadata.

B

We don't support partial. I o they should be small enough, that you don't need to compress them. They're not complex need to be extended.

B

If you find yourself doing one of those kinds of things with an attribute, it's probably better to create a data set or some other structure in the file and then use one of the reference data types in hdf5 um for an attribute that points at refers to that other object or group hierarchy that you're trying to uh work with here.

C

So would would attributes be used for something like provenance.

B

Yeah provenance works really well, unless it's like gigantic log files or something funny like that. But, generally speaking the time stamps and author information is very nice for attributes.

D

B

So, in a nutshell, really fast right: um this is the hdf5 data model. There's a lot more depth in here. I did not tell you about all the different kinds of varieties of data types, and you know some of the more obscure things you can do with links and whatnot, but going forward you could at least apply these four basic objects. Files data sets groups and attributes to problems that you hit or when people talk to you about hdfi.

B

So how do we apply that to the ecp mission right? The? What do we do with hdf5 within the ecp project? So our our mission, our goal here- is to work with the ecp apps in the facilities to meet whatever their needs are with hdf5.

B

We plan to productize a set of hdfi features that are appropriate for that time, frame and set of machines, support, maintain and release hdf5 and then also do some planning for the future right. We don't want to just run out the end of our funding and then go well. Sorry guys uh so, hopefully we'll we'll hit to that.

B

Here is two slides of these there's a lot of teams that work with hdf5 and rely on it to one degree or another. Some of them are completely reliant and others of them are like. Well, we have several different output formats and hf5 is one of them. Can you guys help us up um so lots of different teams, lots of different locations in the doe?

B

Many different um aspects, not just simulation, but machine.

E

B

You know working with facilities that are doing light sources and other kinds of data gathering.

B

We also work with a bunch of the st the software technology teams to support them and build infrastructure and collaborate kind of horizontally, not necessarily they will always use us, but they uh they're building tools and we should leverage them or they're going to leverage us in some way.

B

So when you talk to the app teams, these guys and you look at the status and what they want right, it's mainly this thing like hey man, could you make it go faster, oh, and by the way the files are inconvenient, can you make them smaller and that's great, but what they really mean is this I mean give us our data back and make certain that you don't like corrupt it anywhere along the way and that we can like get it back in a few years and we wanted to go faster and get smaller and all that kind of good stuff.

B

So if the apps are so focused on this kind of performance aspects of things, well, why are they using us? Then we always tell people. Well, you know you're going to lose a little bit, hopefully not a lot of performance when you use hdf5 or some other. I o middleware, and they really don't want to play with your I o middleware. They really want to do their science right. They're, not the middleware developers and io. Just doesn't produce results right, it's not compute in the sense right. It just preserves them and it io.

B

This kind of looks not exciting. We like it, but it's not exciting to them um and, realistically speaking, application teams shouldn't need to know the details about all this item and aware you know it's like saying: oh impitch didn't perform well, go over there and optimize it. You know you don't tell that to the app teams you try to do.

B

You know the communication work in inpitch and then just say we took care of that, for you don't worry it's okay, same thing with hdf5.

B

So that's our goal right, the app teams. They just want someone knowledgeable to fix it and sometimes they're, not quite certain about exactly what would best help them. They say we trust you guys you're smart people and we work hard to build those relationships, and you know build up that trust um in order to make intelligent decisions and when we say hey, it would be really good. If you guys did this, they go. Oh okay, sure we'll try that um they don't have to come up with all the ideas.

B

So this is our goal right. We keep things safe. We speed things up and make stuff small. That's what we're going to play with the data in this way kind of in detail.

B

Our goal for the ecb is, you know we really want high quality software. The applications are running on gigantic machines. They're, trusting the element aware to do the right thing and to be around for a while. It's got to be well designed, well maintained, well-engineered software.

B

We spend a lot of time trying to talk to app teams and learn about what it is they're trying to do and why you know and then say: oh okay, here's how we could help you or I see you have a problem. That's how we you know, maybe you guys should change your code a little and we'll add some. You know tweaks into hdf5 and together we move forward and part of our responsibility really too is to look at not just today but five to ten years out right.

B

They want their data back they're, not going to run their binaries on new machines, they're going to recompile or update their software, but they want their data back, um especially some teams. You know some of the nuclear weapons labs have data from before the test ban right, so they plan to keep data in certain circumstances for quite a long time.

B

So as part of the ecp hdf5, we said: okay, fine great, we will go out and build a certain set of features. We'll talk about those guys we're going to spend time talking to the app teams and tuning our software to meet their needs and as a side effect, we decided that it would be really smart to have a performance test suite for hdf5 some set of um small. I o kernels and benchmarks that we can run on current and new systems so that we can tell are we doing? Okay here?

B

Did the performance fall off? What is this special case? Is there a reason why this got slower or faster? You know: do some decent software engineering on the performance, regression side of things, and also we spend a lot of time thinking about what's going to happen in the future and talking to software and hardware teams and thinking on our own about what's going to happen for new systems, so I'll kind of hit the first four of these in more detail and then kind of skip out to the future.

B

A little bit more app team stuff is is nifty. When you get graphs, they go oh well. This is five times faster for the you know, the mrex team, but really I kind of want to talk more about the capabilities that we're adding and where we're going in the future than specifics about an app.

B

So this first one is finally rolled out beginning of this year. Earlier this year is called the virtual object layer. It's a nice abstraction layer within hdf5 to redirect the I o operations, things that touch a file our container today into what we call connectors virtual object, layer, connectors, ball, connectors and it's right underneath the api level. So immediately we come in the app calls data set create and we immediately jump into the vol connector and ask it to do the data set, create operation right.

B

It happens, kind of at an object-oriented interface, and so they they do these methods for the various kinds of objects in the data model and operations on them. Reading and writing data elements and other things they're very nice for apps, because they can be transparently invoked from shared libraries. They just have to you know, set the environment. Variable app code doesn't have to change, doesn't have to be recompiled, it's great just as long as they linked with the newest version of hdf5, the one that has support for the virtual object layer.

B

They can just pass um all their data directly through something completely new retarget onto a completely new storage system or a new mode of operation without rebuilding without recompiling at all. These are nice. They allow you to stack them and to build up nice chains of things.

B

We'll talk about that in a couple of minutes, so diagram wise right, the hdf5 api, the apps are calling the api everything else that doesn't touch a container or the file that goes over into the infrastructure in the hdf5 library, which is basically unchanged when we made this change to support the virtual object layers and there's two kinds here: pass-through connectors pass the data right on through they don't actually talk to a storage location, they're going to do some special operation.

B

In this case I listed a bunch. You know they're, like do this actual operation asynchronously or cache the data for a little while before you stuff it off into a permanent container or whatever you want, there can be.

B

Pass-Throughs are optional and you can stack as many of them as you like them, so zero star in regex terms, uh there's only one term terminal. uh Vol connector and that stores it either with the native oops, the native traditional file format, or maybe an object store system or in the cloud or in whatever file format that you invent- um and this is all very pluggable.

B

There's a nice, well-defined public interface for people to write, vault, connectors and probably 10 or 20 folks have written connectors that they're using today with this interface- and this is one of the kind of foundational building blocks for like the next three features. Without the virtual object layer, we couldn't do at least not the way. We've done it.

B

The next set of features. So this is a core infrastructure, huge upgrade that allows us to add capabilities to hdf5 that had no planning when it was designed. You can post facto change lots of the hdf5 behavior without rebuilding hdf5, without rebuilding your app just retarget, the completely different storage system and just keep going.

B

So found in mind we added in support for asynchronous io, and this is a pass-through connector. It uses background threads that we use the argobots thread package from argon to help with the scheduling and and organize those threads again totally transparent to the app don't have to recompile. You have to make any code changes if you don't want to uh it. Just executes those. I o operations in the background on a thread. um There's no servers, nothing right. It all just runs inside your app as long as you've got a spare core or you can.

B

You know spare a little bit of time on the thread. uh This works out great.

B

So from an app's perspective over here in the diagram it starts up, it opens a file, creates objects, writes to them, goes off and computes or closes the file eventually, and each of those operations is effectively a non-blocking entry into the task queue within the asynchronous vol connector.

B

So it just says: oh great, I will go open that file for you, here's a placeholder so that you can keep going and then create an object. It's all great I'll go! Do that! Here's a placeholder, so you can keep going, go write some data. Okay, sure great, add all these things into the task queue and then we try the monitor and see if the app is idle.

B

So we idle in the sense that it's not making hdf5 calls um it's gone off to compute and then we start queuing up and executing the I o operations in the background. So hopefully we've decoupled the I o from the compute cycle and we can hide it as much as possible.

B

Generally speaking, it looks kind of like that right, compute, typically the old way it would compute and do some io or checkpoint or something right and then go do some more compute, some more I o compute.

B

Hopefully we can eliminate some, or maybe all of that I o time with asynchronous execution and really speed up the app's appearance of. What's going on with I o right at the very end of each compute cycle, it starts some. I o and that's probably a little bit overhead. Sometimes you can get it down to very close to zero, but there's a little bit of overhead still with io. Then it comes back and it starts its next compute cycle, which is ideally overlapped with all the I o at the end.

B

You still see this one I o block at the very end. Eventually, we have to close the file and flush the buffers and everything else before the app terminates. We can't go into the future. um You still have a significant time savings and um that gets more and more as the more iterations through the computing. I o cycle that you make the more opportunity we have to save. I o time for your application.

F

So it's the benefit of this method over just relying on the operating system to buffer the I o and then asynchronously write it to the disk itself.

B

There's certain operations in hdf5 that are essentially read modified right or it's got a sequence of um operations that it needs to do in order to perform your action update. Some metadata over here update some metadata over there and then come back to you with a new object, so we're trying to decouple anything that could potentially touch disk reading or writing um from the I o from the I'm. Sorry from the app doesn't win, always sometimes it's fine.

B

Sometimes you have enough memory to buffer your your data in memory and and um the os would have done it just as.

B

B

Okay, so this async uh vault connector right has two modes of operation effectively. uh One is the one we've been calling implicit and if you don't want to modify your app and that's what I've been saying right, you can just transparently link um to the uh or dynamically link to the async wall, connector with this environment variable and it has kind of a conservative async behavior. It's like you know.

B

I understand that you're going to expect this buffer to be reusable when I return from my dataset right and kind of will block to make certain we get that done, but any of the metadata things that can happen same with reads. You know we will we'll execute that effectively synchronously in order for the app to be able to read the buffer when it comes back from the data set read.

B

But if the app really wants to take more control of the asynchronous operations, we can build those together in what we're calling event sets and then manage them in a similar, but I hope better way than mpi's uh non-blocking operations.

B

So this looks like another on the hand side, the implicit mode right. It's existing hdf5 calls user doesn't do anything different with their code. They just point at the async all connector on the right-hand side. If they really kind of want to manage this in a more explicit way, they can create a new event set object this esid and then pass that in to the all the same operations right same on both the left sides here, except we're aggregating those asynchronous ops into this event set as the user's proceeding along.

B

Maybe this is a checkpoint right, they're going to do create a file create a group for my checkpoint dump, a bunch of data sets and data in there and then at the end, either they compute for a while longer and allow the data set and I'm sorry all the data to be written out, and then they can wait on that event set at the end of their compute or for whatever reason, if they're trying to guarantee that this is on disk. At this point, they could wait earlier right. On that event,.

C

Unlike event, sets can be large in size.

B

It's it's as large, as you put in there. It's it's basically, a linked list of all the operations that you added in right. Okay, so form them and close them out as they happen.

C

So you know, uh plural, or you know massively parallel uh actions against data are interesting. um Does an event set come with a descriptor as to what type of metadata transactions might be required.

B

No, it's it's a in-memory object um and it's it's really kind of boring right. It's just a bag. Full of you know tokens for the um operations that you executed, asynchronously, okay, it's it's just sitting there managing all those tokens for you in a nice programmable, easily manageable way.

B

So if you've used mpi's, non-blocking interface ever or posix non-blocking, it's similar, it gives you back a token for each object or for each operation that you perform, but that's a real pain for apps, because either they have to pipeline their app in some very awkward way: interleaving compute with non-blocking operations and hoping that it all works out in order to get dependencies right.

B

If this has to be done first and then that they've got to manage that, on the other hand, even if things are kind of embarrassingly parallel- and they don't know, there's no dependencies between them. You end up with the n tokens running around inside your app and you have to manage them all again.

B

So the really big advantage we feel for application developers that these guys are there's a single token. This event said id that they have to manage. As many things as they want in there to happen during this set of asynchronous operations, great, you only have to touch one id thing and keep track of that and internally within the vol connector.

D

B

Manage these dependencies right so we'll guarantee that the file gets created before you use the file to create the group, and likewise the group must get created before the data set gets created and then the data written to it some cases you know we can parallelize things out. If there were 10 data sets in the group, we could fan those out right because they only need to depend on the group getting created. They don't depend on each other.

B

Some things are more sequential, but at least it's asynchronous and offloaded into the background one way or the other. We manage all these dependencies. We correctly handle collective parallel, metadata, io, all the goodness there. That's all fine, and this set of code will execute and produce identical results to the one on the left, the implicit sequence, which is identical to what you would run in. If you run it, serially sequentially, whenever synchronously without the async connector.

B

So that's async. I o any other sponsor questions about that.

B

Okay, so going on the other aspect that we I mentioned earlier is system and topology aware io, and this is I'm certain- I've missed some locations where data could be as well as connections between them, but this is already gnarly enough right. um We've got all these different places where there's a memory buffer, effectively ram or disk or tape or whatever you want to think of it, as and they're all connected together and they're, getting deeper and deeper over time.

B

It's not you know, 10 years ago, even we just had cpus and a parallel file system, and it was pretty straightforward. User knew what they're doing um today we got all this running around and tomorrow might be more or less you know, and not. Every system has a burst buffer or no local storage or is connected to the outside world. But you know conceptually there's a lot of pieces moving around here.

B

We really don't want the application teams to be thinking about this. That's our whole point for the I o middleware, guys right so we'd like to make this smoother and easier for the application teams to develop.

B

So with that in mind and building on all the technologies that we've kind of talked to from here, we have this caching vol connector and it's primarily focused at no local storage today, but with a plugable design, we can evolve it towards caching at any level in that hierarchy. Right we could say: oh this.

B

This one manages data on the burst buffer and that one manages data in the parallel file system and the tape whenever we'd like it to happen, and some of it isn't cash, then, in the proper sense that we would normally think of cash, so we'll probably rename this guy to a location manager or something.

B

But today what it does is you can stack the async wall connector on top of this guy and he will cache.

B

I o that it that gets performed in node local storage and then return it back to the app as it's occurring um and we're just in the process of implementing the update where we can stack the caching connector on top of another async connector. So he could, in the background, be evicting things from his or prefetching things from the cache that it's stored on the uh local storage, and that's this part here in the bottom left right. You can stack these small connectors right.

B

You can build anything you'd like and the app is completely unaware of all this. You can sidestep modifying hdf5 and modifying the app and build up stackable connections across the memory in your system, the different locations where data could reside, and so we think that's where we're going to go in the future I'll talk about a little bit more in the last few slides.

B

So these are two of the other aspects that we've worked hard to really enable good performance for today's apps and ones in the next few years, with.

B

B

So another kind of twist on this is what we called subfiling. You know: single fi shared file is traditional for hdf5, but it's kind of slow, sometimes with mocking contention and other. I o bandwidth, you know difficulties so what happens? Instead of storing in a single shared file behind the scenes underneath the covers to the user, we shard that up into a set of pieces, um sub files and then create another metadata file that describes how all that all works together.

B

We get better use of the parallel file system, hopefully reduce the mocking tension issues to improve the performance and in theory um this is very prototyping like hacking together code right, but on corey we can see some moderately significant um 2x 3x in this very prototype code: um 10x, not 10x there 6x. um So we have some pretty high hopes that this could work out.

B

um Another tuning the I o components to the the behavior of the underlying storage in in good topology, aware system, aware ways- uh and this builds on the um connector notion and is likely to extend back into hdf5 with some feature updates to the core library as well, just to enable it over there.

B

This is in a sense, subfiling is an idea. That's been around for quite a while, I mean muster. Does this with you know scraping, but here we are doing it up at the higher level and giving the apps more portability to their data and more control over. What's going on at the software level, we hope.

B

So finally, okay, fine gpus, are coming right. um This stuff works fine, because we've got cuda. You know data transfers and other kind of similar technologies with hip or one api whatever, um but this stuff doesn't work at all. Yet right you, if you've got gpu private memory, um you pretty much have to send it back over to the cpu and then get it out through the cpu's memory back into some file system, but nvidia and other vendors are working on. How do we change that right?

B

So we've been working with nvidia for this summer with john robbie's effort to allow their gpu direct io to work correctly with hdf5, and what that's going to do is change the hdf5.

B

There's a big blue chunk here in the middle right, all the guts of hdf5 that the apps call into from the top at the very bottom that we talk, posix or mpio or whatever else you want and john's effort were, was to add in a new driver at the bottom.

B

That speaks gpu direct storage gds and this was worked out really really well, um it's a drop-in replacement for the posix vfd.

B

It's a nice single call to enable from the apps and works perfectly passes all the hdf5 regressions test, suites, um it's ready for beta testing. If you feel like playing it out, you have to have a gds capable machine and all that goodness, but it is there and available for people to work on and the performance can be pretty good. um The green bars over here the gds read and write rates um we're still kind of working on. Why is the read not quite so good, but um this is very early right.

B

Very preliminary one single thread, one gpu single e. So, um but we are showing that you know gpu. Direct storage can outperform.

B

um I o from the cpu or the data transfer back to the cpu, and then I go from there.

B

The unfortunate sort of part of this is that it only works in serial hdf5 right because we're replacing the posix driver inside hdf5, but in hdf5, when we want to do parallel io, we rely on the mpi library um to do that for us right. We just make it pio calls, and you know, boom magic happens uh so right now, the developers on both the openmpi and the m-pitch teams are making progress on supporting gds. I o for the mpi libraries, which will therefore enable uh hdf5 to do parallel. I o from gpu native.

B

You know private gpu memory um in the future as soon as these guys get stable and roll out. This capability for mpi and hda5 won't have to change. In that case, right I'll just invoke the mpio call and it will correctly take the buffer directly from gpu's memory out to.

B

Disk again playing with these ideas and the system topology getting changed and updating, um but in the future we've got this gnarly diagram right, um and we really don't think that application teams, application developers a have enough time right and we have enough desire. You know knowledge and desire to go, write, custom data movement pieces for their apps and they're gonna have to pour them between one machine and the other and they're gonna be constantly dealing with this. So this is a real opportunity for our.

B

I o middleware teams to step up and lead out with some new abstractions and some new capabilities.

B

So we've been discussing over the last few weeks what we have to find a good name, um but a data movement dsl right, you'd like to come up with some high-level description of where are the data? Where is the data in this system? How is those? How are those locations connected? What are the various properties of those? You know this one's high bandwidth? This is how big the memory is or the burst buffer is or whatever and then well. Okay.

B

What do you want to have happen when the burst buffer gets 90 full start evicting data to the parallel file system? Oh okay! That's a good behavior to have something right. So this is where we're kind of focusing.

B

We want the apps to be able to create some very high level description of what they'd like to have happen and then build up a nice stackable set of all connectors inside hdf5 and apply those policies and descriptions to that stack of wall connectors where you've got the components right kind of been showing you the components along the way, um and we think in the next year or two maybe we'll be able to implement some good pieces of this um really allow the apps to do stuff. Okay,.

C

Alongside those four, which are very good for, uh could be the resource uh expectations or.

B

How much I'm gonna use.

C

Or well, expectations could be either the addresses of uh specific resources that that are uh required or um or performance expectations.

B

Yeah sure, okay yeah, I mean, like I said, we're very new. We don't even have a you know a concrete example of language sketch, so yeah sure resources uh would be good to apply in there too.

F

Thanks, well, it's really about um well. I have two comments. It's really about quality of service right, it's not, which is related to resources right.

C

Yeah exactly yeah.

F

I mean the second comment I would make is that I think this picture is ridiculously overly pessimistic. um I don't think.

B

F

Any architect is going to build anything as complicated as this, because of the reasons that exactly you outlined nobody's, going to have no local and best buffer.

B

Yeah I mean I threw that one in there just because they both exist and I didn't want somebody to say, but we have a burst buffer.

C

Yeah, I know it's, it's it's a it's an all to all map and- and you did a good job of mapping out which parts make sense.

B

Yeah, so I don't, I don't expect to have both node knuckle and burst buffers, but if you take the burst buffer out, it's pretty good. It's okay!

B

You could add in a few more connections like I don't connect, cpu private memory to another cpu private memory like mpi communication there's, you know we could sit here and draw arrows in here and talk, but the notion still stands right. The app teams don't want to have to think about this craziness.

B

um They just want to have some hopefully default right. When you install hdf5 on a system there should we're trying to play with the idea. Is there should be some just default description that the system guys you know, install for that machine and, like you know this is how our system operates. It's not going to change really rapidly right. um There should be a way for an app team to override that or to emphasize certain aspects of the other, but basically the default behavior should be a whole lot better.

B

If we could just get some kind of general behavior, that's a little more tuned into what the machine's architecture is.

F

E

I mean there's.

F

A lot of prayer out here the best buffer spent a lot of time. Thinking about this question, yeah sure.

B

Yeah I mean it's got its own pseudo dsl also, I mean maybe it's a real dsl if you think about it that way, what do I want to do with the burst buffer right? Exactly.

C

Gds is a great idea in any case.

B

Yeah that'll help. That's that's. That's really good.

B

So I mean this is this is where I'm I'm finished up right, um we're funded by the doe uh lots of great teams and lots of hard work uh by teams at argonne. You know berkeley and hdf group as well. As you know, interns for the summers in north carolina and northwestern um any more thoughts comments, questions I can flip back over here to the fancy diagram.

A

F

A

Any other questions we have quite.

F

F

I think, over the last five years at least every uh problem I've seen specified a computer scientist has come along and said that dsl is the answer to that. So I've just been a little bit wary of specifying just on yet another dsl.

B

Yeah, I hope they don't have to deal with them very much only if they got some crazy idea.

A

Yeah and so and the moment for the gpu direct storage you do have to you do have to do something. Is that right? To put it, you have to use this each the the gds volt connector. If you want.

B

Yeah, you have to use the this guy whoops this this guy, the gds um virtual file driver at the bottom level. um Virtual files drivers are below within effectively the native wall connector, but yeah. You have to choose that guy.

B

um The nice thing about it is that we're still working with nvidia and exactly how to make this work out in detail, but it seems possible to auto detect whether the buffer is actually a gpu buffer or a cpu buffer, and we might be able to just make gds be the default. That seems wild. But I don't so I'm not certain about that. But um at the beast is possible to tell whether your buffer is in which pool of memory.

B

E

Quincy so let's say I want to test it on kodi gpu with tensorflow module. How far are you from letting me do it.

B

I mean we could let you do it, it's not our problem, man. uh No, uh we gotta convince jack and doug and the you know all the people to update the os and corey's gpu nodes to be gds compatible.

E

So it's os level upgrade okay! Yeah! Have you ever approached them about this? Ask them what are their plans about this.

B

We've been talking yeah, um I don't. I don't think it's right, a really near-term idea for them. Yet, but um if we had more users like you, um then we could say: hey we have real users. Do you think you can prioritize a little bit higher.

E

So you mean, I should just block the whole system with my jobs and say this because of my yeah and then say: printing as a solution.

B

And okay, thank you throw me into the fire. I appreciate that john.

C

Hey uh quincy have a different question: um yeah, which is uh more more on the attributes. uh Side things um you know hdf is a, is a luckily a multi-stakeholder uh undertaking. You know that has lots of lots of people interested in it, um but you know I'm wondering whether or not um hdf has thought about bolting in providence at a more fundamental level.

C

um So one one way to do that would be. um I mean just throwing out. Ideas is, is sort of providing a foothold for uh orchid or or other identifiers um that that, when an action or an event set happens that it could be connected to either who or why that happened.

B

Yeah, I would volunteer that we've done some work in that area. Have a prototype flipping back here to the ball connector diagram. Here um you have a prototype provenance, pass-through wall connector. So it's it's a it's designed to take record. uh I guess just to record uh the operations that occur on an hdf5 container, a file and then log that, however you'd, like.

C

Using like uh geekos and and and posix or.

B

Right now, it's actually a kind of a more of a baseline implementation. That's more plain text. Logging, uh we kind of were figuring out how to enable darshan logging and siren. I don't know if we ever finished, that we got fired around.

D

Yeah we haven't, we hadn't finished that because we kind of asked for future funding on that. One of the things that we want to do is use more um standard, provenance, libraries and um formats such as rdf and travel standards. So that's uh yeah that that's a work in progress or somewhat near-term future work, yeah.

C

I mean there's some, um like some simplicity and and uh reason to you know, treat key value pairs. Just you know in the abstract, but um you know kind of recognizing how people uh wield data sets um you you know potentially could bolt on some some easy features.

D

Yeah there is a lot of provenance. Work is already out there, so we can take advantage of this rdf for uh and sparkle type of querying make available on these. Oh yeah.

B

Yeah the nice thing about the pass-through is you can do anything you want and just kind of let the rest of the data in the operation pass on into the file.

B

So close to the top of the hour good time for another question or comment.

A

B

Well, thank you all um again. If you want any more information contacting siren and I or anyone else on the team, um it's more than welcome, I mean we'd love to talk and hear about other cases and interesting ideas.

B