OpenZFS OpenZFS Developer Summit 2016, 10 Oct 2016

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Lustre ZFS & Supercomputers by Brian Behlendorf

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

All right, hello, everybody, my name is Brian Miller, like you heard I'm a computer scientist at Lawrence, Livermore, National, Laboratory and I, founded the ZFS on Linux project and I, wanted to talk to you guys about the use case, we're using for ZFS and how it's become instrumental to our machines there. But first I wanted to mention that I really think the BCC stuff is really cool and I'm. Gonna. Look into that for the previous talk, because that kind of analysis is exactly what we need to drill down a problem.

A

So I'm pretty excited about that. This talk, whoever is gonna, go in a different direction right for the biggest scale machines out there, which is something we do at Livermore. So let me talk about that for a minute, so Lawrence Livermore, National Laboratory, we're located in the Bay Area here we're actually only about 40 miles used to here.

A

So local Livermore is one of 17 Department of Energy facilities scattered around the country.

A

Department of Energy actually runs a lot of research facilities there, primarily research and development facilities for the federal government to do long-term, strategic and technical research and Livermore is one of these they're, broadly speaking, broken into two parts: there's the office of science laboratories that do a lot of fundamental research and then the NNSA laboratories that do a lot of national security-related research.

A

So I work at Livermore, which is one of the NSA laboratories. Los Alamos, is probably another one you're familiar with right. That's the home of the Manhattan Project in World War, two, the atom bomb came out of there and Sandia is the other big laboratory. There's a lot of overlap between the research done from a lot of these facilities, but they're all research and development. So this is Livermore. It's about a square mile campus like I, say just located east of here. It was founded in 1952 actually by the University of California.

A

Interestingly enough, Livermore used to be a World War to Naval airbase. So what was sited here originally was a radiation laboratory that was managed by University to California and that grew effectively into Lawrence Livermore National Laboratory, so the laboratory was originally set up in this capacity in 1950, q is kind of a counterweight to Los Alamos, like I, say Manhattan Project started at Los Alamos, and they did a lot of the nuclear weapons design work in the country at the time.

A

In fact, they did all the nuclear weapons design work and then in 1952 people thought it was a good idea that maybe we should have two laboratories working on this. So one of Livermore primary missions is as a nuclear weapons design lab. They have other missions, but that's one of them and like I say they are set up as a counterweight to Los. Alamos to you know, act as a sounding board to review designs and whatnot.

A

So as I mentioned, and then NSA is that component of Livermore, the National Nuclear Security Administration.

A

Basically, one of the main missions at the laboratory is a program called stockpile stewardship, so stockpile stewardship is basically our laboratory certifying that the nuclear weapons in the US are safe, reliable and effective. That they're gonna work, and this is a pretty hard problem. It turns out, because we used to do this in the US by actually testing the weapons right, and this was every year we have to report to Congress that the weapons will work and this used to be done by testing them. That can't be done anymore right in 1996.

A

There was a nuclear test. Ban treaty that was signed by the US has a little bit of background, which basically said no more above-ground nuclear testing. So we don't do that anymore and the way we still certify the stockpile is with simulations. In fact, HPC simulations for high performance computing simulations there's one way we do that.

A

So that's part of our mission, but I want to let you think that we're just let's that their only business, we actually are a research and development laboratory. So one of the other big projects operating it living right now, Livermore right now is the National Ignition facility. I, don't know that anybody here heard of the National Ignition facility right. So this is probably one of the coolest projects at Livermore right now, it's basically a giant laser right. So the National Ignition facility is a fusion research experiment and is designed to simulate.

A

You know the temperatures and pressures that are required for nuclear fusion, so you're talking about hundreds of millions of degrees, billions of atmospheres of pressure for a fraction of a second to get fusion, and this might look like a picture of the warp core from the fairly recent Star Trek movie. But in fact this is a picture of NIF all right.

A

They used it as a background, so that actually is the the target chamber for NIF that they're standing in front of- and you know, that's a cool little bit of science- that's going on at the laboratory, the fact it's a big bit of sign, throwing out at the laboratory right now.

A

A

Yeah I'd, like it, doesn't look that star trekky huh well anyway, so we do a lot of other research at the laboratory, but those are two of our primary missions at the moment, but we actually have a long history of scientific computing going back to support those kinds of technical missions, all the way back to the 1960s, actually a little bit before that. So, like I said we were founded in 1952 and then we brought in the first HPC machine to the laboratory in 1953, so pretty much immediately immediately after the doors were open.

A

You know we started deploying hardware there and that was a UNIVAC one. So going way way way back, they have a nice computer history, museum associated with the laboratories, where you can see some of these things and the UNIVAC one is a beast of a machine right. It's I, don't know seven feet by 14 feet and the whole thing has a you know: 5,000 vacuum tubes in it something on that order I'd in a thousand words in memory. That was the UNIVAC one, but anyway, historically, we've deployed the latest greatest supercomputers at the laboratory.

A

In support of these scientific computing missions over the years and to do a lot of scientific modeling, for you know, various programs.

A

So supercomputers these days are ranked you know on a list called the top 500 list. It is exactly what you think it is it's a list of the 500 fastest supercomputers in the world at any given time the list is put out twice a year once said supercomputing and once at International, superheating, so every six months and the systems are ranked using a benchmark called Linpack, which is kind of interesting and kind of an accident of history. Actually right, it's not that this benchmark was designed for this purpose.

A

It's just the one that everybody happened to use and it happens to be a pretty good measure of like the computational performance of a system. So you could imagine that one way to measure a computer system right would be to like multiply out all the processors right. I've got a million processors and they're all this fast and that's the theoretical peak performance. Well, Linpack actually gives you a measurement of the actual delivered performance to an application right. So it does this by solving a series of linear equations spread over the memory in the system right.

A

So it measures things like the effective memory bandwidth of the system like how fast the interconnect is, how fast it can pass messages around what the CPU speeds are. All those things come in to give you a measure of like the computational performance of the system. Again, it's an arbitrary benchmark, but it's the one people have settled on for twenty years now.

A

At the moment, Livermore has had two systems in the last ten years on the list we've had more than that going back but like I, say Blue, Gene and Sequoia here had the number one slot in the last ten years, which is pretty cool. I've got to work on both of these systems and they're. Pretty neat do e, like I, say, has a complex laboratories and at the moment, there's actually quite a few machines spread out over that complex that are in the top ten Oak Ridge has a machine called Titan.

A

It's about 17, petaflop s--, there's the squire machine, which I just mentioned at Livermore and live lana land Sandia have a machine that just deployed called Trinity. So, interestingly enough, this list used to be dominated by machines at the Department of Energy HPC machines. But in recent years that's not been the case so much all right. Some of the biggest machines we're seeing now are actually coming out of China with just monster, processor counts and performance numbers on them. So we got a little competition now, which is cool.

A

So what Livermore our computing environment for these machines runs Linux. Alright, this hasn't always been the case, but about 2000 or so we made a pivot to Linux. Basically, four reasons you might imagine right, Linux kind of took over that space.

A

We wanted to be running commodity Hardware as much as possible, because all supercomputers are big and expensive and they used to be proprietary. There was only limited volume of them right. I mean we really love to be running commodity chips. Well, whatever they're, making a million over ten billion over 100 million off right, we get much better prices on that, so there was kind of a strategic shift in the last 10 years, 2 or 15 years to Linux computers. For that reason, we build our systems.

A

On top of Red Hat Enterprise Linux, because once again it's an enterprise distribution, we've looked at some Mbutu, it's definitely an option for this, but we happen to settle on on Red Hat, a while back and to this for our distribution. We had some HPC specific functionality that you don't find in most Linux distributions right. So that means things like low latency interconnect. Infiniband is the current interactive choice. You know connect of choice these days, mainly because it's got such low message-passing times and great bandwidth.

A

It's actually got a lot of things going for it, but over the years we've gone through a lot of proprietary interconnects right at the moment in phantom at is the one we're using our parallel file system choices, lustre which I'll touch on in a second and then you need a resource manager in the stack to basically effectively schedule jobs over these thousands or tens of thousands of nodes right and cores and slurm is a resource manager that was originally developed at Livermore and we're now using there's, actually efforts underway to then develop a new next-generation scheduler, we'll see how that goes.

A

But that's being worked on because, there's you know new problems that need to be solved in this area, so we're looking at that and then we had some cluster administration tools just to manage the whole thing.

A

But one of the key ideas around Linux computer also is doing this with open-source solutions right. Given the options, we really want to use open source open source like oh, like ZFS or lustre or Linux right. We have a substantial investment in these machines and you know we have the staff to maintain and work on them and we want to be able to work on the guts so we're a big fan of open source solutions and engaging with the communities around them as much as possible. So.

A

This talk is about the storage side of things, so I thought I'd address that a little bit now that you've got a little background on. You know kind of what we do with the machines, and you know what they're there for what they're simulating what our mission is.

A

So scientific simulations like the ones we're talking about, there's lots of different simulations. We do on the systems can easily generate petabytes of data right. They just generate enormous data sets and they need to be stored and there's quite a lot of variety in these data sets. Actually we have data sets that range from millions of small files to you know really huge multi terabyte or the multi petabyte files right. We give the users quite a lot of latitude and hope they write out in their data sets.

A

So we really need a general file system that can handle that. Ok, unlike a lot of places, I, would say: well, maybe not a lot of places, but I'll go out on a limb and say that I think data integrity is probably more important to us than a lot of sites.

A

You can imagine that if you're doing a big scientific simulation and you're reading a data set back off disk or something to continue calculating- and you read in some wrong values right and you continue simulating with that and suddenly wallah you've discovered new physics right, not so good right. We don't. We don't like that right. It's all different scale of problems rather than a couple pictures of pixels are wrong. In my face book, imager or whatever right, it's much more important, so data integrity right up our alley, a big thing we care about.

A

On the back ends, our file systems also really require high I/o throughput for checkpoints. This might not be obvious initially, but for systems that are the scale of sequoia or something some of the biggest systems right. You can be talking about having a petabyte of memory on the system right and that pet abided memory needs to be written out of periodic intervals, because, while the systems are reliable for the most part, they're huge and parts do fail on them. So you might lose a note right and you don't want to lose the whole calculation right.

A

You don't want to lose the whole simulation, so you periodically want to be running out already got these timestamps or checkpoints, so you can restart in the case of a failure. Now you want that to be fast, because the scientist today they don't care about writing out data. They want to do the computation. They want to get the answer right, so they need to strike a balance between how much of their time they're spending calculating and how much does its time they're spending doing I/o.

A

So we want to make sure we deliver good bandwidth on the back end, so they can minimize those I/o times as much as possible, and you can imagine writing out. A petabyte of data takes a little bit of time even on a fast file system, and then we can't just stop the mice for I/o throughput as nice, as that would be all right. At the end of the day, they're gonna want to visualize this data, so the requirement there is that they actually have decent interactive performance.

A

Also on the filesystem we use tools like visit, which are really good at like providing visual simulations like you, can see up there of the data in the file system or the particular simulation, and that requires decent interactive performance and not something like a batch table. Most of the jobs are under the system, our batch job, so they don't feel the I/o time for checkpoints and whatnot.

A

So, as I mentioned, lustre is our tool choice to handle these workloads lustre is a scalable distributed parallel file system, so you can think of lust Ria's a POSIX file system. That's mounted on all the nodes. In the cluster and provides coherence across the cluster, which is a surprisingly hard problem, but lustre is Hardware agnostic again we like this, because you know we like to deploy whatever the current new hardware is out there or the best solution for it. We have lots of vendors that bid solutions for hardware.

A

For us, this stuff gets competitively bid so having them be able to bid whatever they think the right solution is is really really flexible and powerful for us again, we like open source software right lustre, is all open source. It was originally put under the GPL v2, and this has you know the usual advantage of the open-source. No one company controls luster, which again is good for us.

A

There aren't that many parallel file systems out there, but we really want to be able to the work on the guts, and you know, develop this out in the open with other HPC sites, and this helps protect our investment, our substantial investment in storage and whatnot right. We were sure that this isn't going to go away on us one day, plus there's a large act to motivate a development community. Roundabout bluster, it's really gotten a fair bit of traction. It is used probably predominantly on most HPC systems.

A

These days, all right, seven of ten top supercomputers for a lot of years, have used lustre and, like I, said it's a POSIX, compliant file system and well. This might sound like a small thing these days, all right, because there's lots of people move moving away from this model right in the cloud, in particular like providing POSIX semantics for a distributed system, really just isn't done because it's a hard problem, but our use case is a little bit special here. We care a lot about POSIX compliance because, as I mentioned, were an old laboratory.

A

That's been doing simulations for a long time and we have a lot of old codes and we'd really like to keep running those codes right. There's, scientists and code teams that have been working for decades and cases on these codes and they're, really not so interested in rewriting the code or re-implementing at every new years, because we found that a great new model for how to do this right. So policy compliance is a big deal for us. We'd like to shed it at some point, so I don't want to close the door on that.

A

This is certainly a thing we're looking at, but at the moment we have so many applications that depend on it that it's really important for us. So we get all these things out of lustre, which is great.

A

Luster is community organizations or at least the cloud community involvement. I should say there are two organizations that do lustre, support and help development and organize things for the community open s. Fs was set up in 2010 kind of as a place for vendors and the community get together. Much like open ZFS is today. Alright, everybody gets together. We talk about technical issues. We discuss the systems, we're building all those things.

A

Interest ef-s is the European counterpart to this right, but the same basic idea: it's kind of a smaller community, necessarily but a very diverse community of people who run this supercomputers like this, so architectural II speaking there's a lot of data on this slide. We not really need to talk about most of it, but architectural speaking. The thing to take away from lustre is that fundamentally, it split in two metadata and data, and this is kind of a the important bit so way back when when lustre was created all right.

A

This was a fundamental decision that was made that we wanted to separate all of the block data from the metadata, at least as much as possible, because it allows us to do things like get good right through put on the system all right if we can actually engage as many clients as possible writing to the server's as possible without making them get involved in you know, timestamp updates and things like that.

A

We can get much better throughput for the system, which is really really important to us, but at the same time we do care about it being a positively consistent file system right. So we have to do that kind of thing, so a balance was struck to keep them separate and a complicated locking system was developed to get us POSIX appearance. Out of this, the result is a really high performing file system, for you know both data, and you know good metadata performance out of it too.

A

I should say that the metadata performance is being improved and one of the reason features to lustre is distributed metadata and the picture here you've just got one MDS at the metadata server, but in newer versions of lustre. You can have more of these right and you can scale out the metadata, which is a big deal like I say we care about not just throughput but also interactive workloads.

A

So this worked all really well for us for a long time. Ever since we deployed lustre back in 2003 I think we had a machine come in called MCR, which was one of the first deployments of lustre. Maybe the first deployment of lustre at scale and that serve is really well up until about 2013, and then we started looking at lustre and saying: maybe it's not gonna scale where we needed to scale. This is the Squier system, a high-level architectural view and it's a big system right.

A

This came up on the books as something we were going to field and deploy and we're looking at. How are we going to develop or deploy fifty four petabytes file system, a single file system, not 54, petabytes of small file systems, right 154, petabyte file system with almost a terabyte, a second throughput, and what lustre skills really really well. It probably wasn't gonna scale well enough for this, so we needed to look at like why that would be so.

A

The reason that we were worried in particular is that, even though lustre handles lots and lots of servers pretty rough well, we're talking about numbers in the hundreds or maybe low, thousands right, you can deploy a thousand storage targets right. But beyond that, you start running into some limitations right. The clients have to individually track each of these storage targets and manage some small amount of State for it.

A

So the more storage targets you build up the more work you're putting on the clients, that's using up resources that might be used for compute instead managing the clients. So what we really wanted to do was build bigger storage targets with lustre, and we couldn't do that. It turns out for really really good historical and technical reasons, as I think everybody here is probably well aware. Writing a file system is hard right.

A

In particular, writing a file system from scratch is hard, so lustre focused on the areas where it was good initially, which was handling, distributed, workloads and parallelism, and that kind of thing and then built on existing technology for like actually writing the blocks to disk and actually reading them out, which wasn't the key problem they're trying to solve initially right. So we built on the legacy ext file system for Linux that worked great.

A

Initially, we extended it and added features that later went at ext4 on Linux things like the multi block, allocator came out of lustre and we exposed some interfaces to get transactional objects of semantics out of xt, which was great, and this works for a really really long time. But the problem is, we also inherited the limits of EXT all right and one of those limits is maximum file system size.

A

It used to be originally like around two terabytes that got pushed up to eight terabytes, but fundamentally, it still isn't very, very big, particularly when you're talking about deploying a 50-odd petabyte file system right you're talking about 7,000 8,000 servers, something like that too many too many for the clients to manage. Imagine and, from our point of view, this is the first machine coming out of this size, but they were only gonna get bigger right. It's not like this was the end of the road.

A

The next one's gonna be twice as big all right, so ZFS to the rescue will finally get to the ZFS bit at the background up here. Zfs just is the perfect fit for a lustre. This is the bottom line right. This is exactly the thing we needed to solve this problem on our storage back-end right. It has all the things we need right.

A

It's scalable, like I just said, that's a big deal for us, because we always want to build them bigger, it's manageable, another big deal for us, because you know we're gonna have thousands of these things. We want them to be as easy to manage as possible. Performance performance actually works really well for us with ZFS, because at the copy-on-write file system, which isn't nothing we had before like I said, a lot of our workload is writing out checkpoints to disk.

A

So it's a lot of writing and it's a lot of writing from a lot of random processes, doing random, writes and the fact that ZF has copy-on-write lets to serialize that all the discs, which is great for performance right, whereas before was something like HD. We be writing blocks all over the place. Just based on where the allocator wanted to put them was the FSB can stream them. That's great.

A

We have the advanced features in the ZFS and, like I, said, data integrity was another big one for us, so this was clearly the best choice when we were looking around for how to deploy sequoia of a filesystem use right, it addressed a lot of our issues and, oh by the way the ZFS source code had been recently released in 2005 right.

A

So we were looking at this initially and I want to say 2008-2009 something at that time frame looking to develop something as a better back end for lustre, and at this point you know, ZFS had been proven on real production systems. This was all great stuff right. This is exactly what we needed.

A

It kind of ticked, all the boxes for what we wanted for a back-end file system for lustre. But like all things, there were a couple problems. All right problem are one the luster server storage layering to be redesigned a little bit. So while the original layering did consider having multiple backends instead of just EXT for luster, the reality of the situation is that by only having one for a long period of time, the layering got broken here and it got broken there and it got. Things ended up in the wrong layers.

A

What they shouldn't have been right- and there was gonna- need to be a lot of work to relayer this, so we could support different back-end types.

A

So this is work that was actually tackled by the lost record developers and it took a lot of releases to refactor this because it turned out that a lot of these assumptions went pretty deep in the stack and but eventually you know, we refactored the stack for luster, the the core luster developers did this work over a bunch of different companies? Actually, originally this work started while lustre was a project primarily run out of son later acquired by Oracle and we've to another company called glam cloud.

A

But this work continued over all of those various companies problem number two: we didn't have ZFS on linux. That was kind of sad because you know luster is a linux file system and you know we wanted to use the FS because we saw cool it was, but there's good news right. We actually didn't need all of ZFS for lustre. This is maybe something that's not so clear from the higher layers, but luster actually ties in to the D mu layer and ZFS. It's the first-class consumer of the D mu.

A

It doesn't layer on top with the POSIX file system. It doesn't layer on a volume. The interface is provided by the DM. You are pretty much exactly what luster needs, so we tied in there- and this is work that was done at Livermore, actually to bring ZFS analytics and, as you all know, we didn't stop there right. We saw how cool this was and we wanted the rest of it. So we implemented the POSIX layer to and the volume manager, so the luster ZFS implementation.

A

Technically speaking, the on disk format for lustre is compatible with the POSIX layer. We did this for a couple of reasons. We wanted to be able to debug the file system pretty easily and it was convenient to be able to the same data set that luster is using to store its objects as a file system and to rummage around and look for stuff right. You could use all the normal system utilities on it.

A

Hex editors DD, whatever you want, because you could easily inspect the file system, we got to leverage all the features of ZFS from the you know: DM you down. Basically, so it turns out things like compression we're. Actually, a really big win for us early on this surprised us initially, because a lot of the HPC workloads depend on libraries and libraries have compression algorithms built into them, but they didn't work as well as you might have expected.

A

Actually, we got huge improvements in compression on the file systems at no cost to performance just by training our compression, so that was a cool win like I said we're layered on top of the D mu, and this work exposed a lot of assumptions we had on the client-side in lustre. You know things like the maximum object. Size happened to be hard-coded to terabytes. That was an ext limit. The block size we assumed was a page size over the years right because it always had been with the XT turns out.

A

It's not always page size with ZFS all right. It could be much much bigger. So at this point, we've deployed pretty much all ZFS and lustre at Livermore in support of our HPC workloads. We've got about 10 file systems deployed with maybe 100 petabytes in production. There are new file systems being deployed at the moment actually, but we're pretty much all in on the ZFS lustre thing, and it's been working out really really well for us, but we're not the only one while we were the only ones to do is initially.

A

It turns out that a lot of the other HPC sites have decided that this is really a good idea to, and they should be doing the same thing. It took a little while because takes a long time to design, build and deploy an HPC system, but Los Alamos now has 2:14 petabyte file systems of the same design, san diego's, supercomputing, Center, 7, petabyte file systems and there's vendors out there now selling these systems right.

A

Anybody who wants to buy them pretty much and there's lots of smaller deployment to universities, research, labs, which are cool, but you know not the big file systems, and mainly we need ZFS. Those are the big file systems right, so Aurora, so we're not done while Sequoia was like I, say the first step along this route. Aurora is a big system designed for the 2018 timeframe right. It is going to be three or four times bigger than Sequoia, depending on how you measure it right. It's got 150 petabytes in the file system.

A

It's going to be a gigantic machine right and it turns out that, because of the work that was done with ZFS Aurora is going to be built on luster and ZFS. It's one of the reasons they can build this machine to the file system. So this is driving quite a lot of the current development, at least on the HPC side, for ZFS right. There's work going on to improve lustres integration with ZFS and some areas here and then there's you know, features that are being added to the core ZFS code to improve HPC workloads.

A

You know things that might not seem so important and fall out. Smaller systems are really important for big systems. Things like inode quota, accounting right when you've got a billion or 10 billion files in your file system being able to get those numbers and be able to quickly get that information is really important, like which user out there has a billion files.

A

I want to know that and then there's all these other features we're working on to which we can talk about later and I'm sure we'll talk about its lightening around tomorrow or lightning talks tomorrow, but the bottom line here is I would say that open ZFS is kind of become a really important building block for the largest HPC systems out there, it's what the biggest systems are getting built on and that's really exciting to see, and that's thanks to all the hard work done by you guys, which you know we appreciate it immensely and it's helping us get our mission down to the laboratory.

A

So I have a short video if you're interested in about a minute about somebody, this stuff is kind of abstract. So let me show you a concrete example of what Sequoia looks like or particularly Grove. This is the storage file system we built for suppose we happen to you know each one of these clusters get the name right and the file system get the name, and this is a time-lapse build up with a profile system. So this is the 50-odd petabyte file system getting built.

A

It's just so much activity.

A

A

I want to say about 30,000 something on that order, so here's pretty old, so I think they're about two terabytes. Only something like that.

A

I've one or two yeah, so it turns out they happen. I've seen some spectacular, dis failures. This is my favorite part. Actually, loading the drives.

A

We don't actually publish any of that stuff um we'd like to, but it's actually a lot of work to gather that data and number crunch that data. So no, we don't publish anything. We do track a lot of it internally for drive failures, but it's not available.

A

At the moment, we're seeing about say about a 1.5 X compression ratio on the file systems, so really that 54 petabyte file system is more like a 75 or 80 petabyte file system right with real data stored in it.

A

Good question: we used a bunch of things over the years, ranging from 10 Giggy and 40 Giggy to InfiniBand. All right most of our stuff at the moment is infinite band attached was we get really good bandwidth out of it? So it's not uncommon for like one of those storage nodes that have the previous generation qtr InfiniBand links coming out of it, and you know there are well Grove, had 768 of those nodes right.

A

Yeah I'm, sorry, the question was what the layout of the drives was like on the systems. So individual systems for the Sequoia system are attached to NIT app jbods, actually, so they aren't and those net apps are exploding volumes that we're running ZFS. On top of at the moment that apps taking care of the raid for those on the square systems, but the new systems are bringing in actually are that next step right, where we're going to a full j-bad system with ZFS and it's going to handle the rain steps so written there. But.

A

So the question was: if it was easy to publish the drive data, would we I think in general we'd like to be open, so if it was easier, we'd certainly consider it I, don't know that our rules permit us to do that in a lot of cases, certainly not for some systems which are classified, but for other systems which are not yes. Maybe it's some way to at least look at.

A

Flash so systems like this we're actually looking at a solution for with I'm sorry, the question was: how do we use flash, and the answer is at the moment we don't right. So we don't use flash at the moment in any of the sequoia systems, but we're looking at it for new HPC systems in a technology called a burst buffer.

A

The idea is to deploy a lot of splash, basically for the applications to checkpoint two, and they can burst to that really really fast go back to computing, and then we trickle it off slowly to storage, so we're looking at that technologies to do that.

A

Yeah, so we've actually got a lot of work going on right now with Intel actually to get that done and to get fault management fully integrated on Linux and working. So there are our pull requests outstanding and development work being done to make that work smoothly, because that is a big concern and that's absolutely one of the reasons we didn't do. Full j-bot systems initially is a lot of work to get that right. So.

A

Yeah, yes, I should have mentioned the name for that. So that's up and github. Now it's called the flux, project or flux framework. I. Suppose you can find in a github. It is certainly a work in progress, that's being developed, it's designed for scalability for our largest systems and the hope is to replace, learn with it. But it's still something that's being developed check out the project. You know ask questions there love to get more attention on it.

A

So I should have mentioned Aurora as a system, that's not being deployed at Livermore. It's spread across, it's actually gonna be decided at Argonne, so just outside Chicago will be the Aurora system vo. He likes to spread of supercomputers around pretty much. This 200 million dollar procurement is a good, a site. There I don't know offhand what their design is for the storage, but I know there are people here can probably answer that question.

A

I mean the question was how many drives per OST for our largest file system, so the ones we're looking at deploying now are about sixty drives for OSD.

A

Yeah, so the intention is deploy these as a 5, 8, plus 2 or 6 8, plus 2 raid-z groups we could go bigger, but 50 drives is about reasonable for us on a server. It's it's 1 enclosures worth the drives. The nodes actually see quite a bit more than that. They also see all the drives from its failover partner, because we do do failover around these systems and often they'll see pass for multipath devices. We'll have multi pass to every one of these devices, so it's 60 times the other drives times 2.

A

So you might see 240 different devices on the system, and you know Linux holds up under that, but it does strain a little bit when you've got that many devices attached to it. Certainly Linux got much better about that. I should add in recent years it used to be that would just fall over and die, but no anymore.

A

I, don't know that there's a public road map for when those features will ship or anything like that, but all the development for those features is being done out in the open. Well, at least a lot of those features are being done out of the open and pull requests, and if you want to join the conversation- or you know, help with the giving have stuff done, that's a possibility.

A

Github, that's probably the place to go so either the open ZFS project on github things will be posted there or for features that are being developed. First, on Linux DFS on Linux, just look at the open pull request or the issues search for what you're interested you'll probably find a ticket and it'll point you in the right direction.

A

We're certainly doing a lot more development in that area, all right to bring some of those new features.

A