Ceph Ceph Day Melbourne 2015, 13 Nov 2015

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Steve Quenette/Blair Bethwaite -- Servicing the Fabric and the Workshop

Description

http://ceph.com/cephdays/ceph-day-melbourne/

A

You're doing ICT stuff for research, it's very different to doing for the enterprise and- and I hope to highlight that a little bit in here and really what we need at the end of the day is both scale and performance at large. And so we end up in the space where we have to create a fabric for this. And we are actually the workshop for researchers to go off and build things from.

A

So so that we got there, so these are these spectrums, you know. Sometimes we have to deal with. So you look at from a money point of view. Ninety fifty percent of the research income come then I said the research income come from only ten percent of the researchers that proxies to rough size in their agendas, so I have to deal with these peak guys and what they need to do, and we have to deal with the long tail which you can see.

A

They're like commodity users, right and research is interesting because it's it's also a little bit like shut up or something it's a little bit like it's a leader in what's happening in businesses. So in business talk people talk now about being permeable.

A

You want to be an organization that doesn't take all the risk itself. You interact with other organizations to achieve your end goal and so I'll talk about permeability a bit and this whole multidisciplinary research thing which research would be doing for a long time is really that same pattern and if you're in nineteen by the new research, we sometimes talk about these four paradigms. You know and that's their paradigms for discovery.

A

Chris and I had a nice beer review this on the other day and so I thought I'd fling it in okay, so I'm gonna go to those three points in a bit of detail and then kind of come back from there. Okay, so it send the pic versus long tail. We've all seen these graphs right and we're going to be careful that we don't leave dead bodies behind.

A

So if your ICT in the enterprise, you tend to worry about the long tail, how you can commoditize something make something cheapest /, but for a number of stuff you have and all your bodies that end up lying at all the guys doing. Experiments at the peak end if you're a traditional HPC center, I'm going to try and look at into you at the time, and you tend to be the opposite way around.

A

Where you only think of the peak, the superstar researchers- and you don't do anything for the long tail, so you're, not supporting desktops or windows at all right and your bodies are all at the long tail right. You can see the bodies on that a bit vague, but they're there right and there's. Obviously they match up at some time.

A

We need to be able to create some fancy supercomputer or some such some sort of fancy device for some Pete guy on the fly, just as we need to be able to provide for the rigorous ITIL vase services that researchers will need to do at the end of they. All researchers are actually on the long tail, because even peak researchers use well-established tools that they coupled together in some integrated way to create their fantastic device right permeability.

A

The example I got here is actually comes from from oil and how oils made right permeability is the idea that you've got some you've got some structure which huge organization, in this case it's through a rock or the sand. Other think this is going to laser pointer. Maybe it doesn't know that the yellow beets run.

A

Thank you. There we go and.

A

And, and then another organization might be the blue, which is the fluid in this case, and when the two come together in some certain environment, they can go off and do something create something new right in this case it's. So. This is where the oil of your carbon and oil and everything else, given some pressure and everything else and time and temperature and everything else we end up with oil and gas and everything else I created something new together right.

A

So permeability is this property of being able to have this sort of a behavior where you can let something else come in and together you make something better. Multidisciplinary research is is, is a term that's often user research, because the greatest discoveries always come at the edge of boundaries of disciplines said that were quite badly I guess what I'm really trying to say is if you're working in a field, that's well a star.

A

This you're not going to discover something really groundbreaking, Lee new, because you're ready working the fields establish all the great discoveries come up sort of the boundaries or we get two fields working together right, and this is a common pattern in research, and so this whole idea that the pyramid was really important and how we need to be permeable has been changing because of computing and, lastly, discovery paradigms and I'll.

A

Take a few slides to go through this and discovery paradigms are kind of interesting, because my view on that is that they've been driven by technology pretty much technology alone.

A

We all understand that their microscopes that's been important to science, and do I press something in particular,.

A

All right, let's go.

A

So it took about 2 300 years ago, where a guy managed to work out a technique to machine lenses, really well, he put in the tube and he gave it to some people. First, people style booking up the sky, and eventually people started looking down low and the biology started, because we were able to take pretty pictures or see for the first time the structure of leaves and the bugs and everything else right.

A

So microscopes are really important and telescopes are really important because they, let us see something that we have not been able to see before. If you're, a peak researcher you're building your own, you microscope right. I'll come back to the microscope in the tick. But what was really important was a piece of technology, machining, brass, but also machining.

A

The lens was the really important part that created the paradigm shift that allowed for many observations to be made from these new microscopes and they obviously see okay might be a bit small, but I'm just going to highlight to a few lines which are really really important. This is a graph from 1875 to now of innovations. If you follow on Tripoli, you might have seen the publication around this early this year or last year. That sort of got me thinking.

A

I want to present this in a slightly different way and and what it talks about, and if you understand, interest, work or compounded, growth works. It's the compare, the growth rate of various innovations, so in here we've got things like the speed of traveling over oceans and all those sorts of things here, I've this user ones to do I, guess electronic technology, this blue line. Can anyone guess what that blue line is? If you can't read it, it's Moore's law right!

A

So from roughly this time, we've been sitting on moore's law and I'm going to come back and afterwards and say: that's pretty much dominated what this third paradigm and why it exists right. This green line. Can anyone who can't read it see what that is tell what that is. It's the number of devices on the internet, if you like the internet of things right, and this thing has been going for 50 years- moore's law 50 years running.

A

If you took the number one apply this graph way to 50 years, you'd be at 150 billion as a number, your app okay, it's a massive growth. Nothing in mankind's been like it. Internet of Things is presently like it and is expected to keep on going like that for F, probably so for 50 years, the very bottom Nick down the bottom down there is is a a same line for imaging sensors, have bowel do sensors now that one's subjective, whether it's going to follow but the rent?

A

The point is that these two combined and they guess with hard disk storage I, should probably add that in law and in sort of an SSD storage storage, flash storage. But this should be alone' spectrum right. We could probably see it will have a similar growth and these guys are leading to what's probably the the technology tool that leads to the fourth paradigm, which is all these data centric things so I'm just going to go to the next slide.

B

A

Don't try so being a mathematician computer scientist I refuse to not put equations I'm in any of my talks. I've got equations, so so the fourth the first program, which is just your microscope. It's all about observations. What I get is my answer. I can count things right. The 2nd Para die is where we started: creating physical laws like Newton's and the laws behind normals, like statistics and everything else right, but mankind can computer more right to end up with equations.

A

The third paradigm really talks about we're computing, enabled us to compute, really big and complicated models. Now the eff has gotten really really large. It's complicated, it's a system of equations, it's whatever else, and we need a computer, the computer in a reasonable time. If we didn't, we only made a discovery right, so the microscope has. This is this? Is this sort of a computer can process this to give us the Y? The fourth paradigm is where the big is actually on the. Why it's actually on the the observations at the end, it's so larger.

A

The humor can't get knowledge out of it by themselves and need something else to help us and computers and storage is helping us do that, and the fourth paradigm really comes this way in these forms, and these are terms now which, which you might be starting to feel. The pressures from your own research groups are your own businesses.

A

The one that it really is attributed to is is data mining and in data mining we actually flip the equation around from the big day there. Why we're actually trying to work out what F is data- tells us what the function the model is from the data, and we use algorithms to do that. So we need fast computers to do that, and we need fast storage to do that, because the two working in unison in my own work and disciplines.

A

This is actually they're the most interesting story and this actually tries to join both big large amounts of data. We've large large models, the people who'll be doing this style of stuff for a long time are your atmospheric guys. The guys are predicting weather right. They have complicated physical bait physics based models, they're running needs to be curious to run, and they have tons of observational data, which is now the other problem, and they got to mash the two together.

A

Now the stress in your system is even worse right: it's not just data mining, so it's not big stupid computers. It's the two come together right, and this is really interesting- is both a big right and lo and behold. Both these are only useful where we know what we're looking for. Some human intuition need to exist.

A

First, to know that we're trying to do these physics or trying to do this style of data mining or whatever else there are certain things which the human mind is still better at and will always be better at and that you just need the intuition. Our brain is wired. A certain way to do things right and visualization is still really really important, and so to that end, where I was hoping to take, you guys was a what we call the K facility. This thing is beautiful.

A

Imagine having really really large LCD TVs! You got at home wrapped around you, it's really really bright. There's something like 80 million megapixels worth of a real estate and join up here is am some histology images? You know if you go the pathology and they take a slice, can and they've stained it.

A

You can see things, it's that good, that actually one whole image and pathology is now only reduced in in sort of in size by 10, so that if this thing was ten times higher resolution, we'd be showing the native resolution to show you higher resolution. Histology images are that's how high they are most researchers or clinicians using it have to use like google map. Oh google earth type applications where they scrolling scrolling scrolling and zoom zoom zoom zoom zoom zoom, and they do a lot of that to get through on these sorts of facilities.

A

They can see a lot more and you could use your eyes to see things. We can also do things like run a hundred different experiments in each of the monitors and compare different things in the parametric sense.

A

Sorry, hello, yeah! Well, there are there's some light up lighting and mood lighting and stuff that does exist on it, so we were hoping to go here and there some great entertaining things. If we, if you ever come to marsh again or river hostess again, I will be happy to do it. Unfortunately, we weren't able to go there today. Do some Dean's bumped us off for their own uses, so the 21st so we talked about this is why we talked about the 21st century microscope, and this is what the microscope looks like today.

A

We've got major instrumentation synchrotrons, massive scanners of various types and all those sorts of things, and even experimental ones. Actually you'll see where we're going down the hall. There is a what's called a cryo am which is a type of microscope, which is still an experiment.

A

Then we have to have then the filters and how you try and make the zoom and everything else is actually the software. We run on big computers to process that right. So rack bond is a cloud infrastructure which is I, guess essentially my baby and and massive, which is the other thing, is our GPU sort of based or imaging-based supercomputer, which is effectively now built on top of our cloud infrastructure and and then there's how people interact with it.

A

So there's the K facility and it's bringing those desktops and that rich environment all the way to to to the individuals, own laptop sort of arouse and people orchestrate to create integrated tools along this way and theta has to move in and out through this. So our data providing infrastructure has to help store all this stuff and make it fast into here and faster to hear and and shareable impermeable, so others from other researchers institutions can get through it, the general public or or constrained.

A

If it's got clinical health records on there, which we can also do so. What we needed to our and our data infrastructure is, it has to be a fabric that can feed it to every single one of these.

A

So back to the extremes or spectrums, so we need self service because the peak guys are going to do it themselves. We don't do what they do it right. They've got the researchers and the applied guys to do it front ends. You know. Sips is sort of dead right in certain ways like it's, it's it's the front ends Alan merge right.

A

We interact with data through dropbox and other things, all these things that that others make better than what anyone up then what we can do right and they emerge out of these peak facilities for certain disciplines at the same time, for the long term for every one house we need, you know, we need multiple and market driven finance, we don't using, they choose them and we need quality of service.

A

We need the infrastructure to be accessible. There's no reason why only my nice guy should be able to do this sort of thing, any researcher in Australia better, do it and we then need aggregation of cost. We need institutions to buy rights and then then give their rights out to their own researchers. So so the Monash is only a tenant and all these things and then the paradigms. Well, we need scale.

A

Clearly we need low, latency and bandwidth and all the things that suited was talking about so that we can make this fantastic first to market first discovery type of things, so Seph is that fabric fabric for us so much OpenStack, and so it's Neutron right. We all we sort of need to be in this sort of software-defined world. We need to be so that we're not asking our IT guys to go off through a job ticket to go off and do something.

A

It's got to be on a dashboard or through a Python script, where they orchestrate the hardware they need for the purpose that they need it right, and so it's more like a fabric, it's more self-service and it's those researchers, not us who will actually create the verticals right. That's really different to how most enterprises think now Merc is our. Is the research center itself?

A

You know which is just the cloud technology of high-performance computing and all the bits and pieces all the people, and it's ask people, including all the IT guys involved Richard the workshop people come to ours because it's a bit. You know we have better tools and better capability at trying to help make a little alarm.

A

You know a little piece of shape thing that they need and, and so people I guess my tree real point here is people's actually really really important and we don't want to lose all of our great IT, guys and I great technical guys. We keep them by having them come to this workshop.

A

Okay, so so what we do, is we conceptualize? Some, you know infrastructure products is we've come up with these terms that a national scale or a state scale at least to get? They got some properties, but essentially, you know stuff directly connected to the cloud stiff storage sort of comes off here. You know and those sorts of things, but essentially we have one or two great, once f cluster or clusters that provide into this space and Blair will go through some of that detail.

A

What that allows us to do is we can sell if you're like capacities, that many tenants and researchers can then use them as they in time together as they're like either at this level at this level or at their own level.

A

What's also interesting is that there's a growing list of useful access layers or these front ends that people use they go through amateur asian stage, where the research is doing for themselves and then they eventually become used by a lot of people in they ask all how do we, you know, make this more rigorous.

A

The interesting ones, probably are one that we've developed in-house called my TARDIS, which is very purposely around data management for instrument data, it's agnostic to the instrument and that there's something a bit special that allows the researcher to control accessibility in the life cycle. Of that of you know, of a piece of data comes off an instrument. It also like allows the facility to do the same.

A

It's an interesting paradigm shift of how to think about that fig share is, is a data management thing for the long tail for everyone every day, researchers we're partnering with them to take it from being a commodity product to being institutional product, and so libraries can data manage a you, know, institutional stuff and try and help make sure that researchers get their data out into the public domain in the long term and then the others aren't we're not really worth talking about.

A

We do use aspera for high speed transfers in and out of the data stream there to here, it's important to keep on having windows, shares and so on and so forth.

A

Just some numbers for us what we have on the ground right now is in the order of 50, dorm or petabytes of the safe storage and by the end of next year. Hopefully, that will number will grow significantly as we move the rest of our research infrastructure into this design. Pattern.

A

I know that there's nothing compared to 100 kilobytes of some other people, but it's still quite significant I guess. The other point I want to raise is we're pretty new at SEF actually we're. Thirdly, I was going to ask you how long we've been doing it for two years Rivera new. We took a big punt right to go down this path and we made it work for us.

A

That's pretty much me at the end of the day, we're still consolidating like any IT group, tends to want to do, but we've just changed where that consolidation occurs. If you know in Steph, actually is the right spot and that's the moral of the story at thus far so with that danger of the blue I.

C

A

You want, you can ask me questions now or at the end, all.

C

D

A

That wasn't even on.

A

Layla mentioned this, we actually started off with our research Claire that first phase of our research think the research cloud. There was a bit of a mistake made in the design pattern for it with.

D

Your hand you're stealing my thunder man is you gonna? Do it yeah.

A

D

How's that yeah.

D

How does this thing with hey cool all right? So sorry, if I hope, I missed anybody in a list here, I was trying to do this it to that topic last night, so going better emails, so there's quite a long list of people that error, obviously myself Jericho, who I don't think is here today, reference way or in the audience and Craig.

D

So at monash, one of the interesting things that we've actually I guess challenges that we've had with SEF and and some of the other technology around here in the cloud space is that we have a corporate IT group and also the research center that have for a while, been really I mean the part of organizationally in the same space, but operate very independently, or at least very sinister, very siloed fashion, and so we've been trying to sort of break down some of those walls. And so that's why we've got a long list of people hit.

D

D

Swaysway and Jericho and myself are all on the research center side of the fence, and everybody else is in in East Lycians and servers and storage and network team and so on.

B

D

I am I realized why my machine wasn't working when I was in the toilet before toads, because I've gotta switch to.

D

All right we're just barrel ahead: Joey, okay, so I thought we'd go through a little bit of the history of how we came to start using safe, and you know if you haven't seen that XKCD then making a bit of a chance to read it. So.

A

D

Stole it off me, the first time around so yeah. Of course, it all began with the cloud so we're one of the nodes of the nectar research cloud, which is a a national project in Australia that was funded through subside scheme from the road government back just post-gfc, and that set up this national cloud.

D

Eight sites around australia federated through OpenStack about 35,000 cause plus a whole bunch of software infrastructure and virtual laboratories to go go along with it to how people get started actually putting things on there, and so that was that was where we, where we came to surf.

D

So in early 2013, we had the first node of rap Mon deployed through nectar, so we had our own local club and everything was awesome, except nobody had any persistent storage, unfortunately, so nectar funded all the compute and ephemeral storage enough for all your servers to run, but nobody had any volumes. There was object storage, but none of our uses knew what to do with it. So we had. We fortunately had a bit of an object, storage hardware and the existing national.

D

The existing object, storage at University of Melbourne was not under capacity pressure, and we had no way to federal rate was Swift at that point. So we had this sphere the sphere storage hardware and decided to try and step on it, and so nectar were gracious enough to. Let us use that because it was technically their hardware, and so that's when we started on cuddlefish, so I was actually just trying to rember this early today, because the bobtail cuttlefish, dumpling, were all kind of released in 2013.

D

I was a fairly prolific EFSF, and that was really, I think, when Seth probably started to be taken seriously. It was we actually looked at it when we were planning our cloud node and it was just when ink tank had come out of dreamhost and the website was this horrible dinky little thing and it was just looked all a bit too fragile, really and then then, like three months later, looks it again, sir. Oh no I can take this seriously now.

D

Yeah, so don't we just go through a bit of show-and-tell that stuff we've got on the ground, so we've got three sip clusters now running a mash, so the first one was this one that we built out of the Swift hardware: 88 del r, 7 30 XD nodes, you'll notice, there's no SSD in them. Of course, we didn't bother buying that for swift. It wasn't a requirement at the time and they've.

D

Actually they've done pretty well I mean that gave us the confidence to keep moving ahead with with SEF, even despite the the trade-offs with the hardware there. So it basically met the user expectation tests. You could have a sale, a Windows Server on the cloud or something doing a Windows Explorer file copy at over 300 mega. Second, you know so that was that was good tick and that that is currently got about 60 terabytes used there's. Actually a hundred and thirty five terabytes of storage committed the error.

D

If we look at at our OpenStack cinder, so we're quite over provisioned, but we have some, you know is coming I'll talk about in a little bit of a moment. We run that at two replicas and that's actually never been a problem touch wood. Of course it's as kind of a smallish cluster, so the failure rates a little bit, I guess in our favor there.

D

So when we came to do the expansion of our of our cloud mode, we're putting hardware into research data center, which is on campus, so we decided to build another safe cluster along with that. This time we explicitly used funding that came through a different program, not nectar.

D

Rds I was the sister or the brother program to nectar, which was all about research data storage, and so we created this computational storage product, which was the safe storage to go with Monash too, and so we changed a few things from we did it on purpose this time. So we because we added SSDs for journals.

D

We vote two eyes demands on separate management infrastructure.

D

Previously, these were co-located on the OSD nodes and the Monash one cluster, and we backed off on the cpu, only use one socket, but slightly higher frequency sent man of ram and we went over to mellanox networking, and that is all because we sort of knew that the RDMA support was coming and we wanted to be prepared for that.

D

Although we still have internal Don you're waiting for it to go really to vga, because this cluster is in production. So this one is now currently it's about. So it's about 300, terabytes, usable we've got a few. We've got a few OSD spear they're actually got a bunch of nodes of the same configuration ready to add in later once once capacity starts pumping up of it.

D

So then, then we did the big one. So this one is the the RDS cluster that we're doing for public, facing object, storage and virtual nerves, and so on this. This guy of course got dedicated bonds. It has a cached here for the road ice gateway. Cached is just spinning disk, but just faster disk. It has a mix of 56, gig and 10 gig networking. So the storage servers themselves, the main OST nodes is 33 of those again the 720 XDS.

D

Each of those also has an MD 1200 j-bot attached to it, so it's 144 terabytes of raw storage per node and that's all hanging off only 20 de which Sudha won't be very happy that, but you know we had one of the really nice things about set. Is you can let you play with you know we had a very clear budget for what we needed to meet to get our price per terabyte and we were able to tweak off that around and choose. We were making those trade-offs.

D

So that's one of the areas that we made the trade-off, of course we're running most of all, the storage at fairly high we've got either three replica or 830 razor code. So you know we can. We can afford a bit longer recovery times.

D

We also what we do. I went back to jewel, socket, of course, with these big nodes bit more ram. I think if we were doing this all again, we would probably go denser. That was one thing that was a real concern at the time and there was a lot of conflicting information about making those hardware choices. You know you need this much ram for so many eos DS.

D

You shouldn't have more than this number of RS DS and a host that sort of stuff. This is you know this is taking up six racks in the data center.

D

If we had have gone with, you know, sixty drive Chace's or something like that. That probably would have gone into a maybe a four and three and a bit rack footprint, and the other thing that was different here is that we moved over to a rail, and this was actually. This was the first rail seven deployed like your first big relative and deployment in Monash still really is like, and that that, of course cost us a few issues along the way, as well with just getting to learn things like network manager and so forth.

D

So I guess I just wanted to try and depict the lad here, because this I think it gives you a good idea of the flexibility that you get with sift. So don't hear, you've got, you've got the the main, the main nodes within the internal hard disk and the external J buds.

D

You've got the cash tier nodes over here, and this is sort of roughly the data pools that you've got. You know there's a few of them left out here, but I think I've captured the main ones so coming out of those external drives. We've got our RVD storage at three replicas. We've also got a set of test data pool on those discs with replicas, so they're using the same crusher all set those pools.

D

You've got the 83 erasure code coming out of the internal drive in this chassis, and on top of that is the case year for the radar scope way buckets as well coming off the internal hard drives on the couch denotes the cat nodes and those cat nodes also have this. If s metadata on them and so then going a layer up, we've got a few hypervisors which offer the presentation services for action. Big user interaction here so you've got. We've got virtualizer a das gateway.

D

I'll show you our radars gateway, hae architecture on the next slide as well, because that was that was actually one of the challenges that we had was kind of. Okay, we wanted. We wanted a high availability service, but there's actually no. As far as I know, there isn't a recipe up there to go and to build one of those things.

D

Yeah MBS sambar, obviously so we're very, very keen for Sylvester being production, because really really we don't want these guys, it would just be nice to have direct Samba, Gate City to be sander straight on syphilis and.

D

So the radars gateway architecture is here so all of these, the hecho proxy and radars gateways. These are running on our presentation, hypervisors and so that guy has disappeared.

D

So how do we get Dennis? Round-Robin gives us the scale out. We've got a checkbox, incapable I've d for failover of AJ properties. I said that they're not a single point of failure and then, of course, RadarScope gateway, just scales out as you like as well, and so that's that's gone really well and Jericho did a really awesome job. Putting all that, together from my diagram, sir.

D

So we've got as I said before: we've got a little bit of new capacity ready to go in the monash to zone. We've got another 10 nodes ready to be added when we need them so that cluster will go to 27 nodes, another another nine. So in that in the big RDS cluster of the moment, we only deployed 24 of those noes to begin with, because three petabytes of storage is still going to take quite a long time to fill up and then we're also refreshing.

D

The Monash one costs are the one that didn't have any SSDs in it or anything before so del actually have a really nice box in the current range us and 30 XD, which gives you 16 three and a half inch drives in a two-hour chassis that can be 0 s. Ds two discs in the back for the OS and we've checked in via me in for the journals, so that'll be interesting.

D

That's something that we haven't haven't done before, but you know if the performance numbers are as promised, and it should fly along nicely and perform that nothing. Nothing else really interesting there.

D

So some of their pain, points and nits would surf surf itself has been really quite solid, I think no, no major problems, no, no data loss. That's that's a big thing, that's important for us, but it can be quite opaque when things do go wrong. You know when the clusters, the cost is either kind of it's either. Okay, everything's hunky-dory or its warning and warning can be anywhere from you know: I! Okay, I'm almost fine.

D

Don't worry about me to know you really need to look what's going on here, because I'm not healthy at all, so you this is this. This example here is a little Seth des snapshot. I took when we had a network partition in in the big yes cluster and you know so you can't it there's really not like I'm, not a lot of clues to go on here. Oh it's kind of it's a lot like it's a lot of information to take in all at once.

D

Obviously we had a Mon down, so that was kind of that was the first place. We went to look and try to figure out what's going wrong with the moms, and then it was clear I'll that guy can't talk to the other guy at these two don't know where he is: okay, there's something screwing with the network, but yet that's I, think I think we ended up.

D

We sort of ended up getting the network fixed and then restarting everything and it all came back, okay, yeah and so some some of the I guess open challenges, questions that we have for the community at the moment.

D

So what one of the things that we're doing with that that picture I showed you before is we've got virtual nares boxes using our BD. Is their storage back end? How do you optimize that picture? So we've we've done a bit of we've done a bit of work in the space reason. Looking at okay, if we want to run say a ZFS box on top of SEF.

D

How many are bd's should we have in each surface, pool we'll sit, and you know it seems we get to about eight and don't then don't see any anymore scaling, but I, wouldn't what sort of what to nobles are there in terms of bird, so the interface type that you're, using with VD and so on, disk failure handling process is I, get a guess, an ongoing issue, I'm not really all that up to date on where things are in master at the moment, with PG repair stuff for a while.

D

It was the case that pge repair was just like a roulette. You know you would just if, if the, if the primary OSD was the one that was bad, then you copy the bad data to the others. I think that may have been fixed already now, so you know that that's good, but I mean our current policy is basically just even for a single uncorrectable read aris a on the drive is just to kill the whole drive and I guess.

D

That's that's also an icing standardize on because it that you can just do that for any any disk problem, but be interested to hear how other people handle that and I guess part of the reason we do. That is also because we're using raid controllers so we're using virtual raid 0 drives for hours days and that introduces that works. Just fine and you get extra benefit of the right cash. But it does. You have to be aware that that introduces another just another layer of complexity in operations between SEF and the hard way.

D

So when it comes to doing debugging and so forth, then you have to mess around with the dell tools to reset virtual drives and so forth. The other thing we're actually we're going through a problem at the moment that we're debugging a performance issue on on that big cluster and it sort of seems like there's.

D

Actually, we don't yet have a really standard way to look at these problems and when I was trying to looking for things, I actually came across a nice wiki page where somebody started and proposed a framework to do this, but it hasn't gone anywhere yet so I think that would be. That would be a really useful resource.

D

Now. What do we learned along the way? I guess I've mentioned some of this stuff, so yeah use dedicated journals. It definitely helps soon also done some good characterization in this area. They had their large.

D

Their large OpenStack sift cluster running without journals for some time, and they had some really nice grass that show the right latency when they started putting SSD journals in and there are sort of all spiky spike in there just go flat, and the network of course becomes much more visible with SEF, then I guess it normally is, and so we've had to spend quite a bit more effort in doing cheering.

D

But you know: we've been working with malik's there and they've helped quite a lot, which has been good and I. Think I'm pretty much done. So. Please ask questions.

C

Are you use that to whatever need for running it on rvs.

D

Well, we do. We do really like the idea how to get transparent compression, because a lot of the data sets we deal with if they're, just for the cases where it's an export to researchers desktop, they are all generally quite compressible where we have where we control the application. Front-End like with my TARDIS, then we can do some of that at the application level, but for the further complexity, I would I would be happy to throw all that stuff away and just go to surface only, even without having compression and so on.

D

Yeah yeah, it dummy just just helps drive the cost down a bit further yeah.

A

Is that problem that he can longtail problem right so that you want the ZFS, keep pressure really for that long, tail part that you don't want for the other part. So even if we did use it, I guess I could imagine where you might just use a certain cases. Even at all, we.

B

A

Both separately soon and that's, what's great, your background and balanced set up dynamically as as demand meter, yeah.

D

Well, that's that's the cool thing about zero Fester's that you can do that for a directory right turn it on and off.

D

No, we haven't haven't done anything like that, so I mean that all our clusters so far have had quite I, guess, homogeneous network and we always build with jewel top-of-rack switch connections. So we're not you know we're not going to be hurt by a single switch Valley or anything like that.

D

We are looking at so for, for the virtual nares devices would use TSM to back up to the file backups there we're not sure really what we still need to implement something for a toast gateway, but that, on the face of it, seems like a relatively easy problem, because there's plenty of open source tools out there that can slip stuff out of an s3 endpoint and put it somewhere else, whether it's file or whatever, and so we have. We have large tape, libraries and so forth.

D

So we can go to you know we can dump it on to say an NFS export. That's then HS end or something like that. No, not so! No we're not doing a synchronous, replication or anything like that or or RadarScope way. Regions. At this point, yep, but Colin, who is now gone, has been asking a lot of questions of vendors about that some stuff there, because we are with definitely interested in having a backup copy in it. Another data center.

D

Yeah yeah, good good question, so I mean part of actually kind of part of the reason why that so that that RDS cluster is it's kind of a standalone thing and the other obvious thing to do would have been to just tack it onto the cloud that is already in the same data center, but that's probably more to do with security and and some of those organizational constraints, because at least nectar and so on is still very much a wild west environment.

D

So we really wanted to have a bit of an air gap there.

D

We obviously want our cloud servers and so on to be able to access the service storage services that are running in that cluster and be able to do so quite quickly, which is why they're connected into the same core wrapping structure, but we want we want the security get there yeah, but I guess and in terms of scaling them out.

D

Why you know Venkat talked about the what d converged, architecture I think! That's! You start to you start to see the need for that once you once you've got to that scale. I think it at the sort of at the scale were you just supporting some block storage for medium sized cloud deployment? It's not doing doing a complete forklift upgrade or refresh of your storage nodes is not you know not that big a deal, but for something like that we talking to me it does.

D

Then it starts to become yeah Morgan issue. Obviously, and you don't really want to go and buy a whole lot of new kit that goes much faster than the rest and then take it on to a slower storage pool.

B

D

Just you know: apt-get upgrade your not great.

D

No I mean we did we're gonna held on tight pants, the first couple of times we did it so I think we we were just we're. Just kind of cautious, initially just follow the recommended.

D

Ordering of moms, iced, teas and so on, and then we always picked like say we always pick the kid like a canary node and did the one first. As long as four minor versions, that's fine, and so maybe leave that running for 24 hours, make sure, there's no hiccups and then do the rest, but with never what never been and always actually I guess, like I watched, Seth uses pretty closely and I had already at least learned before we even wanted to do.

D

Our first upgrade head learn to wait a few weeks right because they have been. You know they have been a couple of them. I guess stop ups they're like the.

A

First one we do I got nervous enough that I put myself out of working with they go.

D

Okay, yeah yeah well actually, over the last one that we did, the the storage guys did that in entirely. They just send us an email when it was done. Yes,.

D

Well, yeah, it kind of really was the only option, writers and there's a lot going on. I guess in the industry at that time.

A

The team did in a way yeah.

D

We are, we looked at a lot of things we didn't do. I mean honestly. Seth was the first thing that we retested in anger and it worked so we left it right, but I guess at the time. Gluster did not have a block storage story at all so and and no no OpenStack integration.

D

That was I guess that was the other. That was the other project.

D

That I mean that looked that was more mature as a project at the time it seemed at least I, don't know really whether looking at the history of development, whether that's necessarily true, haven't, haven't looked at that butt done, and aside from that, do you then start moving into more obscure things, I mean unless you're going and going and buying an appliance that is, and we had you know we had storage nodes already right, so we wanted to do software-defined storage, so there would have been things like extreme FS there's, what's the other one sheep dog, and they all, you know that they're all much much smaller, open source projects, nowhere near the community around them and I.

D

Guess that's that choose like picking picking where the community is is definitely I really is the solid way to go right and that's that's what we did with OpenStack it's what we've done with SEF it's it's been the right thing to do.

A

Acid tests, I guess because prior to.

B

A

All those sorts of things we had a lot of cases where researchers will come and say, look I need to do this sort of de novo genomics mapping which kills disk and so we're getting a lot of requests for fibre channel links essentially to be created, for you know, x y&z, so we kind of knew there. There are the sort of problems that we needed to know that we could solve with this infrastructure, and it became pretty clear that some of those alternatives wouldn't actually really have for short round of reeds and things like that.

A

So we had some. We had some bounds for understanding. You know once we're testing those things what we were kind of looking for as well.

D

Yeah we want to get rid of Sam.

D