Red Hat OpenShift Case Studies | OpenShift, 10 May 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: OpenShift at NASA Case Study with Jeff Walter at OpenShift Commons Gathering 2019

Description

OpenShift at NASA Case Study with Jeff Walter at OpenShift Commons Gathering
Red Hat Summit Boston 2019

A

Thank you so I'm chef, Walter I, am the deputy head and also system engineering lead of the atmospheric science data center at NASA Langley Research Center. So my talk is not going to be nearly as technical as the previous one I'm going to talk to you tell you a little bit about what we do.

A

Give you like a broad brush, we're in the midst of a kind of a you know, a big system, evolution right now, so I'll describe some of our challenges and then also how we're using some of the these open source technologies in this case specifically OpenShift, and how we're applying it to some of the projects that that we're working on, but before I. Do that I wanted to introduce me up some of our technical support staff here with me today. So we have Aaron Kirby, who is one of our senior developers?

A

We have brian kinney and Sean Coogan, who I know are in the off audience them raise your hand guys there. They are back there and Thomas Simmons who's, one of our senior system administrators, and we also have a bloody ago Lee. Who is our Red Hat support person who is under contract to us for a year? So if you have any questions about, you know, government, procurement or management, or strategy or boring things like that. You cannot talk to me.

A

If you want to actually have technical discussions, you should probably talk to one of them, so ok, hands up I, have to ask this question: how many people are aware that NASA actually does earth science, research, Wow? Okay, that's way more than I expected cuz, usually I get sometimes I get you know. People are like wow, NASA, bizarre science, I think we just you know, send stuff to Mars and- and you know, put people in space. So that's really good because you know after you know, shortly after NASA's inception in the late 1950s.

A

You know they realized hey. You know we're gonna be doing planetary exploration. What better planet to explore than than the one we happen to live on and NASA is in fact one of the largest earth science research organizations in the world. In fact, our our the earth science divisions budget is probably eight to ten percent of NASA's total budgets about a 1.9 billion dollars a year. So we do. We do quite a lot of things and the key science questions that that we address in the earth.

A

Science division are mostly, you know things around how the global earth system is changing and what is causing these things? How will the air system change in the future and how can our system science provide societal benefit other than just you know, basic research, and these are kind of the the primary science focus areas and it really covers the whole gamut of anything earth. Science and basically, any anything you can detect from from space, is primarily what the Earth Science Division focuses on.

A

You know, using the vantage point of space, to look at the at the entire Earth system as a whole at the atmospheric science data center. We are, we cover a couple of these areas: atmospheric composition, energy cycle and climate variability and change are sort of primarily the space where we play, but but the whole division covers the entire gamut there.

A

So here is actually the fleet of missions that we have on the on the docket, and you can see this is a combination, although these missions are a combination of things that are operating now, things that are in pre formulation, which means they're on the drawing board still but they've been approved to go forward and things there are an implementation, they're currently being built but haven't been launched yet, and these missions really run the gamut of things from kind of big flagship missions in the early days of the earth science or of the the, as in its kind of present instantiation the Earth Science program, the focus was on building these big satellites that had multiple instruments on them.

A

So you know you could take multiple different measurements, that's simultaneous points in in space and time, and these things are like the size of a school bus. You know and they're large. So we have everything you know from from things of that size, all the way down to cube stats. You know, which you know, sort of two three four, you kinds of things and a lot of those are are are in development.

A

Now NASA is moving away from the big flagship missions of the kind I described into more kind of smaller and medium-sized missions for a couple of different reasons. One is those big flagship missions are they're, really expensive and they take a long time to to develop and launch. This makes us a little more nimble. In that sense, we can. We can go from mission concept to actually having something flying in a lot shorter period of time. It also makes us a little more flexible with respect to changing political winds.

A

You know, if you know somebody, you know you have this. If you only have a few very large missions on your on your docket and you know some for whatever political reason you know you have to cancel it a lot of times it's an all-or-none thing.

A

So so, having you know, more smaller missions allows you to still achieve science and be able to respond to those things, and it's also a risk reduction strategy too, because you know sometimes these things do fail me. We have some of the best engineers in the world that build and launch these things, but they, but they occasionally there will be a launch failure or maybe an instrument or spacecraft failure. So that's a lot easier to absorb.

A

If it's, if it's a smaller medium-sized mission as opposed to you know, if it's a big giant flagship mission, it's still a big impact anytime, a mission fails, you know it hurts people feel it, but um but it allows us to sort of keep moving forward and without big giant holes. Now the reason I'm telling you all.

A

That is because what that means for us on the data system side is that fewer large missions or I'm sorry, fewer, more smaller missions creates more data complexity than fewer large missions for us- and you know it's already a very complex space. You know all the you know: there's variety different instruments, they're all different.

A

The data looks different: it's organized differently, they're measuring different things, so it you know, takes an already complex problem and- and it's just increasing the complexity of it for us, so we have to figure out how to how to be able to address that challenge there. Okay, so how do we? You know manage all of this data? There's a at the program level, the NASA earth science data program. Actually we have 12 different data.

A

Centers distributed active archive, centers or Dax is the is the vernacular that we use for that, and the DAX are co-located with centers of science, discipline, expertise where we archive and distribute these data products they did generate and when I say data products. They're things like you know, they're everything from the raw data all the way through the particular geophysical parameters that happen to be getting getting measured. So some of the DAX are located at not like. Ours is located at a NASA Center.

A

Some of them are located at universities where they may have particular levels of expertise in a particular science discipline and- and some of them are located actually at other agencies. So one of our data, centers is managed by the US Geological Survey and other ones managed by the Department of Energy. So so it kind of runs the the sort of whole space there. Now then, this, so this diagram shows where the DAX are, but it also shows where another term we have called sips science, investigator LED processing systems.

A

So what the sips do is they take the raw data? That's where they have the scientists? The scientists develop the algorithms that takes that raw data and and processes it into these data products that have all the geophysical parameters the measurements. You know when I say geophysical parameters, I mean things like ozone, concentration, sea, surface temperature.

A

You know vegetation types, you know on the surface, you know, although all these kinds of things that you might want to measure about the Earth's system, and then they deliver the data products to the DAX and in some cases like ours, the sips are often co-located with the DAX and are part of the the DAC scope. So, in addition to doing that function, just the DAC, the archive and distribution and distribution services function. We also do a lot of the processing and I'll talk a little bit more about that later.

A

So data policy, I won't read this to you, but one of the core tenets of the NASA earth science data policy, there's sort of three primary things. The first and most important one- is that all NASA data is free and open to the public all over the world. It's all made available, there's no cost! You know it's just if you can get it, you can download it you can you can hold it at your at your site, you know you, can you can have it so this is.

A

This is something that you know NASA's, you know been very proud of. You know for a long time that they really are committed to to sharing this this data with the world. The second thing is: there'll, be no period, there's no period of exclusive access. After a mission launches, they do a mission checkout period for a certain period of time to make sure the instruments functioning properly and all these kinds of things like that.

A

But the core science team is not allowed to sort of hoard the data while they look at it and then make all their publications and then and then share it with everybody else. That's they can't do that and then the third thing is that NASA makes available or will make available all the source code. That's used to generate these data products. Now this isn't like a lot of you think about how source code gets distributed.

A

In most cases, this particular brand of source code is not like it's sitting on a NASA github somewhere and you can just go download it. You kind of, have to know you have to sort of ask for it. If you ask you know, you can get it, but but this is changing some of the scientists chafe against this a little bit because they feel like it's their intellectual property and in you know they a lot of them, have put a significant portion of their careers into developing these algorithms and things.

A

But but the attitude around that similar to the way the you know, the attitude around open-source software and things like that- has changed they're, starting to change this a little bit so at the atmospheric science data center. What do we do specifically? First? We do. We have data archive and distribution services that we provide for a variety of these scientific data providers.

A

As I mentioned earlier, we do operational science data production, where we will take the algorithms from the from the data producers and we will actually manage in the production environment the generation of those things we do: science production infrastructure hosting. Where for for some of our our customers, we don't actually do the operational processing, but we do host their infrastructure.

A

We have a big computer, you know data center, a computer room and will host their infrastructure but they're the ones that are responsible for their operations, and we have research computing infrastructure hosting for some of our local Langley researchers. We have a certain portion of the infrastructure, that's dedicated to allowing them to get on and do their analyses and things like that. And then we have a web hosting environment where we have a you know, a lot of informational.

A

You know kind of web sites, but then also a variety of different web applications that we wear. You can do things like. You know, searching for a particular data that you might want to get applications that apply services to the data, reformatting subsetting, those kinds of things. So it's a variety of to make sure of those sorts of things, but we have big challenges at the moment. Okay, so our data center has been around for 25 years or so, and during that time you know, we've evolved quite a bit.

A

I mean obviously we're not using the same hardware, much of the same tech we were using back in those days, but but from an architectural point of view, you know we support multiple different folks. So, from an architectural point of view, we have a lot of stovepipes and physical complexity in our environment because we have a weave. Basically, what it comes down to is: we've done a lot of stuff to our house and now it's really time to clean the Attic.

A

You know and do a lot of remodeling here we have kind of an unsustainable physical complexity. You know Verity of different clusters that don't necessarily communicate with each other but need to in many cases, and it just it's difficult to manage, and all these environments have their own configurations, their own versions of software. All these things we have a really challenging IT security and network environment.

A

So NASA is a really big target for hackers, as you as you might imagine, and we take IT security very seriously as we should, but in you know, in a world where you know NASA, this stuff isn't getting easier. The the the agency is levying more and more. You know IT security. You know kind of lock down certain things, requirements on us. What they're doing that, while simultaneously our primary function, is to push our stuff out into the world, so this creates.

A

You know a little bit of friction as you as you might imagine, but our CIO is very good about working with us on this stuff, but it's it's hard, I mean and we've been it's created. A lot of churn. Excuse me churn in recent months multi-tenancy, so we have a variety of different. You know like as I sort of mentioned earlier, different functions that we perform. So we have our DAC environment. You know functions the archive and distribution part. Then we have the processing part.

A

You know, then the infrastructure support part, and so we have a lot of different people with different and sometimes conflicting needs and requirements in the system there and we have an aging storage environment. So one of the things we have we hold roughly six petabytes of data at at our data center program as a whole has roughly 30 25 to 30 petabytes. We have about six, but storage is starting to age. It's sort of GPFS file system based, you know, and just broken up into six or eight building blocks.

A

It's backed by you know a tape archive. You know, sort of an SLA, 8500 kind of thing which just in managing you know between those environments and what to backup, and then we send stuff off-site to Iron Mountain. You know, because we have a requirement that for certain for disaster recovery, we have to send certain things at least 50 miles away, just stuff like that. It creates challenges, and it's a lot of churn.

A

So a lot of manual effort to manage that storage, so we're trying to explore different different ways to to deal with the storage plus. A lot of people's applications are kind of tied kind of tightly coupled to the way the storage organization has, and it limits our flexibility and what we can do with respect to moving things around and optimizing the storage. So all of these things combined make it sometimes difficult for us to quickly, innovate and add new functionality. I mean from an architectural point of view.

A

You know again, it's nobody's fault, but over time you know, you add a little bit and you add a little bit and you found it find it yeah to a certain degree of paying yourself into a corner which is a little bit of what we've done so from a strategic point of view. We're taking this approach, we're trying to evolve our environment to get out of this ditch a little bit and increase.

A

You know, have a more modern sort of environment so strategically we want to reduce the coupling between applications and physical environment, reduce our physical heterogeneity and virtualize, all those various complexities that that that we have eliminate the stovepiping and improve our reliability and robustness. So again in a stovepipe environment, any particular application is susceptible to you know single points of failure. Oh my head, node fail, though my cluster switch failed and that will take stuff down.

A

We want to improve our ability to rapidly respond to new requirements and deploy a new functionality quickly so now again because of IT security requirements and our architectural issues. Sometimes this is difficult and staff gets frustrated and our users get frustrated because we can't move as rapidly as we as we could otherwise and we want to stay aligned with our program direction. So, at the program level, there's a big push to put start, putting as much thing as much stuff as we can into Amazon.

A

Cloud I won't go into the whole story about that, because it's that would take another hour or two, but it it's creating a little bit of turn the directions a little bit uncertain, but whatever we would do, we want to kind of make sure that that, from us from a architectural point of view, from the perspective, our applications and our staff skillset that we're kind of ready to to align with whichever direction that they end up going in that regard. So our so part of what we're you know doing as I mentioned.

A

That's why we're here is we're leveraging openshift for our application side of the house to start addressing some of these problems and they and our projects fall into sort of roughly three buckets here is our web hosting a platform we are in just an archive system which we're calling par, Hylian and and the sine state of production management systems which I mentioned earlier, so our current web hosting platform was put together seven eight years ago it's hand-built, FreeBSD jails, handbell vm servers, there's a lot of manual promotion between dev test and production, and this people hate this thing with the raging fury of a thousand suns, because it just it causes them no end of heartache and plus IT security hates it, and they want us to jettison it.

A

So so again, the obvious choices, removing things trying to port all of our sites and applications now towards you know towards openshift, where we have sort of the nice.

A

You know traditional CIC de you know pipeline where every you know, application owner or site owner gets their own project and they can make their content changes as they want and if they can redeploy their applications as they want and all the security scans are kind of built in and the only time the OCIO sees it is when it right, when you're ready to deploy productive production, and by that time it's already been scanned to death, and hopefully, you've worked out all your security issues, so so that kind of the idea is that in you know, improve the rapidity of doing that, but we've been spending a while trying to get our first application up, and we finally did thanks to Aaron and some of the developers I mentioned earlier.

A

This is first application. It's actually been ready to go for a while. We've just been working with IT security.

A

You know I'm getting some issues. This is just a data provider application that allows them to provide us metadata descriptions of their products before they submit them to us in a structured way.

A

Okay. So the second thing is our data, ingest and archive, so our current state is, we actually operate. Two separate ingest and archive systems. I could tell you the back story about why that is, but about halfway through it. You would feel your life force slowly draining out of you, because it's does that to me every time I think about it, but there's two separate systems that we have, so one of them is called Angie. Don't worry about what the acronyms stand for when I was called Angie I was created in-house twelve thirteen years ago.

A

It's very monolithic and it's in service pretty well, but it has a lot of one-off workflows and one-off submission interfaces. Technologically, it's obsoletes based on you know old version of java enterprise. You know JBoss kind of stuff and it's challenging to maintain and operate. You know there's a lot of sort of heroic efforts. We painted ourselves into some corners with it by being a little too accommodating with our our data providers instead of enforcing interface specifications to tell them nope. This is what we need from you. This is how you submit.

A

This is what we need you're certain things to be structured, so we have a lot of one-off kinds of things. Furthermore, it's missing some key functionality.

A

Sometimes data set certain data sets, have different latency requirements and they're required to be made available to users or to get through the ingest process and out to be available for distribution in a certain period of time and if there's a heavy load on the system, and it's one of those high latency things comes in, there's no way to deal with that. It just has to sit and wait, which is a big problem. The second system we have is one called ECS which was developed by the program office at goddard.

A

It currently runs at three of the dax and it was started in the late 90s. You know by millions of developers and there's billions of pages and documentation. Thing was a monster for a while and it and it still kind of is it's I mean it's. It's not like it used to be in the old days, but it still requires them. It's a lot of money for it.

A

I used to work in that program office, so I know how much money it took to maintain this thing and it's just your eyes would kind of roll and be like really so they'd like to retire this thing and- and you know, we want to help them retirement because that retire it because it also creates a lot of churn for us, because we have to operate it and configure it. So what we've done is we started prototyping, and this is a activity that we did with the Red Hat innovation lab.

A

Actually, this past fall to sort of build a prototype of a new ingest and archive system, one that's sort of very modular. You know it's very service based. We can create provider specific workflows, but the vast majority of functions that these systems perform. You know it's, it are fairly generic. So again, you know something that that is just a lot simpler. You know a lot more modular performs these basic kind of things here. You know you get a have a submission. You do some check sums on it. You do some validation on the metadata.

A

You figure out where you write it to storage, you know, and then you store it and then you're done I mean there's a lot more complicated things going on in there other than that, but it's, but at the end of the day, that's what it's doing you know and and some of these stages take longer than others. So one of the things that we're trying to do is have the you know: Auto scale, various certain deployments that deploy services where and when are needed.

A

So certain things like checksumming, these large files, you know, are computationally intensive, so they may take a little longer and as as the queue backs up, you know we'll you know we autoscale based on on certain size. This is still a prototype, it's still very much a journey. The team is working on it now, so it's not operational, but this is where we're going in the next year or two with this okay. So the last bucket is science data production. So so current state we we do support multiple missions.

A

We do the science data processing for multi missions. Each one has its own bare metal cluster and they have their own versions of OS and Python and C, and you know, libraries and all this other kind of stuff that has to be maintained and again I talked earlier about single points of failure. You know you lose one thing on those clusters and it's and it's hard to it's hard to adjust, and these things are unable to share idle hardware resources.

A

So when, when these missions come online and they they get a hardware budget and it gets refreshed every three to five years, they get a certain budget and they usually buy based on what they think their peak capacity is going to be or as much as they can afford, and they only need that peak capacity really periodically. They'll update their algorithms and they'll go through a reprocessing, they'll go back and and say: okay, I want to reprocess all my my entire missions.

A

We've got to go back to the beginning of the mission reprocess all the data that creates a heavy load, that's sort of the peak load and they want to get it done in a reasonable amount of time. Those activities don't happen that often so a lot of the time that hardware's just sort of sitting there running at ten fifteen twenty percent at the most.

A

So a lot of the software that manages processing it's again. It's was built in some cases. Ten 15 years ago, it's Perl five based it's getting pretty old scenario. Configuration is manually intensive. It's it's a lot of work to kind of adjust these things to add algorithms or to change the way certain things fit together, and it's also difficult to track history and provenance.

A

So, in the very early days of this kind of thing, we had systems that were built and they used a relational database as as the backend, we found that that was really tough, because every every provider, every scenario so their algorithms chained together and complex sorts of ways, and it became really difficult to manage that with a relational database cuz. You know you had to normalize the schema changed the schema, and you know it was just kind of a nightmare.

A

So then they there was the reaction to that was they swung completely in the other direction? It's like we're, not gonna use any kind of back-end database. If we don't need it to manage the processing and while that's technically true now it makes it difficult to really kind of see what happened. You know you have to use the the end products as kind of a proxy. You know, just like oh I, see this stuff on the system.

A

So therefore I'm gonna, guess this is what happened, which, which also wasn't really a good good place to be so, but now we have a new mission. That's coming online that we're it's a JPL, Jet Propulsion Laboratory is managing and building this mission, but we are going to be the data center for this mission and we're also building the processing system, because we have a long history with with the principal investigator, for this I mean he likes what we do so, but this is a really interesting program really interesting mission. Here.

A

It's called Maya and you know mostly what what the processing consists of is processing the data that comes off the instrument itself.

A

The night, the interesting thing about Maya is it's going to take that instrument, data which has its own complex chain of things, we're also going to run a chemical transport model over the areas that that my acquires data and then and then also a geostatistical regression model that will run periodically that co, locates or that you know, does a regression between what the instrument actually sees and what some ground surface monitors are actually saying rolls this all up and do you know gat filled map, you know for air quality parameters and then downstream.

A

These aren't part of NASA, but downstream. You have epidemiologists that will take this information and induce ticking a cross-reference of with geocoded birth and hospital records to research, health outcomes related to air quality. So, but this is a little bit more of a complicated scenario than than what we've done before so so we're building this system actually an open shift now and again we're we're in the process of building it.

A

Now, in fact, we have a critical design review for this and in two weeks and here's some new things that are different about it, one all the science algorithm executables they could deliver to us are going to be images. We've experimented with this, so it looks pretty good. That's we think we can do batch job management with native openshift capabilities. So the way we do this now is. We have cluster management, your resource management software specifically were using Unova grid engine or Oracle grid engine or whatever.

A

It is now used to be Sun great engine, but it's changed where it's, basically, you put you its you know, has cues and you just send stuff to a to a queue and it kind of batch process it with with some prioritization schemes in there for certain things we have to combine. You know high performance, computing and high throughput computing approaches. So what we do now, typically again, what I described is sort of more the high throughput approach where we're not really so concerned with the runtime of any single job.

A

We just need to be able, over a period of weeks or months. You know ran a certain amount of production through the system, whereas the HPC approach is sort of a more traditional. How do I get this one application to run to use computing infrastructure to get some one application to run as rapidly as possible? That's what the chemical transport model is doing. It's based on MPI, you know, and we're just now, starting to explore.

A

Okay, given this environment, or given this sort of paradigm that we have now, how do we get an MPI application to run in openshift? How do we what's the best way to containerize that I know some folks have done some work in that regard, but we're still still a little unclear to us how this is going to work.

A

We have a few ideas, but we're working with the JPL guys on figuring out how it's going to work at the back end, I mentioned we weren't using databases we have now for the back end, we're primarily using a graph database and the one we're using is is neo4j.

A

This has a couple of benefits. I think the first one is that that these scenarios lend themselves really well to to being represented by a graph. You know you have again complex rules and and these sources of these algorithms will they chain together in complex ways. So it's useful for managing the actual processing. But then, when it's done, you actually have a your representation, your history, your provenance of what happened, what inputs were used? What versions of algorithms were used? What versions of the in you know your data was used?

A

It's all represented there in that graph, and you can just go back and query it. You know and see. Historically, you know what you might have done or if you see something funny in a downstream product you're like woah, okay, what happened here?

A

Resource sharing and accountability again, one of the things that we want to do with these environments is to be able to allow one provider to use and other providers resources. So if you come in the way, I want this to work is if you come in and and you're in your data provider, your omission and you say, oh, come into by X number of capacity so that whenever you want to use that capacity, you're always guaranteed access to at least that amount. But if mission number two over here isn't using their stuff.

A

Okay, great, you know now: I'm gonna use mission, number, twos capacity cos and I can get my job done quicker. Then, if mission number two comes in and says, okay well, actually I need some of my stuff back then it'll. You can kind of shrink and contract that way, which is what we really like to do and then have all the science data production managed by a single system instance. So again, I mentioned we have multiple versions of or multiple different systems running independently, that do this kind of stuff.

A

But a lot of the functions they're performing are common things. So the system that we're building here from Maya were what we're trying to do is make it so that those mission, specific complexities are sort of well encapsulated. So you can just sort of focus your changes in those areas and then you know stuff. All the common functions can be done by kind of one instance, so the operator gets really just sort of at one view and has doesn't have to deal with multiple.

A

You know: 6 8, 10 different things running simultaneously from a system perspective. Ok, so from just to summarize so the way our future is here, what we're doing in the next currently and over the net course. The next couple of years we're going to continue transitioning our applications and systems to OpenShift. This is very much a journey that that we're on we're not there yet we're still figuring stuff out, but but we like what we see and we're pretty excited about.

A

It collapsed and virtualized the infrastructure, probably using something like we have OpenStack instance 2 over seconds. This is on site.

A

Actually so we're we're exploring you know, as I mentioned earlier, would like to homogenize the physical environments and they were actually also with bloody, getting ready to explore actually putting the OpenShift on top of the OpenStack environment, seeing how well that works, exploring transition to archive or exploring transitioning our archive to object, storage for a couple different reasons, one to get rid of our those old tape archives and then also create nice decoupling and abstractions between the applications and and the actual data storage and future proofing architecture.

A

You know one of the things that we want to avoid going forward. The lesson you know lessons learned of the from the past is to set ourselves up so that we can pivot and we can more easily respond to because things are always changing and we don't know what's going to happen, but but this will hopefully make us more agile in that regard, and then, from the program point of view again, the program like I said earlier.

A

Our program is focused a lot on AWS, but we think that that there's a lot of very large potential solution space there in our domain to demonstrate the viability of either hybrid or multi cloud future to achieve our our mission objectives. So, okay, that was all I, had.