South Big Data Hub Data Sharing & Infrastructure Group, 20 Jan 2017

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: CI WG demo: DataNet Federation Consortium

Description

Date: 1/20/17
Presenter: Mike Conway
Institution: Renaissance Computing Institute (RENCI)
South Big Data Hub

A

I'm so I'll give like a quick demonstration to I'm, not really here. You know this is what I told these guys I'm, not really here to demonstrate. Ciphers per se, cybers is developed by our seme discovery. Environment is developed by ciphers, so what I really want to do is is put it in context of what we're doing in DFC and and what we think it means and what's happening here and going forward that may be of interest conceptually.

A

The broad strategy we have here is this is coming from me, so I'm, not speaking for everybody but I the step tails nicely with what Kenton was saying, we're, maybe at a layer below I, think a lot of the things that are being discussed, that look really beautiful and we're really interested in interoperating.

A

But we see the data hub as an emergent network of nodes, and so what I mean by that is we're less focused on developing a site as much as we are in developing software that can be used to form constellations of sites, and this can half. This can occur in my mind, around sort of the centralized nature of data hubs as they exist and the idea is, we want to embed Pauline policy management and openness from scratch paste to publish reference collections.

A

This is a tip of the hat to Regan's data lifecycle, a concept which is we're really interested in what happens below sort of public visibility of data and metadata from where independent researchers are working, doing computation all the way to where they form collections that can be published, and we want to sort of consider that whole spectrum.

A

The idea is that data can be my data before it's our data, and we want to make sure that we can handle that data in one context from all these transformations that can happen before it's finally visible in something like a publish endpoint. So again, just like the internet. We want to sort of see if data of grids and cyber infrastructure actually behaves like the Internet as we know it.

A

So the ideas local modes freely associate we're able to network Confederate, together with local control and I'm, emphasizing that, after the visit to the northeast data hub, where they're really interested in pardoning to party data sharing mediated by agreements. So we want to sort of take that sort of idea and apply that to the the data hub concept. So maybe the big data hub is the red of focus of this chart.

A

But then there is a network that emerges around and can connect to or the independent of that central hub, but using all the same infrastructure and concepts so that one can join the larger collective collectively at points or it will. It has to service the entire data lifecycle and it has to support discovery by appropriate audiences and that sort of implies things like the data one catalog and things like that of valgus of shamelessly from Regan.

A

If that's his idea from scratch space to shared reference collections, that you have policies that mediate that so you may share between two parties working on research and do analysis, pipelines those products end up in digital libraries and finally, into federated sustained collections, which may be something that things like India share or see or so forth, represent on the far right of the spectrum.

A

So so it fails to facilitate that we're working on what is the apache web server of data? You know some basic things that are sort of non or sort of easy to sort of grok would be it's open, source and ubiquitous, and it's packaged for the hardware and software ecosystem of what's out there in research land or where people are doing real work and then beyond that. What are the facilities that are needed? Policy management again to support that life cycle and as Kenton is indicating and I'm really I?

A

Think that's a sort of brilliant thing that can just demonstrate it's, this confluence of data and computation, which is where cyber is really a discovery.

A

Environment really sort of represented to us of that concept, brought to a practical fruition, ubiquitous discovery through metadata indexing and discovery, and also high-performance data, storage and transfer that last one cannot be discounted, and so, as we like to think about the data hub, we can talk HTTP all we want, but eventually we are going to have to move big big buckets of data around, and so the whole thing has to rest on the substrate that can handle sort of this management of data losses.

A

Okay, so where are called my other foot is in the consortium, so we want to rely on the consortium model for the underlying data. This is where the scratch base. You know from my scratch base all the way out to these published reference collections. It has to rest on an integration with things like storage technology as well as high-performance computation. So these are some of the people that were working with on the consortium side for that not on your Intel who just joined and so we're going to be integrating this technology with lustre.

A

So in terms of sustainability. In terms of that scratch space federated collections, that's what we're interested in this is a tip.

A

What we saw the Northeast big data hub and why it's an emergent network is because you may have an organization a who wants to share some medical data with a researchers organization see so they want to have data sharing, but under control, so that they can enforce policy data sharing agreements, so in this case, at the fabric, because a and C can can enforce their data sharing agreement, but also C and B can share, for example, products of that analyses and prove back to organization a that.

A

Their data have been handled according to of the prescribed policies and so forth. So there's all this is really saying is that within that network, it's not all open that we have to design and do something that allows these kinds of restrictions and sort of little clusters of this network to work in on little islands policy.

A

If you will so, we looked at de from cybers as a season sort of starting point for the data workbench part of this, which is you know, tools for data access and sharing and basic metadata management, as well as a model for accessing computation, include their model, bring your own compute, which sounds a lot like what Kenton is talking about. That means that researchers can dr.

A

eyes tools, throw them in there and have their environments be able to execute what researchers provide for analyses on the data that sits inside the environment, and that can also include as we're talking about shipping that shipping that computation to the data within this environment again controlled with the policy those so that again, if it's medical data or sensitive data, that we know, we know that only certain kinds of computations happen and how to process of results of analysis. So we looked at de is providing a service layer to that.

A

We can exploit and adapt for these use cases again in tenten's presentation. He had a really good overview and think of some of the use cases that the data hubs are going to have to follow in the attractiveness of us of the cyber stack is that that sort of have a functioning embodiment of a lot of those use cases so kind of a central organizing component for us, for that, we'll also include other interfaces and access methods so quickly on status.

A

What we, what we had done over the last few months, is take the ciphers architecture and with their with the health of the ciphers folks, take it out of ciphers and now start running it as a standalone software package.

A

On top of this other data layer, I was just mentioning, so we have visor with the Odom Institute that is using the EMS architecture for social science data integrated so that you can publish from that environment into dataverse and they're, using a lot of our analysis tools, so they provision they provision there environment with these tools and they're doing all right there now within DSC.

A

The last thing we did was with the baka Nam and John Goodall at University of Virginia have taken her hydrology workflows, docker, eyes them, and we can run them in this architecture, but also we created a gateway so that things like Hydra share can launch apps using the same infrastructure. So it becomes a generalized service in the same way that.

B

We could share.

A

Yeah yeah, and also just recently, this was presented as a brain initiative by the UNC neuroscience Center is we've also created. These are three environments we've created, so the start environment is linked to a GPU resources and researchers are doing imaging of mouse brains using GPUs, and so again we can do this all through the same set of environment and services.

A

This was just presented by Jason, Stein, / neuro, and that's going towards a pipeline that that pushes that all the way to agents that are on microscopes that can take the images and go through a QA process, go through a creation of collections, pushing that through analysis and so forth, using these same patterns over and over again.

A

So um we want for working with a new Rob and the Cypress folks who are putting together an official community de um as a software package and then we're working with the consortium to extend the plug ability of those services in towards the core architecture over on the our roads, construction side, so met extension points or metadata, curation indexing discovery.

A

um So, for example, we just met with data one to see about formalizing, a sort of a plugin data, one member node, a software stack and formation of collections across distributed grid so that you can have collections in multiple places that all appears one collection but are still having all the policies in force and all the various nodes um and standards, commodity approaches for data sharing and the idea of Alucard installation of components. So this is more about I'm, a shop and I want to do. Data sharing I mean I, need this piece.

A

I need this piece to run as services on my hardware or out in the cloud, but my own computational infrastructure, like we run Apache web server to serve web pages. um I, don't have a ton of time and again, I wasn't going to really demonstrate de so much, but for those of you who have not seen it on the visual model is of a workbench where you have data. So this does all the sort of things you've seen over and over for sharing data, doing a metadata management setting apples and things like that upload download.

A

um It also has a view of apps where people can bring their own pieces of computation. um For example, here's a Baca noms, a hydrology workflow app, it's a docker eyes, damage that goes in the environment and then you can stage data to it run analysis run the app and then the results of the app shall appear in the analyses, um and you can run stuff both on high throughput environments like condor as well as use they've integrated this with the agave science api c.

A

You can also run apps on high-performance computing like a Down attack. um So the idea here I know that's really equipped.

A

I think you guys have already seen cybers, but the idea is we've taken that out of cybers it's now an independent software stack it's being integrated with the data grid um and we're working towards a future where there is this interface, as well as a generous rest service API um that can be installed with pieces as needed at an institution at a business and that will integrate with their high performance storage that they're buying from DD N or EMC or okay, so um I'm going to stop there I know.

A

That's a whirlwind, and hopefully some at least some of that made sense. I feel like I'm, going 100 miles an hour, so I'll open it for questions. If there's any time remaining y'all good.

B

Courses yeah there other questions, folks who are listening in on the phone or discussion points. Oh.

A

My you have examples of the type of applications that Jerrod is running in his environment. Oh I wish he was here to characterize that I know that they do a lot of game processing pipelines. There's a lot of work with the DNA analysis um and I think they do a lot of spatial analysis. It's really open-ended, though,.

A

Yeah I, don't have much more to say about the the particular kind of analysis, they're they're doing I've always seen it more as software that is completely pluggable again so that you can do gene processing pipelines like cybers does or you can do our analysis of statistical data if you provision it as such, the way that Weiser did so, it's almost like the the kinds of processing are less relevant.

A

um That's another we're saying I, don't know. Oh I.

B

Have a question: yes, this is Roger Emily one person I want to ask you is that since you have integrated de with with higher odds, you know in a sense and then have this interface on top of the data grid.

B

Now I dodge itself is a federated data grid, so you can go to a data which is in any of the underlying zones of ions, so it gives you the zone of zones and things like that. Can you do the same thing for the apps? Also apps can be from multiple zones, and you can you can look at them and use them across zones and currently.

A

That that use case is not supported. uh That is something that we were talking with cyber's about about adding Federation to other aspects of what they do, because they already have use cases for that yeah.

B

We know we know that the irods exports Federation or data and metadata, but having expert for app, also will be really good. One right.

A

And I think the focus we have right now would be like the definition of what we're calling a computational resource as sort of a packaged installable piece so that you could have a storage grid. But then you could identify certain nodes as being a sandbox for the computation. So we had a conversation with Kenton before about and and that found that a pleasingly they had already thought of this, which is, could we take brown dog services and then on the data and ship it down to where that data is at rest.

A

That's very important for this thing we're doing with the brain initiative, because those brain scans and the models that result are very very large. So we want to push on things like subsetting and so forth, down to where the data that rest and I believe that yeah.

B

You can also do it in such a way that an app can be replicated, and you can have that pick us and- and you can easily then say: okay, these somebody in Africa they're, where the data is located. I don't need to even ship it yeah.

A

And you know so: I want to ask Kenton. Is there a way to expose something so that you could put buttons on?

A

What's a nds share, but in terms of finding where that data is at rest, is there a way for a site to say here's, the sandbox, you know here's how, when you ask an app to be run, I could pick that up and use the same sort of infrastructure to push that down to one of these compute resources, for example. That would be really really interesting um like if we had could be a provider, if you will of that kind of functionality, what that would entail.

A

B

If you're, storing, maybe I, think you are doing it I'm just being a devil's advocate here, you store an app as a data setting hydarnes you automatically inherit all all those facilities which it works, provides with you with access control and replication versioning and all those nudity things. Each authentication, authorization, nudity yeah.

A

We're interested in that for a president as a preservation task, because you know the the de will already do like if you want an analysis. I may be out of time, but if I'm looking at an analysis that has wrong um the idea, is, it is repeatable at keeping data about that of what happened. What app was run? I can see the parameters.

A

I can tune them and relaunch it to achieve repeatability, which you knows is a big thing, everybody's looking for so the idea of being able to add at a point preserve what happened and keep the computational docker image with the data is at least a step towards. You know having a complete preserved image of the whatever the task lows.

B

B

Any last questions.