South Big Data Hub Data Sharing & Infrastructure Group, 20 Jan 2017

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: CI WG demo: NDS (National Data Service)

Description

Date: 01-20-17
Presenter: Kenton McHenry
Institution: National Center for Supercomputing Applications (NCSA)
Midwest Big Data Hub

A

So with that I'm going to introduce dr. Kenton McHenry, he is the senior research scientist at the National Center for supercomputing applications at the University of Illinois at urbana-champaign is a deputy director of NCS, a scientific software and applications division. The principal investigator of dibs, Brown, dog and co-lead of NC essays, innovative software and data analysis group, which works with researchers to build novel tools, services in support of scientific data needs and if their story behind that brown dog data lies, I'm not familiar with it. Yeah.

B

I touched on the last time in do you have to do with basically there's a lot of different ways to manipulate data. It tries to wrangle those, and so it's basically a month of different kinds of software. So that's what I named around our company all.

A

Right great well with that I think we're ready to hand it over to again. Thank you. Okay,.

B

Thanks so today, I'm going to talk a little bit about the national data service consortium and kind of activities we're doing there and show you what we've been developing as part of that.

B

So the NDS activity is largely towards supporting the publication of data scientific data specifically and linking it to the papers that talk about it that have produced that data and overall leading to its reuse, so that other scientists, potentially even within different domains, can discover that data and do stuff with it pretty more easily and overall, the goal is towards advancing discovery within science towards making new discoveries that are currently difficult, much more doable in the future.

B

The kind of story that I lead off with I've been using these days as from the Washington Post from a year ago, the kind of highlights what we want to see more of it basically discusses an effort out of NIH to basically bring together a bunch of different data sources that actually cross different countries and was brought together over a number of years to basically make this discovery in the brain that led to some insights into how schizophrenia is triggered, and so this article actually touches on how difficult it was to do this even now, and how pretty much impossible.

B

It would have been done just to even do a few years back. The program officer from NIH basically talks about how any scientists don't want to give out their data. They want to be buried with their data and how there was the major accomplishments even do this, and so on. The I would say in terms of what we're doing with the NBS it's towards enabling these kinds of things that they're much more frequent, less worthy of a story, kind of scenarios. You know this is kind of the way things will be we're pulling together.

B

Different data sources is just the way things are done, and so, with regards to publishing scientific data, there's a couple of aspects to and the most common one being you know, you've got to have somewhere to put the data as where to store the bytes and I. Think a lot of people tend to focus on that, but that's only really half of the story. Everything else that comes after that is crucial and in these presentations I just throw up in you, know a couple of ASCII bytes and ask people what is that right there?

B

If anybody out there can answer me well, what does that mean? Anybody Mike it's? Basically, if you watch Hitchhiker's Guide to the galaxy, it's the number 42, you know they answer to everything, and what this would I try to highlight here is basically it's difficult to say what you know, those bytes those bits mean without you know, information around it without the knowledge that it was ASCII in more complex scenarios. You know the format of the file indices over collections of files.

B

They could find what's where metadata around the data, so that you know what's in what and all kinds of other things around access, control, data, transfer, data, transformation, analysis and so forth. All these are services that are above and beyond just just the storage of those bytes and are crucial for data publication and data reuse. Eventually, so we focus on that, and so once you get down that world of data services, there's a there's, a huge landscape of things that are currently being researched and developed to kind of deal.

B

With these things for a scientist getting into the world of putting together a data management plan, there's a lot of different tools out there. A lot of them are redundant. A lot of them are contained within specific domains and not really known about beyond that, and it makes it difficult in terms of using it using these things and also connecting them together in the case where something more could be done by taking one aspect of one and combining with another aspect of another and to do something even bigger.

B

So the national data service consortium is really engage with regards to publishing data in the sense of caring of navigating that landscape of services and tools that are out there and really addressing interoperability of those tools and services.

B

And so we even try to work closely with the research data Alliance on that aspect of that in terms of kind of defining what those standards are for interfaces and interoperability and on our and trying to implement those in some of these components that we work closely with, and so in that we've engaged with a number of universities that attend our consortium meetings, the supercomputing centers, our national labs, cyber infrastructure efforts in the data Nets and the data infrastructure building blocks out of NSF.

B

We, as we are doing here, engaging with the big data hubs, earth cube and the RDA and so forth. Basically trying to map the landscape of what's out there and trying to pull together pieces and address. You know, interfaces and protocols that we could potentially implement, or at least motivate the implementation toward fall to the implementation towards towards making it easier to connect these components together, and the last thing I would have mentioned too is we also engage with publishers themselves.

B

Since we are talking about publishing data, we engage with like nature science, also here and so forth, and representatives from them typically attender our meetings in terms of the cyber infrastructure components that we engage with participants of the our activities. We try to highlight certain aspects. One is basically they're each trying to engage some Big Data challenge.

B

That's out there in terms of meeting some need in terms of access control, data, movement and so forth, something that's needed by the scientific community to use data, making kinds more effective and at some level of figuring out how to within their current funding sources or by going up to other funding sources, specifically addressing interoperability and not just one-off kind of scenarios, but specifically some way of doing it in a in a way that basically can be done meet more than one need through some sort of interface.

B

That many people could potentially use so we've been between some of our work with regards to the community activities, have centered around mapping the landscape, as I mentioned the components kind of breaking things down into individual. You have pieces of a data infrastructure from as a metric or authentication transfer, storage, curation analysis, explorations so forth and kind of mapping these into some of the components that are that are out there and what does? What? Where they're, missing gaps?

B

Where are there 10 different things doing the same thing that those kind of indicate where they BIA interface really could be used, so the one could pick one and swap it out over time if they need to and so forth, provide different things out and we'll see what kneestr needs best and overall really work towards. This is a concept I think first came of the RDA is kind of doing for the this data input data world. What to happen for the internet or long ago, where, basically, there was a lot.

B

There were lots of components vying for each piece of the internet back then, and it was an open until you know there was the event, except that we should use tcp/ip, that the HTTP and HTML and all these other protocols came to be the things tenant for the user at least became kind of cohesive, where it's a sense that they could pick any browser that they wanted. It would still work with any other technology. In the background, a web administrator could pick any web server they wanted. They felt was best that they knew best.

B

It would still work with everything else in this picture as well and so kind of doing that and kind of taking the name from the original data net solicitations out of NSF moving towards a data net, we're basically the same kind of thing where web servers are replaced by archive technologies and other services like DNS, is replaced by transfer services and transformation services and so forth, specifically towards addressing the needs of the scientific community, but perhaps at some points addressing needs of the general public at large as things evolve, and so this kind of leads into where we are been moving.

B

Development, wise, so Kusum see funding from NCSA and some efforts out of sdsc and Argonne National Labs we've been working towards sort of developing some tools to kind of a foster, this movement towards interoperability, and so the first one is in. We call nd up laps. This basically has three components: the MVS Labs workbench is the main software component of this, and what it largely is is is this sort of an app store for these data management tools and services?

B

A catalogue of these things that are being researched and developed and are under active development, perhaps not finalized at the moment, and so a user. Some news, a new project, that's looking for data management tools could potentially go here, find tools that meet that they need for curation or sharing data and so forth and actually deploy them. So the tools are basically contained within this app store for these data management tools, as docker containers and and managed to kubernetes, and so from here.

B

A person can basically find the tools they want and run them run, ten of them and try them out, and so what that kind of looks like is that you add up to your workspace and you say: can select lunch. It pulls the dependencies of those tools with them. So in this example, here cloud out of data at state it pulls MongoDB, RabbitMQ and so forth for dataverse it pulls postgrads and solar and so forth, and these are all dr.oz components as well, and so you can launch them and try them out for yourself.

B

So if you go to this YouTube video down here, I want to announce for lack of time. You can see a demonstration of this actually being a run for a couple of use case scenarios that we showed read one of our previous workshops.

B

So that's uh so that's one of the tools we've been developing nds labs also provides resource allocations to cloud resources scattered across SDSC, mcsa and others to kind of help more advanced users, try out new technologies and work towards interoperability. I'll just mention that the NGS labs workbench also is meant for that interoperability development aspect. It supports various web based IDE of methods of accessing data, on these tools accessing terminals and so forth.

B

So you can actually debug the gamestate, perhaps build a tool that crawls ten different archives with ten different protocols to try to make us sort of a search engine for data and try that out in that environment, so forth and collaborative support, basically providing a development support for these tools to developers that basically work across a number of different projects from bids to data nets, to SI, toos and so forth.

B

That mcsa and other organizations as well, and so that's one of the tools and then I'll just finish up with the second tool here, which is kind of ramping up, which is a portion of this activity, which we call NDA share, which is going more towards the line of what animal data servers might look like it's sort of a portal towards to all data. That's out there, regardless of what technology is behind what archive technology?

B

What storage is actually on and you can think of it kind of like a Google for scientific data kind of thing, and so what we've been building there is that as a resource that would kind of foster that so in terms of archive technology date of publication, there's a lot of tools out there. I won't go into too much details of this, because I'm certain I'm running out of time here, but this globe is published, is one example where you can basically put it's the extension of globes transfer where you can basically publish data.

B

Sets you basically upload. A data set could add some metadata and basically, the end of the day get a DOI or some other handle. With that. You can then reference that data set by data net see does a similar thing. It's got a drop box like an interface. You can put data there and same kind of process. It'll find a repository for you get a DOI dataverse has been around for a while.

B

If you do that with, and so what we built for this kind of world is one something something above a useful tool that can be leveraged by all these things, for something that's becoming more and more prominent. These days, the need to run analyses next to data sets and so trying to be agnostic to technologies, but building something that each can leverage. We have been begun working on this tool. We refer to as a data DNS and what it is. It's kind of analogous to a traditional DNS that naps IPA.

B

You are else, IP addresses what it does is it Maps digital object, identifier, two data sets to the specific locations, whether it be one or more, where that data set can be located and further mapping, it all the way down to a path on that system where that data can actually be mounted, and so what this does is allows any of those tools to then basically take that information and without actually having a data store, they're referencing, that data set and launch tools next to the data.

B

So if the data is too large to move its a terabyte or more legally, where it's added launcher to paterno book next to it in our studio, notebook next to it, dr. oz, container next to it and so forth, and so we did a demonstration of this at supercomputing. This is kind of a web portal interface to look at the data DNS entry, so based on its citation where it's located and you can launch a notebook from here, in which case you press the button. It brings me to a trigger a notebook.

B

You can run it and get some sort of visualization on some data set at some remote location, but the idea here is to leverage it in all these data management technologies that currently exist so in the globus published case, basically imposing on it these little buttons down here way to this juana juana, some sort of tool. Next to the data data net seed, then kind of thing basically imposes you know, imposes Oh too capability right on it.

B

Data versus a kind of thing, so we're working on basically doing this through means that minimize the effort on issues each of these activities we're working on a bookmarklet which basically contains JavaScript inside of a bookmarks that you could potentially run on data versus Global's, publish data in seed, and it will automatically stick this buck on them without having the developers of each of those projects actually do anything.

B

So that's kind of an angle we're taking on that and it'll, be one component of this Google for data that will be India share at some point, so this is kind of a first step, so I'm going to end there. Those are two of the technologies we're working on at the moment. There's YouTube links in these slides that'll be online, so you can actually see them running and if there's any questions that can take them now,.

A

A

I have a question how each researcher like at Big Data community, can contribute to this and yes mm-hmm.

B

Specifically to one the things I showed like labs or the gala dinners in.

A

General I know, and and yes, the but I, don't know how each researcher such as doing big data reason, research and the we provide. There are two to be connected: how we can try to link differences outlets.

B

Yeah so on our web page there's a little link to submitting pilot efforts, and so, if you do that, there's a little form you fill out and basically propose what your, what you think basically will be beneficial in connecting some components and developing some technology around that, and- and we look give you those are periodically and with that- provide resources and work with you towards enabling that kind of thing. So that would be a one way to do it.

B

The other would be attending the workshops that we have every six months and engaging us there. I would say.

A

Hello, this is riggan I. Had a quick question about the status of the RDA testbed. You proposed I.

B

Don't know that that's gone very far yet.

B

So the idea would be that today the nds labs could he could serve as that purpose, but I don't believe, we've done much follow-up in terms of making that happen yet and yes, the latest labs workbenches, it's still in its alpha stage. The beta release of that will be in the next month, and so that would be when it's more stable but I, don't believe. We've followed up too much with that at the moment, very good.

A

This is Lea I, really liked your app store, storefront that you showed when we had our hubs, we're having a series of meetings for strategic planning for the South hub with our pies. My counterpart were not a rollin cost in our teams. We came up with a similar conclusion that that was needed. Is this something that you've been working with with the Midwest hub to provide? Is this something that could be expanded that all the hubs work with you on and also? How does it relate to what exceed provides in their tools and resources? So.

B

In terms of need, yeah, we have seen that cross need and specifically from earth cube. We've been involved with the earth cube architecture committee, and this kind of thing came up there after the fact too, and so I showed it to them as well, and as I mentioned to them, it's open source and we welcome contributions from anybody, and you know anyone is welcome to skin it as they wish to for their specific endeavor.

B

What we're trying to do is basically catalog these tools and docker containers and in, however, that's done, they can be shared across any of these instances at some point, so I think for the greater good. However, it's branded it's it's good to have the eventually help us build up this tool base of all these different data management technologies are.

A

They are these tools open sourced, or is there a mix of open source and closed systems? Is it just for academic researchers, or could private sector folks also come in and use these tools so.

B

Currently, it's all open source, potentially ryebeck stuff could go in there, but we're not addressing that at the moment or that's a whole different can of worms and so there's plenty of stuff in the scientific community. That's opening so we're tackling that at the moment. Okay.

A

Well, I like if Mike and I and maybe stand others could have a follow.