South Big Data Hub Data Sharing & Infrastructure Group, 6 Apr 2018

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: CI WG demo: HydroShare

Description

Date: 04/06/18
Presenter: Dave Tarboton
Institution: Utah State University
West Big Data Innovation Hub

A

While we're going through all this to walk your minds, if there's anything, you want to think about or contribute on that I'll just sort of put that out there, and so rather than take up too much time, I will just go ahead and introduce Dave who's here from Utah State who's part of the hydrosphere program, which is a project. That's collecting hydrological information across the whole country, believes I.

A

Think we've used some of it even during the the Harvey situation down here and he's the principal investigator for Hydra chair and lead civil engineering for a water up at Utah, State University, and so rather than read all the stuff on the agenda. Notes I'll go ahead and turn it over to you to tell us pan out of here. Yeah.

B

Great thanks, thanks for the chance to talk, I'm really happy to be sharing this with you, and so how I share is a web-based system for sharing hydrologic data and models with with specific functionality aimed at making collaboration easier for hydrologist. It's been developed over the last six years now, in collaboration with quality, to support the initially.

B

The data, management and publication needs and then growing into model sharing of the hydrologic research community, and we can think of it as the hydrology communities sort of contribution to the transparency and research reproducibility movement, it's funded by the National Science Foundation, the it is its r2 program. Software integration for sustained innovation and their programs changed names now, but it was funded under the old name.

B

It's really operated by the consortium of universities for advancement of hydraulic science incorporated, which is what cause it stands for with the project that I'm leading effectively developing it and other co PI's on the development of it already tohsaka named for the second round of funding sharing weighing at the savviest into the University of Illinois.

B

So the rain noses quite well um I wanted to just in the next slide talk a bit about what quietly is to non-profit consortium of about 130 US universities, whose mission is to shape the future of water science by strengthening interdisciplinary collaboration and this data sharing activity of Hydra series is part of that sort of broader community effort. So that's the effectively the audience that this cyber infrastructures is targeting. Our motivation is really collaboration, so this slide is intended to emphasize the collaborative nature of hydraulic research.

B

The need to combine information from multiple sources to do analyses that may be data and computationally intensive at all, but you still need collaboration even if they not and a great address the Grand Challenges, avoiding flooding, avoiding water shortages, and things like that on the right. You see just a screenshot of the Hydra share website, which I'll be talking talking through a little bit and it's a sort of open access. Anybody can post data in it and use it for for collaboration. So it's really.

B

A user contributed almost button map bottom-up type system, similar to perhaps the way people would share videos on on YouTube and pictures on snapchat or one of one of the other social media sites, and when you think of the data and models used by hydrologist that they are quite diverse, you have time series data a lot of geographic information that may be gridded in the form of rosters or in vector or linear format.

B

In the what we refer to as geographic features, you may have multi-dimensional space time data, and then you may have that aggregated together into model programs and model instances, and we distinguish between the programs, which is the the code that actually implements the computations and then that the model instance, which would be that program, plus the data for application of it at a specific watershed, for example. So we've designed how to say to hold a sort of wide variety of the data of interest and in the format that role just like to use.

B

So first, let me do this slide. This will step through what Tyra Shea is first at a platform for sharing and collaborating really exchanging information in terms of computer files, so files storage with Dropbox ish type of functionality and hopefully as easy to share information as Dropbox. But then we want to add information. Metadata descriptions provide access to that metadata through API. The capability for web apps and social functions enable formal publication of the data to get digital object identifiers so that it can be identified in in citations and enhanced trust in the finding.

B

So that's all some of the value added functionality that we're building on with the goal really to advance the science by enabling the community to easily and freely share the products resulting from from the research, not just the scientific publication, but also the data and the models used to create them. So it's based on a fairly carefully designed resource data model that uses open archives, initiative, objects, reuse and exchange standards.

B

So the pattern with that is there's a every-everything is referred to as a resource, and we use the word resource because that can describe quite generally an an object. If one might be sharing comprised of computer files, it may be data, it may be a model. It may be. A combination of a combination of both that can get grouped into irrigation.

B

So there's an aggregation that says certain objects are collected together and at the bottom here we've got some of our our schema, the sort of core part of the schema for the dublin core elements in our in the in the metadata, and if you want to know more about that, there's a paper at that link and I should also point out that the slides are actually available in how here itself should have pointed that out on the slide at the beginning, you can search for the keyword, BD hub, WG, 2018 or talking along unique identifiers.

B

That would be a bit of a challenge to type in, and so this is the I've just got a couple of screenshots of some of the interface with with Irish I. Don't want to recognize the Lightning talks. I can't do too much of it, but there's the individual user after they've logged in can go to the my resources page and create new resources that that's where they get to basically post information into the system. These high-rises used as the underlying storage layer.

B

So if you've got data in a-rod's, you can actually create a resource directly from a-rod's data, or you can upload it, and what we're working on for the current refunded project is the ability to pull in data from other third-party storage systems, perhaps Google, Drive or Dropbox with systems like that.

B

So then, when you actually get on after you've created a resource, it's got a number of features. The landing page for the resource shows the authors, the owner, the type of it when it was created, citation information, for example. If it's been published with a do, is that'll be that'll, be given there, the abstract, with the user, created the resource roads and then for each resource. You can manage the access, so information can be private or or public.

B

You can give people permission to just view or edit is commenting and rating- that's been somewhat underused, but we built that in with the idea of trying to promote social value on to full resources. And then you can do things like organize resources into collections and Prall, and you can also create different versions. But one of the interesting things is you can open them with compatible web air. So the concept here is that apps can or effectively any web-based system that can connect on resources through the applications program.

B

Interface, two steps to both visualization support analysis and anybody can establish that app and then registered Mahara share if it gets approved by quality. It'll appear on the apps landing page, but even if it doesn't get approved by quasi or stole in the process of being evaluated, and it's still available for people to do you.

B

One of the apps that we're putting quite a lot of energy into is really a deployment of of Jupiter hub with a Jupiter parson notebook, because that gives really general capability to have let's say sort of entry-level programmers write and execute code in the system. Where is all of the libraries and dependencies effectively resolved for them? Extract data from hydro say: do they work and then serve it back into Hydra share, including the notebook itself, and then let other people have picked up on working on the notebook more in a in a collaboration.

B

So I know this is a fairly technical crowd, so I wanted to go bit into how the system works, and it is sort of high-level, is really three parts to it. The main entry point is there is a django website, so the technology that we've used as a software snack built on on django and that's used effectively to support the loading of information, support, the editing of metadata support, the discovery of resources and to organize and annotate your contents to to manage access.

B

So if you want to think of this as parallel to perhaps the way a PT works, you think of this as a as a file explorer and then we've got irods as the effectively interface to the storage layer and that's to allow data to be held in an federated data store. So, while Hana she provides some capacity. There's also capacity for other, perhaps heavy heavy or big big data users to establish their own federated irods server.

B

That's within appear in a seamless way to the two websites, a distributed file, sort, that's analogous to those sort of different hard drives on a on a computer and then leave there's the web apps that provide actions on resources and that's where the real power and extensibility comes from, because anybody can set up a server to operate on resources through that are held in our roads through the through the through the API.

B

So there's a number of examples of those already this swatch here, which is actually a padieu running with hubzero, there's apps that Rho G is at University of Illinois. Now those happen to be offline right now, because they were on the rajah system, that's going to rebuild and you can have apps that take advantage of standard systems that come out of say you need ADA and the atmospheric sciences for accessing multi-dimensional data.

B

So we we decided that the multi-dimensional data format and Hydra share was to be was we're going to just use Nate TDS files, because that's widely used for that.

B

So, um but a couple of slides to to in here. This is just a bit about our statistics and we keep track of the the users that we we have sign up and we keep also keep track of how frequently they they log in and whether they active or not the primary audience being the us hydraulic research community, but it's open to international use and we're also trying to keep track of who the people are in terms of p.m.

B

to report to the National, Science, Foundation and other organizations, and then we're also looking at the number of resources that have been added to the system and their and their types to sort of understand how people are looking at things. This is just a fairly small snapshot from the metric tracking system that we have to be able to understand. What's going on, so this just summarizes some of the points that I've made it's a web-based system for data and model sharing.

B

It's a nexus multiple types of a dragic data using standards compliant formats, there's a discovery mechanism which I didn't show, but one of the pages there was discover- and if you go to that, you enter keywords and it uses a similar based discovery.

B

You can share models and to the degree that the models can be executed by apps. You can execute them, facilitate ease of access to high performance computing and that really comes from the the data being considered to go into a system and we're actually going to be I'm traveling to eg you in Vienna next week and there's a group of the hydrosphere team are are going to be connecting Harbor say to the Cheyennes supercomputer and that's part of encode to trying to do so further.

B

The collaboration around around a model- that's running there, um so we're really thinking of seek to be framing the data as social objects that people could use for collaboration, I'm, trying to be interoperable to other data and modeling systems was ultimately that goal being to advance hydraulic understanding more rapidly and that's the picture of some of the team outside of our tile event and a lot of clear. It goes to all the people. Who've done all the work.

A

A

Things up to questions from the audience.

A

This is great, I just actually take a peek at it online and I can. Can you tell us I think it said they were like 9 percent of the users are professionals, can you let us know what types of companies they work for their engineering firms or construction firms could do you know.

B

Well, I, don't know offhand, we would have to go and yeah. We have to look at the list and try and categorize it. So I'd think that it's probably.

B

People who are involved with say, for example, American water sources Association, where we've had presentations quite a bit. People who are involved with is Rhian GIS software and the some of the yeah there's the consultants involved in the business of solving water problems and water forecasting.

A

I'm speaking to construction firms interested in water management, is it easy for them to sign up.

B

Yes and right now, there's no limitations. Anybody can anybody can sign up and get an account in a matter of a few minutes.

B

So we did I mean there's always danger when you sort of make a system free and easy about whether it's going to get overwhelmed by by a use that wasn't necessarily the primary one that you defended for, but we've got a strategy where we basically gives each user a free quota of 20 gigabytes and then, if somebody needs needs more than that, basically they just need to talk to quality about it. And this there are sort of NSA's person planets on the hydrologic Sciences program.

B

Then quality would bend over backwards to try and accommodate if they're from a company. That's really got a lot of money, then we'd, try and figure out a way to get into the negotiation where there can be some sort of funding for the cost of whatever they want to do.

A

A

All right, so we have a question chat here: I'll go ahead and read these: are you using any ontology, x' or otherwise preparing for utilization of a IML on these fast data resources?.

B

We have we're not using any any formal oncology's well um at a rudimentary level, so our resource data model describes each metadata element from formally from a namespace where we, where we've got terms, that we can pull from dublin core was or things like there we are using them.

B

We found that for quite a lot of concept, we've defined our own terms, so that that may not be necessarily all that helpful for everybody effectively defines all the words they're going to be using themselves, but that's definitely an issue that we're trying to sort of be sensitive to. We would like it all to be effectively machine, readable.

B

But if you've got any ideas or suggestions to help us with that, there would be a good thing to follow up on all.

A

Right, I, don't see anything else pop up in the chat window on that, but go ahead, oh yeah. Actually my question was for you now: I really enjoyed the presentation by the way and and I have a couple follow-up things at all: I'll be emailing David about, but not all I was wondering if you have any thoughts on how Hydra share and whole tail, maybe even our workbench, where there's commonality and maybe a place to converge, I think that would be interesting. The ability to to migrate I mean you know we're so hotel is I.

A

Think many people here know about it, but it's essentially the we're using containers as a way to have visible computation and publishable of data results and the computations so and so I think. With this I mean we have the ability to subscribe to irods data, it would be sort of again data discovery and bringing those things in, but it would be an interesting thing to do as well as perhaps even impossible getting some of those apps into containers so that we can run with those in an environment that will capture that.

A

So I do think: there's there's a lot of parallels that we can. We can work on here.

B

Right I would like to learn, learn a lot about that. I haven't heard about Hotel before, but there is we're using docker containers quite widely in the system itself has is split up um and amongst a bunch of docker containers, but also in our Jupiter hub environment. Some of the models are being put in docker, so that's the one level of Magnus using the other one is turning malloc at DePaul. University in Chicago has a Eartha cube project.

B

That's developing what she calls as SCI units, which is a sort of containerization procedure that can you can go through a sequence of steps, executing programs. It will record all of those as well as record all of the dependencies allow you to put all of those in a container that she calls a Sai unit. You can actually then push that container into Hydra share. Somebody else could download it to a different platform and really cute and reproduce. The results. I think.

A

B

Of lying around in 100 yeah, it.

A

Sounds like it's right for the picking on that, because all those things I think are possible and there just be interesting to see how we could federated cross things like this. I I had an additional question we probably run on rather than going on, but the one that I had was. This is mostly users bringing in data. Do you also support sort of data streaming coming in from sensors and other sources that are then leveraged by both the modeling and data integration? Portions.

B

We do to a limited extent the we've got a couple of what we refer to some community high-value data sets such as outputs from the national water model that we are actually supporting on separate, arid servers at rain seed and though we provide access to that to apps that can get launched from how to share so that there sort of one sort of connection to high-value data. The other is well Hydra.

B

Share was developed with sort of user contributions in mind because the quality prior to the start of Hydra she had developed the quality hydrologic information system, which is really designed for streaming data coming in from experimental watersheds stored in a hydro server, which is a system that has a relational database to hold this data and publish it using a standard.

B

Now, an OTC standard called water ml. So we have the ability to reference data that streamed into the quality. H is, but a lot of that functionality is to support the hydrologic communities from that that other state of functionality that cause eSports.

A

Alright well, thank you very much for a good talk and lots of information, and we probably ought to I should get in touch with you a little bit about Christine suggestion as well as the I. Don't know if you know about the data national data service stuff that we have in the workbench there, which might also be an interesting area to look in I, did some.

B

Of it but I could fit anybody didn't know.

A