South Big Data Hub Data Sharing & Infrastructure Group, 7 Jun 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: CI WG demo: UMN Cyberinfrastructure Support for Water Resources

Description

Date: 6/7/2019
Presenter: Jim Wilgenbusch
Institution: Minnesota Supercomputing Institute
Midwest Big Data Hub

A

For those of you who don't know Jim, he is at the University of Minnesota, where he leads their computing center up there and is involved as as with many interesting projects and I've known Jim for a few years, but through this and through several Stampede reviews. So.

B

Let's just say that.

A

And will not say don't come back and talk because they are always interesting. But Jim is talking about support for water resources that they're doing at the University.

B

Of Minnesota, so great thanks, Nile and and certainly hope that I get down down south sometime in the fall. Perhaps given the weather forecast that I just heard it's and me down there.

B

I've got a lot of stuff to talk about, some of which I'm an expert in a lot of which I'm not but I. Think that's the case with all of us when we start to get into these big cyber infrastructure projects and, and so what I hope to do really in the next 15 minutes is just at least give you some idea of what we're doing and if there's interest in terms of going and further.

B

Of course, we've got some time for questions at the end and and then you know, I'd love to see emails from you. If there's, if there's something that we'd like to go, that you'd like to go deeper and we'll certainly be able to do that. So that's my way of apologizing a bit for for going through a lot of information in a very little amount of time. I want to really start off with Water Resources Center and just to sort of introduce a bit a few of the players here.

B

Water Resources Center is at the University of Minnesota and it's its mission as it says here is really to advance the science of clean water. For Minnesotans, through innovation, workforce development and knowledge exchange beyond the state of Minnesota, we're also part of an IWR, which is a national organization, and you can see involves all of the US as well as its territories, and so so we, you know we do from a waters perspective. We have this sort of overarching mission and purview and fit in again to a larger national effort.

B

Our focus is sort of at the intersection between water, land and people, that's our niche and then in particular we're looking at things like urban stormwater agricultural, rural watersheds, oops. Sorry that moves back really easily. Doesn't it I'll try to move that forward? What just happened.

B

Somebody else controlling this Carl I seem to have lost control. Oh there, we go great decentralized waste, water, groundwater, drinking water and surface water and aquatic ecosystems, I'm not moving. Who is that's the funny thing about.

B

I'm going to take you through two projects and just to give you a sort of a taste of the scope of what we're dealing with specifically around water. The first one is a socio-economic survey project, specifically we're looking at surveys of about 4,000 Minnesota farmers there across five of these designated areas and I've listed the areas there. The PI's are listed here and we are just starting with this project.

B

It's it's a great sort of area for us to go into, because this socio-economic survey data has sort of a whole set of challenges and opportunities that we've actually touched upon in some other spaces. So you'll see the relationship there, but we've just literally launched this project within the last couple of weeks. Another one that we're a little further into is the integrated watershed economic modeling. Basically, this has the goal of looking at ways that different scales of crop management can actually impact nitrogen reduction goals.

B

We're also looking at this from a temporal standpoint to try to understand how these scales intersect with different, wet and dry years.

B

So we have a good amount of temporal data on these and then we're we're really looking at ways that we can increase measures of water quality to also address things like a sentiment and phosphorus, and so that that project is often running and our third one has, as its data a remote sensing and here we're looking at Minnesota water quality in particular, and we actually have in this case up and running as a working website, lakes, the lakes quality project or the lakes quality browser, and you can go in right now.

B

The URL is actually pasted there at the bottom of the slide and play around with looking at various aspects of the lakes in Minnesota. Moving forward. We're really interested in in developing better means for integrating these data with some of our other sort of spatially distributed data as well as sort of temporal trends around lake water quality.

B

So the types of data is, you might expect in the project. One. The socio-economic survey data are really in the form of numeric responses and typically sort of stored in Excel spreadsheets. With respect to the second project, the integrated watershed management project- these are really get a little more diverse. We've got, you know, spatially explicit data.

B

We have weather data that we're bringing in, and then we have specific information on the management practices that are applied and in the last project, we're dealing with raster based geotiff edges, polygon, shape files that we have to deal with at different scales and then then just again standard standard data sheets that were all pretty accustomed to dealing with.

B

One of the challenges that we that we found in terms of getting into this project is that the array of tools were kind of diverse and also in particular, they were commercial for the most part, and that presents some fairly unique challenges when it comes to restrictive licenses and how we make some of the analyses and the work flows that we develop, shareable to a larger community and thereby sort of repeatable as well, for people who are interested in sort of repeating those analyses and potentially extending them.

B

So that's probably not unfamiliar to many of you here, developing CI and so really. In a nutshell, the overarching requirements that we had were to to be able to deal with or accommodate licensed software potentially installed by the user, build some new data analysis pipelines, in other words, sort of convert some of these existing. Maybe commercial software based workflows into Python or our streamline the data cleaning part so that there could be more of an automated component to getting data into a platform, make it easy to find and analyze.

B

These relatively diverse data types analyze data over broad geographic areas and also sort of a longitudinal time spans and move data between being really completely private to to open I, know I'm a big fan of open data, but the reality is in many cases there are very valid privacy. Consider Asians around the data that we're collecting, especially when it comes to survey data.

B

We is sort of where I switch gears a little bit and and now flip into the platform mode. We've been developing a platform mode that really focuses on agricultural informatics, but it has a lot of the attributes that really sort of fit well, as we began to work with the Water Resources Center. You know specifically in that it enables these public-private research collaborations quite nicely.

B

It accommodates a broad set of different data types, and so gems is the name of the platform, and it really that name represents the different types of data that gems is able to ingest from genomics environmental management and socio-economic data types. It's built explicitly to accommodate temporal data data dispersed over time as well as data dispersed over space, and so that was also something that fit really well with some of the requirements that we had in working with WRC, the Water Resources Center.

B

Also I, think you know this is importantly we're getting some some real real-life examples and some real emphasis from various groups looking at at the future of agriculture, and we were called out more recently and in this report dealing with science breakthroughs to advance food and agricultural resource in the year 2030.

B

How do we do it probably weighs very familiar to to most people on this call? We leverage a lot of the growing ecosystems of tools that makes things like this much easier. In particular, you know we we are from the ground. You know up a containerized system, I'll talk specifically how that helps, but there are a number of tools, some of which I'll call out, some of which I just won't have time that specifically have greatly helped us advance past some of the obstacles that I'm you need sort of regional data sharing and efforts.

B

One of the first things I want to sort of focus on again is this containerization foundation and and how? Specifically, it's helped us around platform portability. So we built the the platform gem specifically so that it can be run on clusters, workstations of modern size and laptops, and everybody here, who's involved in any form of virtualization knows that.

B

Yes, you can run containers and in these areas- and it's made things particularly nice for us for a number of reasons, some of which I've tried to highlight here by emphasizing where, in particular, being able to operate on these different data platforms is important, so obviously sort of on the compute intensive side of things you know, having the ability to run on a large clustered, an infrastructure is, is very useful and being able to also be able to run on you know.

B

High-End workstations gives you the flexibility where you can work with groups where they're very sensitive to where their data are going and that usually, you could potentially run the platform behind their firewalls, with less emphasis on on data sharing, but more emphasis on the capabilities of the platform to manage data and the tools that are available. And then you know, I have to emphasize that from a developer standpoint. We started this way because we wanted to be able to develop this platform whether or not we were connected to one another.

B

So, just having that portability gives you an opportunity to develop tools and and other features without necessarily being connected to a single system, so switching a little bit now to the specifics of the platform. One thing that I wanted to emphasize on here is the importance to some of these other other efforts that have that are happening, that we're probably aware of and if not I wanted to make sure folks were and and so right away. We knew one of the issues that we would have is is is how do we?

B

How do we determine who you are, and we didn't want to be the gatekeeper in that respect, and so we leveraged something called Globus off, and that gives us the ability to rely on an investigator's home Institute to validate their identity.

B

If, if they're not connected through in common or through some other means to to Globus, then Globus is connected to to Google, and one can also use your orchid ID, which again greatly facilitates at least that first step of figuring out who you are and believing it second thing that we really wanted to be able to do, and we've made some great improvements really over the last six months is making it easier for people to find data.

B

So our tools specifically are geared around being able to select geographic areas of interest and then discovering what what datasets might be available for you to play with or to potentially integrate into your analyses and I'll talk a little bit more about that piece, because there's a couple of layers in terms of the way we make data discoverable. That I think are interesting and somewhat unique, and and actually this this sort of gets to it is, you know, once you've selected data.

B

There are fairly easy and intuitive ways that many of the platforms today allow you to to browse and to filter and so forth. We have a way that you could copy the analysis that the data once that they're discovered to the Container, where your analyses are running or you could preview the data before you- maybe move it in here into your analysis, space, but importantly, there's also a way to ask whether you can have access. So we we disentangle the metadata from the actual data recognizing that are in many cases.

B

There's like there needs to be some understanding about how those data will be used before those data can be inspected or analyzed by a third party, and so we call that sort of selective data sharing so that you can share your data more selectively and that could involve also a timed embargo. So imagine the case where you you just want to wait, or you have to wait for some reason before you can share that more broadly, with with potential collaborators, we've done a little bit of work getting further into this.

B

Now as we develop, you know specific collaborations with developing custom tools, and so these tools are again sort of built within this containerized framework and allow people to do. You know some common filtering of some of the data. That's available that specifically geographically explicit or has a temporal component, and so this is just an example of one one project in which we're working on where people have needs to to be able to overlay different types of satellite imagery and understand better. What sort of you know crops might be of interest based on these different bands.

B

Spectral bands that are available in the data sets that we have in that working group. Very importantly, we build this on a Paterno book, so so we're we're heavily leveraging an interface. That's increasingly, fortunately familiar you can see my bias already. I I'm a jupiter fan and they're a lot, and there are a lot of people who now are using this. So it gives us a little bit of a head start in terms of onboarding.

B

Folks, obviously has the advantage of being shareable as well and and thereby sort of being replicable, so that people can share notebooks, workflows and and pretty easily then get to work with one another.

B

We've also were contributing to Jupiter and by adding some support for VNC remote desktop environment. So through Jupiter you can, you can get a full desktop. This is. This is great for a lot of our projects, like the ones that I just mentioned, because they've got a lot of software that you know don't necessarily or they've, got a lot of workflows that don't necessarily easily fit into a Jupiter notebook, and this is a good way for us to at least accommodate some of their analyses without having a completely retool everything and so we've.

B

You know, there's a bunch of work that we've had to do in terms of the permissions and things like that to make this work right with respect to the datasets that are part of the gems platform, one of the things that we're also putting that's unique to our platform in the sense that I know other platforms do this.

B

But our approach to it is we're creating this wizard like data integration tool that takes people through uploading, what we call products, because we refer to those broadly as both data and and and analyses or workflows, one of the just I'm kind of skipping over some steps. But I want to give you again.

B

A sense of what's happening here is this is this is so so far been sort of very focused on AG and ecological data, and so, as part of this process and part of this, you know attempt to to better be able to integrate datasets. We pay attention to whether these data might already be part of a standard, ontology or taxonomy, or some some known vocabulary, so that we can begin to again put some structure around what it is that you might be talking about when you're.

B

You know when your data set has things like reading program associated with it, and if we tie that in with an existent that might help as far as again standardizing and integrating with other datasets.

B

Another thing that we're adding some value to is is what I want. What my my stats, professor, when when I was in grad school, used to sort of drill into my head garbage in garbage out. We want to try to figure out ways to prevent that by looking very early on in the upload phase at where there might be some data error, and so one way we do that is we provided some generic tools to to try to alert someone to potential outliers, and so you can.

B

You can kind of select as you as you begin to upload your data, some of these different tools for visualizing the data before it's actually been registered to the Jemez platform, to try to keep keep potential errors out of there. Likewise, and- and this is definitely in the same vein- is for every project that I've been involved. That includes spreadsheets.

B

There are inevitably multiple spellings of the same thing, which can often confound things and take anywhere between a couple of hours to a couple of weeks or more to correct, so that you can actually get on with the inference that you're interested in, and so we have some tools that look for potential spelling errors and allow the the end user to easily correct those before the data are registered in in the Jemez platform last, but certainly not least, just overarching metadata standards.

B

When it comes to the product itself, we follow now a couple what we follow now: six various metadata standards that are widely accepted and put some additional data around your data to help us again group things where relevant and help end users also find data that might be relevant to their.

B

This is this is where I just wanted to mention.

B

We do separate separate the actual core data from the metadata that we've either extracted or you've explicitly entered, and this was important for us in working with some of our industry partners where they were definitely interested in in sharing data, but they also needed to know a little bit more about how it was going to be used before they shared it, and so the idea between separating this data, as it gives someone the ability to find the data through our various search mechanisms and then work through an agreement with the data holder.

B

If there are concerns about how the data are going to be used, and so so, this last sort of step sort of exemplifies. Maybe what that would look like where you could specifically change some of those attributes. Team members who have access to it and- and you can see various metadata components that would be part of what was advertised to a broader community last little bit here.

B

That I wanted to mention is we've got concerns in the AG space and I think this is true above other space, specifically about who owns the data and how its protected. You know. Importantly, I think this really gets into areas that are administrative, technical and physical and in terms of protecting the data, and these are things that we're looking at I'd be happy to go into more detail in another in another venue.

B

The other way sort of again on the technical side that we're looking at these and were being helped by a number of things that are out there or data and motion data and use and data at rest. And then, just importantly, it you know, you can do a lot to protect data from from those those standpoints- physical, technical and administrative.

B

But if you have data privacy laws in your state that allow people to make requests on certain types of data and render them public, even though the constituents may be that owned, the data think otherwise, then those protections really don't matter so we've had an important thing and in a couple years and in the making happen here in the state of Minnesota, and that is, we have afforded as private non-public these type of agricultural research data that reside in our platform- and this is very important again from the standpoint of developing confidence around data that otherwise hadn't been afforded.

B

Those same levels of protection. Future work we're looking at better, defining our API. We actually have api's now, but we haven't advertised those in a ways that other other platforms can more easily pull or push data into gems. And likewise we need to do a lot more work when it comes to being able to access other folks data and so we're doing some work in that space.

B

We also have a great group right now who are looking at integrating IOT platforms and the data that they collect with with the gems platform, and so we've got some great, really cool proof of concepts going on right now, I was just over on the st. Paul campus, where some of this work is being done, and there was there was a coterie of students actually scattering field sensors all over the campus there that I'm, hoping didn't get run over by lawn mowers today.

B

But this is an area that we're excited about and then the last sort of future thing that we're that's very much on our roadmap is federating. We have an existing collaboration now with Stellenbosch University and in South Africa, and we're beginning to really develop now and excitingly make plans to codify this Federation so that data could actually be transmitted over very broad geographic air is, and so we're excited about that, and there are other other collaborations along the Federation lines that we're working on right. Now.

B

Many thanks to those people who are doing the work, in particular the gems team, many of whom are shown here. Not all of them and then to Jeff Peterson in particular, he's the water expert as well as Adam Wilkie who's just recently joined the team.

B

These are the folks if you have any questions about anything related to the three projects that I just mentioned, no know the answer to and and are the experts again and what University of Minnesota in particular is doing around water resources and with that I'll say thank you and also, if you're, a water person be sure to save the date, because we've got a cool event going on here a year from now in the Twin Cities.

B

It's absolutely lovely here in June, I highly recommend it, especially if you live in Austin, it's it's much cooler right now, and and and it's it's not too cold. So so with that again, I'm really happy to take any questions.

A

This is Melissa, while people might be thinking about questions. I just want to also note thanks, Jim. That was great and I want to just note that Jeff Peterson that was on the previous slide is the incoming pi4, the Midwest big data hub group up at Minnesota thanks I.

A

Thanks for that presentation, Meredith.

B

A

Really appreciate the the user focus on discoverability and meeting people where they are with the workflow and we're always interested in what sort of successes you've had or opportunities you think are out there for community feedback on data quality and actually looking to improve. You know the fundamental state-issued or nonprofit issued data sets. Could you maybe speak a little bit to that? You mentioned linking to like standard ontology yeah.

B

We I hear you that Meredith, that's that's kind of been one of the great things working with this group is before we started building things we actually started about started when I say building things building technical things we first started building community and one of those communities is the IAA. It's the International agro, informatics, Alliance, and so because AG, actually, you know, very clearly transcends borders.

B

We wanted to make sure that when we began to address some of these things like ontology, that we had broad community input recognizing that we weren't going to be able to force everyone into one, you know particular ontology that we needed to hear what was out there and adapt to it. So we've been working through that group, specifically around data standards and ontologies and that's been- that's, been extremely useful. It's opened us up to a lot of groups here in the US and abroad, who are really the leaders in that space.

A

You know presumably spent you know, minutes or hours or a huge chunk of their career working on the data. Have you seen any successful mechanisms of their? You know, review of the data getting back to the data stewards or that potentially an opportunity where the hub's could add some values, sort of yeah.

B

I think in fact, I'm glad melissa actually brought up the Jeff and his role, because at the University of Minnesota our focus going forward will be on Water Resources and one of the one of the things that we specifically put into the proposal was. The idea of convening first and convening is, is will be around specifically Water Resources the data standards and and and needs of that community, so that we can whatever sort of development that we do would be focused on those needs.

B

And so that's top a list and- and we literally are just kind of it- Melissa can could echo this I'm sure but its it's. The announcement has just been made official, but there hasn't been an official announcement yet because there's press people trying to put that together various communication peoples at universities that are putting those together. So that's going to be one of our first focus, though. So that's really.

A

It's great to see and we're fans of Nye we're, especially you know: Sam's furnald who's on our steering committee and the West being the outgoing president.

A