South Big Data Hub Data Sharing & Infrastructure Group, 20 Jun 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: CI WG demo: Building Cyberinfrastructure for Multidisciplinary, Multiscale Agroinformatics

Description

Date: 11/2/2019
Presenter: Jim Wilgenbusch
Institution: Minnesota Supercomputing Institute
Midwest Big Data Hub

A

You know about me so I'm gonna skip over this. The problem is aggro informatics and it's a good problem to have actually sensors sensors everywhere. It's extremely true. If you work in the AG space, it's not atypical to any of the other spaces, but there really are sensors everywhere.

A

Collecting data at a huge scale, difference from little sensors that are out in the field to satellites that are up in space.

A

We have data coming in in all kinds of forms, so that's kind of the challenge, the opportunity and the problem, and when we get into actually looking at it a little bit further, we see that the data, not surprisingly, are fairly siloed and they're siloed across institutions, individuals and and and subject matter- and there are some sort of glimmers of hope in the sense that people also have some sense of what metadata that they're, collecting or general vocabularies.

A

They might be using to describe these data, but I can say from personal experience that you know, especially when we start talking about developing cyber infrastructure. The fact that these things exist and that there are many of them lead to important implementation questions which one do we follow, for example, you know, and, and and and that gets into interesting arguments and and cultural warfare that is difficult, sometimes for CIA developers to be able to sort out by themselves when you get into the actual data.

A

You also discover that the data are broken in some form or another, and that's probably not at all a surprise, but it is a tremendous challenge from the standpoint of building cyber infrastructure for domain-specific activities.

A

In fact, I would propose that it's actually one of the biggest obstacles to realizing some of the promises of the Big Data revolution is the that data are messy and the fact of the matter is the the sort of non sexy side to big data is actually rolling up your sleeves and figuring out how to clean it. A little bit of an example of that is, you know in some data sets that we're collecting their digital data sets they've actually been published on. We see things like this, where you know the management conditions are really.

A

You know full of different languages, different spellings, different capitalizations and so forth. There's different ways that you might say low, nitrogen that have could have a pretty significant impact if you started to actually do analyses on these different activities. So the broken data issue is certainly a huge challenge. It's not specific to Ag, but a digital agriculture, but but it is tremendous in terms of working on these problems.

A

I just wanted to throw this out there, because this isn't a new problem right. It has been around a little over a decade. Now, where you know Jim, Gray commented and in 2007 is the tools for capturing data, both at a mega scale and I'm milla scale, a really just dreadful, and here we are again a decade later and I think it's the same. Now.

A

Anyone dealing with data, especially I, think in the AG space, is also seeing that there are privacy concerns a lot of them, and this is just a quick scan of some of the relatively recent literature on this question of who owns the data whose reaping the benefits.

A

What are some of the big privacy concerns around how data are being shared and aggregated and I? Think when looking at infrastructure and developing CI in particular, these are really important issues to look at because they can be major roadblocks in terms of getting work done and I say that, because actually I've done a good amount of work in the academic health space where we actually have pretty good defined boundaries around handling data and and how to do it, and likewise actually for financial information.

A

There are federal based laws and for both of these cases, there's also state-based laws that govern how those data are stored and shared, but there actually isn't a whole lot right now. In the agricultural space, there are some groups, many of them industry-led that are beginning to develop some prototypes of standards. But it's still largely speaking the Wild West in terms of policies and and rules around how to handle out your cultural data.

A

We worked for about two years in the state of Minnesota to actually get legislation on the book. This just passed. This August, where data that are stored on our platform, which I'm going to be describing in just a minute, are actually considered to be public, private, non-public and that's important, because there's a couple of ways.

A

Once data go on to state-run systems, that people can actually get them, that would that would potentially keep people from actually contributing to common repositories, and these are similar to sort of FOIA or data privacy acts that people have within their own states to request those data. So we're actually now able to put the data onto the platform that I'm going to describe and give some assurance of privacy of that data, which is which has been extremely important and in terms of working with farmers and and private companies.

A

A sort of last general challenge that I'm going to face is really the scope of the things that we want to do. The datasets are incredibly diverse, genomic Environmental, Management socio-economic, hence the name gems or data that we're interested in actually getting to be interoperable, so that we could make broad inferences about various phenomena related to the food value chain.

A

But, of course, we also have this thing called time and space, and these are really important elements and things that we've actually considered in terms of developing the platform and and again, this is sort of tries to illustrate that a lot of people have done things like this, where environment by genomic data have been used very effectively over time and space, but again we're interested in actually extending this inference over over many more different data types and doing simultaneous modeling of these different, disparate data types so break it down.

A

We feel like there's two general problems, one technical. We need to develop tools to facilitate easy data, ingest and analysis.

A

We need to develop cyber infrastructure that really scales from small data, which is actually a lot of the problems that we work on to big data and people that typically are sort of thinking, a big in terms of volume. But we all know that, there's more to to that and then develop models and there's a lot of room in the space that span diverse data types and also time and space on the social end, and you could definitely make an argument that this also bleeds into technology, but on the social and I would say, promote.

A

You know a develop, develop standards that will be useful to the community, while also recognizing that data are going to be messy, and if your platform doesn't deal with that reality, then it's probably not going to work well, also promote fair data. I think everybody here knows what fair is, while also respecting data privacy concerns that's critical and I, and in the AG space it's critical, because in actually a 64% of R&D now and AG is done in the private sector. So that's a recent change.

A

As in the last decade, we flipped over to actually funding more research in the private sector in in Ag and we're in it's a hopeful story. The hopeful story is, you know, we start off with visions of what we might want to want to have, and we we we have the need for all of these intervening technologies before we can actually realize the airplane that moves large numbers of passengers fast and without those intervening technologies it won't work in the same way.

A

You know we have in 1792 the Farmers Almanac, which tries to help the farmer make better decisions in terms of what they're going to do with Ag and we feel like there have been a ton of intervening technologies that make it realistic for us to develop a platform to address some of these. These questions that we have that maybe we put to the foamer Farmers Almanac before, but we could use more data-driven solutions in the future, so the timing is right and and I think again, I'm preaching to the choir here.

A

A lot of people understand that these these obstacles have had fallen and, and we we have a lot of useful tools to leverage and so we're doing that and I'm happy and want to make sure that we have time with this group.

A

In particular, for people to ask questions, but this is a smattering of those tools that we're using everything is containerized to really support these principle, sort of focus areas, data transfer, data, interoperability, data analysis and data sharing, and then we've really from the very beginning, focused also on making sure that by using these containers, that the Jemez platform is portable and can run on clusters, workstations laptops, whether you're running linux, mac or windows, or running someplace up in the cloud and reason for that really.

A

What Robus is that there obviously are different values to to running on these different platforms, and you know if, if, if you have to do long-running compute intensive jobs, you want the cluster where there are serious privacy concerns which we actually are supporting. We're not going to host the platform. But it's going to be hosted behind company firewalls and then, of course, for developers. It's important to be able to move that platform on to laptops, so people can move around and make changes to the platform very easily.

A

The specific contributions that we're making can be sort of wrapped up into two parts. Gems share gems share, essentially, is what is controlling access to the data so who sees it when and what they specifically see. Somebody commented earlier on that this sounds a little bit like what maybe irods might do and right.

A

Smart sharing gems share is really a close fit to irods, but doesn't have nearly the the broad functionality that irods does it's really lightweight by design, and it could be. You know it's conceivable that it could be replaced by by irods in the future.

A

It does have some future features like data versioning, which have been really important so that, as people register data products, they can they can roll back easily to other versions it.

A

It supports this notion of sort of open private and pooled data sets that are also really critical for some of the things that we're engaged in and it it really is sort of beyond data, in the sense that we can actually what we call products include. Not just data sets, but also workflows in this case. Right now, the workflow largely means a jupiter notebook, but we have actually worked in terms of wrapping up other, more complicated workflows that actually operate outside of a notebook gem tools.

A

These are much more as people seem to be saying now, rather than custom bespoke things that allow us to utilize some of the niceties of Jupiter hub to deliver web-based applications within the platform for data cleaning and computing and data interoperability.

A

What it looks like, can everybody see this okay I'm not giving a live demo during a lightning talk that this is essentially a screenshot of an older version where somebody is uploading data- and this is just a view of the data that you can click through to correct spelling errors whatever these might be.

A

So we use some some increasingly sophisticated algorithms to match these terms, to give people a good idea of where there might be spelling errors and where they might have meant one thing, but instead got another and to easily correct those in the first pass before those datasets get registered.

A

Likewise, we actually do some guessing now on I think we're up to 16 ontology Xand vocabularies, so that we notice column headers that fit some of those, as well as some of the data within them, and we make some suggestions, but there's a pulldown style menu where the user actually can change that. If we got the guest wrong, you know Agri Vox, as opposed to the crop research ontology, a couple of projects that we're working on genomes growlers. It operates across the G by E by s.

A

Space sort of on a national level G by E were engaged heavily now in the genomes field. Project which is supporting a group of collaborators over 23 States and on a global scale, were involved in a multi peril risk analytics project that again, it exercises all all four of the gems. If you will.

A

This is what our team looks like, and this is what we do and actually I'm going to be headed there right after this meeting and we've also hosted some large international groups, we're on our third and already preparing for our fourth large meeting groups.

A

This group has been growing and they have been feeding us outstanding feedback to really drive the tool and keep it grounded in important ways, little description of sort of what that is specifically from the iaa standpoint. What we've done, of course, we're at a university, so we've got an awesome opportunity to train students and we're doing that now through our Bibby program, bioinformatics and computational biology, and have actually gotten some fantastic students out of that program. Working with us now more information you can find at our now newly-minted website.

A

We've got a brochure also and, of course, feel free to reach out to me. If you have any questions, I don't know if I went way over ten minutes but happy to take some questions.

A

Yeah, why don't we go ahead? Just we can have plenty of questions if people feel like hanging around but we're down to about 13 minutes here. So I want to respect people's time.