South Big Data Hub Data Sharing & Infrastructure Group, 2 Nov 2018

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: CI WG demo: IMaD (Integrative Materials and Design)

Description

Date: 11/2/2018
Presenter: Ben Blaiszik
Institution: University of Chicago
Midwest Big Data Hub

A

I called that integrative materials and design and our goal is really to connect various pieces across the Midwest materials data ecosystem and we are, as you see, part of the Midwest big data hub.

A

So just diving in here, as I mentioned, the high level things we're looking to do, we're looking to connect researchers so connecting people first, whether those are academics or industrial researchers, data services and software and tooling across materials. Science, starting in the Midwest and then branching out.

A

Another goal is to provide simplified and unified access to high value materials datasets that are generated by the partner institutions, and our third goal is to inform the community and public about all the great materials informatics work, that's being done by these groups and by other groups as well, but doing that through a series of videos, articles, webinars, tutorials workshops and, more so.

B

As you see here on.

A

This map, the partners are, are shown with Northwestern you Chicago is the lead institution, the University of Illinois, urbana-champaign, Michigan and Wisconsin, and we've also partnered with some researchers at NIST, and some industrial researchers at citrine, informatics and I'll describe a little bit more of those interactions as we go forward.

A

So some of the connections that we're building between researchers, we see that as very critical, making sure that people are talking to each other in this in this little ecosystem. The first is with the materials informatics. Skunk works at the University of Wisconsin, so the the Chicago and Argonne team is holding weekly meetings with this group, mostly around the discovery of metallic glasses.

A

But it's also to coordinate software development and machine learning projects and make sure we're not duplicating work and, as a result, we actually submitted a joint CSSI proposal that is still waiting waiting for work and, as you see, this is, as Jim mentioned, it's a great opportunity to start training the next generation. So the skunkworks is a group of about 20 undergraduates at Wisconsin that are interested in materials informatics and we see there's quite a good opportunity there to train them and we're working with citrine informatics as well.

A

So we have bimonthly meetings with this group to discuss data service integrations and joint machine learning projects. We've had two joint papers over the last year with this group and we actually have now a project funded by citrine as a result of this, this collaboration just recently. Actually, in the last two weeks, we had materials microscopy data workshop held at Northwestern, basically working working through challenges to toward capturing data from from various instrumentation, especially around microscopy data.

A

So we had a representatives from Zeiss, Joel, sit, informatics, academia, funding agencies and we'll have a follow up workshop at NIST in 2019 and the nice part here was we actually had representatives from every one of our our spoke projects teams at this meeting, so we had a nice chance to have everybody talk and and have things move forward, we're also working with the the national data service to build integrations between the materials data facility effort and some of the services that Andy has this building and, of course, two again: a coordinated team software development we've seen this kind of as a recurring theme that, if you're not talking you're, often duplicating what other people are doing.

A

So we're really finding that to be important here. Another aspect of community outreach that we're working on. We have ahead through the I'm at project to two different interns, Stephanie Fox and Austin Keating, who are part of the middle school of journalism, science, communication program, and so we found we've actually had them visit each partner site and create videos that are showcasing the people that are at those institutions, the facilities and the data services to really drive awareness of all of those you can see here on the right.

A

We have a set of four videos that are already released and we have another set of I believe four or five that will be released in the coming two weeks covering each one of the partners.

A

So the other piece that we've been working on is leveraging the materials data facility effort to connect data services. Just to give you a real brief overview. The materials data facility is building data services to allow researchers to publish data regardless of size, so maybe terabytes of data data type. So it could be heterogeneous data and location. It could be distributed.

A

Endpoints we're also looking to automate data and metadata ingest, so you can think of automatically indexing piles of files that are materials related and enabling a unified search and discovery across data sources that come into MDF, but also data sources that we index that are from the community more widely and so, in particular with I'm.

A

Add the the concept that we've kind of really latched on to is this MDF connect flow, where we have many inputs being piped through the MDF Connect service and set to many different outputs and really the goal there is to to make it easier for users to deposit their data from where they're collaborating into many services. On the other side, all from one location, we heard this many times that there are, you know 10 or 15 different data services that are important to the materials community.

A

Researchers work. Don't don't really want to have to deposit their data set in 15 different ways, so we're trying to help allay that problem and, as you see, the inputs, we've been able to focus on through I'm add are largely centered around the partners. So the four seed service is built at Illinois, a materials Commons is at Michigan and we've also been working with NIST and of course, we have other integrations that make it very simple for users to get data into this pipeline. The data is then sent through the pipeline.

A

A series of extractions are performed on that data, extractions, meaning trying to pull out things like crystal structure or other material properties transformation, meaning we transform the data into a form that is amenable to deposit in other services, so we could deposit into other MDF services like publish to get a DOI or our search service to allow querying and aggregating of that data. But beyond that, we can also deposit into services like the NIST materials resource registry, informatics generation platform and nano mine and others that are in development.

A

So we have a variety of ways you can do this. You can do it through a web form which is shown here on the right.

A

We ask for a very minimal amount of metadata, like title authors, data location, so we can go get that data and index it tags description, and then we ask you where you want to sync that data to so, if you want to send it to different places, you just click these boxes and if you go- and you only have to do it from one place, the other thing we've been able to build is a Python client that allows you to automate these processes.

A

So with you, if you have a large collaboration, you have hundreds of different data sets that are going to be deposited. Often you can use this type of functionality to automate that, and you see that you get all the functionality that was shown in the webform that you get it through a Python client.

A

The other thing that we've seen that is important is integrations to basically the places that researchers are doing their collaboration right now. So if a researcher has their data in Google Drive, it makes it very easy to send it. Through mdf connect box, we have an integration with dropbox, figshare and and others.

A

The one I do want to highlight, specifically is our connection to for seed since they're a partner of our spoke- and you see here on the left, a screenshot of their project space, where you have, in this case gallium nitride atching with a given pressure, and you may have some files associated with that, but basically that the four seed project space is an active space. Where you you share it mainly among local collaborators.

A

And what we're seeing is that those researchers also want a way to export that data outside of for seed for the community to have access to later. And so we we were able to work with the four seed team and Ben Gillespie at Illinois to build a very simple publication flow from foresee to mdf connect, and you see that the user can select a repository to send to they send a select materials data facility, and then they fill out a little bit of that same metadata and click.

A

And then everything is sent out to the community, as as the user wants I'm just going to show you a very quick video of how this how this works? Just to show you how easy it is. So you log in with your institutional credentials, you see, there are hundreds of different institutions. You can login with I'm gonna use, Chicago you'd then select to become a contributor and then you're going to fill in a little bit of the metadata like the title, the author's institutions and the data location.

A

So in this case I'm going to choose data, that's sitting on a global assign point so I'm going to pop over to Globus grab the folder link copy that in and that to the day of location, and then I'm gonna tell it where I want to send it so I want to publish it. I want to send it to Citra nation.

A

Now you can follow along as our services parsing through it, and here in a second you'll, see that it's set to Globus Globe is published. So now you get to DOI, and you can cite that in your in your papers and such and that is persistent, a persistent location and then down here. You see that it was sent to Sutra Nation, and we pop over to such a nation and the nice part here is that it's not like you get the same functionality. You don't get another DOI, you don't get.

A

The exact functionality would get from depositing into Globus, publish you actually get things like features that are important for DFT calculations like the functional, the cutoff energy and a few other things, and these are also now in the situation platform that allows you some easy access to machine learning, capabilities.

A

So that kind of wraps up what the high-level summary of what we've been working on with I'm at as I mentioned, we're working to build the connections between various data services, materials Commons, is is still in development for seed materials, data facilities, information, NIST, various services at NIST, and we expect many of these service connections will become available over the next quarter.

A

As I mentioned, the the video series will be online next month, we'll be rolling out our website at the same time and we're looking to start building connections to other big data hubs and other material science groups around the country, as we speak. So with that I'll open up for questions.

B

All right, thank you very much, so we turn it over for questions for Ben.

A

B

B

When we start talking about community there, any plans security for actually getting things like food in there and I'm going to conclude, I mean things that have been transformed beyond their necessarily their original biological.

B

Meaning that their organism, cultured from an organ like a possum or put them right back towards. Oh.

A

Yeah I'm trying to train to share some slides here, but I think I'll just give up for now. I think I. Think I heard the question, as is there is there hopes that we could apply this to a different domain at some point? Certainly we were interested in applying this to various domains. As as we're building these data services there there are only a few layers of it that are really domain-specific.

A

You know the underlying cyber infrastructure is largely generalizable, so the pieces that are ungenerous alar, largely within that MDF connect service which I labeled as converters or transformers. So those pieces would need to be changed for different domains to understand different file types and understand service integrations in different areas, but everything else is really very domain: agnostic and handling things like large data transfer, publishing datasets getting a DOI. You know a lot of those things are very generalizable, so we're definitely looking for collaborators in other spaces to build.

A

You know, MDF of other things, I hope, I got the question right, but I had.

B

A little hard time hearing.