South Big Data Hub Data Sharing & Infrastructure Group, 3 Apr 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: CI WG demo Enhanced Robust Persistent Identification of Data (ERPID) & FAIR Digital Object Framework

Description

Persistent Identifiers are commonly used for long term identification of publications (DOIs), published data sets (DataCite), and even people (ORCIDs). However, PIDs have could have more utility throughout the data lifecycle. ERPID and the FDOF are looking at ways to track workflow and provenance information with PIDs that could enable universal data interoperability and full reproducibility of computational workflows.

Date: 04/03/20
Presenter: Rob Quick
Institution: Indiana University
Midwest Big Data Innovation Hub

A

B

A

I'm John blue hue I'm, the executive director of the Massachusetts green high performance, computing, Center and also a member of this during.

B

A

For the sharing and cyber infrastructure working group, we typically start these by going around the virtual room.

A

Eighteen people here, if you could just briefly introduce kind of who you are and and where you're from and I'll I'll just call the steps from from from my array here, starting with run up.

B

Hi everyone I'm Bren, Renata, rawlings, goddess I'm, the executive director for the south, big data, innovation hub and I'm at Georgia, Tech, Melissa.

C

There I'm practicing using the spacebar to unmute Melissa Craven I work at sdsc, formerly the Midwest big data hub. It's great to see everybody on the call yeah and.

A

Just as a heads up you and Christine can later on fight over who does the OS n update Florence you get to introduce yourself with your new hat.

D

My new hat I know one of my hat, so this is Florence Hudson and I'm, currently interim executive director for the Northeast hub, but I've been participating in these some for a while, as I've been special advisor for the Northeast home in sunny areas. So it's great to keep working with you all and.

A

Shannon, hello.

E

My name's Shannon McKean and with the University of North Carolina, the Hasan's computing Institute and I'm, the deputy director for the south.

B

E

Hello, yeah I'm, Jeremy, Jeremy dollars.

A

I'm also with the Renaissance PD Institute and I'm a project coordinator Marion. Could you introduce yourself also sure.

C

I'm also with rincey a project coordinator and will be here for the next month or so, and.

A

Everything that I say as a meeting moderator that sounds anywhere near intelligent is because mary help me figure out what to say just just to get a credit where it's due Rob will will let you will introduce you later Leah.

D

Hi Lea Shanley senior.

B

Fellow at the University of wisconsin-madison and former.

F

Co-Chair of this a while ago and delighted we heard thank.

B

You thank you 15 for mentioning it and thanks dinner. Congratulations! Lawrence! You know.

A

Welcome back Jay.

E

B

E

B

A professor and director of global outreach for Google substrate Institute at RIT Rochester is a technology.

A

G

Jason Kapowski with the Renaissance computing Institute and the executive director of the.

A

Irods consortium and.

A

Hi I'm Alfonso tourist I'm, an assistant professor at Utah, State University,.

B

I'm CUBAN Magli from the University of wisconsin-madison I was doing data management.

D

In the libraries but now I'm in research, cyber infrastructure in the Department of IT.

A

F

Hi Christine Kirkpatrick a proud co-chair of this working group and deputy director at the West big data hub also at the San Diego supercomputer center, where I work on other fun projects that came out of this working group with John and Melissa and fun to see all of you and a glad. We can still be connected at this strange time, struggling to be a contributing member of society and very glad. It's Friday me Oh.

A

Petra Clements from the University of Vermont I managed seperate the structure for the Vermont's core program. There.

B

Hello: everyone, my name, is Tyler Murray I work for the institute of marine remote sensing and I'm here, representing.

A

The 3d wetlands south.

G

Big data hub spoke project.

A

And there's one person on the phone Erika 3:05 is that somebody new or somebody who's just connecting my phone.

H

That's me: I was not able to assuming my name is: go pain and I'm, a researcher at NC, State, University and I'm, always interested in the data quality and fair digital object. Framework Rob had put in about this particular working group meeting couple days ago at a fairs seminars, Reznor, so I'm, just kind of listening to see what's going on, and it sounds like a very exciting working group with very different stakeholders.

A

So, thank you. Let's move on.

A

I think we have someone from every hub Florence. Maybe I'll start start with you and North East I'm.

D

Sure that sounds great. Thank you. So one of the big things that we've been working on is that a week ago, tonight NSF reached out and said that they would like the hub's to work on developing and NSF rapid coded Commons, and this would be an information source highlighting the NSF rapid grants which are fast but rapid, has I. Think it's not going to, of course, and so they're, currently 22 NSF rapid grants that have been announced regarding kovat they're planning on more including and the convergence accelerators.

D

So we've been working with all the hubs, which were very grateful that were such solid community to put together a proposal to develop this NSF Kovach Commons, starting with the rapid grants that are announced and then looking at connections to the open knowledge network, convergence, accelerators and maybe some other analytics and tools. So that's what's been keeping us super busy. All this weekend, get me dizzy and then the beautiful thing is. We have letters of collaboration from all four hubs from OSM and John.

D

Thank you for helping on that as well and Christine as well, and everybody and Renata and the whole family. So that's what we're up to our hoping to get that submitted today or Monday. We have swinging for a couple things, a couple of facilities, statements here and there and sizing what we think is going to be needed for this, but it's very exciting and a great opportunity for us to work together in a unique and valuable way which we've been looking for in this kovat area soon as it started.

D

So we're delighted to be working together and that's what's taking up a lot of our time.

D

I'm sure someone will talk about the all hub summit that was moved from May to October as a concept now we're trying to figure out the right venue the right way to do that with you know, I know: I was let's go on vacation next month to South Carolina and the people with the condo said: here's the letter from the governor saying that if you come here you have to you have to quarantine for two weeks, so you can't go inside the condo I'm like well. That doesn't seem like it makes a lot of sense.

D

So you know I, don't know when things like that are gonna change, so for us to be flying somewhere, I might get to stay in the hotel if it happens, so so we're working through what some of those things mean.

A

God there were a couple of people from the South pub I'll. Let you don't fall for that.

B

I'm happy to give an update in the South hub what Florence was talking. I was very salient that's just this week. We've also walked a long road on for the last year now working with NSF harnessing the data revolution there, projects that are funded through that big idea, which and all of the pis that are involved in those trying to look forward to a development meeting.

B

So we're switching that meeting very rapidly and we did a survey as all of the hubs to of the community, also a virtual meeting, and we were going to have a three day, in-person meeting that was going to be partially API meeting and partially for future. Looking meeting with recommendations around collaboration and coordination in the data science space, and so because of everything, that's happening in the world that is happening this month, and so we've switched it to a fully virtual meeting.

B

We're gonna be running it with the same collaborate, collaborate collaboration feel, but with just the three days being fully online. So that's going to be an interesting update and how the community may synergize. So this could play in to some of the some of the work that's going into Cova down the line as well.

A

Actually, we were working on rescheduling a meeting earlier this week. It seems to be going around and we realized, oh man, if everybody doesn't have to be in the same place on the same day, we don't actually have to have the meeting all on the same day, so we're gonna work on spreading it out. That's we're making.

B

A virtue out of necessity.

A

On some of this stuff, let's see Christine or is there somebody who could talk about the West hug, yeah yeah.

F

I could do that and then maybe Melissa won't mind. Oh I'll do this. She could do OSN later. That's a great, so the West big data hub Meredith's, put together a very kind of just the favorite resources for people at the at West. Big data hub org, hope I got that right, so you can go there and you can think see things like the oh shoot. I'm gonna get all the nouns wrong, but didn't get to the US. Did you response?

F

We put this up real, quick, which is a clearinghouse for if you need volunteers or you, yes, us digital response, font, ear, mouthing I'm just going to piece this one into the chat for people who haven't seen it I! Think it's a you know: we've been thinking as a hub. What can we do in this time? And it's probably not to spin up new efforts but to highlight and try and sift through everything that is bombarding our community and just the you know, everyone's an expert everyone's doing stuff painting?

F

How do you find your place to the thing that you can help with? So that's just one resource and there's a few others again carefully sifted through at West big data hub org. We also have so at SDS C.

F

We host the go fair us office, which is all about implementing practices via training or or infrastructure, or different tools to make data more fair, which is findable, accessible, inoperable, reasonable and so this stuff tails, with some of the activities that we have planned for the West big data hub in the second part of the grant that carry over from our activities in the first one, which were about research, data management and and best practices when it comes to infrastructure and also, of course, with the specialization of at scale.

F

So we've been very involved with go faire host the US coordination office and starting this Wednesday with our first zoom bombing experience for many of us. We launched our four-part webinar, but we did actually. We only started 10 minutes late and we got into a new room and it all worked out. You get the introduction to the virus outbreak data network on this coming Wednesday at 9 a.m. Pacific, 12:00 Eastern. We have dr. Mayor, Jim, fun, rice and I.

F

Hope I said that right who will be presenting on how they're doing training capacity in Africa, especially as they try to gear up and confront kovat 19 with a much different and under-resourced, even more under-resourced. Then then we're experiencing, of course situation and trying to make sure that the data Swan fair so that it can be quickly aggregated and insights mined from it.

F

So she will speak on Wednesday and by the way, it's not just because we're interested in what's happening in Africa, because they're mobilizing and doing this rapidly they're actually creating a lot of training materials that we can use only here in the US then in week, 3 we'll look at creating their data points, which is this concept of instead of data sharing, providing ways for people to get to data and make queries without actually ever taking control of the data to try and ease some of the barriers to sharing and then in week.

F

Four we're going to have three different people, including Microsoft Natalie Meyers, at Notre, Dame, and then a couple people here at the supercomputer Center. Look at various ways that they're mining heterogeneous data, so not just from journals, but also geospatial, all kinds of omics data and doing things like building map, knowledge, graphs or even word clouds to look at trends in response, so we're pretty excited about it. I'm also part of the virus outbreak, disease, Network and I'm, along with Rob part of the RDA Coburg 19 working group, which is so immense.

F

We have I think 300 experts signed up to a moment. It's spawned off five different working groups from omics to social science, to community outreach and some other things I can't think of at the moment, and so Rob and I, along with one or two other people are already a tabla asan's that will try and help the the co-chairs navigate through the very aggressive timeline that the European Commission is given for some policy I input they need, and then last but not least, I know. On blood Florence mentioned the rapid that she's working on.

F

We also have gotten an inquiry from NSF to ask: are there federal data sets out there? That would help researchers if they were open and as we find them, if we could, you know please let them know so they can do what they can to try and help facilitate all of the day that people need that researchers need for for modeling what is happening or doing the research that see them, but didn't take too much time. Then.

A

Thanks, and is there anyone here from the Midwest hub.

A

Okay, yeah that wasn't wasn't going to burn you with that good. Why don't we we go on hi, there's a Venice section for updates on outreach and engagement. We just heard a ton about updates on outreach and engagement. I wanted to.

D

A

Anyone had anything to add.

D

You know one thing we've started doing: this is Florence. Is that as we're reaching out to members of the community we're finding interesting virtual events regarding kovat at this point, so there's one I just put in the minutes.

D

You can click through our homepage to our coded site, but on there there's an event as an example called being led by body I'm Italian, so I like to say it like that, but it's BA RI and they're having what are the other effects from kovat webinar on April 24th, and now that this stuff is virtual, anyone can participate so I asked him. Is it okay? Anyone participates, they said yeah, so I think this is an opportunity for us, as all the hub's working together to kind of highlight these opportunities with each other.

D

Would that like bombarding each other there's so many of them, but I know we're putting them on our Kovach site, the ones that are Kovac related, so I'm wondering what people think you know about that.

B

He Lawrence actually think that's a great idea if there's a desert joint Google Doc, where we can just kind of list these things, at least we can push them out to our various communities, because their.

D

Best way to do that, so we have them on our website, so you guys can just scrape them off our website. If you want I, don't.

A

Just make a link to the page. This is another yeah.

D

You can just make a link to the page as compared to keeping another Google Doc cuz I know I'm, not gonna be good at doing both of them. Is that okay Renata, or? Is that not a good answer?.

B

That's fine it just doesn't. It doesn't allow us to send you ours with without you just getting them as emails, so it works.

D

For me, but no I understand okay, so I think we have a shared Google Doc that we got a lot of this from. Do you want to open it to more people, I think you and I am Meredith and John and Renee I.

B

Think that's I, think that's fine I mean cuz, they know and I think we'll probably be the biggest fire hoses of things so that we could just keep going okay.

D

And we're good consolidators to across the hub, so I'll make sure you have access to that and I'll send you the link so I'm. Sorry, if you want already on it, but I'll go find it in my 3,000 Google Drive things.

H

Okay, great thank you. Yeah.

A

A

Let's see music and you say a couple of words about the OS n.

C

There we go sorry still having it's it's: okay, I'm, getting this! The hang of this spacebar thing yeah, so the OS n we're we're having great success with moving toward automated or centralized, and some automated management of the of the the hardware and the and managing the software. When we have updates the software, pushing those things out, really great headway and working on our trusted. Ci assessment for for cybersecurity and we've got a number of new use cases that we're onboarding as well. The next big push in addition, so the leadership team met this week.

C

We have a handful of sort of top priorities to we're gonna be moving into a no-cost extension period, but a handful of top priorities for us to execute on over the the coming months and one of those efforts and I'm I think it would be great if maybe in the early fall.

C

I can come back and do a presentation of this group on our use cases, because one of the efforts or late summer, one of the efforts that we're doing and is really necessary, is getting some more information out about the the use and users of the OS n and actually being able to articulate sort of the them as case studies. So people can start to say, hey, we know of a project that really could use the OS n, 4x purpose. So again, a handful of new use cases.

C

We've got a whole bunch now, with allocations anywhere from from ten to a hundred terabytes and larger. We actually have a group now and working on spinning up a petabyte worth of data. That's going to take some time because we have to they're going to get funding to recruit new, a whole new pod to pay for a whole new pod for that, so it is exciting times and we're we're looking for we're, we're open and ready for business, we're looking for new users and, in fact, Tyler.

C

It's great you're on the call, if we maybe it's time to circle back and and and talk about about, moving forward, if you already with some of the lidar data, so I'll leave it there. So.

A

C

We were talking.

A

This morning, one of the neat things he brought up then was was like any project. You have a set of expectations about how people are going to use what you've got and we're. Now in that period where people are coming to us and saying we want to do this and it's like whoa, we never thought of that. The user community is inventing new uses as we go, which is kind of fun, and that.

C

That's a teaser.

A

For your talk later this year, thank.

C

A

Let's move on is there anything else, anyone anything that anyone else wanted to bring up before we move on to the main event.

H

um This is paint can I make one point. Certainly.

A

H

I'm I'm still a little loose in our data centers, so we've been actually looking for how users can use our data if they need to especially meteorological data climate data, but they find they're hard to get or hard to use, and we like to hear about it so along that and I just caught wondering if you guys arguing the situation like that and would like to contact me. So I will be able more than happy to forward the request to the data center management.

H

C

Thing this is Melissa. Are you asking specifically about the open storage networks.

H

No I'm, assuming it with no net NOAA data center I work. I, do some research on the data, stewardship and fair data fair set like that, but currently the the data center management actually are looking for feedback in terms you know if and how the user, especially with the cover 19, and if they can use other data if they needed to do they, have a difficulty finding or if they don't understand or how to use it, how to integrate it into their system.

H

So your situation that youth you, you know any one, people in a hop in the situation that needs to get the data I'll come out or the data are not very good to be integrated, and that will be a great feedback for us to improve our data.

C

Okay, that sounds great. Why don't? Why don't I follow up with you? If that's okay, I'm gonna, send you my email address through the chat privately and why don't I follow up with you and then see if I can't hook you up with somebody in in the hub, that might be a good contact how's that okay.

H

I I'm not paying the zoom, so my email address is gee pain, GI, George, P, Peters, eat, Edward and nice. Why don't I put it in a note? That's.

B

C

H

Oh, okay is your name in it's.

C

Melissa, your name Melissa Melissa,.

H

Melissa. Thank you. Okay, ciao yep,.

A

That's good anything else.

A

Hearing none I'll do a quick introduction. So Rach quick is the Associate Director of the surgeries for structural integration, Research, Center and Indiana University. He can talk a little bit more about who he is and what he does and he is going to talk to us about persistent identifiers and again you've seen the abstract so I'm going to just turn it over to rob and let him do a better job of introducing.

E

Okay, can everybody see my my slide here? Yes,.

A

E

Very good and- and you can hear me- okay, yes, okay, very good, so thank you for the introduction. I will talk a little bit more about my position and what I'm doing but I wanted to start with this short story and pitcher- and this is a a real picture of me and my daughter as I, was preparing slides for this presentation. Yesterday she asked me what I was doing.

E

I said I'm working on important, slides to tell people about my work and she said to me quite seriously: just tell them science and come play so I wanted to just say: science.

E

But really this is I. Think the reality for a lot of people now is, is that they're working from home and have various different interruptions? So please do excuse me if you hear some laughter in the background or or such because I am working from home as many of us are and in have interruptions occasionally, if there's an interruption, I'll just say science and stop and go play ooh, but how do I forward this next slide?

E

There we go, but I'm gonna talk a lot about persistent identifiers, one form of persistent identifiers that everybody's familiar with our do is at this point. Pids is just kind of a a general description of of a DOI being one form of PID with its own specific standards and metadata behind it. Pid is that I use are gonna, be very generalized, just as persistent identifiers um and I often say, wrap.

E

It and I know that several people earlier we're talking about various rapid NSF proposals when I say rapid during this I'm referring to robust persistent identifiers. So please excuse the clash in in acronyms there.

E

So when I first started with RDA- and it was back yet- oh- it was in Amsterdam and I think it was p4 that the fourth planaria may have been a bit earlier, but somebody said in one of the presentations data data everywhere nor any drop to drink, and everybody recognizes that from the Samuel Samuel Coolidge Rime of the Ancient Mariner, with the with the insertion of data versus water. Now the question we've been asking ourselves in the rapid project is and rapid again being persistent identification of data is not only.

E

How can we make some of that water drinkable that we're surrounded by in oceans? But how can we make all of that water drinkable and that's really what I'm going to talk about now is a very an overview of what we could do to make data universally interoperable and one scheme to do that.

E

So let me say first of all that many of the contributions in these slides and there are only 50 15 of them, so you can see what a collaboration this I think there's 15 of them, and it looks about like about half that many names that contributed to this presentation, but they they all, have contributed in in both intellectual and actual slides during this presentation. So they deserve as much of the credit for this work as I. Do myself, I am at Indiana University I am the associate director of the cyber infrastructure integration Research Center.

E

This was previously known as the science gateways Research Center we've we've really kind of changed our view to look at all of cyber infrastructure and integrating of the cyber infrastructure that's available to researchers into a usable format, something that people have heard it many times now that the scientists can do their research and and not worry about the technology not become IT experts along the way. I'm. Also, the principal investigator for the NSF project called the enhanced robust, persistent identification of data.

E

It was previously in a prior grant called the robust, persistent identification of data that was started by Beth play Lee and I took over for her when she moved to NSF and- and she deserves a lot of the thanks for the initial ideas here and what is in this presentation, along with really pointing me in this direction, to think about what is this?

E

Universal interoperability mean I'm also with exceed and I run the extended collaborative support services, science gateways portion so I have a connection to the cyber infrastructure there and on the RTA I'm part of the technical advisory board, along with Christina and I, am the co-chair of the RTA data fabric interest group, which really a lot of these ideas were fleshed out in here. In fact, the data interest groups came, and they had all these puzzle pieces that we said well.

E

Can we make this a real fabric that is then useful for the community, so I'm gonna start really big here and and make a suggestion that there are three main eras of IT and that we are in middle of the second era and moving towards the third. So that first era was basically from the invention of computing and transistors to about 1995, and in this era there were really many computers and many data sets.

E

Occasionally, a single computer was connected to a single data set usually via a mounted drive, but for the most part, all computers operated head Rajini in heterogeneous lis, and all data sets the same way, of course, with the 1995 and the proliferation of the internet. We went into this new era where there were now a single computer, and many data sets the single computer being that all computers could talk to each other.

E

In a single with a single communication protocol, and- and you may recall, son had a marketing slogan called, they said the network is computer, so in this era, from 1995 till sometime, hopefully in the near future, we're in an era where datasets are still heterogeneous, but really there's a homogeneous computing structure. Now you can probably guess, from era one and year two what the third era may be and again 2025 is just a projection. I think that it may be.

E

There will be schemes for it sooner whether it will be widely accepted it might be after that is of one single computer and one single data set, and by that I mean there will be interoperability of all heterogeneous data, meaning that you can interact with all headers Genia the same way, and this is actually from a small white paper by George Strom. For those of you who don't know George make it a point to meet George he's with the National Academy of Sciences and just a wonderful person to talk to.

E

He was part of NSF met and really the formation of that initial networking technology. That came to be the internet over time, but I hope that everybody here it's a chance to meet George because he's just a wonderful person to talk to, and we have kind of, a motivation for making all that that data into a single heterogeneous data set and that motivation is really the Internet of Things and machine learning. Internet of Things have made and sensor are inexpensive. So you now have all this data machine learning requires will require a better data infrastructure.

E

If we're really going to see how far it can go where we can push these these new machine learning techniques and even when data remains in silos, global fare data infrastructures could automate that drink data, wrangling step, which anyone who has spent a time really trying to work. The pre-processing up to a analysis knows is the majority of a data scientist time whether this 80% is is realistic or not. I'm, not sure.

E

But that's the estimate in some areas, and we really see this emergence of open science and this emergence of open science at least some advocates are saying, we'll be, have the same impact and it will rival the the original science revolution. Making science open and available to all can really have a massive impact, as everyone starts using the data available and, in fact, the resources and instruments available.

E

Also- and it's easy to see that if we have that those three eras and the data infrastructure coalesce as the internet and the web did that that data infrastructure will be revolutionary and, interestingly enough, the two capitalized words here: internet and web coalesced around two specific things and those were protocols, the internet around tcp/ip, which basically meant all devices could talk to all other devices without having any specialized software. All you have to understand is tcp/ip and the web was HTTP.

E

The protocol that allowed every read browser to understand bits in a certain sequence and to make them into something readable by human. So the data sharing and interoperability has kind of a a long history, and this history really and why we're moving towards fair and open data is, is combined in several things. One is the technology advances we've gone in the 1970s from thousands of transistors on a chip to now a billion transistors on ship, the networking technology and fiber-optic and laser communications have gone from.

E

I wrote down the first thing megabits, but I remember having a 300 baud modem, which means it was hectic bits per second to now experimental, petabytes per second networks. Dis prices have dropped tremendously. It was half a million dollars for a gig, a gigabyte of data in 1981, and now that's about three cents per gigabyte, as you can get. A four terabyte drive for about $100 in these great performance increases along with have enabled this data intensive science and things like machine learning and the complex algorithms are now realistic as they weren't before.

E

At the same time, society and government has been moving forward in 2011. There was a interagency, a US federal interagency committee committee. That said, basically, at this time we can now store more data than we can effectively process. In 2013 we had a u.s. presidential science adviser signed an executive order and that required that all all the federally funded research be openly available so leading to again open science, then January 2014 there was a workshop in Leiden University.

E

That line that was led by Professor Baron Mons, who was actually part of the vote and meeting earlier. This week and the results of that meeting was really the definition of fair data, so I think everybody in this room probably understands what fair data is, but just to rehash. That and I actually have some details on the next slide, but that's findable, accessible, interoperable and reusable, and then the this one is a little less well known, but there was a a national science national academy of science paper last year at Masari.

E

It was in 2018, so I guess two years ago. Now that included a recommendation that all research products be made available according to Fair principles, so we're really pushing along the lines of this open science and these fair principles as a way to allow open science to happen. And how do you reach interoperability?

E

Interoperability has, for a long time, been a tool used for by computer scientists to create new levels of extraction. So you have high-level languages that in interpreters that solve the interoperability problem for heterogeneous computers. You have the internet that solves the interoperability problem for heterogeneous networks, and the question is: can this digital object architecture, which is at the base of these persistent identifier schemes?

E

Can that solve the interoperability problem for heading heterogeneous data and just to reiterate the fair principles and what I've done on this slide and in fact this, like I, did it I I? Did this with Lewis Benigno from Leiden? This slide separates what's in green, in the fair principles, and these are the the fair principles word by word. What's in green is what can be accomplished purely in a technical level and then in black. What the?

E

What needs the community to provide some solutions along with, though the technology can still aid in those things, you'll see things here like the the first one metadata is assigned a globally unique identifier. We can do that with technology. That technology is this for a long time. The persistence part is not a technical solution. This persistent parts requires a community to say that they will house these registries of persistent identifiers for long for long periods of time.

E

The same thing with with point two here: we can make a technical solution to have a way to describe metadata for the data whether it can be rich is really going to fall in the community, and you can go through the rest of this, but really what we focused on in the e-wrap and project and what we're really focusing on in the fair digital object framework is what's in green here, and there will have to be some society movement around the things that are in black or blue whatever.

E

What or color that text is the nine green text.

E

So the deal may.

D

I ask a question, or is that inappropriate sure.

E

D

Great, if you could go back to your green and blue days, so one of the areas that is so research, but it's on this findable side is named data. Networking have you been involved with that at all, and what are your feelings on that from this findable perspective? Do you think that that will help so.

E

I'm aware of the name, data networking and I did a compare and contrast for the e rapid proposal without digging it into it too. A too far the one thing that is the major difference as the name data networking actually changes the base level networking protocols right. So you need a change in networking um what the digital object architecture does it operates within the the initial.

E

Are the existing networking framework now I would say that the goal is the same for many projects right that that can all water be drinkable and it's more likely gonna be a combination of good ideas from many I can't say enough to say whether I don't know enough about named data and networking to know. If it's going to a you know what what part it plays in or if it solves some of these issues.

E

However, I think that some convergence of a several of these technologies is going to be the right answer in, and we know what the solution is. The solution is that all water is drinkable, how we get there I think the devil is in the details right. Thank you. I think. The thing that it has the so it's what I've know of name dated networking is very powerful that the the downside of it is that it takes changes in at the networking layer and whether that is palatable. Why the community remains to be seen.

D

So, just a little more I want to share on that so I used to work with Christos Papadopoulos on this a little bit when I was at internet2 and now he's a DHS I think so he was in Colorado. So the interesting thing when we did a meeting at NIST about it was that they had some posters with tanks on them and of course, when you say what's the use case, they say. Oh you know tactical.

D

You know they can't tell you anything but I think what they're looking to do is see if it actually provides a more secure, more valuable, valuable data, networking and finding opportunity in an environment that they can manage. Because it's the cover, you know it's the military. So there may be some interesting things coming out of that.

D

So I just wanted to share that we'll see what comes out in the public, but it might provide some interesting new opportunities for those who don't know nd, M I think the idea is that you put the identifiers on the front of the data packet. So you look for the data packet. You know about the IP addresses and, as you know, is being shared, you have to change how you're doing networking, but but it's interesting, so it may be a future innovation that could be valuable, yeah.

E

And as I said, I think a a hybrid solution is going to be the ultimate solution.

D

E

Going to need to be you know the network's going to be need to be involved. It's going to need to be aware of what's happening with the data if that chain takes so again, I won't go too in depth, because what I know of my main data networking is not enough to say too much, but I do think that the network is going to have to be aware of what's happening with me with with data and be part of the solution.

E

D

Well said: hi, hello, I,.

H

Have a comment, um so you mentioned you, you mentioned number time that all waters are drinkable I'm wondering whether because there's always a balance between resources and result, I'm wondering whether make it all water available, but some water or drinkable some water. Maybe we don't need to be drinkable. We just need to used to flash taller. Don't think like that sure.

E

Sure in in making our all water drinkable another important thing here doesn't mean that anyone can drink any water right. You, you have I. Think it's here in point point a one two is that the the there needs to be authentication and authorization, also, meaning, basically, that you can. You can still. You still have to live behind that authentication and authorization scheme, because you know there there's a difference between open data and and data that every anyone can use any time without some authorization and authentication did that answer your question.

H

um I in a way, I guess I'm wondering whether philosophically is too much of a go to requiring all data to be interoperable at the sound of at the same level. Does that make sense to.

E

You, oh yes, but I. Look at this with a the say, my idea of what happened with the internet right, so the power and the Internet of the Internet and the networking protocol and tcp/ip is that all devices connect the same way right. So the you don't, the the network doesn't need to know whether you're on a cell phone or a laptop or a roof, a Internet enabled refrigerator the the network is the same.

E

So if we can put the technical components to be the same, that doesn't necessarily mean that you can interact with all data and, at the same time,.

E

At the same time, you you really have you, you do have to say: people are going to come along in stages of this now weather.

E

Yes, so there's a lot to be done both on the technical side and the community side to get to the stage and some some groups will come along faster than others. You can see again when we went to that one computer model that it became two and as the internet developed, it became too expensive to not be a part of it right.

E

H

I think I'm kinda, where you're saying thank you.

E

Okay, so I and I've gone on here already 25 minutes, I see so I do only have about I'm about halfway through my slides, so I'll try to go through this quickly. um The good thing is that the people who are really thinking about this and the brains behind this are the people who are involved with the internet. I mentioned George strong, but Robert Kahn, who is the executive director of CNN RI, which created the handle system? The handle systems are what do I uses as a technical background.

E

This has been around for a while Robert Kahn. Is that also credited as a co-creator of tcp/ip?

E

Pids are persistent identifiers point to digital objects and provide a kernel of metadata, so some state data about that? This can be done at a very quick level. You can get information and then kernel of metadata and that kernel metadata is just enough metadata to do some based operations. Much like you have a very base operations HTTP. Only in only enables things like get head and post and delete that there's a kernel of metadata that would allow you to interact and learn more about each of the digital objects.

E

A digital object can be anything we have a weather. Is the digital object interface protocol this place the same role as HTTP for webpages. It facilitates those operations and, and deos can be anything that you represent digitally, as as people know, there are P IDs for all publications data site does data now orchid does puts PIDs on people basically, so anything that has a digital replica say a representation can be a PID. The PID is involved in our testbed.

E

Don't have to come from from the testbed itself they can be do is, are pids issued by other agencies as long as they're, they're, persistent and globally unique and resolvable. So I think that covered that one thing to say: digital object, architecture, here, services based infrastructure, only so basically the technical components, it doesn't say anything about the modeling of data that objects themselves. For that we look to groups like the Fair principles and and the RDA working group on the PID Colonel information.

E

These relationships are defined outside of the technical realm you you can say that location represents the location of the digital object. For instance, the master is all mobile part of the PID, but to say what else and what other objects are are what other metadata is available? That's left up to things like the fair principles and and the PID colonel information.

E

So really this is just overlays from a user terminal or a user client to some bitstream right, so the digital object infrastructure, digital object architecture allows end-users to work with persistently identified digital objects, which basically they can get some information about from the various repositories from identifiers services which, in the end, just map onto some bitstream, whether that's streamed or written to a disk or again, regardless of what technology is used to write that if it's on a sequel database or a file system or a ad space repository of structure again interacting with it, should always be the same.

E

Now the e-rep and testbed, which is entering its second year, production, has the basic components and really what we did had was puzzle pieces from various different various different organizations that we stitched together. It really works with a pids, a handle service. A handle service, as I said earlier, is the same. Handle service based software. That issues do is or D lies is the big one, but there are a few other ones that are used outside of academia. It's it's the same software. The the handle server is exactly the same.

E

We had a data type registry data typing is important because, if you're having this machine communication, you have to know the form of the data if you're going to act on it. So, for example, if you have a created on date in the in the state data of the PID, you need to know that that's coming in the some standard form that the the machine can then parse an act on so so this the datatype registry is where you register the types of the state data that you're going to get from resolving the PID.

E

This helps you make decisions, then as to whether that is useful. The so the components really of rapid are a handle service for issuing and resolving persistent identifiers, a data type registry for recording what data types those are for machine actionability, we're working on a mapping service that really prevents the refactoring of repositories right. So we understand that we're dead in the water. If we go in and say, okay, you've got a changer. Your repository schema to do this that the other thing before it can be part of this.

E

Before it can be part of this structure, we really have to have something that map's, what is an existing repository to the digital object architecture and we're working on that with deep spaces, the one we're working on right now and then and then that operations protocol, which is really the hinge of it all, and it's the it's. What I call the digital object interface protocol and this allows the basic so both basic and extended operations.

E

The basic operations are much like HTTP they're, a in probably the one that's most used will be get because you you find out some metadata about an object. You want to bring that object to your computing source or you know your screen, so you can review it whatever it may be, but there are the basic crud operations which we've enabled in a rapid Craig to the create, create, update, delete and I. Always a read I always forget are so yeah, so so, basically, this are two services that would be globally network.

E

There would be many of these, not just the e rapid service, which is a testbed which is two server sitting in Indiana. They would be. It would be a network series, our federated series of these handle and data type services, along with a protocol that functions with them, and let me talk about so we have a wide variety of use cases they include open, I went the wrong way.

E

They include some weather modeling, some some climate surveys in the Taiwan area, some wife's genomic projects, some actually a stream as part of it, with its virtual machine images, we had galaxy as part of the project where we were looking at their workflows, but the one I want to Center on and just mention here is the science and engineering applications gateway or C grid. If you go to rapid, our PID secret org you'll find our our test, but enabled in a science gateway.

E

What it does basically is assigns peds for every aspect of the workflow, so each input the raw data is assigned a PID. It's it's at this point assigned when it is when the workflow has started. However, you could imagine that a PID be assigned to data as it comes out of the instrumentation there's some data preparation, software or some pre-processing software- that is also assigned to PID at the intermediate data products again PIDs and down the line until you have your output and and visualization, which are also assigned pids.

E

This in the end could all be wrapped in a DOI which then could be published outside you. Wouldn't want to use DOI is that on every step of the way, because there's there are some requirements there that get pretty heavy, and this you can imagine following a series of P IDs and that allowing you to totally reproduce a experiment, in fact, even the resource that it runs on, be given a PID.

E

So you can, you could potentially go back to the same resource, there's a little bit of questions about whether the architecture remains the same for any like the time that would allow that to be reasonable, but this is enabled in a gateway.

E

This is all behind the scenes for the most, for the first run, so the researcher coming in for the first time doesn't even realize this is happening when this becomes important is when they publish they have a path to reproducibility, to show basically the entire computational work flow and and as to how they got the results and I know I'm running along on time. I did want to also talk about the fair digital, the fair digital object framework. This is what what several people in the US and and Europe are working towards.

E

There's a fair digital object. Working group I have the link on my final slide here, but this combines the digital object architecture, which is over here on the Left, even closer with the fair interoperability. It pulls in some link data concepts for semantic relationships between the metadata and then you have the digital object. That is as fair I would suggest, talking or looking back over Behrens patient for more on fair digital objects, but the the idea is basically everything that I just talked about is given.

E

Some is given some semantic relationship to pull it from a digital object architecture and combine it with fair, and it also has a specific format and RDF format that again is relatively available and usable. The general I here idea here is you: have a fair digital object. Wrapped in PID is that an agent can act upon when they decide that that digital object is fit for the purpose that they're looking and they can get that from the metadata.

E

That is part of the state data that you can get from resolving the persistent identifier so going back and and realizing that I'm going 20 minutes long here, the the next year of IT we talked about these I will just say that I think this. This final IQ era is coming, whether the technology we're working on or a hybrid technology, as I mentioned, that includes networking or or some commercial group is going to develop something before we get this out the door.

E

Is it waits to be steamed, but I think we are moving towards that third era, where we have basically a single data, set that the technologists who is writing the client can interact with all data through one with one method or with one protocol, basically so realizing and I hurried through the last half of my presentation. I will, if I have time, take questions I, think.

A

Just being respectful of people's time, what I'm going to do is just a quick note on upcoming presentations, and then people can hang around the virtual podium afterwards, but are also free to leave so I'll. Just make a note we'll be meeting again on the 1st of May Stenton Martin is going to be talking about the National biome data collaborative stands at at Oakridge, and then you heard earlier christine kirkpatrick who's.

A

I think, still on the call will be talking about the virus outbreak in a network on on the 5th of june and linking back to some of these fare principles. So with that I think people are free to drop off the call or Rob it sounds like you've got a little bit more time, yeah.

A

Anyone who wants to ask questions how about it.

E

Yep I can stick around for about ten minutes and I have a one-on-one with my boss, but he'll wait. Okay,.

A

Any question for Rob I.

F

Just wanted to say, Rob I really appreciated your presentation. I especially appreciated the last couple slides which really wrap up some, not only complicated concepts, but to be experiencing this as it develops in real time. There's so much to sift through about you know what what is worth following and and what's going to emerge is the way to do it. It's really nice to have a recent retrospective kind of of what has emerged as things that we should be building around so I appreciated that a lot.

E

Thank you and and I said a few things in there I think I said the devil is in the details and those details are working out and the other thing that the fair digital object framework is that that consortium is really looking at, and this is something that George Tron tells us also is that general consensus, which we think we have the general consensus as to where we're going and and the second part of that is running code, which is really the part we're in now is we're really trying to implement this technology in a way that is, is generally usable by anyone right, I.

F

Also, really love the drawing at the beginning and I think that one. It was just a timely way to frame the discussion, but I also think the more that we can support each other in our varying working styles and acknowledge that it's a different time and that, yes, your kids, might be in the room where you might be sharing a home office with the spouse and things are different. I think that's just really good to interject and be real about with each other.

E

Thank you and again, I think you're right being real, is important. Right now and and even approaching it with a bit of a sense of humor I think is good serious.

C

Science, let's play, we.

B

Get more real as we.

A

Get more virtual right, yeah, here's a thing that wove in and out of your talk and I I also love the diagrams. It's especially in the last couple good nice way to kind of present it conceptually, but you know kind of forever. There's there's! You know a worldview that says we got to organize the data better and better and then there's a different worldview, especially with the rise of machine learning. It says, yeah, you know it's put in a pile and we'll figure it out.

A

It looks like you've navigated that pretty well by providing layers that kind of in between that can can support both points of view, but I wonder if you could comment on that tension and in particular, where you think the boundaries are on being able to you know how far do we need to go to structure the data in the world to make it usable and how much should we just leave to people to figure out yeah.

E

So I would I would point you towards more of the details of the PID kernel information. In my opinion, you want to get the minimum set available for all data that allows you to do those base operations which the base operations are your computer science first year. You want to be able to retrieve an object. You want to be able to put some basic trusts and determine if it's the object, you're looking for so you need something like an md5 checksum.

E

You need a created on date, and so that has to be very small, and it has to be very small, for a very technical reason is that this has to operate at internet scales right so you're looking billions of these objects. So if you try to put all of your metadata into the associated with a PID, then all your time is going to be turning and trying to get through. You know tons of metadata, so we need there are really two sets of operations, one.

E

What's the base set of operations that you need to establish enough trust to know that you want to know more and then, when you want to know more, you need to probably still talk to a metadata server and in fact, if you want to do complex operations, something like a visualization operation that takes in a certain amount of data and puts it into a visualization or a even is something, as you know, getting the average value of the steal.

E

Then all of these similar data objects that you're going to have to then get some a more in-depth profile and that metadata and that metadata still probably lives on a metadata server, not within the not within the greater environment. That allows you to quickly operate on on on billions of objects. Now you you need a pointer in that in that PID kernel as to where you can find more right. So when you, when you're searching for an object, you want to know if that object is suitable or fit for use yeah.

E

So so it comes to what is searchable, what is reasonably searchable and then what is it that it takes to build the necessary level of trust that you're going to do some extend operation or some analysis with an object? So from a purely technical point, you want to keep that kernel as small as possible simply for a performance issue, but you also have to have that greater detail and that that greater detail is going to tell you.

E

You know whether the you know the units on the the number you just picked up, so you know what you're actually calculating and all that needs to be recorded also, but you don't necessarily need that about every object. What you need about every object is where it's located, where you can get more data from it, and you know some very bass, some very bass metadata that allows you to establish trust that you're the object you're getting is the object.

E

You think it is an authorization, so whether you can actually access that that digital object or not, that.

A

That helps you know was part of the genius internet is what is the minimal set of operations that you need to make it all work and reckon it and I love it that the internet didn't exist before 1995? Oh I'll, take that under advisement, so.

E

The internet, I will say it absolutely did exist before that, though the the genius of it really became widely used at that point. Right, which is, is, as I said, the the network doesn't care if you're, using your your cell phone, your laptop your server in your data center, your kids, toys that have an IP address. It talks to them all the same way. They'd be the network itself, treats them the same way and that's what we really need: clients to be able to treat data all the same way.

E

Yeah and you know, make some decisions based on one some decisions based on what's available in the kernel and then even more decisions based on what's in an extended profile or extended state data. Well, it's good.

A

I find it remarkable that there was a group of people who, in 1968 kind of laid out the foundation that that worked when things exploded in 1995 other than address space. Yeah.

E

Yeah, and and for you know as long as computing you you would have, one manufacturer could talk to the same manufacturer right. So you had protocols that allowed Sun systems to talk to Sun systems in and.

E

Say they'd be anyway, you had protocols that allowed communication between light devices for as long as you've had computing well, not quite as long but but nearly as long as you can computing is that genius of the network doesn't care. It can talk to you weather, no matter what operating says them or what hardware you're operating right, yeah.

G

Can I ask a question about why, what's what's your justification for really wanting resolvable identifiers, as opposed to just global identifier, z' for such fine-grained objects, you you've got data, sets that have a hundred thousand files, and you know you've now got a reasonable ID for some intermediate data product that you'd really be hard-pressed to understand without understanding the whole data set so white. Why do we want to have a resolver service to be able to get to every single one of those individual things without having to find the main data set.

E

Let me say that you can do that. That doesn't necessarily mean that it's useful to do that and that you, you have to do that right at this point, we're still showing that you, you can PIDs themselves are very inexpensive. The resolution is very quick as long as the the kernel, the metadata, doesn't get big. It's going to be useful in some cases and not useful in others. What we found in the seed I'm, sorry, the secret project, is that you may want to start at the intermediate product all right.

E

So so you may want to start at the pre-processed data you may want to start from the raw data. You may want to actually change just a component of the input file. So you you don't want to change the molecular model, for instance that you're submitting, but you do want to change the parameters that are given to the to the application to run that. So, if you divide them in two, you know this is the. This is the instructions or the analysis, and this is the actual data.

E

Then you can change one without changing the other. So and again, this is useful in some cases and not useful in some cases, and the granularity is really going to be up to the researcher in what is useful or the community and what's useful. In fact, one of the projects that we looked on early on was called the Perseus digital library and the question they had is with text analysis. Do you want to put a PID on a chapter? Do you want to put it on a page?

E

Do you want to put it on a letter, a word? What is the right, granularity and I think that this is not the the what they came to, but I think the applications are going to define what granule area, what granularity is useful right, so the applications that that will use that that use this input data either exist or they will exist, you're going to put you're going to solve the granularity problem based on the workflow and the application that you have doing.

E

Your analysis at this point we're early enough that we're just showing that if you, you can put PIDs and again they're inexpensive enough, you can you can issue about 1600 per second without putting any reasonable load on a machine on the server that is running a handle service and along with populating its metadata, and you know, yeah, you do need a license from the Dona Foundation to get a prefix, but that's about 50 dollars a year.

E

So if you're looking at billions of PIDs for $50, it's inexpensive to put PID s on there, so the usefulness isn't going to be defined by a technologist. The usefulness is going to be defined by the research community and, in fact, the research, the applications developers I think in the end, because you're going to want to feed the application with the application is I'm.

G

Really I'm really questioning just the resolvable part, putting global identifier, zhan, really small things so that you can reference them. You know, I think, is something we can't write. You don't want to do that with a DOI and but but things that don't need a resolver service just to have those pids that can be connected to a DOI level or to a higher level. I think as a place in this game. To so I'm curious that you know this is sort of there's two eyes.

G

There's what you're talking about and then there's just persistent identifiers that aren't resolved Obul that are even cheaper and I'll, be curious to see where the breaks between those three things really turn out in the long run. Yeah.

E

Yeah and I think that yeah I think that in the end duis are going to be.

E

You know the the final that describes the entire data process, the data set and, and maybe even the workflow, but in that you can have small things and again we're resolving is only resolving a very small kernel of metadata right, so so you're not looking at if it's resolvable getting the the only difference between the PIDs, we're using on the rapid project and the PA IDs that are used for as do is, is that the metadata schema behind the scenes is is much more structured and rigid and larger for a DUI than just these PIDs.

E

These run on a very minimal scheme and, yes, I, think, there's a still a lot of open questions as to where the usefulness or what level of usefulness this will be. Will it be that in the end we decide that really it's it's only useful to put PIDs on data when it's published and put somewhere like data sight, or is it useful to use pids at this workflow level, which is what we're doing.

A

And and Rob I think at this point your boss is probably waiting, so we might say thank you and.

E

I do encourage anybody who has more in-depth questions to get a hold of me. I've put and I think this is still up. I, don't I, think I'm still sharing links to the rapid project itself, the testbed, the fair digital object group and then, if you're, really interested in, want to listen to me talk about this for an hour with a little bit more of the technical details. Instead of of the shortened kind of high level overview, I've put in a videotaped presentation, there.