South Big Data Hub Data Sharing & Infrastructure Group, 17 Feb 2017

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: CI WG demo: Open Science Framework

Description

Date: 02/17/17
Presenter: Matt Spitzer
Institution: Center for Open Science
South Big Data Hub

A

So with that I'd like to move on to Matt Spitzer's again, he is going to be talking about the Open Science framework for connecting institutional research and Matt. Are you there with us I.

B

Am and you can see my slides okay, we.

A

B

Great well thanks everybody. You know that was actually probably a great way to start with Jen's presentation, because there's a couple use cases that were dealing with that might actually be press a way.

A

And on some of that, indexing.

B

The metadata questions we're looking at as well so I'll be very curious ahead of us, we'll be back at the end.

B

Yeah, like you, said, I'm with the Center for open science for a nonprofit organization that is funded by a selection of family foundations and government entities that you see there at the bottom slide, and our primary mission is to improve the rigor, the transparency and reproducibility of scientific research, and we do that in a number of ways, I'm only to be talking about the technology side today, obviously, but just so, you know, we also do be to other sort of legs of the stool.

B

If you will to train researchers on how to do, reproducible and more transparent research practices, as well as work on the incentives which I mean by that working with publishers and funders and how they incentivize researchers to do things like share their data and make their their research more open.

B

Obviously, we don't change how researchers are incentivized, we're not all going to maybe it jumped into the fray, except for the early idealist, of course, but will mainly talk about the technology that we build, which we build as public with infrastructure, all free and open source tools to help individual researchers and increasingly their institutions, provide enhanced discoverability collaboration sharing and actually the solving a lot of the challenges that researchers have when they tackle either small projects, or, even more importantly, the large collaborative research projects, some of which were known for mostly right, sawtooth heard of the center.

B

Because of our involvement in the reducibility project. For psychology and most recently that was just the first B findings were published a few weeks ago. Reproducibility projects cancer biology. We are managing those projects as part of our sort of research on the they practice as a research. But the vast majority of what our resources go to is to build a technology system of epiha that we use and promote the the primary. So we we don't here's the Open Science framework and to give you a little context of what that is.

B

I'm going to go into specific institution use cases, but it is designed as an open workflow tool for researchers to manage a research project from beginning to end full lifecycle and actually to make those individual research projects, essentially more extensible, to go around the cycle again by yourself or by another researcher.

B

So we we build components and in parts of the tool that quickly address all of these, but we don't build them in the sense that you have to come and use our tool and nothing else, because we want to be researchers where they are, and everyone who has a workflow has a preference for how they do things, how they manage their data, how they store their data. So the way we do that is actually we connect other tools. We connect to citation tools to connect to design tools, study, design tools, data storage tools, repository tools.

B

A lot of these are in place already. A lot of them are being built today by us and by our by the community of open-source developers. So if you'd like to use Dropbox and you'd like to use data verse or fixture Dryad, you can use those tools but use them in a way that's connected until you eliminate the data silos.

B

That's a lot of what we work on is helping researchers create a more fluid way to manage their data so that, at the end of the day, when their funder or their publisher or institution says now, we want you to curate that data preservative archive to share it. It's not a burden to the researcher, it's actually a VP's pressing abundant they ever I didn't share. We all know an institutional repository.

B

Compliance is extremely low and we think that by connecting all of these services in to aid works where they can lead to a institutional repository, would be much more improved. Workflow and I'll.

C

Share a little about.

B

A pilot project we're working on to do that again: I'm, really just going to talk about the institutional approach, but to set that a little bit of a bigger picture of what we're doing it. Campus tool called The, Open Science framework, which is comprised of these common tools that we all need to operate on the internet. You know occasion file upload file, rendering a lot of these types of services, and we built it in a modular way that we can actually build lots of different systems.

B

On top of that, so we have our own sense preprints. Now that we've released from building multiple versions of those preprints for different communities. So we've put up the infrastructure for social archive, engineer, archives, spy archive and out aggregate ID, which just launched earlier this week, using that same tool, kit, just in a modular way, so allows us to do a preprint service that allows this ability institutional tools for schools like NYU in order Dane.

B

It allows us to build registries and other repositories of data for a lot of different groups that don't have the funds to go out and build original infrastructure to solve these needs. They want to use tool that everyone else is using and we can provide some really nice energy across the top of them by with integrated search so similar institutions. This came out of some recent work with Notre Dame, especially last year. We began highlighting this this additional layer on the OSF last year to really solve these challenges.

B

You know that gap between which research is funded and what it published is often a black hole at a lot of institutions where's the data going. Is it never going to be archived? It ever could be published so providing visibility. Their insight on how collaboration is happening, that interdisciplinary research is a really growing trend in institutions or sometimes struggling to provide tools and frameworks for a biologist and a computational scientist to work collaboratively, and we think that we can help with that.

B

And then any one of the most really fascinating ones I've come across is actually providing an access to the workflows and what I mean by that is in the research data services side of things at your libraries, they're professional there. That can help you curate and archive the data for good repositories. But most often they're asked to work with researchers when they're, given an email, a file attached to it and that's completely out of context and may not be the file that they should be helping you with.

B

So by having an open, workflow tool like the OSS. You can actually invite that that curation expert into your project space temporarily over the read-only link that allows them to interact research right, whether they're working right, whether workflow tools are all connected and all this information in this slide deck by the way is on our lips and on a site which I'll share the link to at the end.

B

It goes on the first slide as well, so we're doing a lot of institution to do this is our current sort of a partner base using the OSF and a few that will be added. So a number here. The southeast UVA BC you can detect. We've had conversations with lots and lots of other waters, including folks down at Georgia, State, North, Carolina, State and others in their own piloting and testing this in various ways.

B

So they're not doing it the same they're all so pushing out in different ways, but you notice, there's some other really small groups on here to be saw our Center for behavioral economics, the Mindwipe Institute studies, contemplative Sciences, these extremely small groups that don't have as a jewelry polish or ease to know where they have the funds to build one.

B

So they're able to use our public business structure to help promote the sharing of the research and data that they have so that others can find in making more discoverable, and it starts with using the USF I'm not going to cover a lot of what the OSF is as a standalone tool. So hopefully some of you know about it, but it is the freedom is where so anyone can go. There create an account today at OSF, IO.

C

B

Institutions, we add a layer of having single, sign-on so princess view where click on OSF died. Why you that edu.

C

B

Would actually take you to anyone's page and if you had in one your credentials, you could log in directly with your institutional ID. We also have a work at integration as well for login and what you get these institutional pages is simply add another layer on top of what's already at the OSF. So UVA is a good example where we add probably about a hundred users of USF there. We turn this on and it provides a public hub of research, that's being that's being shared at that institution. Some of the may be active research.

B

Some of it may be past research. Some of it may be a placeholder for a grant has been awarded, but what you can quickly do is search across this. This collection of research projects for things like the researcher who.

C

B

Know SEC working with it's working with Jordan and how many projects are they working on I can get some basic information about the project, and this is all up to the researchers discretions you'll tap. You can have a completely private project, but if you make it public people show here and it'll be discoverable and we provide persistent.

C

B

With projects files so that if you'd bet a link in an article, you can get linked back to that that project and that project, maybe we still have to discover a digital research by you or at your institution. You can also search across these four things. Look like if I search for bias, I find the two projects that are dealing with bias in their title description or in the metadata tags. So I can easily search across the entire collection of research here at UVA to find a different topic that I'm interested in I.

B

Don't have to go to 500 apartment websites to see who's studying it. I can actually just look look on it here on the project page itself. Let's live script once live there. They go back on the project page itself, it's the same as the OSF has always been a project to store data, and it can structure your project with multiple collaborators. We just add a little bit of metadata that is affiliation with UVA and we can't have multiple affiliations.

B

We have lots of projects where there are researchers at two institutions and UVA logo and VCU logos will be on the same project and that project will show up in both institutional landing pages. So it's discoverable from both entry points.

B

um I picked up two use cases for the USS, but I thought might be relevant just for general use and actually was really relevant based on Ben's discussion earlier a lot of times, people see the USF as a place to store data, and, although that's not really our primary goal to be a data repository, we simply want to complement the other tools that are out there, Amazon s3 and dry and Dropbox.

B

We often find that researchers don't have a good place to store the types of data that they want to store at their institution, and a really good example of this is Adam. Summers is a fish biologist at University of Washington he's a very.

C

B

As fish guy, you can read that wired article about he's scanning fish and his goal is to scan all fish. That's a lot of scan and institution told him. They didn't have a place for him to store that data, so he started storing that on gear and vaccines. Ribosome ate his STL files, renderable and injury is on our browser. So this is a scandal to fit a lot of. These are very, very, very small species, and the use case for this is actually other.

B

Researchers are 3d printing, these scans to study them at larger, larger sizes and there's a several hundred fish now scanned on there and other researchers are now kalam contributing to that as a community source project. Just put all fans of fish into this project on the OSF and because, oh he's quite well-known on Twitter, you get a lot of people talking about what they're doing with this fish scan. So it's the interesting use case.

B

The other one is not so much about the data, but what you get from putting data into a public shareable repository and a lot of local institutions worry about in citations and then certainly that's an important part of a story. A more significant part of the story is the impact of that research beyond citation. So any public project on the USF gives.

A

B

Pieces of data it gives you things like where your visits and referrals coming from how many visits you're having more granularly than that. How many downloads of your data set is happening, and then people can actually take your project and fork it.

B

So, if you're familiar with github, where you can form a piece of code and have it copied into your workspace, we do the same thing with the metadata of the original source of that port being copied into your project, so that you can extend the research project when you find a piece of data or a protocols or methods that you want to use in your own research space.

B

You do that as well, and you can actually track those if they're public you can identify them this evening, you can actually physically represent or visually represent the standing on shoulders of giants.

B

Lots of other features I'm not going to go into any any more of these, but this is that you have to pull these out from our table of contents from our our help documentation. But if you have interests in any one of you better sort of function at each of the OSS be happy to answer any of those questions.

B

A couple of quick pilots that I wanted to make sure folks were aware of both of them involved in a game, but now we've expanded with other groups, the first one this was actually one of the original sort of starting point for the institutions were can be done. Is northern aim wanted to build a service to preserve the research code that is being used to analyze data and prepare datasets and a lot of disciplines so working through their Center for research computing?

B

They were going to build a ingest engine where researchers could come in and drop in their Dropbox data, their github file and run things on a docker container. That would then preserve that docker container for other research to use at a later time, and they were beginning the process of under stride to figure out how to build all the API to all different services that they would need to do that. We.

A

Happen to present.

C

B

Osf around the same time, and they really need all the baby on. They needs a single API z/os s. So.

A

What you're, seeing.

B

Here, it's kind of cut off with the images here on the upper left is the OSF with a custom command line interface built into it. That is connected directly to the high-performance computing center at the Center for research. That is, allowing the research on the usf to identify a data source. That's connected, so in this case the Dropbox file, a script which could be a completely different location. That's connected a github repo or something else like that and they're able to execute that script on that data and put it into running on the docker container.

B

So there a question in the back: I saw I handed it race just dressing up that no bromantic good for this is this is being done with an NDS grant and it's in a pilot phase. You can go and look at the presentation there at that link and we hope to move this out of highlighted, perhaps adopted by some other centers research computing, maybe late this year. The other one, which is also very interesting and I, missed a little bit earlier, is connections to institutional repositories.

B

The groups behind Fedora, so Dora space, Johns, Hopkins, adore today are currently piloting and prototyping a service where an institutional repository would be able to.

C

B

From the Open Science framework and pull that data into these repositories so that they can curate more.

C

B

The research that's going on at their institution.

C

B

Second, phase of that would be a push of data from the OSF into the institutional repository accessing and custom metadata form that would be associated with that institutional repository, so that researchers don't have to package something up at a different look and email it to another that are representative. They can simply push buttons submit to their institutional repository, and then we would populated with as much metadata as we already had, and they would simply fill in the gaps in order to get it in to curate and D.

B

In this case, which is the assumed authority, so we believe very strongly that connecting the workflow is really critical to enabling change, and the change specifically is the transparency and, ultimately, the reproducibility of research. If you follow any of the reproducibility Dave debates, if not that the science being done is wrong, it's often times when we can't even reproduce our own steps from a year ago that we did in our lab much less expect anyone else. You want to build on our research to do this. Do the same.

B

It's not that people are falsifying information is simply. We don't have complete enough information to produce similar results of a paper that we studied and want to build on so starting with this, but we think there's a lot more, that we can do by connecting to other services and I. Think that's really one of the call to action that I'd be curious about this group. Giving us feedback on is how else could this model of providing a sort of Commons, forever research tools and services affect the goals of the hub?

B

So this is thinking about things like connecting to grant administration schools, IRB tools, other custom data collection, workflows, obviously, data management, HPC and ultimately institution repositories. How we can bridge.

C

These together.

B

Not by locking anyone into any one particular service, but actually by connecting as many services as we need and providing a place for researchers to move move data around very easily, where actually I always say go with your data round, leave it where it is but connect your workflow to where the data should should be. So if your data should live a dryad great, just connect that to your research project in a way that you can cite it share it and collaborate on it.

B

So we're doing a lot of these different things and our goal is to build the infrastructure and actually let the branded versions of these be community generated. So similar to social archive is a group of sociologists that want to have a preprint service and instead of fundraising, a million dollars to build that we can provide that for free, and we can duplicate parallelize, that infrastructure very very easily.

B

It's simply an at reduced cost and part of that community, and this is another thing that I think mike weighing on the metadata discussion is our project associates with ARL? Is the share project you can be. You can find as a shared iOS at I/o what this is they a metadata, harvester and normalizer service.

B

We are currently sourcing data from 146 different repositories, including things like NIH, Commons, institutional repositories, a lot of publisher repositories, so these are API application components that are brought in and the metadata is harvested and normalized and that's a big job, it's being done as much as we can by automated filtering, but we actually have a crew of about 40 data curation associates that are mostly helping to initially notarize data at institutions. So one.

C

Institutions, data.

B

Can be more easily globalized by the by that community cuz? They know more about it. This is actually with speeding our preprint service, so we're actually aggregating all of the preprints from archive bio archive and peer j into our OSF preprints, in addition to the services that we're standing up on top of it, so that you have a single location to search 2,000,000 preprints, you can search for another data event. These are mostly resource advanced by the way. Theater publications grants, data repositories, missions.

B

You can search on that and actually filter by Thunder source, all different kinds of things you can set up a notification feed one of the really cool things that we're just now piloting. This is the last thing I'll show is a pilot project with UC San Diego to use share to create an institutional dashboard of research events.

B

This is simply imply that right now dedication purposes, but what this is doing is actually sourcing all of those locations, including their own repository of information, to come up with all the events that are coming out of UC, San, Diego and associated faculty and staff, and you will be able to you, know, sort to it by type by collection by contributors.

C

B

You not as if you, the Scripps Institute, is the primary here there, and this is really just our first prototype of doing this and we're hoping to get expand on this in the future. So what I'll leave you with is just it with? What can you do is? Hopefully it's just the USS research team that be a tool today to solve collaboration, sharing challenges explore using the u.s.

B

def and that's something that makes sense to complement the other services that you have address to three maza Tories or reconsecrated, you'd like to use suggest that they feed.

A

B

Share so that it's more discoverable so hopefully solve that metadata challenge that I have to get about. I was curious if there are other solutions out there and then.

A

The workshops that.

B

We that we hold that that's of interest to anybody- and you know we're doing this as a community supported model and so certainly other developer communities or our critical to our success. If you have ideas or connections to make we're always open to those- and this.

A

B

Is available at that link there at the bottom and I, of course, always welcome any emails at the back. If you have specific questions and I'll leave it to us any questions any what it has today.

A

Outstanding, so thank you, Matt I could probably guess that maybe Mike or Regan might have questions particularly to your question about connecting with the hub and and such. But if other folks have questions too, we open it up to the floor. Oh.

C

Yeah, this is Mike I think this is really nicely done, and I'm really interested in. um You know that there are a lot of similar qualities. For um you know the stuff we've worked on with some sybers and you know I'm interested. Are you participating in that NDS, universally accessible public API? You know standard metadata and handles and being able to expose yourself to metadata harvesting from NDS. Is that are you kind of tapped into that project? I believe.

B

That your group is on not as directly involved with that piece of it, but I believe their hat is some connection there, because of the grant that we had to do that original dashboard with RDS related. But certainly our intention visit that the API is for all fools. Both posm and Cher are completely open and can be build upon to harvest this data into other other news pages so that certainly the intent behind it. The share to excel that's actually kind of a bare-bones.

B

We don't really expect anyone to go there and effectively use the 20 million records that are there. We certainly expect there to be used cases like the preprints, where we're building.

A

B

That purposes a certain subset of that data, so that it can be services they need within a community either by type by source or some other filter that that makes sense, I think the biggest challenge there and you get the more granular use cases is that most of the sources that are providing data have extremely.

C

Different ideas.

B

Of structured metadata right.

C

Exactly yeah um yeah, so I don't have a what I did is, while you were talking, I started poking around in github and looking for extension points and stuff like that. I think this stuff is really neat.

C

It's really well done um so I don't have a I need to do a little digging around in the api's, but I could certainly see at least for some of our DFC purposes like I'd, be interested in mining like what are the extension points so that you know, data that's held in services like DFC um could be exposed um and we work a lot with the dataverse people too, especially at the Odom Institute here at UNC. So I know that's very big but anyways. All I'm saying is like as you're talking.

C

I was kind of looking around the background and it looks like it's extremely well done and I might try to ping you later offline, if that's possible, to just kind of talk some about that. But I think this is the kind of thing um you know a question generally for the data hubs is: are they looking for?

C

um You know a sort of a public face or a commons, as you know, what's the roadmap for how you sort of embody what what the the big data hubs look like um and that's officially vague and I'm rambling, so I'll stop there. But, yes, your.

B

Secured interesting question them: anybody visited to be a guide. We service, all that information, we're trying to do a front anyway to improve some of Sanitation. It's certainly always a challenge to keep that up to date, and it must be like. So if you have any suggestions or issue there. Certainly let us know- and we can work with you directly.

A

Does anybody I said absolutely.

C

Go ahead and go ahead.

A

Anyone on the line have questions now.

C

This is Jim, I've got one of it share, go.

A

C

Forgive me just this kind of aggregation as a little aside make sense. I'm wondering you seem to be asking for a lot of consent from institutions for being for being registered. Is that a matter of policy or is it a beta release thing I? Just looked at my own institutional repository, I think we're not, but in there, but just sort of sitting there on the home page, we have like an RSS feed and some stuff, like I presume services exist.

C

Frequently, um are you expecting at some point to just harvest such services, as are just openly available? Yeah.

B

C

Which do it? One institutions registering formerly at a time yeah.

B

We do offer if you go to the share OSF site, you can register your repository and it's the information and we'll build a harvester for it. Each one has to be looked at individually by a developer to again normalize the metadata. We in some cases will actually go out and request permission to harvest, because it's a central link for the previous was a good example.

B

I think we had a few of them in there go be basically with fire archive and said: hey we're going to we're going to start harvesting that, for this purpose, does that make sense and they were very supportive and actually they made some changes, because a lot of these api's are restricted and how far back you can even harvest sources limitation for the calls, and so we've been able to work with a lot of the positives to overcome those challenges and and provide a more seamless integration of, or at least a seamless flow of, the data.

B

So I think. The first step is that you want to put your data into it and how to contribute is just a good register your source and then, if you have other sources, you can point it to way. You can either contact us and we can perhaps approach them or the easy snap is actually for you to let the other repositories know hey. We all put our data in.

C

There this will.

B

Be a usable source, okay,.

C

So you find that you need a human contact for logistical assistance anyway, at.

B

Least, to getting started in for permission. So if.

C

The NDIS opens.

B

We generally can do it, but we do like to work with people directly.

C

A

Let our folks have questions.

A

Anywhere in the room.

A

And give some moment if there's any online questions or their online questions.

A

You muted a few folks: are they able to unmute themselves if they have a question most.

C

Folks are muted, please unmute yourself. If you wish to speak.

A

All right last call.

A

All right not hearing any other, further questions, then perhaps we shall close for the for the Friday maps and Jennifer. We really want to thank those wonderful presentations and all before people go. We want to mention that on March, 10th, 31.

B

A

March 31st on March 31st. We have presenters John, Moore and Florence Hudson from Internet to community, supported by the NSF regional hubs and March 31st. We also have Alex, tokus and Clarisse Castillo for their side, ass, national cyber infrastructure for scientific data analysis at scale, so we're very excited to have that.

A

We will only have one in March, because the big data hubs are meeting for their annual meeting at NSF for the zippy eyes, and then we will be bringing after a few more sessions to demos, to a close, and if people have ideas for four actions that we need to take working group action, we will transition from from then on. So with that, thank you, Karl for helping to get a little justic set up and thank you all for attending have a great weekend. Thank you. Everyone. Thank you.