South Big Data Hub Data Sharing & Infrastructure Group, 2 Mar 2017

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: CI WG demo: Dataverse

Description

Date: 03/02/17
Presenter: Gustavo Durand
Institution: Harvard University
Northeast Big Data Hub

A

Description: this is data versus an open source platform. That's built to build data repositories for sharing and publishing research data, evolving. The projects community focus from small social scientists or big data needs of research has led to the features focused on computation, big data, storage and replication.

A

We have joining us today, Gustavo Doron from Harvard University, he's the technical lead and architect for dataverse and has been working with dataverse since 2006, leaving the overall architecture and technical design of the application mentors and provides code reviews for estella developers assist the dataverse community to contribute features that are important to them and works closely with the development project managers in leading the team and project so without further ado. I'll turn it over to the tsavo to tell us more about the universe.

B

Thanks, so you guys can see the slides right. Yes, ok, so yeah, so basically I will be talking about the project. I'll, give an overview discuss some of its features. The technology were built on then shift over talk a little bit about our process and some of the benefits of how we're doing that and then about some of the collaborations. We've had and I'll end discussing a little bit more about the community overall.

B

So what dataverse is is an open-source platform to publish site and archive research data, it's built to support lots of different types of data and users and workflows, and in the next few slides when I talk about the features, I'll divide up the features by those three categories and obviously features might overlap, but I tried to kind of place them in how they best fit within there, and so one of the couple of the benefits about the open source and again I'll also talk about this.

B

A little bit later is the transparency we're able to have with the community and then also the fact that we're able to collaborate- and it's not just all developed by us, but it's developed by the actual users and there's a sense of ownership from within the community of the product itself. So we've been working on this since 2000 2006 here at Harvard, our funding comes partially from within my qss and also we work on getting different grants and these different collaborations that help adding a lot of these features.

B

We have a core team here, that's about 15, which includes everybody from the UI UX team and the designers to the management and technically, like me, also all the developers, obviously, and even our curation and metadata specialists, and then we also have, as we mentioned, a lot of contributions from the community itself. So the team expands and grows over time remotely with the community.

B

Okay. So, as I said, data users and workflows so some of the features we have to support different kinds of data and I, one of the interesting parts of the fact of being open source and being having this large community. But also one of the challenges is that we have lots of different types of institutions and organizations who want to use it.

B

So they have lots of different variety of data, and so you know one of the things that database does is it provides persistent, IDs and URLs for all your datasets, but some organizations want to use data site for the persistent IDs someone who's handles, there's actually another group that wants to use something called Daraa. So we have it built so that the ID, it's modular and you can add, on different providers as needed.

B

We also make sure we generate a citation for any data set and provide proper attribution and we're compliant with Fair principles and general data, citation principles and the way we do the citations of data and have landing pages and are machine, readable pages and things like that. We do offer domain specific metadata and, by that, just what we mean is well. Every dataset has a core set of metadata, that's needed for the citation which the title of the author.

B

You also want description and keywords, and things to describe it that we, the software itself, allows installations to create and use different metadata blocks is what we call them to support different domains. So, if you're, an astronomy, there's a block, that's has specific fields that are built upon standards that have to do with astronomy.

B

If you're in social science, we have a block, that's for social science and we work with domain experts and try to add more domain specific metadata blocks as we can and the system is built to be able to dynamically, add them without meeting and you release of the software.

B

There's also burgeoning of data sets so that, as you, publish in create different versions, but cite an earlier one, that's still accessible to users and our file storage were able to store the data either a local file system, or we also, if you run on OpenStack, you can use Swift object store and what we're doing actually now at Harvard is where you actually run on Amazon and use s3 as our storage solution.

B

So we also need to support different kinds of users. So with that we different institutions want different ways of logging in we provide a native. You know using a password that you can create through a database application, but we also use Shibboleth to connect to a lot of the different universities. So I can log in, for example, with my Harvard key and not eat a soul, separate login, and we also do have o F, which we primarily added for orchid.

B

But if we with it, we also got for basic with just little bit of extra work, a Google login and the github login, and now that's configurable for each installation of data bursts. So you can decide installation which of the sign-in options you want to include for yours so well at Harvard. We allow any of those there's Texas. Digital libraries, for example, only allows people who are associated with the Texas University, that's part of the consortium, and so they only have the Shiva plug-in, for example, because the users can be different sized.

B

You know, for you might have an individual researcher who has data that they want to put on a database or you might have an institute like for us IQ SS. You might have a journal there's this ability to embed data versus within other data verses, and so you can basically manage the hierarchy of your organization with different data verses to to be able to categorize a different data. You have with that, we also have branding and we actually have branding both at the installation level.

B

So Harvard dataverse looks different than Texas digital library and looks different than Odom looks different than scholars, portal and Canada, etc, etc. But then each individual data verse within that installation can also have a little bit of branding that kind of says. Yet this is my data as researcher you know, John Smith, and then we also have widgets so that you can take a view of data verse and actually embed that in your website.

B

So if you have your personal website and you can put a widget in a listing of your data sets or a specific data set website, so the other many variety of things that I've mentioned was work clothes. We have lots of different work clothes.

B

We have a very robust permission system which allows us to create all these different workflows so, for example, harboured dataverse- and I use that as my primary example, because that's the one I'm most closely associated with, but they we allow anybody to upload data and create databases and datasets, and so anyone can do that.

B

Another installation might say you know what we only want certain users to create databases or create datasets, and this law must do what the access controls in terms of use as well, because once you've uploaded data, you may have different permissions that you want for who can download the data and whether they can request access or not request access, and things like that.

B

The permissions also have the ability to have users and it individually, given permission or also groups of users which you can either have explicitly defined groups or you can create groups based on your shibboleths. Like you know, I could we can create a group here?

B

That's all Harvard users or you can create groups based on IP, there's different publishing, work, clothes so journals and, for example, really care about the ability to have someone create their data, sets and add them, but they don't want them to be published until they've had a chance to review them, and so they can manage that with the permissions to say, only certain people are the ones who can publish, but anyone can add a data set, whereas if I create my own individual researcher, dataverse I want to be able to be in control of everything, so I can create and publish, and do everything I need to it that data set.

B

We also for journals, have the in or developed in for journals, but it's for use for anybody. The idea of private URLs. So when a data set is in draft version, you can provide a private URL that, with the token allows so much. Oh, not instantly go and look at the data set and therefore provide anonymous peer review, and we also have a bunch of different upload and download workflows. So you can upload files via the browser.

B

We can also, if you have a Dropbox account, you can there's a button to say, add from Dropbox and you go and go ahead and log into your Dropbox and answer it. That way, and one of the cloud reasons we've worked on and we'll talk a little more about that when I get to the collaboration slide is this for big data packages.

B

We are now able to our sync data over and not use the browser and drop because they have HTTP limitations on the size, and so when we use the our sink to be able to transfer these larger big data packages and, in general, one of the big things about data versus interoperability. So we provide a lot of api's. We provide sword for deposit, it's a standard, simple standard for being able to deposit stuff, but we have very robust native API.

B

Our goal is to try to make anything that you can do through the UI be doable through api's as well. Probably not we're not quite there we're, maybe at 90% or something. But you know, I can so I'm using your I can use the search API to do a search and then I can download all the API I can also upload and publish via api's.

B

We've used that a lot with journals where they have publishing for the papers and then they can automatically through the api's deposit their data into dataverse at the same time, doing it as one operation instead of having to go to multiple sites to do all the work they need to do, and we also provide harvesting which allows different data, verses and or other services to get the metadata or give their metadata back and forth, and so, for example, at Harvard dataverse.

B

We set it up to harvest all the metadata from all the other data verses as they come to be so that you can go to Harvard a diverse and search for everything. We don't grab their data. So when you find something interesting and you click on the link, it'll send you straight over to their installation and then you they're able to manage their permissions. That way so being the tech lead.

B

I always have a quick slide about the technology quick little bit, but if there's any questions afterwards, but we're a Java Web Apps, we run a glassfish, we use Java standard edition, 8 and the Enterprise Edition 7. We use a lot of different modules for all the presentation, business and storage layers in the backend. We store things in a Postgres database.

B

We use solar for indexing and search and, as I had mentioned before, you can store files either in the filesystem Swifter s3 I mentioned our transparency and that's one of the nice things about the open source. So we have a our develop process. Is this all our issues are in github? We also use this called waffle, which shows a trend issues as a transition during a sprint, and so these are when something comes in. It comes in the Inbox when we start looking at it, we move it to backlog and we contribute Sprint's.

B

So every two weeks we have a meeting, decide what we're gonna work on for this sprint. We put things in there. They get moved over by developers as they work on them. When developers are finished, they move me the code review, I or one of the other developers and or one of those reviews it.

B

It then goes through a QA process and then, when I through it gets to done these two links, we don't time to go through them, but the first shows our roadmap and upcoming recent and upcoming releases, and the second is a link to that waffle board. So you would see this and this image shows it all closed up. But if you click, if you go to and then click on the double arrows, it'll open up and you'll see the specific issues in each of those different parts of the process.

B

I mentioned the collaborations that we have, and so the two in bold are the ones that are more interesting, I, think to you guys and so with SP grid, which is structural biology, grid data.

B

We worked on adding this rsync support to be able to get large data in and out of data verse and to be able to we're working with them to be able to replicate it across different places so that, if you are working in France and the we can replicate it to a server there, you can do your compute computation work there and it's closer and you might have access to that environment, but not to some other environment. We've also worked with the Massachusetts open cloud: that's how we got the Swiss storage.

B

We worked with them to get the Swiss storage enabled and through swift to be able to allow compute access to your data, and the other collaborations are ones that we often talk about these presentations. I, don't think they're, not so super relevant, but in general they're, interesting for different reasons. The first one for handle support that was one of the persistent identifiers.

B

We actually worked with a couple of different organizations and it was managing the work that Dawn's first started, but then sim it finished up and so that we could then merge into the core research space was all about creating a Java, API client for the API library for client. Libraries. Heed me for the api's to be able to use eps through java and we've been working with the provenance group to get provenance and are working on that. Actually, this current sprint we're trying to finish on some stuff. We linked to that.

B

So our community I mentioned that we have many different isolations. So we work on the sock, we're here at IQ SS and we also run one of the installations to Harvard University in solution in collaboration with libraries. But the software is down is downloadable and able to be installed anywhere and we currently have 32 different installations around the world.

B

That's what these big dots are on the map and those are ones that are in production and are willing to be on the map, but there's a lot of others that are also testing it out and our communities continues to grow that way and because we have all these organizations and everyone's interested in different features. We've had over 50 different people contribute to code, and we have lots of hundreds of members in the community which include these developers, but also includes research, researchers, librarians and data scientists. We meet with them regularly.

B

We have a dataverse google group so that we can have email, threads and communication and it's great to see, because when that first started most of the time, people would ask questions and the responses would generally come from us, but as our community continues to grow before we're even able to get to it, especially if it's like from Europe and later off-hours responses come from other members of the community, and you can see the community itself growing and sharing with each other knowledge.

B

We have a community call every two weeks and so, if you're at all interested, that's gonna be not next Tuesday but the week after, but every two weeks we do a community call. And then we have an annual community meeting that we host here at Harvard where people come in and we have presentations about what people are working on in the group and what we're working on, and we present the roadmap coming up and to have discussions and lunch and things like that.

B

Here's a picture of our community from the last community member last community excuse me and I think that's what I've got our next media meeting is June 13, 14 15. So if you're interested in wanna join us, let us know if you have other questions, obviously feel free to ask now, but we're always available through the support and or other these other channels.

A

Very nice. Thank you very much. So I'll turn that over for questions around the horn.

A

If not, I've got one going so I'll go ahead and toss this out so I'm.

A

It's now, Gaffney I, one of the co eyes on a project called the hotel, where what we're doing is try to bundle data with the sort of containerized applications and environments that produce those for reproducibility and one of the challenges that we're going to be facing fairly soon with this is how to bundle- and you know, put these into repositories, and so have you had any experience with bundling both you know, application and data into you know a searchable discoverable environment for sort of both aspects of that order.

A

Is there anybody else in the dataverse who's sort of looking into that yeah.

B

We are definitely looking into that I mean or dataverse itself. You can upload any files, and so you're definitely encouraged to not just upload your data, but also upload the software and code that you use to get the results that you've been getting or we have had talked with I, don't know if you've heard of a group called code ocean, they're.

A

B

Know for-profit company, but they work on this idea of bundling the software and data together and then providing I think they call it uh I. Think it's not. They don't know the container. They call it um compute module I think, but the basic idea is right, gather and mm-hmm reproducibility and so we're work.

B

We have proposals with them to get a grant, so we can work with them and we, the one thing we want to make sure is that we're not developing specific to them but to that kind of product so that if there's an open source, one that someone has that we can connect with that as well. But we definitely want to create that idea of being able to reproducibility very important within the data data world and and within dataverse itself, so we're working with those groups.

A

Absolutely and I, just just going to say, I mean it's sort of twofold: it's both reproducibility from the point of view of being able to redo it, but also being able to discover how people did what they did and that's a different aspect of search which we're sort of challenged right now, a bit to how to define so I think you know it might be good for in the future, we're very, very close tracking code ocean as well and and understand what they're doing I think we've actually have some conversations with them on things.

A

So I think it might be good in the near future to start to come up with some, maybe even some some. You know everybody will have a standard but proto standards that we can have for some of these environments so that we can have transparency even between the environments, yeah.

B

C

Hi, this is Renee bastogne I'm, the executive director of the Northeast Big Data hub Gustavo, a quick question for you: dataverse has it: is it primarily used in academic data sharing applications, or are you guys doing applications that move beyond just the academic domain? I.

B

I'd say it's beyond the academic domain. I mean it is definitely about research and academic research in that sense, but it's I mean it's not just for universities. A lot of the different installations are like, for example, the netherlands has one for their national social science organization called don's. There's a few other organizations that have been with like agricultural studies and things like that across Mexico and France, and the rest of the world that aren't necessarily university based yeah, it's definitely academic refor.

B

The most part we have also talked with some of the groups like Red Hat, is interested in possibly using dataverse for some of their internal kind of processing of data, which is obviously very different than what we're used to in the back academic world. So our goal is to produce it as a platform that could be for anything.

B

Obviously we come from an academic environment. So a lot of the easier user feedback that we can get and have access to is more that, but we definitely are welcome and interested in being more broad if we can well.

C

I would love to we're working on a number of things that that spanned outside of academia. It's really at the intersection of academia, government agencies, for-profit companies, etc and and I'm a firm believer in not reinventing the wheel. So I would love to particularly easy, since you guys are in the Northeast, I'd love to come up and- and maybe talk a little bit more about how this might be applied to some of the things that we're thinking about yeah.

B

Definitely um get in touch with us either me directly or through any of those links, um and you know iein or you know the project manager or some of the other people. We can set up a meeting and figure something out sounds.

C

Great. Thank you.