South Big Data Hub Data Sharing & Infrastructure Group, 3 May 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: CI WG demo: CADRE (Collaborative Archive & Data Research Environment)

Description

Date: 5/3/2019
Presenter: Valentin Pentchev
Institution: Indiana University
Midwest Big Data Hub

A

I, for one was really excited when I saw that that balance and pinch of who's from Indiana University their network of science, oh gosh I, should have looked up. What this stood for, I just know where you're from anyway.

B

A

You NSI he's been involved in several I think very interesting and kind of leading-edge projects over the years. A lot of us will know him from awesome with a know he's here today to talk to us about cadre and you'll, have to tell us how you, like that acronym pronounced the collaborative archive and data research, environment, and so I will turn it over to you about it. Thank.

C

You my full name, is thank you very much for inviting me here. I hope everybody can hear me okay and see my sides, if not please blink twice. Okay, thank you. I am new to this group, but I veteran of the Midwest big data hub I was there on the first Charette through the wild wild west rules of engagement. In three years ago, we have the metric science institute, were presiding over networks and spoke when spokes were Stewart King and we're even participating in a beta Sciences ring of the Midwest big data.

C

I was recently elected on the steering committee of the Midwest big data. Up and I am one of the three copy eyes from IU for the new phase, which we are anxiously anticipating.

C

Talking about today is a different project called cadre. That's called, we were dot can be pronounced, collaborative archives and data research environment, which is the product of a project we received funding from the IMLS Institute of Museum and Library Sciences called shared big data gateway for libraries.

C

The reason we we are discussing this there is a dilemma against most of the libraries who cannot provide research with sustainable standardized access, the license data set as well as pretty available datasets out there. We have been experiencing this problem at Indiana University by ourselves by the means of implementing the infrastructure and allowing our researchers who come from various different backgrounds to access this data, and we realize that for anybody outside of the r-1 institutions, this is almost impossible and even for the r-1 folks, it is possible but requires a very large investment in resources.

C

That's how, by the way, we were able to convince him to go with us and created the cadre again product of the shared Big Data pop, what it is it's a cloud-based platform that will provide secure access to bibliometric sociometric in informatics for now data to a large body of university researchers.

C

We are collaborating with the Big Ten academic alliance and a lot of their members are funding the project. But the idea is to have an open tier as well for everyone out there being in a research library in a community college or just interested public to be able to access to those resources or the Big Bang and the rest of the r-1 institutions. This provides solution to the main problem of big data, sets implementation and getting access to both license and open source beta and in cabra.

C

We also tried to our gift and our speaker more about this in a minute: various kinds of researchers with various levels and depth of understanding in computer technology in general. So there will be a graphical interface that will guide people through creation of the queries and exploratory of the data. A little bit more on the second point, when I came here at Indiana University in 2015, we were one of the few institutions who had purchased the data and what was the region, and back then was dated to be.

C

This is the web of science data from then Thomson Reuters now terabit. The original plan was for the data to be stored in a non networked computer, which is changed to a basement wall and have a sign-in and sign-out sheet. It took us about three years to create and convince the administration of the universe that we can solve the technical challenge of creating a secure and click on one of our conduit clusters, where the data can live and be accessed by researchers, while still adhering to the vendor specification, I've taught for a while.

C

Are there any questions or anything that I can clarify before moving forward a.

C

Little bit about the project partners we are, are you and I Indiana University Network, Science Institute, with collaboration with the IU libraries and the Big Ten epidemic Alliance BT ia, all of their members are on board with the project and nine out of the big 10, which are actually 14, pledged a pledge monetary support not only for the duration of the project or for three years afterwards.

C

The main reason for that is all those data sets are received in XML or JSON format in a card drive in the mail, and we argue in the project proposal successfully, but it will be way cheaper for the big institutions to work with us in a consortia and membership model other than try to recreate it for themselves.

C

We're also proud that this would be cheaper for the big institutions and it will be possible for everyone else is it would rough estimate see things from half a million to a million dollars on average to start accessing the data? We also have the support of three out of the four big data hubs. We are hoping that the fourth one will join us as well, and this is the beginning of a project that will hopefully span way beyond the issue of you see the day.

C

Quickly. Queue works about a few words about a uni. We are what I call unique started: cross campus transdisciplinary, Institute inside of a big educational institution. They told me to start there when I came from California, where I was involved in multiple other startups and I tried to keep for fourth year the startup culture alive at uni. Our mission is to strengthen the theories methods, analytical tools and practice of network science, but also to foster collaborative interdisciplinary approach, understanding the complex problems of our society, the startup culture history.

C

It is important because a union was formed intentionally outside of every school at IU. We don't have a team, we don't find for indirects, we don't, you should decrease and we don't have students. What we do have is through project like this: try to bring collaboration to people who care about network science.

C

In general, we have team authority professionals which I manage I'm, the IT director of the Institute, but we also have a team of research scientists who are free to choose their own research fads and who are intentionally kept in the same room with the IP staff, where all the noisy and messy ideas get bounced around and help us move forward.

C

View of the project goes. First, we set up a boat on the st. user needs and expectations through a collection, wild collection of user stories, the product ownership council meeting scheduled to for the entire project duration.

C

One was just conducted in University of Chicago, sorry in PTA Commerce in Chicago last weekend, and we came with a lot of interesting guidelines and as much our communication as possible through meetings like this one I've worked in IP for quite some time to know that the main the main drawback of many IT projects is people like me who, like working with expensive, shiny toys, build them and then let researchers use whatever was built.

C

We decided to take the opposite of through research, into engaging the research community trying to build, especially what researchers are telling us they care about. So the constituencies of the other project are people from informatics and computer science who care about application, programming interfaces, notebooks and access to raw data. These are usually folks who have lots of PhD students and postdocs at their disposal and need access to the data itself.

C

There is also big science of science community who, although not as Oris and resourceful as the computer science folks, usually know how to access relational databases and cloud native technologies and have very specific research questions to answer and, finally, our tertiary of constituencies, our research, libraries instructors and the general public in in general.

C

What we have for all of them are three different approaches to the same data for the first ones: access to the raw to the raw data, in the form that it comes from level science through care of it or through Microsoft academic graph through US patent data and so on. This is XML JSON, common tab separated files. We provide an access to dynamic, schema and cloud native technologies. Like you see on Microsoft, Azure site and adenine glue on AWS, then we allow them to spin their own data, briefs our clusters using HD insight or earmark.

C

If those terms don't mean anything to you, they are intended for people who do this on a daily basis. For the second part for the second type of a researcher, we are currently engaging the research data Commons center here at IU, for a research of the best database and datasets to answer specific queries, since the creation and gathering of user stories showed us how different than a theory genius the research is, we figured the different queries to those data sets would be best answered by different technologies.

C

We plan and currently have, in the end, clave a relational database. We are bringing graph database and testing between few of those listed here, neo4j dagger graph agents graph, and we also utilize the same cloud and service technologies mentioned above and for people who don't know how to write, sequel and cipher queries.

C

The main part of the project, or one of the main part of the project, is a guided query building interface, the query builder, which is a web interface currently being thought process, will allow researchers to generate and create their own queries and also one very seldomly discussed part of it being able to suggest the best technology. First user will have control over it at any point, but our initial tests show that relational databases answer some queries.

C

Marginally, faster graph databases answer different kind of questions in a way better and more comprehensive methods, and some of the distributed computing and cloud native technologies are very useful in very small specific data cases to understand.

C

Why are coming from the relational database that currently works on a pretty big server in Indiana University was taking weeks to answer some queries that include multiple joints and creation of citation databases, to the point that we had to take into account the regular maintenance of the big machines here at IU to make sure that the queries will we'll be able to finish the same query in well created, distributed.

C

Distributed job on a cluster in a in a cloud takes usually about 40 to 50 minutes for the few weeks on a relational in-house datastore.

C

Another big part of the of the end crave or the shared big data access gateway is a federated login which allows us- and this is already working and available to use which allows us to pass authentication to the educational institutions themselves. This is first to make our life easier, but also to make sure we are able to restrict access to appropriate resources.

C

Most of the Big Ten academic partners have now cleared and access to web of science data from claret analytics a few of them, however, have a little bit different requirements than the rest based on the federated login in the institution that answers the login request. We can figure out and further restrict who has access to what data sets. This is an empathetic includes proprietary data like the web of science and mixes it with open source data like the Microsoft academic graph, again, US, Open, Data few newspaper, publications, memnet records and so on.

C

The other big part is the research asset Commons, which has ability for people not only to run queries, create datasets, used tools generate their own containers, but also share and save metadata queries, results, visualizations and patients, and so on again, through a granular security system, people will be allowed to share save their own data, so they can continue further share data with the research who are part of their own group, share it with the entire organization or again vendor license permitting open it to the world.

C

This will ensure reproducibility, replicability, provenance and transparency of the data, since every process and permutation is first very well documented and second duis are issued at every step. Data gets changed. This will allow for creation of publication and actually education packages that will be easily traced by a single link to the end wave or to the shared big data gateway and have not only data, but all the tools and libraries require to work with it.

C

If anyone of you have tried to replicate data data analysis, using usually technology school, there are few years outdated, knows the complexity of making sure you have the exact system that the researcher had the research asset Commons with the unique identifiers will ensure that you do.

C

You can trace every permutation and change the data that was made since the inception, and you can first use the tools published on your own data set, or vice versa, use your own tools on datasets that we already have yes, this is basically it in a very quick nutshell: lightning talk, I guess this is our main diagram and a few of the assistants. I already mentioned the authenticated federated system. If you take a look, it includes a custom login.

C

We are trying to incorporate Google, Microsoft and Facebook accounts as well for those not part of educational institutions who still want to access a little bit limited but available three tier of the ad click, the research asset Commons, where people will be able to share and save their data. The web query interfaces that will allow access to various resources and.

C

Different kinds of database search all the tools under the research asset comment which I don't have highlighted, which allow people to start either notebooks or labs or through an API use, their favorite programming tool to access the same data and on the right side, I have to move my camera window here, as mentioned, we have the raw data format of the seedings datasets, the world of science marks of the Academy graph and USPTO data, and we hope to be able to increase this exponentially once we make it work with multiple the sets as well.

C

This is the first question I get from every library that has seen the project. Can we include this data set as well? Can you put PubMed? We just got a request from the IU School of Music to process some really old, musical scores and the ability to be able to search in Latin.

C

All of this is being processed into multiple databases, again relational inground databases at this point not to exclude any other technology.

B

A

C

To me, and also in the future, allow access to the local institutional resources. This is the reason I have been following since the beginning and very interested in the open storage network, because technologies, like the open storage network, will make available for institutions to use their own compute infrastructure and pull up and down those huge data sets and I think this is it from me and the rest of the readership I want to thank you for the opportunity and out. There are a few ways to contact us.

C

One last thing that I forgot to mention this was just announced on the product owner council: we're starting a Catholic fellowship program for anyone that is interested and using our data sets, you can go to cadre dot, IO dot, edu and find the fellowship program link. Please distribute this to your students and researchers who are interested. We will be getting access to preliminary data for all our fellows and we'll be also able to sponsor six of them. Six teams to come present with us at Isis I'd 2019 in Rome.

C

We are doing a workshop and a tutorial almost a full day event to this very prestigious info metrics conference in p3 in 2019.

B

A

Is that wasn't a pun.

B

May I ask a question: this is for sure, please. This is a great presentation, so you know I work with you know trusted CIA at IU, and also with the research software, research, Security, Operations Center, and one of the things we're looking at because I work on accelerating cyber security research into practice is getting datasets that the cyber security researchers need and giving them access to those. So those there's two different problems. What is actually finding people who can share cyber security related data?

B

You know who are willing to do that static and dynamic and then actually making it available. So is that something that would fit into your frame or not because when you.

C

Say liabilities it.

B

Would? Okay that's interesting now.

C

The project it's a two year project, which we honestly believe will be able to keep working on perpetuity to the membership model, but for the three year of the project, we promise those three datasets that I mentioned before and we'll have to deliver on those. The idea, however, is that this will pave the way of expanding the resource to multiple other datasets and again. This is the first question. I get everyone has a data sets they want to share, or at least make available to to their users in each of them.

C

By the way we've learned has their own challenges and requirements, so I'm sure the cybersecurity ones will be very different than the most open web of science, but are very open to taking a look. Okay, okay,.

B

This is Jay.

B

It's great place today, thank you. You mentioned about licensed data set yeah. My dad in your contacts, I think this I'm also thinking about security data said what do you mean by licensing you're.

C

Originally by Thomson Reuters parrot analytics, and they just saw that they renamed themselves again to web of science group.

C

It cost few hundred thousand dollars to have, and once you have it again, you're delivered an XML card, drive car back with XML. So originally we were one of the few institutions who have purchased it here at IU and we knew about other labs at UIUC at UC and the rest would purchase their own office conference with all of them.

C

Try to open some sort of collaboration and were stopped by the vendor who told us we cannot share data, although we have license to purchase those data sets then, would be taa to be an academic alliance and hooking a collective agreement with profit to participate to all of its members, and now we are allowed to share it and what I mean this is the license dataset.

C

So once you want onto cadre, if your network, the big ten to be able to access the license data set of science, if not, we also have a free version of it, which is the Microsoft academic graph. There is an open beta, set and KB accessed by anyone.

B

C

B

C

Hope for you to have even more granular security system, that more companies can be part of the world of science and they decided our license by a certain lab in a university or multiple universities can be brought in as well.

B

Could I ask a question? That's that's sort of related. You had a box in the diagram that was labeled, I, think granular pigeon permissions. How do you manage those permissions or granular data set permissions so if I want to access and I have rights to certain things, how do I communicate those rights or how do you find them out? Well,.

C

There's a box of rocks, most of them are data. Sets you have access to and currently again this is a very young project. It was supposed start in September Elektra start in January. Currently we based this on the institution. We know that University of Iowa has access to the web of science University of Iowa steak. However, it does not so using the authenticated federated authenticated system, we pass the authentication tokens to the appropriate institution, get it thinking if you log on as part of IU or UI, you get access to the data.

C

If you log on, as part of you, I see you doc, a second form of permissions are when people are allowed to create their own teams and assign them to projects and then share data with them. People in my lab people in my university or I, want to share this with everyone. We allow users to do that. However, there is another check of is their proprietary data, which is not allowed to be shared if anyone wants to make derivatives of their work based on marks of the Adamic graph, which is an open resource public.

C

We would allow this if it would include web of science. We want until further review by the vendor.

C

B

Does make sense just one little follow up? How do you define a team? Is that something? That's it's an internal definition to to this project? Er? Yes,.

C

It is a thorough definition. So, and again this is not very yet. This is a plant to be there, but the idea is that I can invite people to join me or I can add to my team people who are already part of cadre. Thank you. So we would allow multiple people to be members of multiple teams that do not need to be limited to the same institution before collaborating they should be able to form teams of which are promises, intuition wise Thanks.

A

Any other question.

C

Well, thank you very much. I will share the link to the public fellowship in the chat. Please take a look and or word as much as you can. We need all information because again we have working proof of concepts that we demo by the way the link I will share, will have a another link to the recorded video demonstration we did few days ago, but this is a two-year project in its infancy that we are shaping, along hopefully by the feedback of the researchers who use this data.

A

Super well I know I'll, be following up with you and I'm sure you're connected to Melissa already on the OS n. So when we're able to help we we look forward to connecting with you on that and I will I will definitely let people know about your fellowships I think that would be a very sought after but yes I. One final, just thank you again, Val for your presentation and for taking time on a Friday afternoon. I know that was that was really interesting and we appreciate your time. Thank you.

C

Oh and thanks for your support.