South Big Data Hub Data Sharing & Infrastructure Group, 11 Nov 2016

Previous Meeting

⏯

youtube image

►

From YouTube: CI WG demo: HIVE (Helping Interdisciplinary Vocabulary Engineering)

Description

Date: 11/11/16
Presenter: Jane Greenberg
Institution: Drexel University
Northeast Big Data Hub

A

You know it's knowledge. First, it's and a lot of people involved in it. Hi folks, hi I may just move on here, how it is about making data, discoverable, interoperable usable, reusable and it's a technology and an approach, and the focus is all what we call semantic metadata.

A

You can think about indexing data or finding metadata semantics metadata to represent data, and rarely is one mistake, controlled vocabulary or semantic oncology Duff. For instance, you may have a data setting in these geographic information, taxonomic information, topical information and also you need multiple types of topical information. That's part of the motivation. The next is that your taxes and my taxes are paying for the creation of ontology.

A

There are hundreds of oncology's out there. A lot of them are being created by federal agencies and increasingly they're being published on the web and in public release data formats and available for machine activities, and so taking advantage of that and the fact that people in indexing is costly. Have one person individually, curates working with multiple vocabulary. Size uses all NASCO to integrate these existing oncology or oncology's that you may be creating in-house, but it create in a linked data, Semantic Web type format. So the acronym is I love it.

A

You could think about the little be going out to the different vocabularies and bringing back the right. The right term or contest to schematically represent your your data. So let me this is just an example of sort of the architecture and I think I pretty much said this already, but hive allows you, researchers by terms for multiple vocabularies or semantic mythology, I've done using machine learning, I'm going to say a little bit more about hi 1.0, which worked with Kia and C++, which is a machine learning.

A

Algorithm came out of Waikato, University and Jones, going to talk more about new work and we're we're moving right along that you could actually slug these different on machine learning. Algorithms, depending on your types of data and so I've also just encourages entropically for 51.0, is dating back a while here, but this is just to give you a picture of it for the homepage here and you can see you can register multiple vocabularies, and this is the hive demo you you can take.

A

The high technology and approach is put it into your system in specific, very different. For instance, the LTE or a long-term ecological research network uses high technology daily, and their interface is not look like this. We have staff different vocabulary here.

A

The next here is an example of the concept, rather where you could select multiple vocabularies and search on these one fell swoop right, your your concepts, so the here's, a search for the word precipitation in agrovoc, which is a vocab and compel of the UN Food and Agriculture Organization LTS H, and you see the keyword in context for precipitation, and then there is the third approach, which is the automatic indexing approach. Those little green dots here in this high 1.01. Is you select your terminologies to you? Choose your file or you can put an ear.

A

All of this URL at the time is, it was something on the web and three start. The processing see that I've been around for a little while and the original demonstration grants that we had was actually IMLS in the dryads project. So here's just an example.

A

Anything and other scientists publish an article and they're putting their data into the Dryad repository and I'm, just going to kind of assume of people have heard of drive, but it's a repository for day lines published research should have been very involved in that, so the scientists go to the hot, selects the vocabulary and in this case examples just three vocabularies and comes out with a tag cloud. Some of these terms are very good and clicking on them and fun them to the metadata to represent the the data is good.

A

Some of these terms are not so good and that have to do with the machine training that we could do a better job at just in the interest of time. Here is an example. What the original architecture looks like combining several open source technologies with Java based originally, what you've been looking at is the demos by the original demo site and will give you an update there, and the code was originally on. Google, we've moved over pretty much all the way over to github and I'm.

A

Sorry that I don't have to correct URL there mike is going to say a little bit about hot hives in IRAs. Actually in a minute just to show you, here's a Joan has to a demonstration at Renzi, and we just put in a several ontology here, and you know the time you think about the time it takes somebody to look on multiple vocabularies of ridiculous. We should be doing that this stuff.

B

Is always really readable.

A

And we should be able to have some kind of interaction here with at least person, ultimately stuck in return. So Mike do you want to just go in your life now for human.

C

Does terrible don't.

A

C

A

So we're doing taxing and.

C

As far as I actually sort of did in the previous meeting, so they're, essentially cran drawing um setup yeah but I wanted to UM I want to just kind of put in the context of how high pits and the we're we see the system. So next one there John. um So the whole idea here is that in on the DFC side at the core is that I rod server and it's not really good for doing discovery, but what it's really good at as being like sort of the canonical data and metadata repository.

C

So you can have positive. You can have policies that control both of those things. So the stuff is are always there and then we can treat indexes as ephemeral. So we can project different parts of the grid out to different indexes. So you might have this collection being indexed through a elastic search.

C

You might have another collection that is indexed and put into a triple store, and this is where high fits in is specifically in this area of semantic metadata, linked data and also being able to have triple store representations of the data inside the catalogue. So that's the point of that and then the next slide so we're hive comes in is you can have policies that describe metadata templates, so any data in this collection has to be curated with certain kinds of metadata which can include curation.

C

For example, you must you must apply in number of terms from via from these vocabularies or you or these vocabularies are suggested for human curation. Okay, the idea would be okay, we want agrovoc and we want mesh as available vocabularies and and curators can be queued to search across those vocabularies to find applicable terms, they're applied to the data and then on.

C

The application of those terms on the editing of metadata on the ingest of new data and automatic extraction terms can be applied from hive, which can then trigger indexing into, for example, triple store next next slide. Okay and the whole point of this is really the nugget here. Is this idea of virtual collections- and this is based on the idea that you can set a rate these grids, so we have a global namespace and then here what we're using hive and we're marking up collections with semantic terms from vocabularies that the hive service provides.

C

I've provides again the tools for human curation, as well as tools for automatic extraction of terms using natural language processing, which seems to fit really nice into what brown brown dog is contemplating. But the idea is, you can federate with another site who is also using vocabularies applied to their collections?

C

These two Institute's can then feder 8 together, and you can then index across the the data in both those collections and project that into an index and then, for example, you can do a sparkle search and you can actually see data that is spread across distributed collections using a tool like sparkle and they appear and behave as one collection and I think. The really cool thing is for the folks work that were at the Northeast hub.

C

What I'm most excited about is taking this exact concept that Jane is talking about in doing things like applying data sharing agreements as computer actionable policies between these nodes in the Federation. So you can have one collection from the outside user looks like one coherent collection. It's actually distributed geographically between organizations that maybe have agreements in place, and then you can find those terms using tools that real people use, because real people are not going to use with dumb query language and irods they're, going to use sparkle or elasticsearch, or something like that.

C

And so again, that's kind of like a I have I've only mentioned hive in passing, but but the idea there is this is why we were interested in hives, because if we didn't have a tool like, however, you would have to build it, and so then Joan is going to talk about how we're going to save the world.

A

C

It short and sweet, that's where we see the the very important place of hives and doing rich search. So you can search by near terms or apparent terms or things that are like that's this term, and how amazing is that? Ok,.

B

I, don't know about saving the world here, tough act to follow there. What I do want to talk about there primarily is what we're doing now with the current verse 5 hi, oh so this is basically a rewrite redesign of Taiwan. Oh, that we started.

B

These are just you know, snapshots of what changes showed you in the oh version. This is a Java implementation that runs on a tomcat server, they're, probably I, don't know at least four different types of data stores, underneath this multiple api's and needless to say, is rather complex. One of the obstacles I think that we ran up against was that it was becoming increasingly difficult to enhance and improve an ad capability to hyeondo, and one of the reasons for this was in part, because this was developed.

B

This version of it I think was wrapped up around 2011 2012. A lot of those Java libraries that it's built on are updated by kills me the source code for some of them isn't even available. So the effort to upgrade all of these libraries is is significant, that in the fact that it is not a very extendable modular design under the covers, so the other issue that came about- and this is one that I think Jane encountered- was that we were running into difficulty- finding Java skilled programmers.

B

It seems these days that everyone is jumping on the Python bandwagon and probably with good reason. It is the class I teach over at UNC. It is a relatively easy language relative to Java. It is an interpretive languages. Language, unlike the compiled Java language, but I, have to say it is easy to pick up I.

B

Think it's going to be easier to find folks to work on this, and the other thing that's really important is that there are an awful lot of really good Python libraries that are emerging to handle processing, natural language processing, so there's no shortage of library support for the Python language, so we made the decision I guess it was class from.

A

B

A

Really AG another.

B

A

Have metal discussion and.

B

They would job in time that a.

A

B

A

B

A

In that area, in Python.

B

So so a lot of it, though, was just addressing the skills availability to keep going, to be able to extend and to be able to have more of a plug-in architecture and new algorithms and vocabularies, and one of the things that has always been an issue is being able to import vocabularies into job into haiwa. Know it's a difficult, time-consuming, somewhat painful process.

B

It is not something that you can easily hand off to someone who does not have development experience, so that is something that has always been an effort that we've been focusing on and that has been simplified, a fair amount, because more of these libraries do seem to be supporting the RDF format, the resources churches format. So if you could at least have a common format, that's when, when a substantial have starts to being able to simplify this insulin process and also the data model, I want to have four different data stores.

B

We have one that doesn't mean that we won't need to add others as we as we continue with this, but start start small start fast, make it work. The other thing that we need to do is just improve the web services. There is a REST API for high bueno. That is what we have derived and given to Mike, so that API is available, but the intent here is to do the same thing with high to O and improve the interface. The ones for where the Java version is is a little bit cumbersome.

B

It could be a great deal more streamlined for the sort of non developer to use, and then what I mentioned before one of the things that drove the decision to redesign in Python was actually the availability of all these tools and libraries I.e a whole lot less code for us to write. Ok,.

C

B

What's already out there, there is high 100 runs on Tomcat, which is very powerful web and application server. There is a very lightweight web framework available in Python called cherry pie. This was referred to me from another faculty member at UNC, who has used it before so far, so good, the Drexel administrative support has been wonderful and getting this up on their server, so it's working pretty well so far.

B

There is the natural language toolkit, which does a lot of the heavy lifting in terms of actually, you know processing documents, doing the coercing the stemming the filtering classification. So we don't have to do that.

B

This version of it does not use the Kea or Maui algorithm that was used in oh. That is a very robust algorithm if I think looks at something like 15 different features of each key word that it extracts in order to rank the the relevant.

A

Relevant education, as.

B

A key word for metadata generation, so we didn't use that we did actually contact them to see if we could find out about going to applies on versions, but that wasn't in the word, but that would be a good thing to do- is to try to pass aya sofya algorithm. But the author of that actually had taken a look at this rake album again and she said actually I think that it was pretty good, but it needed some enhancements, but rake is fast.

B

It probably only works with maybe three features when it tries to evaluate the ranking of a term, but that's a good start and the code is available there. So you can add what you want to it. It's basically open source RDF lab has been invaluable because the libraries I'm sorry Cabul Ares that we've imported into hive.

B

Oh so far, we're all in RDF format, our DF lives did most of the work, didn't have to write anything and then SQL Lite, which is a very lightweight relational database, which you probably all have on your phones and on your laptops. Pervasive is what is being used here for dispersion of having oh. This is very much modular lives with an API, so we had to scale up to a larger database on that can be easily be done, but we're starting out with SQLite.

B

um So these are intended to be just a few screenshots to show you how at least from this demo page how it mirrors the same functionality in heroine. Oh, this is the home page. If you have two slides, when you click on the blue link, it will take it to the demo. This is a list on the home page of the vocabularies that have been imported into to load. So it's a little hard to read, but it's the unified astronomy's, the scarf I think the next one is the US Geological Survey.

B

There is a metals vocabulary that want to change students, creative force and then the last one has to be the smart one. Yes,.

A

B

A

B

Yeah, that was one of the ones you have to gaming. So all these vocabularies ruin our dear format, I'm sure. Either you go to these websites. They give you a link to download the file. I wrote some RDF, I'm Anna wrote some Python code to use their IDF live library to process these, and basically they go through parse all the content and they generate actually a very large graph database. The graph database of triples the triple structure defines the concepts and their relationships.

B

It's also a pretty big database, so my first test was actually with the United of unified astronomy, bazaars and that has I think something like eighteen hundred concepts and I. Don't remember the number of relationships, but the graph database that got generated was about ninety Meg. It also has thirty thousand plus triple Gator, so I started looking through and going. This is an opportunity, for you know. Improvement here is ninety mega kind of dig and I realized. An awful lot of the triples have nothing to do with what I was just.

B

It was really kind of extraneous information for our purpose, so I did pull those triples out, but then I also took the graph database and I just converted it to a relational database. The very simple database with three tables I believe and when I did, that it became 1.1, Meg or thousand troubles, and it's fascinating.

B

So that's not to say that in future requirements might mean that some of those triples have to come back in, but it's not a difficult thing to do to do the conversion graph to the relationship, so that was actually one of the nice improvements from hard 100, where we got it down to one database, a small database, because some of the high 100 databases are quite large.

B

This is the Browse vocabulary, page again similar to 1.0. You select a vocabulary on the left hand, side you're, going to see a hierarchical view, a tree structure. All of these vocabularies that come from RDF are hierarchically. Organized most of them are organized around top concept, so the vocabulary, definers, are authors themselves, define the top concepts of the vocabulary and we use those okay. So the starting points there usually aren't very many of them, maybe 20 25 of those.

B

In this example, you can see that I've used the USGS one I highlighted term in there when you click on it. What you're going to get on the right is the detailed information so from RDF you get your preferred label for a term or a key word. Yet ultimate labels get notice if it's and provided broader concepts narrower concepts relating councils, so you can then select inside of those I'm, not sure if I'm going to lose one.

B

This is an example. This is the actual demo that the United unified answer I, think I. Dare all your attacks is unified astronomy, bazaars? If you look at the Big Bang Theory, you can come over here, find a little extra information, the broader concepts. These are narrower concepts you can go through and basically you know if you're looking for particular terms, you want to confirm that the term that is being used for metadata is correct. You can get such detailed information about.

C

B

Is nice yeah I mean.

C

B

Running off trucks, it's Ted over and.

A

John just launched this last week.

B

Well, I have much help to say: I have to say this is admin at Drexel has been wonderful because we had some logging issues and that we're not related to the code, but we are got five minutes left. Okay, starting vocabularies and I- didn't talk about this before this is just a case where you can pick the vocabulary since you want to search from you enter some kind of you know, concept key word whatever it does a wild-card search, and it's going to list over here the different vocabularies that have a concept that contain that.

B

A

Replaces the tense attack that yeah and.

B

Then here's indexing and again this is very similar to what we were doing before you can use selectable Cabul areas of interest. You enter the URL for the web resource of the document click index, and this is the word cloud, and this just shows which keywords were selected, and you know it's like a regular tag cloud where the the font size of the keep see the ranking.

B

Lastly- and this is just sort of my own kind of working set of next steps- I do need to harden the web services API. It is a REST API right now the the browser has a lot of JavaScript that issues Ajax calls there's a REST API and the backend talks to python python talks to the database generates json that gets sent back and it gets converted to the web page that API does need to be hardened or use.

B

You know by my doctor, I've structure, but I haven't done that yet that's coming okay, I do want to add machine learning, because that was in the original version, though.

C

B

Are libraries without the plug-in right and.

C

Then that we need to get into brown dog to just kind of show how these things can flow into each other. That's what I really hope for yeah and.

A

We have Howard get to the whoops on this Thanks. You can't have it back in either kind of I.

C

Don't think I wanted to.

A

So, with five minutes left on a Friday here,.

B

A

A relief great thank you is wonderful.

B

A

Much does anyone have any questions or for the three presenters, so we can't hear you one genetic.

B

A

The mics everyone on the lines.

C

It's not exactly.

B

A

How 2.0 does it still have the Capon like a you, know, a paragraph in and it would tell you which of the which ontology it's it's most likely to be yeah.

B

It will, but it will be sort of a a wild card collection of it, but it.

A

B

Focuses on key words and phrases as opposed to a paragraph.

A

No, but you put that paragraph in a minute it will, it will parse it right and.

C

A

C

B

A paragraph has in it being the document from which you're ganna do. Yes, it.

A

B

Get a text from so how many yeah? Okay, let's try.

A

Moving up I've.

B

Only got four at the moment.

A

Yeah you're always thought we should load in a sound.

B

A

B

A

That's right well,.

C

A

Thing about hive is just to remember, so you can have your own hive. Anybody could take the.

B

A

Approach and the technology and I said like the LTE or folks, you ties, and you know, I, don't know how often they use it. I could check again, but they have their own hide. It looks very different. Brian has a Isthmus of hive and it's a prototype. There's some legal group in Italy, that's using it and they have their own book cavities, but the demo hive could be a service yeah. You know we had like lots of February.

B

So, and that is next step I'm like it's my last one admirable carries yeah.

A

B

Didn't need to put in Library of Congress and see how that works. That's that mesh neuroscience.

A

That's my particular interests, economic, others. Any other questions you I will have questions that I know you and I are following up five. Yes own afterwards. I do want to say. We right now have on the books, Mike, Conway and Kenton. Doing additional demos on December, 9th I'm going to be out of town, the I Triple E be do summits and other conferences, and, given us the holidays, would people want to postpone those demos to the new year? Okay, Mike is nodded yes Kenton.

A

Do you feel okay, postponing the MDX to the new year and Carl can work with you to set up a date. That's.

C

Finding me excellent.

A

All right well, I, want to thank everybody, but if you have still have questions for the presenters here feel free to continue, it's just I have a next column and have to jump on. Thank you all. This is fantastic. I really.

C

A

B

C

B

A

A