South Big Data Hub Data Sharing & Infrastructure Group, 17 Feb 2017

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: CI WG demo: Biodiversity Open Data Landscape

Description

Date: 02/17/17
Presenter: Jennifer Hammock
Institution: Smithsonian Institution
South Big Data Hub

A

Today, we're really excited about the two presentations we're going to have the first dr. Jennifer hammock from new studies institutions, and she is a project manager at the encyclopedia of life. A lapsed chemist and marine biologist abouthe has the learning biodiversity informatics um job for four years and jennifer, and I met. Maybe that was the beginning four or five years ago, when you were starting on the road of citizen science, crowd-sourced data collection as well and dr.

A

Matt Spitzer will follow and the Open Science framework for connecting institutional research Matt is a community manager at the Center for open science and nonprofit tech organization, building, open source public goods infrastructure to improve the transparency, openness and reproducibility of research and I. Think I forgot to mention I'm Lea Shanley for those who may not know me and the prospective director of the South big data hub, so welcome all on a beautiful Friday afternoon. We're very glad to have you here.

A

Jennifer I forgot to mention her title as a biodiversity, open data landscape, a beautiful mess and with that Jennifer, please take it away. Thank.

B

You Leah and thank you- everyone for coming I, haven't been to as many of these meetings that I like, but I try to keep up on the slack and I think this is a little different for most of the M most of the presentations you've had I've seen a lot of solutions presented um and I am I, guess going to have more of a use case representing my research community and our data I do have one or two things to crow about, but mostly I think it turns out I'm bringing in my problems.

B

So thank you for coming and and I hope. You will have lots of the feedback and suggestions for me to pass on to my people.

B

So the first thing to mention about the biodiversity data spear is as big data goes. It's not that big, but it is messier than most big data. At least that I have read about so you're. Talking about.

B

You know some two million species less than a trillion records in terms of a thing that has a bunch of metadata in it, but represents one simple offense: they come from two sources, primarily there's a long history of Natural History work, a lot of museums and herb Aryan, that kind of institution going back a long way and the numbers that indicate the messiness are things like this is 200 years.

B

So imagine research being done over a period of about 200 years since sort of late in the career of Carl Linnaeus who started naming organisms, and you can imagine the variety you might get in recorded data and a lot of it is written down. um The vast bulk of the breadth of information we need is written down, so it's recently been digitized and is being reborn digitally increasingly.

B

Some of it is born digitally nowadays, but we are primarily concerned actually with the old stuff, because the new stuff to a greater extent takes care of itself, and our job in my community is to rip everything together, including historic data, which is important for things like climate change and other global trends, for which you require a baseline, to compare what's going on now with what we would have expected, they without human intervention.

B

So a lot of old data, a lot of data, that's inherently complex, but again we're only talking for the most part in the millions of of Records I'm not going to delve for those are enduring too much into barcodes or other molecular data. We touch that fear, but that's another sort of born-digital, much more organized subs fear of biodiversity data, so most of the problems are outside of that area.

B

One of the two big sources, the museum specimens that looks something like this- is carefully staged, but otherwise realistic depiction of museum specimens material from which the data originate.

B

This is the first department that might be in the Smithsonian here in DC, same museum, different departments is entomology and again are their prettiest specimens, but this is representative of the depth and numbers. And finally, this is the invertebrate zoology department, which is everything in the sea. That's not a fish or a mammal, pretty much so this is perhaps the widest evolutionary variation that you see in the.

C

B

These guys are one one big source of the raw material that we're working with the other one is born digital and often online, and it it goes by many names. Science is probably the most familiar. This is a snippet of activity from a platform called I naturalist. This is where ordinary humans, who may be professional biologist but usually are not going identify an organism and place it with a photograph to voucher what it was on a map with a time stamp.

B

This is a representation of eye net data at the global scale, but to give you a sense of density, this is our neighborhood here in DC, so to use it. This is this: is the typical density of pins you'd find in an urban area on a platform like AI naturalist, of which there are maybe a dozen sizeable ones over over the globe and a couple of hundred smaller boutique platforms?

B

Assembling this kind of data and again each of these records is primarily a species name, a location, a date stamp and something to go through it usually a photo.

B

This is a snapshot of jeebus, which is the largest aggregator of data of this kind, and it does include I naturalist, for example, and a number of Museum sources, the Smithsonian's in there, the big European Natural, History, Museum of London, Hamburg, Paris and so on. Darrel in there, and also the biggest single provider to this map is called Ebert at the citizen science platform, because birds have more charisma than any other group that they provide about a third of the data overall on this platform.

B

So a third of this data is within the past several decades. In a group that represents less than one percent of biodiversity, I.

B

Mentioned some of the data was in text form, and so that is primarily right now in the process of being structured for the most part, it has been scanned already you'll find it. This is a snapshot from the biodiversity heritage. Library you will find an image of a page. You'll see are on the OCR, keeps evolving as I'm sure you're all aware. The quality right now is not bad for most of the text is available in here, although it does depend to a certain extent on typeface, and that depends to a certain extent on vintage.

B

So older texts can be a little bit more problematic to scan.

B

Nowadays, they have done processing most of the printed text and they're moving on to other kinds of content, such as handwritten text, which is usually dealt with by a crowdsourcing and also image, tagging and identification, which actually, that often also requires crowdsourcing, because to find the metadata associated with an image you kind of have to go crawling through a book, so they have some representation in Zooniverse, which is a general-purpose online citizen, science and crowdsourcing platform, where you can do anything from identify galaxy types to figure out what this.

B

What this spider illustration is US and the Smithsonian has a bespoke platform of its own, where that, where VHL images are treated so they're still busy, and they are getting to the point where they they are going after the higher hanging fruit.

B

So altogether, this stuff is connected in something we described as a biodiversity knowledge graph. This image is from one of the pillars of our community, who is the the futurist among biodiversity people. His name is Rob page, that's his blog down at the bottom, and he is trying to drag us into the 21st century.

B

The various item around that ring are all important for biodiversity data. They are interconnected in all the ways indicated in some other ways. I think you didn't want to clutter the graph too much and the darker lines are the more easily structured and linked data.

B

Most of this, this is that one accomplishments I really wanted to crow about. Most of this data is mostly open. Sorry, most of these categories of data are mostly open, so you can get to it. There usually isn't a paywall, and usually there are restrictions on race of this data, so we don't have a lot of security concerns or ownership concerns in this area. Here the exceptions to those are, for the most part, about half open for, for publications, scientific journals and for photos, be its collection, photos or amateur wildlife photos.

B

Some of them are are available for reuse, and some of them are not it's a matter of business model in both cases and that landscape is shifting fairly rapidly.

B

But it's complicated the driving force behind the opening of all that data was the blue show declaration, which was an imitation of the Berlin declaration of a fuse earlier, but the blue showed exit dates from 2014 and a large number of biodiversity institutions and a number of individual researchers all signed it as a commitment to making their their data in their knowledge, open and reusable and shareable in order to advance biodiversity Sciences more quickly.

B

A number of governments are following suit in a general census for science, including the US government, I'm sure you're, all familiar with this and the primary vehicle that biodiversity folks are using to open up their information is the Creative Commons system of licenses and other statuses, and all that I'm showing here are licenses.

B

These allow reuse of information that is in copyright without seeking permission from the they're they're legal licenses. They have legal code available and these are different flavors, which have different restrictions upon reuse. The Creative Commons team has also made available other stamps which assert a lack of copyright.

B

So, for example, if you own something, but you are going to deliberately place it out of copyright, you can use a CC 0 stamp and if you know that something is public domain either, because it's because of the source being having having originated that way or because it has aged out, you can apply a public domain stamp.

B

So I'm going to talk to you now in a little bit more detail about this area of the graphic because I'm more familiar with it. This is the traits based area of a graph. A trait for biological organism can be many different things for your native attribute, like body, mass or body length, flower color for plants preferred habitat, so they can. They can be tangled up in behavior and my project. The Encyclopedia of life, as well as a number of smaller projects, have been aggregating traits at the species level.

B

Trying to make this information available for reuse and for better indexing of the rest of biodiversity. Knowledge traits come from a number of different places, but an important one is modern publications, either data or data papers or published datasets. There are a number of publishers out there that are helping individual researchers as well as institutions make this data available. It is still in an interesting variety of structures and formats. You can get anything from, for example, ecological archives or Dryad. You can download datasets.

B

And those datasets could be anything from XML, most often something tabulated like a TSP file or spreadsheet, in Excel format, to a PDF of a table which you can you know. Usually it will be selectable and you can get the data out somehow, but there's still a wide variation in how individual researchers are expressing, what they think of the structured data.

B

So this is an example of a page on the Encyclopedia of life offer which I support- um and this is a summary page showing you a sort of an array of structured data records that we have available for this shark. This is just a summary view, but to get a closer look, this is what an individual record might look like. Look like these are two different records for different species. So you know, onset of fertility is typical of a vertebrate.

B

Animal cell mass is more common for a microbe like sea, Sparkle and records come with metadata, but we aggregate from everywhere and metadata very extremely widely. We have an ever-expanding vocabulary of metadata terms and we borrow them wherever possible from online fairly mature online ontology, many of which are produced by something called the oboe foundry, but we have found that all existing ontology is in our domain are incomplete, so we frequently have to make up terms and place them in our ad-hoc vocabulary.

B

We're hoping someday to deprecate all of this and adopt terms from somebody's ongoing maintained vocabulary, but that's that that's an ongoing with us I have a visual of this, which I'm going to try and jump out to. Let me know if this is working.

B

There's still thinking at my end.

B

Yeah this is, this is probably.

B

Probably too small to read, but this is a single relationship between three species: a host plant, a disease microorganism and an insect vector that carries the disease and all the nodes are important pieces of metadata that are attached to this relationship, including things like provenance. How we know various things about these organisms and lots and lots of context which life stage of the insect which tissue on the plant, what the symptoms are of the disease and so forth.

B

B

Oh here, it is okay, sorry about that. um So that's the kind of records that we get.

B

um We aggregate here a tol from a number of smaller providers who are specialists in a given type of data or individual cultures. Some within with particular subject areas somewhat particular taxonomic areas, types of organisms interest them. Some of our data is distilled out of what we would call occurrence. Data, for example, Obus the ocean. It functions by our geographic information system, contains records of organisms at sea.

B

So just between kind of this is a species location in time stamp at sea. They often will also have depth information, and so you can distill out of that a depth range per species at which it offend you can determine whether it's a shallow water species or a deep-sea species. So the variety of sources is broad. The number of sources is quite large as several hundred already and we've only been added with a small crew for two years.

B

So this is the landscape that we operate in and one more source of complexity that might be adding data to this bucket in the near future. Of course, everything these days is affected by AI and the two things that's doing, AI, of course, is learning to read, and since we have all these text sources of information, they are beginning where the community is beginning to annotate our biodiversity documents and anticipation of being able to reason over them with AI. That's.

B

Possibly more work than regular document processing, just because natural language in this context is a little bit unnatural, so the knowledge that the existing systems may have in interpreting human speech may not work very well with scientific jargon from a hundred years ago. So it's it's still in the experimental phase. I personally have been involved in three different research projects on biodiversity document annotation in the past three years, so it's very active, but it's proving both complicated to work out and the the one last thing is machine.

B

Vision, of course, has also gone through some some radical advancement, the last couple of years and plant net is the first product that I have seen come to market. That has really benefited from deep learning, but the potential is becoming clear for crowdsourcing the identification of individual organisms out in the world by anyone who has a we can have it's: it's not equally tractable with all organisms, but with flowering plants with anything larger than a centimeter or two that can be captured well enough on a phone.

B

It is potentially someday possible to to get an identification from a photograph. If you take it well enough, currently photo quality, which is a human problem, primarily, is probably the biggest bottleneck to to getting this kind of thing to be more productive in terms of number of observations, and that is what I have for you.

A

Well, thank you very much. Jennifer you folks, in the room or folks online have have questions.

C

If we're might be yep.

A

Your mics communis can hear me I'll.

C

Get closer so when you were showing me do I guess it was maybe to put this to slaughter hat all the knobs or bacon, keep going give up and out on the back order. One more! Oh.

B

C

Yeah, well, the one that that might be out there. Sorry it anyway, the one that had all the observation people.

B

Oh yeah yeah yeah, these two probably yeah.

C

I was interested to know that there seem to be a bunch of observations on the middle. The ocean, oh yeah,.

B

Yeah! Sorry that this will crumble into one and.

C

C

B

These are primarily the museum side rather than the citizen science side. This is showing you where research vessels have gone in the ocean.

C

B

So I think it is the French who operate out of ten Tasmania into the Antarctic. For example, see the Americans and the Europeans, of course, are responsible for that cloud in the upper middle yeah yeah. So the patterns that you can see there's a lot of bias in where things have been sampled, but that's for a variety of reasons. The way cities are lit up on every continent. That's mostly citizen science, yeah.

C

A

Night, well, it's for a separate project. I do want to add. There's a program called adventure, scientists that engages citizen science, volunteers that will go to extreme locations in.

C

A

They engage sailors that did a transatlantic wait and.

C

A

There are groups that do a data collection there that aren't professional oops it cleared. Does.

C

What more empty space and more profitable Sahara I believe.

B

That is the end most of the arid African interior Russia is not as empty as it looks. They just have not been they've been slow to share.

A

We're screen McDowell making comments.

A

This is anybody online. Have questions we'll give you a moment to unmute your phone I.

C

Put a business's mic over a UNC, so I'm serious about you, know I'm curious about now trying to match this up with what the data hubs are interested in. So, like one question would be how much what's going on here, overlaps with efforts like NSF of efforts like data one or seed or carrot hop, or uh you know how it intersects with things like efforts to build sort of national scale catalogs.

C

So, for example, how are we promoting findability of this kind of data, and you know perhaps how you know: I'm looking for sort of intersection points now with if the data hub was a thing that brought together researchers and citizens with this data to discover it manipulate it and so forth? Are there any sort of ideas or visions about what's lacking, what's already covered at.

B

A simple level um this has been somewhat addressed. For example, sources of exactly this kind of data are usually in contact with one another. So for occurrence data in the US, for example, there is a portal called Bice on biodiversity information, something their slogan. Is biodiversity information serving your nation I think, and they are a hub of Jebus, which is this map, so they are already hooked together and there's a similar one for marine, their American Marine data, which is just called ovis USA, which is a node of the ovis repository in terms of leveraging.

B

This data against sort of orthogonal datasets within the US or other nationally organized data, community I think that's in its infancy, but the barrier I think is more of a subject matter. One than a geographic one I think a given data type is better organized across the world than different data. Types are organized within a nation if it makes sense.

C

B

There another question in there yeah that.

A

A question right, yeah.

C

The math interpretive signs that go into my last point that you have I was curious if there are efforts to solve that challenge of where you have perhaps similar data, but have that different structural metadata being stored in two different locations, how you can provide a more.

A

C

Way to search across multiple databases by having perhaps databases of the metadata that is existing in all these other repositories, there's been efforts to look at that at Chris, especially we have to you, know similar species stored into different laboratories that might have to construct your data, but would be useful to a researcher I.

B

Can think of two to two areas where I know enough to respond? One is wisten within this very kind of data. This particular biodiversity data of the species on the in the place. At the time we tend not to aggregate metadata so much as really try to get all the records represented in aggregator databases, and then the problem does become duplication because there's more than one layer of aggregation.

B

So there are a number of small databases, for example aggregated by bison within the US, and some of them also belong to international organizations that may create their data also, so once it by the time it gets up to GB if it's there twice and people are gradually getting better at their use of identifiers so that we can more easily D duplicate that data.

B

But that's the state this particular data type is in currently and oh do I, remember what the oh yes, oh, the aggregation of metadata is more outside my can, but my understanding was that that was one of the rolls of data one, for example, in the US, but someone more savvy might be able to help answer that question. I.

C

Can make a comment on it? We've got a project here, red-suited assessor, a database is called a data version, basically about aquiline, made metadata and searching through it right now. We work with people from a neuroscience community think helps kids connect to try to try to map a fair amount of metadata from different location together. So there I mean I think that's the only way. I know exactly what I.

B

Have a question for that response? I'm! Sorry, didn't catch your name, Oh Howard,.

C

B

High nerd, do you know if that's primarily done through our people aspiring to directly map one field to another, or are they constructing vocabularies in between to describe the relationships among fields.

C

In the case of the neuroscientist at the moment, well, what they're trying to do is what they think of as harmonization which some of us might think of. The semantic net is just going to try to take a bunch of their great key value, data right and enact the keys onto some sort of common key and try to map the values or comic set so that you could be in. It makes you kind of a persons in a.

A

C

Example or something called stats, which is can't remember what it stands for, is basically it's a personality assessment, prescriptive thing: to sort of pinch, a certain number of questions passionately so they're trying to map other assessments into that assessment of traumatic values.

C

You can have some common basis that prepares a little. Are you trying to have some sort of semantics of ontological similarity or what exactly kind of comparing.

B

Yeah, it is still difficult. We actually we in the biodiversity community, think of looking at neuron, the Biosciences in general and looking about 10 years into the future in terms of ontology development. That's how far we think they are ahead of us.

C

B

Not surprised that so hard are yeah.

C

It's: what are you feeling better still.

C

C

Yeah, oh sorry,.

B

Was that to me well.

C

B

That makes them feel worse actually, but I appreciate the morning. Yeah.

C

A

We move on they're there any last questions for for Jennifer.

A

You, okay hearing, no other questions. Thank you very much Jennifer. We really enjoyed that and we will be making that available at the recording of this to post, so we'll be able to share it again.