South Big Data Hub Data Sharing & Infrastructure Group, 6 Apr 2018

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: CI WG demo: NIEHS Office of Data Science Efforts

Description

Date: 4/6/18
Presenter: Mike Conway
Institution: National Institute of Environmental Health Sciences (NIEHS)
South Big Data Hub

A

Most people we've known Mike because he's been at many of these meetings here, so he probably doesn't need too much of an introduction here, but he's going to talk about the data science efforts set and I'm not gonna, try and pronounce it so and ie HS, because I'm sure everybody pronounces that acronym differently, but talking about what sort of what's going on in the office of data science there and what the priorities are in the office and where they're going so I'll turn it over to Mike. To tell us all about it. Hey.

B

Guys and and some of this I'm not going to I'm not going to like talk about a particular product or development effort, because it's a strange world but I've got sort of jumped over the fence and we're now in a position to sort of eat our own dog food. If you little so um I remember, there was some discussion during the last meeting about I think this was with Renaye talking about.

B

Are we producers or consumers and I was mentioning the fact that we're sort of now sort of straddling that line and I sort of want to talk about what we're doing and sort of why? We think the hubs are important and what we are looking to both get from and contribute to that effort in the context of the hugs. So with that so I'm now working at nie, HS, National Institute of Environmental Health Sciences, which is part of the NIH, a larger NIH organization, were based here in RTP we have a beautiful campus.

B

That I contend is actually the model for Hawkins, lab and stranger things. If you guys don't know the stranger things, people were and went to Jordan High School. So if you see the picture of the lab and compared to Hawkins lab, you tell me okay, so, but I previously I was working with dr. Moore, dr.

B

Raja saying are in the nice center and then with the Datanet Federation consortium, where I know a lot of you folks from and I became involved with the BD hubs effort when I was at Rennes C and that's really continued as I've moved over to an IE HS, and so it's interesting for me and I'm still navigating coming from the end step side of things now working in the NIH side of things and how much things are actually flowing back and forth between those boundaries, as well as the boundaries between what we do with vendors and so forth.

B

So this is really an interesting area to be in so what we are there, how about a little bit of what we're doing with ODS and what the NI EHS Commons is and again ybd hubs. Efforts are important to us and about contributing back to this community.

B

So this is from our mission statement, accelerate scientific discovery, collaborative research and improve public health through that out of application of scientific data and knowledge. So I with the wonderful presentation we just saw just remove the words hydrology and insert Public, Health and I know: there's even public health overlaps with hearts, hydrologic, community and you'll see that we are actually all working on the same problems when you step back and look at it from the sort of multi disciplinary level, and that again, is where we were five years ago with the data.

B

Ness was multi, multi disciplinary research so, for example, the controversies with Gen X and the water here in North Carolina, that's a hydrological problem as well as a public health problem. So you can see definitely how these over. So this is really at the very highest level. What we're concerned about so with toxicology and the things we do at ni EHS were worried a lot about life, science and genomics, and you know how the environment impacts, the the genome and the health effects of environmental exposures.

B

So we're dealing internally with a lot of high-throughput technologies that are creating science data and then applying computation to those datasets to drive data and then also to begin to preserve and distribute and share that information, with the focus on fair and I'm sure fair has popped up more than once in in a lot of these conversations.

B

But that's the fair is sort of our focus now, so they findable accessible, interoperable and reusable, and this is in the context of what's going on and I know, there's lots of folks wearing this hat on the calls well to where NIH at large is really looking at these problems of big science and how the cloud impacts it and how we manage health science data share it securely compute on it securely control that make it findable and discoverable.

B

So we in our group are very much tracking, not only what's going on in the BD hubs group, but also what's going on with these larger efforts and we're tracking it in terms of, or you know, orienting our own efforts internally, as we deal with internal data with an eye towards this opening up into this sort of shared, fair based environment, bringing in large datasets high throughput.

B

You know you know next-generation sequencing, all these sorts of technologies managing that data, managing it in terms of compute, creating datasets that are preserved, honorable with authenticity, provenance and all the regan more things that I like a channel for you guys. So it's really a nice place to be so. You know, as I'll say, we're starting internally. What we're dealing with a lot now are the internal datasets coming out of our core labs, so krylia mass spectrometry.

B

We have studies where you know we have, then tissue samples or different things are being sent out by the P is two different core labs, where different sorts of assays are done. All that data is returned to the pis and integrated and analyzed to create studies on, for example, cancer effects of a certain chemical or we're doing Studies on all kinds of environmental exposures.

B

So it is not just fast key fouls that we are concerned with it's. This larger idea of an information Commons and a knowledge network taking in all these different aspects of health science and bringing them together with the point being to increase the the speed to which we can find new knowledge. So you have this larger picture that we are all driving towards in our own domains, but it's always towards right and so we're perpetually towards, but we're starting with some more prosaic concerns were and I borrowed this from Regan.

B

But this is this idea that data has a lifecycle and, in a lot of these projects, were more concerned with the right to boxes. If you will of this data lifecycle. So as we even look at the NIH Commons effort that is really oriented towards these right, two boxes so we're organization right now. That is internally way over on the left-hand side, which is managing data.

B

Coming off of our core labs, managing researchers trying to assemble different kinds of assays, different sorts of data sets to create publications, to create to drive conclusions, and so all that is occurring on this left-hand side of the chart and even just dealing with those first, three or four boxes. You know our work is cut out for us, so so again, where we are is more on this left-hand side.

B

We're very much concerned now with internal facing data, with an eye towards what's going to happen and I would actually sort of submit that a lot of the entities that are involved with projects like the BD hub are actually in reality, living in that side of the space. But what we're concerned about is how do we manage this data over on the left-hand side so that it can transition over? To that right hand, side as fare data, where we've kept provenance, where we know where the data came from, we can make assertions about its authenticity.

B

We can also have the repeatability to um you, know, validate a study or if questions come up, we could you know- or we can hoard, that you know, since this data is also very expensive, to acquire. How can we reuse this data, so those sorts of things, so our architecture right now internally we're really dealing with repositories for this core data and managing the delivery of this core data um to our researchers.

B

Previously, what you have- and we seen this a lot of places- is you have ddn's and then apps and so forth- that are just chock-full of terabytes of sequence, files and other sorts of datasets, and you know we begin to lose value in that data, because people can't really describe where that native really came from the metadata is not complete, so we're dealing with immediately getting our handle on these sorts of things, so we have, as with a Hydra share a grid that is based on irods.

B

We do have an eye towards Federation mechanisms as being an important way that we can extend the Commons out in various directions on both towards external collaborators or potentially other parts of NIH, but also, interestingly, how we can use this sort of architecture to place resources inside of different labs, who maybe are sitting inside a science feed lamp, doing work but be able to draw data from these sort of distributed nodes into a central core to do computation, processing, replication and and so forth, so sort of data tearing and in a sort of a different way with the labs being the first part.

B

That's here, and then we are very much interested in cwl and getting to some common processes for describing workflows and processes. So like most people doctor and singularity the ability to run workflows on one platform and move over to a larger, more capable platform. So as we move from things like am I seeked Mexique and the bigger scarier things like nova, seek and needing to avail ourselves of more computational power?

B

That's where the sort of agility in our workflows of pipelines comes from and in that using the ability to interrogate or close and singularity containers and so forth to capture parameters and show reproducibility establish crop provenance chain custody.

B

Similarly, we're very concerned with starting to develop utilize all of the different ontology and taxonomy that are out there. So we're in the process of evaluating and adopting of you, know, tools to manage our ontology and taxonomy and are actually currently wrestling with the kinds of data we can extract from the workflows and how we integrate human curation into this, both to start describing the experimental methods that are being used. The questions that are being asked, the outcomes that results from data and like them to publications. None of this again should be surprising to anybody.

B

But again, if you in within the BD host context, what we're really interested in is kind of future proofing, and- and we look at the BD hubs and even things like we are very interested in ga4gh and other groups in terms of future for proofing and sort of confirming the architectural choices we are making now internally. So that, as we begin to open up to the world, we're in a place where we don't have to redo everything, if this makes sense and I think a lot of people are in the same place, we are.

B

This is just to show not necessarily to be able to read this chart, but um this is just to show where we are. Is we have all these basic elements either in place or in pilot, um but we're choosing among technologies? You know internal identifier. Are we looking at men IDs? How are we archiving data so things like BD bag? Is that appropriate to us in choosing ontology z' metadata, how we describe our data and then processing pipelines from all of these different data sources?

B

Our concern right now is our epigenomics core and all of the various sequencers that we run and again adding them things like crying on mass spec and other kinds of ways to analyzing data, delivering them to pis and ensure long-term, fair access.

B

So in terms of VT hubs, you know, I looked at the rings and the spokes- and you know I- think our concerns. It's like Maslow's hierarchy. Our concerns are the ring concerns um where are the highway signs and where the guardrails- and you know so, you know for us, you know what here are some of the things that we as an organization are looking at so very much the NIH Commons. What's going on with the.

C

B

And what we learn from the data notes, since that's where I came from we're looking at peer institutions, so we're looking at what the hydrological communities do we're, also looking at what NCI is doing with their cancer data comms in Scott Collins, and also what vendors are doing and say, although um I think this group and the concerns this group are expressing are way ahead of where the vendors are I have a major you know.

B

If you talk to vendors, sometimes- and you know they won't know, what fair is even people who are dealing in you know things like storage technology and stuff. Fair is not something that really has penetrated a lot in the vendor community. So I really do think that people that know about this and the real conversations are happening in the context of what the hub's are doing a guardrails, and this is the other thing.

B

The other ring, though, of what the hub's are doing, is looking at things like security and compliance, we're very concerned with data usage agreements and again, when the reasons were internal facing right now is or when you start opening things up to external collaboration. That's that's like going from two kids to three kids or something like.

D

That it's not that's, not it's not a.

B

Linear progression of complexity, yeah and so, and that's kind of where I are nothing again. I didn't want to get too much in the nuts and bolts, but we have been really interested in the various presentations of the various technologies and again sitting on the other side of the fence. We, you know as part of ni.

C

B

We are not in the business of creating products, but we are in the business of being parts of communities and utilizing of projects at a level of maturity and bringing them together to create coherent systems that can deliver this sort of fair or fair pluses.

B

We kind of focus on fair plus computable, because I think the tools we need mostly exists in the open-source community in the kinds of things that NSF is produced and the kinds of things that NIH is producing in their comments, efforts and how the hub can be a resource for us in terms of sort of navigating this and I will sort of stop there. We could go into the mind-numbing detail, but I know we all love to, but what I was trying to keep it more towards I guess.

B

My reaction is some of the conversations I've heard recently about the hubs and their world and I won't stop sharing there and then pass it back very.

A

Much so before I dive into any questions, I'll open it up to other people here.

C

So you've had privacy and trustworthiness, and things like that on there.

C

B

Are they are not a separate topic at all? The reason I had don't have them on? There is because it's sort of not my expertise area but I do know. For example, if you look at the NIH data Commons efforts that there are whole sections whole key concerns of the NIH Commons efforts on things like ethics, privacy, the science drivers in scientific use cases for this data, so it had more to do with the fact that that's not an area can cover very cover very well.

C

Ok because there might have an e at the end.

C

Interesting. Thank you very much.

C

This is math from UC Davis have a question for you. I was interested to hear that you are thinking about. You, know ontology and vocabulary management tools, and can you talk a little bit about that and and what that means and how you interact with other ontology communities like Bo or or or or whether you or whether, the you're, the right person to talk to about that actually.

B

If I could in rasam I, don't know if you have your microphone available, but I would invite rashaam to maybe comment, because she she's part of the office of data science as well and is sort of leading the effort at ontologies or stephanie. If she's on than perhaps they'd like to come in.

E

Briefly say that we are looking at several biomedical ontologies and identifying, which ones can serve our needs. So that's one aspect that we are looking at and the other aspect is how to manage these ontologies. So we are looking at several different tools and in the process of evaluating them, and maybe Stephanie can add further on that.

D

Hi I'm Stephanie on and the Health Sciences control vocabulary has had a lot of fits and starts, as you can probably imagine, given the multi multi disciplinary nature of the topic, so we pretty much encompass or can potentially encompass just about almost every discipline that is out there in one respect or another, and there on your currently efforts underway to re-energize a multi Institute approach along with EPA and others within the environmental science community, to try to come up with an approach for developing an health science vocabulary, they're, definitely oncology's that already exist in the biomedical space.

D

That obviously was not you know that we would be able to incorporate and not try to reinvent the wheel on and it's understanding where there might be gaps in vocabulary. So, for example, maybe apps related to exposures that we would want to work on developing and obviously arguments for exposure made some areas of difference than how you can a might need you to implement that. But it is definitely a topic area.

D

I am now calling this a year of of control vocabulary in our space because we have been trying to bring it up gosh over the past, I think it's been at least 1012 needles, but we have tried to get our Institute to begin thinking about controlled vocabulary, and it is now become apparent just within our Institute, that we have a lot of data systems and applications where people have continued to reinvent metadata or some type of a control vocabulary to help with coding of their content, and we situations where one group calls it organ your cartridge and rectify that type of situation.

D

So yes, like effort underway.

C

Thanks very much.

A

All right, I just had one quick question because I sort of greedy to ask it but I'm going to do it anyway. So a lot of what I keep seeing in NIH and other spaces for a lot of this work. They keep talking about cloud, and you talked about cloud and vendors and things. What is what? What do you feel sort of the the NIH is view on what a cloud is is because it's one of those questions everybody's got an opinion, but nobody seems to be the same. I'll.

B

Make a technical answer and then I will definitely deflect all the answers and so forth. I'm.

A

Not the good policy answers, I'm thinking, you know what what are they looking for when they say the word cloud and and to.

B

Me if there's no such thing, that I think that really what we deal with more are about sort of agility. In terms of where data is.

C

Stored so even.

B

Internally, it could be his prosaic is: do we have we've got data on this net app and we've got data on the DDM? How do we treat that under a common view and that's our interest in irods all right and you know we could have data in some occasion where it's on the DDM and then you have a replica or you push stuff out the Amazon or on the glacier or something like that. So is that cloud. What so I think cloud is a meaningless term I think more. What's important? Is um data?

B

Has independence of location right.

C

B

Data has a an identifier that persists, no matter where that location is, and then it's it's really more about this portability of execution of analyses and so I know the NIH Commons.

B

A lot of the effort is really more oriented towards minimizing data um egress charkas by moving computational units to where big data sets it right, and that is just you know, that's the same problem we have if we want to process data locally or process it on some high-capacity resources that within nie HS or send stuff to biowulf, or something like that so I think cloud is misleading, I, think it's the standards and open api's yeah.

A

That was sort of what I was going for. Cuz yeah I like to pick on that word so, but I thought you know, I think it's one of those things that is good to sort of talk about, especially in this hub environment, because you know at some level we are the cloud and the important thing to sort of remember on that. That's, oh.

B

Please I'm sorry, no.

C

No go ahead: Mike, okay,.

B

No I was just gonna, say: I. Think that's what's what's important is that we would like to build systems that are not walled gardens that, yes, that we can and have other people consider to be a cloud. Even if it's sitting on just a tool, you know Linux rack server in our computer room. We can still say it's filed in terms of how you perceive it right. Yeah, I,.

C

Was gonna offer up a definition for cloud? Would this work for people, and this would be from an agency perspective? We are not buying new new machines, I it seems more and more. It comes down to that. Put it somewhere.

A

C

Another place: that's right.

A

Or as a service model right right, that's you know it's we're not going to be isolated to being our own Island.