South Big Data Hub Data Sharing & Infrastructure Group, 31 Mar 2017

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: CI WG demo: SciDAS

Description

Date: 03/31/17
Presenters: Claris Castillo (RENCI) & Alex Feltus (Clemson)
Institutions: Renaissance Computing Institute & Clemson University
South Big Data Hub

A

Alex Phyllis and Clarice custom foresight, a national cyber infrastructure for scientific data analysis at scale, so we are now looking into the future. Here clearly is a senior conversation with Network Systems researcher and the Renaissance of Guinea Institute here at the University of North, Carolina Chapel, Hill and Alex is an associate professor in Clemson, University's, Department of genetics and biochemistry and CEO of hello. Look at that rate, symposium, allele, ok, allele!

A

Thank you so die death is designed to improve flexibility and accessibility to national resources, helping researchers more effectively use a broader array of these resources, sideout to develop you to large-scale systems, biology and hydrology DSS, but is acceptable to many other domains. So with that I'll hand it over to Alex and Chloe.

B

Hey Bob I, the creases control the slide event put some animation here, we'll see what happens with it, but so we're a recently flooded, NSF project, scientific data, analysis and scale side us or by desk.

B

We haven't even got to the point now, since we started in February to decide on how to pronounce our own acronym for really really new, but primarily what this is is a end user driven project where we're trying to build your music views, we're using on experimental tools, we're trying to build the production systems to be able to do what the NSF is doing and actually just generative generate usable scientific results. And so one of the big problems were trying to solve.

B

Is that even scientists who are very geeky and understand how to use high performance computing and things like that they're having trouble processing their own DNA? This is me I'm, one of these people I'm an end-user and we're constantly trying to figure out why our nodes aren't working or jobs are failing and we're dealing with I have a hundred and fifty terabytes of data, and we filled up instantly by trying to deal with some of these practical practical issues.

B

Next next, one.

C

B

C

Creepy you what.

C

B

So I'm, so there you know people having problems, processing their data from a data perspective like again begin myself and one of my colleagues and then there's a people that are building in amazing systems. That I really didn't really know much about till it's years ago, other than just using whatever my campus add to be able to process data.

B

These amazing systems, you know, ranging from well Clarisse, we'll talk about some of the systems that will be used, but I found that a lot of times we don't know what these systems are, a scientist going to market very well, sometimes it always said my needs and what they I'll have up here is sometimes it's just like I, don't really know how to get access to the systems. Do I have to what hoops do I have to go through, and sometimes that keeps me from museum.

B

So next next slide it's a solution of this and the way we rewrote this proposal is to embed active end users like some of the people that are there. Pis and Kofi is on this proposal. We're trying to process data at the Teheran petascale, but added by scale, looking at from an end user perspective and so much raw data asked to go to the whole system and then embedding those on software design teams with cyber infrastructure developers like as like we're working with the Rinty and these agile design teams.

B

So what we're doing is we're sharing our destiny on NSF reports and really we're developing the system while we're building it next next slide. So what we're doing to be able to do this on the design team is for the distributed cyber infrastructure component is gluing systems together. Allow us to discover the main data move. It fluidly across networks be able to launch scientific workflows in the way that a scientist is comfortable doing, even if it's not the best way, and so it improved the flexibility access to to national and global resources.

B

This is our solutions, problem and I shouldn't miss this I, didn't say at the very beginning is that this is NSF funded project, and this is Clemson rincey at Chapel, Hill and Washington. State universities are the primary side, but we're collaborating with people already around the country. Next slide.

B

This again is user engineer, focus we're trying to have the scientist use our embedded in stress test the system and a big problem. Part of this is that I find that using big data sets in genomics, which is what I do genomics and genetics this. The general-purpose systems done always functionally. You expect them to and the networks of move data around it as well. Next, one disk ego I put these dumb animations in here, and so this is that what the the concept here is that we have already on on it's on the team systems.

B

Biologists hydrologist is they're sort of our primary use cases for stress testing. We have an education person on doing education, analytics a lot of plant biologists I, do a lot of plant biology biology work, so I have a lot of friends in the disc this domain to do missus, bioinformatics cysts.

B

We have some space biologists, but also some people that are interested in from looking in the Earth from space fermented image analysis, we're really marrying these people with people like to work in rincey and also people to work in technical people, internet to you and a lot of partners, the computer, scientist and network engineer, storage engineers, the visualization people, the HCI people. We really try to bring all this together into one try to create a creative village to be able to get some work done next slide, and so there's the concept.

B

So you know, if, if you build it, they will come we're trying to change that concept. Here too, they will help build it long using it. Ok, so here's our little construction of a baseball field, so we can all play together. The network engineer the geneticist. You know where they're going to go was together after the mowing, the lawn and constructing the field. We're really trying to try to do this instead trying to alleviate the problem of building a massive system. That is great right, but it doesn't get quite advertised correctly.

B

It doesn't get utilized to its full potential and we're trying to do that in education, as the user really trying to respect the inefficient unoptimized habit space of the domain scientists that that's that, if you don't do it that way, you can turn fine is off because they don't want to change their their bad habits next slide and so for stress testing. You know, I'm a biologist and a big big part of the data that we're running through the system as we develop. It is biological, and this is just a.

B

This- was taken yesterday, a snapshot of the data. That's in one repository, the NCBI sequence read archive in Maryland, which is such that the internet see when we move data through internet I've been to our systems right now, I've been doing this for years. Actually, but just it just cast the magic 10 quadrillion, 80, GS and C's base pairs that have been developed with a new or not new. Eight year old, sequencing technology, that's really districts, huge data sets you can see. This is an exponential curve. This is a lot of data.

B

A base pair is a byte, and so you can look at this as petabytes over 10 petabytes of data that have been generated and I see this, as you know, hitting X of ice if it hasn't already it's just not in this repository, if it's out there, if I think I saw in John's slide that there was on during 12 petabytes moved across skynyrd at to last and 2016 times. I got that right.

B

You know this that you could have a very small number of scientists moving this data around and mining it and meeting that capacity. Just from this one repository so there's a lot of data to crunch six lives, okay and so the biologists you know, I got stacks hard drives in the office that we had used from received from colleagues to process their data. You can't do that anymore, it's like, and so one of the stress tests where actually we're trying to develop biological results while we're using the system.

B

We in my lab in a state of Sigma and watch the State University of Kali and some other people, we are generating G Direction patterns. We build these graphs of G interactions. This picture here shows the little dots are inter genes and line between them or gene interactions, and we process that 10 petabytes of data at at NCBI. That's a big place where we draw from in other places, choose to the polenta, generate these kind of networks, and so this is a massive computational problem to do so.

B

Really stress testing it from an algorithmic and just raw data perspective. Men are in what we're doing next slide, please, if so, we're really trying to this kind of get ridiculous here and see how much we can stress test the system and we're trying to I focus a lot of plants and sort of a human perspective too, and really trying to branched out into the Tree of Life and pull big datasets from different nodes on the Tree of Life different organisms.

B

There's some numbers here showing you like this is actually pretty dated that there are 38 terabytes of green plant law data that we can process. We probably right now. If we restored process and raw data will be generating, you know 3, petabytes or so intermediate intermediate files, and we don't have that kind of storage we're trying to deal with a data processes in this. You know how far we can push the envelope when we're generating this data.

B

This is sort of a model with the grant we're going to be publishing the genome interaction patterns that are interest to different groups of scientists and biologists, medical scientists that could be mine for information, good blood, and this is a we're going to detail here. But this is a lot of the workflows that the comfort zone for from my group is where we've been developing Pegasus workflows and run on the open science trip.

B

The open science kids allowed it to really scale up from a pretty robust data center, a Clemson, and that we can, you know, branch out into exceed resources as well, but we already have robust workflows like this one that generates gene expression, matrices Pegasus, workflows that are functional now, with Qi the weekend, we're going to be testing into the NSF cloud instance. It's like chameleon and cloud lab and then really moving to OSD for production purposes and using Internet is a big data mover for a data next slide, and so we're not just about biology.

B

The two main sets I for the the project are the genomics biology, community and hydrology community, so there's a project called Hydra share which is really exciting for me, because I'm still learning about what it is: Constance I'm, not a hydrologist, but this is a project it's in being pretty soon from the NSF and there's a really cool data repository the Hydra share that they build to store all sorts of data sets.

B

This is the collaboration for many many different University and universities and we're seeing if we can't like find some some cross-pollination between the disciplines with this, but also be able to engage Hydra share, learn from what they've done in their database, but also be able to use, add some more data, crunching power to Hydra share, and so there's many many aspects to it and Hydra shares that's a really cool resource, and so we're we're seeing how we can further enable this next leg. I.

B

Think this might be my last yeah, so that's just the end user perspective on things and just pass this over to the people who know things I, don't know. There's please hello.

C

Everyone and now well, I think there is, he probably does need it. The don't chambers are empty.

C

We advanced the data, fine 74 structure, to enable a scientist like Alice, Kim, Sun and cistern, or what is the same diversity as a hydrologist design, harder to use data to drive innovation and do a new discoveries, so in R and T we have in the past year with the goal of enabling science in science, we have a developed, multiple technologies and power tools and I'm going to be talking about the ones that are coded foundational to spiders, and some of the projects is also are, maybe for the foundational to the system that we have accepted to build and develop to meet audit and a biology community needs so to start a to overcome conventional technologies.

C

First modulus idols, which is an open source software that provides tools to manage data from occupation to um archival and reuse, destroyed all a lifecycle and has been widely adopted by the government, the industry and academia, and it's also the back end on some other prominent ACI efforts like a cyber it used to be I plant. We have then also Orca an estrogen, II or katissa controls a world that was developed in collaboration with silk to orchestrate resources across validated resource providers.

C

A musicians provider could be cloud providers or neighbor provided link into the tool, and currently he wants a rose electricity in Genova to genie, a genie testbed, a studying more than 25 and more than 20 network providers and in a to Jeannie Orca. Allow scientists to create virtual infrastructure of slices tailored customized to their needs, bringing together in a computer source, attic storage, the social network all connected to layer, 2 networking and distributed across multiple cloud or cloud providers around the nation and internationally actually, and so when we have these two foundational technologies.

C

What we have is the ability to build to believe that structure like the one we try to defeat in the figure in where the scientists have these very customized.

C

A virtual infrastructure spanning cloud like a million st-denis and unable to to define and one of the main of the key networking capabilities that we developed and the day under the Orca and esta genial, dale, and that we're going to be using inside us is the ability to stitch to external resources like HPC machines or a data repositories like NCBI to further empower they scientists and to you to do the science another foundational project.

C

It is behind cider each atom, and this is a project in where we integrate, is an NSF for the projects that completed one a year ago. But in in this, we integrated pegasus the workflow management system, develop III and USD with Orca to demonstrate that typical applications could dynamically in legal changes in infrastructure to adapt to changes in the workload like, for example, is a scientific workflow needed to the three for transferred data from an external data depository or needed to scale out.

C

We could dynamically the workflow itself, the scientific application to dynamically instantiate Network links the to enable the data transfer and, as you remember, Alex, a representative workloads that we are going to be using to drive. These projects are Pegasus entity worthless, currently running in osg, the type of program which is in the same pain of the previous one.

C

It's it's currently running and it's also negative projecting where we integrated idols with Orca, and by doing these were breaching the gap that it sees between the data management layer and the infrastructure to explore organization opportunities that are not possible.

C

With these two in layers being separated like, for example, we demonstrated and do something in supercomputing that we could prioritize the traffic of a or the transfer of a certain data based on the metadata associated with the file, and this is what this integration of idols and or attack enable, among others, many others Sdn basic limitation in techniques that we developed on the radii, and here we are facing the federal scale, challenge that Alex and Stephen have brought to us. We are going to be building and hardening these efforts.

C

Previous efforts is to allow scientists like Alice, to build collaborative infrastructure across a much more richer and complete ecosystem that will include cloud lab tag, a million open site grid and its genus and be tightly integrated with a wide area network in data Greece idols that would be initially deployed in Renzi, Washington, State University and clamp them to meet the needs of the specific applications of the biology community and the hydration community. The procedure by hydro share.

C

We will also be extending these while creating new approaches so that we can provision containerized scientific applications. So now, in the past, we have been using virtual machines as the unit of deployment of scientific applications. That's the case for at Eugenie and indicate for cloud like a million we're kind of webs.

C

What's going to be building new mechanisms so that instead, we can containerize applications and throw on a containers which is obviously going to make the system more efficient, more more portable, and we hope is going to take us to to the next level of the system in the 1574 efforts that we, the water meter width and this figure shortage is a ten thousand review and we have started this project is like a month ago. So we have been a heavy discussions on what we should be doing and how we would do it.

C

But even the higher level looks like this and we are in a related to connectivity between NCBI and all this video 74 structure, airports throughout Eugenie and we're going to be supporting they compute and Dana needs of tripod, which is a valid for Amazon drone.

C

Is that software infrastructure effort led by what is the syllabus in intention to set a biology community and that is currently deployed in a more than 100 sites, so I'm as pyro sure have needs to em drawn or enable the it accused the computation against the data that they host and we plan side that we hope to try this to be the platform for that and that, if sorry, we don't have a thank you climb. How is that possible? Thank.

A

You wonderful thank you, Alex increase. Do we have questions folks online I'll. Give you a moment to unmute your phone.

D

So so I'm curious, so this is for Alex or Chloe say is I guess you know it's interesting to me that you know what you've built to support the biology is seeming like it's going to be more generally applicable. I mean. Do you see this so you've got sort of the genomics community, as well as the hydrology community, any thoughts beyond man in terms of what you might be able to report with this I.

B

Guess I could answer that, so we do have some other other people to that. Are we worried? Sorry, remember we're six weeks into it.

B

They have a few people that were actually supporting right now and and a lot of these are cases that I understand what they're they're their pain points are and talking to other people outside of my domain to find out what their pain points are and I listen to what's happening at mediums and Internet to meetings and things like that, trying to find broader, broader people, we're trying to focus it on hydrology and genomics, but we will be going going outside and looking for opportunities to generalize it whenever, whenever possible,.

C

I mean at effects Mia, so this is the structure point of view. We're really designing the system to be extremely generic there. For example, if you look at the top of the architecture, I mean we're looking at enabling the provision of containers- and you know whatever is within those container- is completely immaterial to the infrastructure and I don't associate generic cell phone from hosting data, so the system is by design in genetics to serve the broader community.

C

A

Is Great Pacific.

B

C

Will be tripod and Hydra shell, which are effectively users of the system.

A

Interesting Hey yeah.

D

It's great cleary said you know, I'd be very interested in you know. Probably some further conversations about you know. Just as I was pointing out. You know we're looking at what our next investment is going to be. So you know Alex and I have already started. This conversation be great to come, bring in on.

E

Hey Clarisse, it's tan, so one thing that I that this conversation is kind of I, don't know germinated in my brain.

C

E

C

E

So you know we have for a long time talked about trying to create generic infrastructure and I think we're doing a pretty good job of getting there, but yet, no matter how generic it is. The I think inherent level of complexity, of what we're trying to get done in creating the infrastructure for community of science kind of almost requires that these special people, these unicorns, that Joan was talking about that can help.

E

You know the communities of science take advantage of generic, but not personalized software, and so I'm wondering you know, I wonder if we could start I don't know thinking about. Maybe is the right term to use about how far you can go afield from let's say you build something it works for a particular community like cybers, did I think it's been quite successful. You know building an infrastructure for a community of science and more than one community of science.

E

So how you know the question is, but yet in talking to drop the level of alcohol hand-holding is quite extensive. So my question is: is there a way of crew I'm, not sure if Howard's still there is there? A way to do a Facebook similarity, you know friends, map of scientists and science domains so that we think about how far a domain is from another domain to kind of get an idea of how hard it's going to be to migrate. Cyber infrastructure do.

F

You see what I'm thinking about I do see.

B

F

And one of the things that fraud and I thought about in our most optimistic moments, with the data bridge is trying to build a map like that right. That would be focused on the data sense, but you know, do I think it's going to be an easy thing to do. I, don't.

A

E

I think I did a good place to start. That's a real good idea. Yeah.

C

B

One model stand that we have to: is it so just to focus on the triple databases, so the triple is that it's just it's a technology that it's a repository of restoring genomic data, but it's built around lots of different like communities within the live science community that are a lot of more agricultural focused. So it would be like a tree community and different there's. Actually, the insect I 5k. Is there too, and I think that that that is the way to tether together the users through these repositories that they're already using?

B

And then you know putting these these tools, these air coding generic tools into their hands that meet meet their needs. So I think that there's ways to do that without trying to like you know, just kind of naturally go through from a repository perspective that human-computer interaction and interface is out. There that's a way to build, build ease, but I was bringing this up, but I think that it's you know defining the word community is a very difficult thing, especially writing about biology.

B

I mean I've I've, been in the biomedical groups and different radically different agricultural genomics groups, and it's just it's hard to define what that community is to be able to draw that map. Yeah.

F

A

But that this is Florence, that might be something that's an opportunity for the hugs over time right, because we're going to be seeing these different communities and we're going to be trying to continue to enable cross hubs across both cross, whatever collaboration, it could be that we're able to build that map over time. It's just going to take time, but that could be an interesting objective for the hub's working together. What do you all think I.

D

Had a follow, this is adjacent Arendt, Zi Defoe to a stands question so I heard about mapping and we're talking about mapping data sets and then we're talking about unicorns and the ability to will help to sort of enable this science and to me that sounds like more like mapping, expertise than datasets themselves. If you have one specific community of practice, that is using the infrastructure and they're quite successful and you have a perhaps a new community that is joining. That does need that expertise.

D

How are they going to find it specifically if it's not within their community, so.

E

Jason the reason I triggered on Howard's comment about the data set data bridge is that at least naively, it seems to me that there may be some wisdom in looking at the art of the artifacts that a community of science spins off, which could be viewed as the data science collections or I'm sorry, data collections as a kind of a fingerprint of the kind of science at seven.

E

So if there is a metric that allows you to say these two data collections are close, even though ones for erotic, psychologists and the other one is for left-leaning journalists, no I'm just making this about. You know, maybe there's something to be said that there's a look, you know, I mean I, admit the do vocabulary. We haven't done anything to characterize the expertise. I fully admit that I'm just saying we got to use what but Gregor.

D

E

Available to us, but.

D

If we're looking at systems like this, specifically, we mentioned cybers, where the applications themselves are also artifacts within the system. We now have a another thing that we can graph and map and track yeah.

E

Very good point, so we could look at workflows. That's a that's! A really good point that we're closed. It looks similar to one another might indicate a kind of a similar process.

A

C

I think Alan attended this sort of pollination that you talked about right, that we're doing with hydrologist community we're getting a family involved from UNC to also understand how these these non-commissioned have smeared and data set needs. This similar data needs right, so I think we could say see that we are trying to understand how these two communities on their data needs and the type of data that is in which are alike and how they could use the system. Is that correct, Alex.

B

Yeah and I think that I look at I think this is where this project is. It were epic we're in a good place, but we can all work together and, if that's a big, most important thing but I like everything from a data perspective right and I just hacked through PAC through my my years of scientists, you know scaling up my research and so I I. Look at I hear what you're saying about generalizing systems and software, but I, look at ways to generalize the data, and so I like in biology is pretty easy.

B

We have the evolutionary tree to relate data together and I would look at different, triple database repositories and the ncbi has different ways of aggregating data around the Tree of Life and so from a domain. Scientists biologists perspective. That's the way I would tether together. The researchers is through the Tree of Life evolutionary perspectives and then in other we're doing. I have visions with sie. Das is mapping a lot of the cyber infrastructure to be able to look at complex gene interactions across the Tree of Life.

B

So it's like it's not really a computer science perspective is completely a domain perspective and I can see how how far I can push that concept with map mapping the cybermen infrastructure back to it. Does that make any sense at all.

C

G

So this is this is Renee, so just to chime in a little bit on the discussion that and I can't pay too much about it, because we're still writing the proposal for it and not not that secret. It's just not fully baked yet, but one of the things that we're working on is is rather than focusing on the underlying infrastructure or databases or repositories, or a lot of heavy metadata we're trying to build a resource as the first step of our BD map.

G

So the hub's are all thinking about putting together different versions of in in some coordinated manner of big data resource maps. So we're trying to do something, that's as practical as possible, with as little lift as possible, and what we're focusing on is the notion of domain experts posting their challenges, the the problems that they're trying to solve, along with the data and trying to create interactions between them and data science teams. You know frequently what happened to our machine learning things our weather frequently.

G

What happens is datasets published and then you say: okay go out and figure out what to do with this data. Scientist. Machine learning folks, so we're trying to do it the other way around to create an ongoing dialogue between the domain experts and the data scientists, where the data is a central part of that dialogue and then have the community start to create solutions on the data science or machine learning side that actually solved those problems.

G

So this may be as part of what everybody's been talking about, a way of starting to gather the data about what the different challenges are in different domains. Who are the types of folks who are asking for specific solutions, and then you can kind of map it to datasets and the underlying infrastructure.

E

That that's helpful, so it kind of stems to be like this. What you're doing is kind of doing a very early chunk of what some people call the blackboard architectural model, where people post things up on this blackboard- and you know they just basically add you know they share ideas, partial solutions.

E

You know things like that into Google. In fact, Wikipedia's got a pretty good write-up about this.

E

It's an old idea, but it's actually getting subtraction again because actually I think it's fairly old I think it was published some years ago, but it's getting subtraction because now, with you know, kind of smart api's, you can start thinking about being able to post a solution and reach out to data in kind of a lightweight fashion. So it's almost like a mash-up toward a solution.

G

Yeah, that's that's in fact the the general direction that the thinking is going in and I'm not sure the vibe is the first time I hear of the blackboard architectural model. Aha, that's the folks who are helping me put this together, whether whether they know of the model, but that's that's good to know too, but but you know, happy happy to fill you in once. They've fought through it a little bit more because it sounds like this is something that that there are lots of moving pieces.

G

But maybe we can make complementary efforts to try to get up a whole.

B

Since then, can I ask you a question that we just said when you say reaching out to the data? What do you mean you mean.

E

To go and depository.

B

E

So there's a lot of what I'm saying is informed by my fairly recent experience with this NIH effort that has taken a kind of an odd turn, but it's an odd turn that I find particularly intriguing and it is to kind of create. So it was proposed that we there's four teams that are being funded across the country to try to pull together very disparate data sets and attack.

E

Basically fundamental questions around bio, human biology and the idea was floated that would suite like a blackboard system where we post stuff and we're not trying to provide answers or provide complete sources of knowledge or anything else. We're just going to let this kind of emerge over time. It's real simple metaphor, and so the reason I'm saying you can reach out to the data is that we have a hackathon coming up fairly soon and what we've been charged to do is to take all of the data sources.

E

We have and put them up behind an API, a smart API and expose them and then we're going to do queries across, and so it's a really just object, intriguing concept because it doesn't require me to do a lot of I mean it looks, I think it'll be inefficient and it isn't what you'd want if you're trying to get answers fast at highly complex problems, but it does permit you to kind of start to assemble a lot of stuff without tying everybody down. I think the way to do it fully specified.

E

G

That's the general idea that we're going force, and so we you know, we started to look at a lot of different efforts and, as an example, very metadata, intense efforts, those well when they're up and running then they're quite good solutions, but the the challenge is getting them up and running. You know across each new domain, particularly if you're the domain is very different, so we're trying to think of a way of doing this. That's much lighter!

G

That really starts to give you some visibility into a particular domain and allows you to to get the community to actually work, to create the solutions together and and then decide. Where do you go from there right rather than pre conceiving? Where do you go to to create the solution exactly.

E

In a lot of ways, it's analogous to this data like concept that people have been banding about recently, which is don't try to get all your data organized and into a central database, just pour it all into receptacles that you can control and eventually somebody will start working on some part of it. That will help inform something else and I'm in an emergent fashion. You'll start to organize the data, so this is kind of even more generalized, which says: don't even put it in a receptacle just set up an API.

B

F

The API actually going to do whatever you designed.

E

It to do I mean.

A

E

You so if you design an API that does nothing but allow Census Bureau data from 2010 to 2016 to be revealed on a you know, state-by-state basis and you get 62,000 inquiries on household incomes and you built a bad API. But you've learned from the API error.

E

They I, don't know how it's gonna work, Howard, it's completely free form and half-baked yeah.

F

If Jane involved no.

E

D

Have you looked at length data if you, this is something that Mike Conway would immediately jump out and say what so I'm going.

F

To say it form.

D

F

D

But this is where a link data comes in handy because you can get sort of emergent annotations across data sets to the linked data specification and with the appropriate tools that allows your users to not only annotate. You know the datasets themselves, but the relationships between them, which may be Han yeah.

G

That's that's actually that's great to hear because we were, we were looking at different ways of allowing kind of light annotation. We were looking at a organization called hypothesis that allows kind of you know web based web browser based annotation, so.

F

G

That's very helpful: Thanks.

F

Yeah, we studied a hypothesis and you might be interested in a thing called cedar as well, if you're interested in that great.

G

Cedar as in the tree, yeah.

F

If it fits Stanford's, alright, anyone else we're going to talk about.

D

Sounds familiar but I'm not I'm not going to.

E

Kwan Gatien, thank you for that tip. I was I, pulled up the lookk Data Platform 1.0 primer.

A

B

E

B

Smarter, that's.

E

B

Okay, can I ask a question along these lines. So what do you from an Internet of Things perspective and I'm thinking about DNA sequencers right expect I'm just focus on the biology, but what happens when every every Hospital effort everywhere there's a blood test when every research lab is, is generating? What forms talking about the block? The bronzo bytes of a DNA sequence, data that you're never going to be able to annotate you're, never going to be able to store it's going to be after the ephemeral.

B

There's just no way to keep it around around up. Electrons right, like ahead of these models, fit into that when you're doing your dealing with post exabytes of data that has filled with valuable just health care information.

E

You know I think that's an existential question and I don't mean to be facetious about it. I think it's very hard at this point to come to grips with the fact that we're going to be producing date, I mean I, think we're already there that we're producing data that we're never going to look at and how we hook that into something useful.

E

Well, maybe there will emerge some kind of quantum storage device and we can put front-ends on it or something I just don't know I, but you know what has been most frustrating part of my career, which is getting long in the tooth. I guess is that we always try to engineer things to work really really well and and I'm inclined to try to figure out ways of getting things engineered to work in a half-assed fashion. I hope it gets better.

E

B

I think it's or at sort.

F

Of the function.

B

E

You're right excuse my language by the way.

A

So now that's a good point. I mean that's like the new frontier that we're going to be dealing with. That's why the working group that you co-chair is distributed big data analytics with all this stuff right and we're going to have a lot to figure out over time. Yeah.

B

And it's not a this is not a try to meet I, don't have a science fiction problem or existential, it's it! A guy right now, I've got 170 terabytes or something like that. Storage and I need to process petabytes of data, and that's you know, try to go out to national computes able to do that, but I'm just one dude. What happens when there's like you know, 100,000 people that need to do this.

F

B

I think it's something that that's kind of what psychosis. What I want to be is like kinda like getting at some of these concepts of like trying to do too much and being able to what do you want to keep in? What do you want to erase because you're? Never that's what other Internet's here is that I want to be able to have the data in a repository and then process it and delete it and not worry about having to download it again, like that kind of experimental, more ephemeral, type, experimentation, yeah and.

A

I think collaborating with NIH and stupid and those guys will be interesting over time as they get more into precision: medicine, precision cancer too, but when they're trying to leverage all these different pieces of data, you know clinical research from around the planet in a hundred different languages if they use cognitive computing, an AI to interpret and translate and then put it with clinical research data, environmental and your peg plant animal genomics data. All this stuff trying to come together to create context.

A

That's going to be a journey, we're all going to be going on, I, think and.

D

I think this leads right into that to the link data and then the metadata and the provenance so you're going to have datasets that you can't keep because they're too big and you're going to process them and they're going to have results and you're still going to need to know how you got to those results, and you basically have a shadow of the data that was at some point. You may need to go back and recreate all.

A

Right guess where.

D

A

Made this somewhere else, it's not all about you, nothing, not you. In particular, you know that.

C

A

A distributed big data and analytics environment. That's part of the point, multiple.

C

The application get at all so important because we need to have an understanding of the value of the data arise. Then was what is data that you want to keep and how does it change any? Does it it's a harder metric to compute, depending on the domain time and.

F

It may be that we just have to start to be comfortable with more uncertainty. I mean this is something in the data bridge right, so we so we're trying to build. Let's say we have a hundred thousand data sets and we build a network on okay and, and we say to the users: okay, give us all your data set and we'll give you the ones that we think are the most similar.

F

But the problem is there's a dozen at least or maybe more like 50 ways to define some learn and and there's no actual answer right. There's no ground truth. Nobody knows, nobody can look at a hundred thousand data sets and say: oh yeah, these are the ones are most similar if they could, but I wouldn't move on and do something else.

C

Look at Google search will search, has done it very successfully with most bigger data, so.

F

C

F

What would forget.

C

About they give you an a their academia given.

F

They give you an answer. They don't necessarily give you the answer right and it's that's the thing you have to start to be comfortable with in some and I. Think that's true in our project and I.

F

Think that's true with some of the issues you guys will at some point, yeah you're, probably not going to have all the data yeah you're, probably not going to be able to do all the computation you want, but and then the next question becomes okay, but is what you can do useful and so so I think the answer these for us has been we're going to try to maximize quote-unquote usefulness whatever that means within the parameters of the resources we have, and you know you want to keep pushing that that boundary forward.

D

F

Sure, but if you wait, if you, if your ideas, you get a way to it's done before you start using doing useful work, I think you're going to be disappointed and.

G

This is again part of what we're trying to aim for so where, again, this is a very light approach, so it's not nearly as scientifically rigorous but we're trying to let the community to decide whether a data set is useful or not, rather than trying to predetermine it by similarity measures which seem very.

A

G

So usefulness for us is: was it useful in solving the challenge that was posted with the data or not, or do you need to have different types of data or different data sets to solve that challenge?.

C

Data because it's linked data extracting the fundamental algorithms that have enable Google to do what it does, because the data are because the worst failures have leaked and they can drive, is not based on the popularity which is a which is a metric of success or usefulness. It cannot suggest that perhaps you know, which is where you should prepare, neither the data set. If they are brought into a link model linked data model, then you can just apply the techniques that are being developed in computer science to do the next Google.

C

You know the natural will say, but to apply its activists in algorithms. That Google says use yeah.

F

C

Probe are you with us when you look for some of the provide you with a list of kids that are most of the family? My cereal, like in.

F

C

That makes sense. So if you play all these data sets to the web and you use web technology, then you work your problem is stones should be solved right.

F

I'm not so sure, I by the way, I just read the original Google paper. This.

C

Main line yeah yeah, it's a party! Why.

F

They read the PageRank paper, there's two papers, the original there's the PageRank paper, that's Paige, and then there's though the Brin and page paper that introduced google, it starts the first words. Are we probably introduced Google yeah.

C

And it's amazing to me: maybe.

F

C

Is you know, get a lot of publication at the youth that immediately design one.

F

C

You know it's like.

F

C

Applicable, so it is not.

F

A pretty look, roll it applicable is just good. You know, good I mean hit. The PageRank algorithm depends on the fact that the links and the anchors are there right.

B

F

When you talk about a hundred thousand two hundred thousand disconnected datasets, there isn't necessarily anything that connects them to each other. It's just not there and and even even loose.

C

I'm not ready to have that conversation, but just keep in mind that the internet, it's a lot of data yeah.

A

C

A little bit I actually.

A

There's a much bigger conversation, yeah.

C

Do after that, apparently.

A

Nih, head yeah, went off and built the whole system mission, follow it, but I actually is connecting with the Google person to suggested what you're talking about and then to follow up with you on that. It is 4:30 that we'll take one last comment or question and then we're going to close for the day.

A

Alright with that, thank you all thank you, then. The liveliest conversations really late, Friday.

A

Carl will be sending out information about the next infrastructure working group. Oh thanks.

E

A

Will have Amazon data science team, presenting as well as Microsoft, we'll be talking about a do-er, because you're now opening up the competition for the hub is your credit.