South Big Data Hub Data Science Round Tables, 28 Nov 2016

Previous Meeting

⏯

youtube image

►

From YouTube: Harnessing the Data Revolution: A Perspective from the NSF

Description

SBDH-HarnessingtheDataRevolution-Roundtable

A

We have a full house today here at the Renaissance.

B

Computing Institute's.

A

So we're very excited I, so welcome to the inaugural South Big Data hub data science roundtable for those who may be wishing to live tweets this event and quote Titan. Please use hashtag, SPD, h16, south big data hub 16 or hashtag BD hubs; I'm dr. Lea Shanley, the co executive director of the South big data hub here at the Renaissance computing Institute at beautiful North, Carolina, Chapel Hill. For those of you who may not be familiar.

A

The south big data hub builds R&D communities of practice and accelerates partnerships among governments, industry and academia for those who apply data science and analytics to help solve regional and national challenges. The South big data hub is part of a network of 4 hubs launched by the National Science Foundation in 2015 and co-sponsored by our host institutions and other partners.

A

We manage the South hub jointly with Georgia Tech Georgia Institute of Technology in Atlanta, and we serve 16 states in the South Region everywhere, from Texas to Delaware and everywhere in between and have 500 members from universities, nonprofits corporations, foundations, communities and communities of practice. Before we get started, I'd like to introduce dr. Stan, a halt, he's the director of the Renaissance computing Institute and one of the two principal investigators.

B

Thanks Lee, yes, I'm, so pleased to see everybody here in the room and then also hearing all the beats. As people come online. It's would you estimate Stephanie how many people online at this point. Yes, 35 or 40 more people in 36 we're going in the right direction. So it's my great pleasure to introduce Jacob aru from NSF he's currently on assignment as senior advisor for data science and in the size Directorate at the National Science Foundation I've, taken for a long time he's a wonderful colleague does very interesting research and scientific research himself.

B

He's the distinguished scientist and associate director of data initiatives at San, Diego, supercomputer center and that's part of UC San Diego and he worked on applied and applications oriented research problems, all of which are related to data management and data. Analytics he's been part of a lot of national initiatives. He was CIA.

C

The open topography.

B

Project, it was part of cyber infrastructure from comparative effectiveness, psych or neon nice. Quasi G on the list goes on and on. Needless to say, he is an incredible asset for NSF. You know, NSF has steered with this leadership and we were very pleased to have him come down to Chapel, Hill and market tomorrow, he's over at RTI, giving another talk and we will be on Durham or another South Betty hub initiative with a group, so I welcome everybody here. Thank you so much just a reminder.

B

As you come online, please do mute because we'll hear you typing or eating or doing whatever else you're doing so. Thank you so much and with that all to remember to change.

C

Alright, then, thanks very much for the vicious introduction.

C

B

C

Not an active scene I just wanted to say that the beeps are also when people exit they might be just leaving right now.

C

Okay, so I think I'm going to take about 3035 minutes to ask you an overview of this whole area of harnessing the data so I missing the data revolution. It's one of the big ideas where NSF has recently put out. Let me apply off. My talk will be just mentioned. What the NSF big ideas are talk.

B

C

Bit about current data programs and then try to spend more time about future direction. That is where should be both and as Leah mentioned, we want having discussions, Q&A and so on. Clearly, the ideas are formulating. We are not saying things are fast and stone here, so we are open to listen.

C

You are the folks who are doing the real work here at the cutting edge, so in this community.

C

So this is just the size organization, computer science, Directorate. Seven. There are four divisions.

C

I'm, assuming you're all familiar with this and where that person is blanking out, so there are two of us.

C

B

The NIS of Big Ideas.

C

This is something that was recently put out by NSF. Hopefully you all had a chance to see it. The goal was to come up with a set of really exciting initiatives that.

B

Would catalyze.

A

C

B

C

Public appeal attract partnerships with industry, private foundations and academia, so curvy sector by the way, I should also say that I'm very happy this some big data event that you're trying to do and I'm reminded of that. Because when we talk about partnerships there we get a house program.

C

Also to put forward the agenda for cutting a tree, cutting-edge, research.

C

The process by which does each record had discussions or what they thought would be. These big ideas for the next five 10 15 years from a Directorate point of view, and these were all bubbled up when assistant director ad retreat. That happened across all the directorates, and then these were filtered down to these set of.

B

C

That we're seeing through refinement and collaboration across directorates, so here.

B

C

On the top of the six big research ideas, obviously I actually left me out of been happy. It was one.

B

C

B

C

Ideas and everything else was connected, but there are six navigating the new Arctic work at the human technology front here: understanding the rules of life. Basically, you know you know, type, the quantum leap curing and then windows on the universe, ways like mechanisms. Let me now have data sciences in the middle, and actually you can already see that many of these, if not all, actually have huge data requirements as well right, so they.

B

Will be in connected.

C

Data there are also some process ideas and these are more to do with how NSF will do its business about doing its business. So the notion of convergent research is that in each of these, almost any problem that we are looking at that the community.

C

Now we have interdisciplinarity, we have a lot of disciplines coming together. So there's this notion of convergence and the notion that problems that we are looking at, whether it's smart cities or climate change, precision, medicine, all.

B

Really need a Felicity.

C

B

Process idea, that just is that NSF.

C

Recognizes this need, which might imply that you know future programs that may come out will have more and more of these introduced of inter Directorate kinds of initiative.

C

The other second process idea is you know how do we do programs that increase the diversity in science and engineering? The third.

B

C

About this notion of midscale, so if you look at NSF today, the standard programs might get up to say 20 million dollars, or so, if they're, the big SPC and those kinds of things. And then there are the major research equipment projects which go around start from around 200 million Forster. So there's this gap between 20 million for 200 million, where there could be a lot of interesting work. That could be done, but there is really no sort of vehicle to make that possible.

C

So that's the identification to say: let's create some mechanisms in between say in the 50 to 70 million dollar range program, and that's very relevant, of course, to the big ideas and and certainly for the harnessing the data idea and then the last one is about creating some kind of an Innovation Fund that would be at the discretion of this would be to jumpstart new areas.

C

So those are the.

B

C

And they are all very interconnected, so this circle shows on the left-hand side. Are all the big ideas on the right hand, side are all the records and the lines just saying that they're all connected as you can see with the jumble. But if you look at harnessing the data idea, it connects to actually all the directories. So that's important everybody would be interested in that and then, if you look at also size, it has links to all the ideas behind this stuff. That's very interesting and exciting.

C

There's very interconnected thing, so in any of these ideas that we push forward, it has to be a multi.

B

C

So now, let's talk very quickly about existing programs at NSF, so when we look at in the data area, so when we look at our data science and data programs in general, we like to use this a quadrant diagram to show these sort of four basic areas in which investments are being done. One is, of course, naturally foundational research or the other is cyber infrastructure. A third one is education and workforce development, and then collaboration and partnerships and overlying. All of that are policy. Issues such as open data and data.

C

I won't go through all of these, but these I just show you the names here and so the big data research program, which actually I have hand in helping coordinate it's actually cross foundational. We have many program officers live in, is a big help on this, and and we have other fillings from every directory. We have folks involved in this, but that's something I'm more familiar with, and then there are these other programs that we also have, which are all on the foundational research research side.

C

When the cyber infrastructure side, I'm sure you are familiar with these programs.

C

And some of the directorates have their own, or at least in the past, have had their own CI and the most important.

B

Sort of Education.

C

And workforce development activity.

B

And Esther.

C

Traineeship programs in the past it has actually gates. This certainly has a data science track, but in.

B

C

Always continues to be an area of interest, and finally, this.

B

Is a new area for us.

C

And in collaborations and partnerships these are really house program. We are very.

B

C

Thank you all for doing a great job of it and everyone.

B

C

About this everywhere and we from the roads I think it.

B

Already looks like it's.

C

Going to be a program, that's gonna be successful.

C

Really to create this multi sector collaboration and help jumpstart a new, create.

C

Ok, so that's all I'm gonna say about the programs. Now cyber infrastructure is a key component, of course, in any of our strategies,.

C

The other thing I wanted to mention is at the federal level again. This is my way of just giving you some quick context. There is. There are multiple interagency coordination groups in different technical areas: the White House Office of Science Technology Policy. One of the coordination groups is in Big Data. That's an interagency working group like cochair, along with my clinic Allen Theory who's.

C

A

C

And we put out a report back in May of this year, just laying out what are the issues for a little R and D I just wanted to point out what a more does the 7 topic areas it all again look familiar, creating.

B

Next generation capabilities.

C

Such as understanding.

B

Trustworthiness of data and.

C

Conversation, that's a big issue when you start.

B

C

B

C

Of decision-making.

C

Building the big data, cyber interest and the issue of sustainable preservation.

C

C

Big issue, of course, is privacy, security and ethics.

C

B

C

And training, as if familiar almost every University any size is looking at starting programs in data science, undergrad level masters, there's a huge demand for these kinds of skills right now in industry. So how do we do things in the short term, but also what's the long term strategy? And finally,.

B

Creating an enhancing connections.

C

In the innovation ecosystem and once again, our absent spokes program from NSF point of view responds to that again note that this is a federal thing, so it's meant for all agencies to consume and do as they wish with these recommendations, so different agencies might have different programs that respond so coming back to the harnessing the data theme.

C

What it so highlights is research across all of the NSF directors, as we saw there are inter connections from harness together to all directorates. We.

B

C

Three circle: diagram as showing the key activities that would be involved in this area, so one is just the theoretical foundations, so it involves a match.

C

B

Other is the systems foundation.

C

C

And the third is called data intensive research. This is research with data in all of the domains, may be biology, lines and cell biology here, and so those would be the key sort of areas in terms of the research. But then also education is a big aspect and then, as I said before, there has to be some.

B

Cyber infrastructure.

C

That supports I'm.

B

Gonna restricted.

C

That skew towards the kind of data science, research and the whole explosion.

B

In data science.

C

A

If you think about this.

C

Saravanan de architectural way right so you think of a layer diagram for how we, how we might architect an open national infrastructure.

C

You can think of the core infrastructure at the bottom. There are some services in the middle and allow you to exploit that in and then there are applications.

C

B

Infrastructure really has to be storage.

C

And so it's not just.

B

C

To be compute that goes alongside with this, and so it's storage and compute together.

C

So as I say, there could be multiple stakeholders or multiple partners involved.

B

C

This the scale at which this kind of activity is going on first of all, there's already activities, and secondly, NSF is not saying that we are the only ones enterprise it's too big.

C

So there are partners from the campus level. There may be partners.

C

There's only a role for commercial could be the cloud providers, others and.

B

There could be national.

C

Things which again folks, like an ester and.

B

Certainly, there are international.

C

Activities that you might want, one of the substrates that we might exploit here is the sunk infrastructure that n s that was already funded in the terms of in this form of CC and I. You know this high-speed networking there is on campus and across geni networks and so on. Those could find.

B

C

Substrate, on which all of these, in terms.

B

Of services, there are core services right.

C

Just basic things like authentication.

C

But then there is a whole set of interesting services and it could be developing that are related to discovery of data access to data, deep analytics, semantics integration in a very broad sense, a lot of action. So.

C

Of course, as you go from the top to bottom, you get more sharing so could be shared by multiple record projects and then more disciplinary.

C

A big issue in all of this, of course, would be things like governance policy, and how do we manage.

C

So that's sort of the broad context for it. Well, all of this is should be created in the context of applications and.

B

The applications.

C

Are science, so, ideally, what we would like to do, then, is to popular some of the big national priority research areas that NSF has identified and essentially.

B

C

As customers, so you TBN, BLS understanding the brain and the national brain observatories, neuroscience infuses, the Nexus of food, energy and water systems. The third one is part and connected communities and then the MRF C's. Are these large facility projects like the ocean, observing.

C

Ideally, we get the most bang for the buck and also to make it really interesting research in this area. All of this should be able to serve a general set of way. I think we are trying to with this vision. We are also trying to get away from some of the siloing that has ended. I mean right now, the way I marry FC projects are done. This each project does its own cyber, and so the question here is: could we do something more generic that could.

B

C

Leveraged only seems to make sense and also and allows us to focus on the more interesting things right. If you can make these more common and take care of this, then maybe working on a novel services.

C

The next thing that.

C

So this environment would be characterized by open data, open systems.

B

Okay, well I, shouldn't, say just open.

C

Source, it's open data open and Android.

B

Chain was in the audience, they say you never said previously of.

C

Course, privacy's and a lot of this sharing is going to be very, very interesting and tricky, because I mean there's a whole lot of intercept data where privacy.

C

The moment you have read about people or anything, that's special issues. How do we there's a good challenge? Is so ok, let's talk a little bit about the future. Now so now, I'm a data guy, so I just love it I mean the data is I can retire without any problem.

C

Just working on data for the next Megan has retired and is still working, so I think there's plenty to do, but if you want to think in terms of what does the landscape look like, so one can posit that you know there will be this data intensive aspects of.

C

Internet traffic analysis, one data center logs, another.

C

B

Then we have this new.

C

Discipline of data science, data science discipline is really about studying data per se. The whole lifecycle of the data and as.

B

I say here this theory.

C

Of data science, the presentation of issues modeling.

B

C

In statistical modeling, but also like data modelling, all of those kinds of things, statistics, machine learning and.

B

C

Technology, technological or pragmatic issues, I, and certainly, if you put all of that together, that's a new discipline. I, also like to think about translational data science. The translational data science, then, is the notion of applying data science techniques to solve real world problems. So.

B

Then there is a connection.

C

From data science to pay rent in computer science, you might have so much log data that you have to use. I do some other statistics, I.

B

C

With some statistical techniques.

C

Similarly, you have in intensive all the Geoscience, and so that's the role that translational.

B

C

It's kind of important to have that concept, because for decades this is like.

B

A

C

Me but I still go always is concerned about this professionals, who really make things happen right, they're, the guys who are doing the plumbing and we're really making sure that the HPC or the data happening and there's no way to recognize them in the university system. They're.

B

C

Transparent, it's like contract workers that Microsoft put people on contract so owning.

B

C

This activity is actually a first-order, active I think would be important, I think just.

B

As you have translational medicine.

C

B

Years ago, I was.

C

Trying to push this idea that mind is the.

B

Sc that I was.

C

B

C

Translational made of Sciences.

C

B

C

Do we, how do we make progress so one way that one mechanism.

B

C

Use for NSF is for the community to express if you look at the traditional sort of science disciplines like astronomy, they.

B

Tend to get together as a community and identify what's.

C

The next big interest, which have this kind of storage system, it has to be able to accept petabytes at a time and- and that becomes so the vision of that we need so.

B

I think data science is that that.

C

B

C

As a community used to be able to articulate what are the things, what are the infrastructure needed to make a real big so.

C

So one so, you know identifying what are the needs of science and data science and we have community, for example. The idea would the idea would be to have community workshops to identify these needs. What.

B

C

The needs I think they go from everything from what are the hardware and of platforms you need. What are the data set, so the data set itself? What are the infrastructure?

C

That's fine, but we now talk about.

C

Just mentioned, we've been working in some topic that I've mentioned in a couple of slides with industry, which is a Guha, are weak, ramakrishna who have from google, and he find this phrase the hollow world for big data. So we come.

B

To hello world for big.

C

Data and activity, hello, big data is, and you walk up to a petabyte and say: let's take 20 terabytes out of this and do some modeling and.

B

Give that as a homework.

C

Whereas a lot of.

B

C

Scale, companies they do this every day, so we have challenges like that, so data themselves are and then you might build under software stacks and use and also distributed test bits. That is something like smart and connected communities. There might be a testbed. There is this project for the array.

B

C

B

Are many other.

C

Smart city projects out there, and so a testbed that that you could use for smart cities could be part of instances and so on actually neon the National ecological Observatory Network, when it was originally invited envisioned for then mission actually has an open system so that others can plug in there. If logic, not.

C

B

Example: here's a grant.

C

That we just awarded under the big data program, a couple of guys from Virginia, Tech and University of Miami. They are creating this testbed for smart cities, and this last sentence here says the spread is intended to be open access to be able to support both research and the whole scientific institution, as well as other users requiring non proprietary money.

C

B

C

We facilitate so let me now just highlight a few of the things that would be part of this vision going forward. I already mentioned theoretical foundations and systems, and so on. So just just to dig a little bit more into that and, as I recently held a workshop on this topic of theoretical foundations of data science, there.

B

Is a workshop report.

C

You can get.

C

You know so I just pulled out a few statements from the report. Theoretical foundations are fundamental for industrial applications, scientific understanding. There is a demand for training prints in this area by the way, this workshop invited sort of 1/3, 1/3 1/3 people from theory, people from computer science. But this is a really sort of machine learning, statistics, folks and math equally divided among those communities, and it was sponsored by size as well as the EMS.

B

C

A group of theory folks, but it was very interesting to see the very.

B

C

And practical issues that they had.

C

So there are science in the broad discipline covers everything from experimental design, phase and data collection, all the way to data analysis and the.

C

Foundations should have strong interfaces. Application domain in the connect all the way to the apps.

C

Interdisciplinary collaboration between theoretical computer, science, math and sciences, there are problems reproducibility, privacy,.

C

B

Of and then they also discussed.

C

Number of modalities by which these, if they are rare, to be centers of this type, how could they be launched and one idea was leveraged, something.

A

C

Big theory, one.

B

More, maybe one more.

C

Okay, so that's on the theoretical foundation side- and these are a.

B

Few things I made up.

C

On the infrastructure of systems and engineering right, so the infrastructure is about storage. It's.

B

About data, as.

C

I mentioned it support the basic. What would be the data services and that's the hello world example I already and their idea is like you know, it's not just queries to.

B

The data you want to have.

C

Dialogues I want the system to do some storytelling, innovative ways in which you can have multiple interfaces to the same data and in.

B

This model I mean sequel.

C

Just becomes assembler.

C

Another idea was machine learning systems and there's a meeting again I'll mention in the next slide. We had some folks from industry, including the vice-president for cognitive systems, from IBM's of this V, the Watson, and they were talking about how it's very important to think about how to build generalized machine learning systems, because really, what's going on right now in the industry, is building.

B

A machine lines, a machine learning system for.

C

B C, and they can clearly see the writing on the wall, that this is not a scalable and the cost of maintaining multiple vertical systems will surrender square in this, so they would like to they really like academia to start talking about. How do you build a generic machine learning system? What are the things that are transferable from one domain to the other? That.

B

Would help us build a more.

C

Generic architect so.

B

That's this notion of machine.

C

Learning systems, and then there are, you, know, ideas like reproducibility, that's one of them I put in, but I think we are only in the beginning.

C

I've been talking to some of the folks who do this who've been doing this in their research, and you know people haven't fully thought.

B

A

How do you test.

C

A

Now reproducibility.

C

Systems will say: okay, look, I took the all this stuff from this machine grinded over there say I got the same month, but if I want to and say how did you get that long? So what what happened? We had this these they're, not there yet I. Think.

A

C

More engineering and usability.

C

B

C

If we're talking about the data ecosystem, then there are lots of sources where data can come from. We have research, data management and doing something- and also you can think about this as NSF funding this right. So.

B

C

Funds, independent researchers, they're, a small group database community will get together and I justify the need for a database in a certain thing and mentioned Open Graph II when they introduced me. So that's the project funded by NSF. Actually, that's for just airborne lidar, so we.

B

Are airborne lidar person, you know you later there you.

C

There are certainly institutional level repositories, so they said NSF, the institutional repository would be non NSF, maybe University have their own repositories or regional network, and there may be large community there. This is for you know, you must not only use the small in a basis, that's actually the entire Wyoming finding supercomputer building which, where there are pedabytes large.

B

C

For many techniques, well, if I have all of these kinds of data and I would like what I would like to do is put a set of services that allow me to do discovery, access.

C

B

To have an umbrella.

C

And I think this is also part of that same provision. If you think about back to that picture of harnessing the data vision, do you want to populate it at the bottom, with all these different kinds of data? But then you want to have services and provide you integrated access, so one activity. We are right. Now that's going on that you get trying in this area. Yes, you know, could we create an open knowledge?

C

Knowledge network is something that connects a number of entities together, based on whatever relationships that might either have been discovered or defined or inferred.

C

So there's machine learning, environment, modeling involved, but somehow you create linkages between things and you create this network that shows sort of semantic relationship between things that could then be used to answer all sorts of complex.

C

So we call this meeting. We come up with any name for it, so just called entities. Facts questions are so that was the meaning under the nitrogen, which is the interagency.

C

Meeting about me from industry, academia and agency, so that's where we have the VP from IBM. We had a voice from Amazon.

C

And a few agency people and a few so.

B

The motivation.

C

For this was you know we have, we definitely want to move and we are.

B

Moving towards more natural.

C

Interfaces, yes, then it gets more complex. You got to have easier ways and in industry it's things like city and for Tana they're. All.

A

These interfaces that allow.

C

B

Kind of the future.

C

That we are going into, we want to have a network data and information infrastructure capable of supporting integration of information.

C

So that they can do question answers- and you know, you're going to have simple interfaces and machine learning and knowledge of presentation can be instrumental access to these complex model, driven narrative that.

B

Was the motivator.

C

B

Tend to divide you idea, there's and that.

A

C

Entered your name is so palette open the Open Knowledge cradle. Can we create an open map.

B

Skill knowledge networks.

C

To possible research, innovation on.

B

An entire class of new.

C

Data context and inference.

C

So I'm down to my last few signs, another area- that's really important, I'm, just getting also a lot of attention in what you comment.

B

Is this notion of.

C

Ethics, fairness as we go more and more data-driven, there's a tendency to just believe the data or just the Machine conclusions. Well, there might be biases in the data they may be.

C

For example, we just funded under the Big Data project on algorithmic, and it's.

C

And those recently workshop L on data are responsibly.

C

B

A new AI machine learning working.

C

Group under the US TV that is actually looking at this whole issue,.

C

So I think the key idea here is and how do we embed ethics? The only thing I'll say here is I think there are ways to you really need to make it integral. It's not a question of saying. Let's take, though, all the curriculum that we haven't just tack on one more course on ethics and the kids will learn something it.

A

Has to be deeply embedded.

C

And my last slide, I think is just to talk about some upcoming events and activities. We have just funded the National Academy of Sciences run this workshop and envisioning the data science and it's CSTB the computer science board at na s. We're also the statistics word in the board on science education. So what we want them to look at is step back and do some blue sky thinking. If data science was a new discipline, what would it look like not.

B

C

Do I make it fit in my university.

C

Envisioning the whole discipline, but, as part of that would be, if there's an undergraduate.

B

Discipline in data science and what.

C

And also I think is I have a feeling that there will be a big role for community colleges in this. So what do you do for a two-year program as a preparation to the undergrad, but also you might be able to get a two-year degree and get reasonable they're.

B

Gonna address all.

C

Of that it's gonna be a workshop, followed by an interim report, follow banner. They work finalized.

C

We have another workshop coming up under the nitrile umbrella, called metrics for assessing the value of theory, eating it supporting him and that and then next year we're gonna have the big date of ki meeting, which we had last year for the very first time. So this second big deal of we are meeting, but this time we are combining it the annual meeting of the house.

C

B

C

Open for any questions.

A

B

We have people.

A

Listening through the WebEx and people in the room, we'll start with some questions here and Stephanie will signal us if there's questions for those participating on WebEx if you're on WebEx, please type your questions in the box there so Stephanie did you have a question and we'd like to create a discussion. So don't just ask questions but off your your comments and thoughts on this start responding to each other as well. So with that first question.

B

It's not a stretch to call these socio-technical systems, and you know whenever we go to these workshops, know the technical parts easy. The hard part is the social part. So, to what extent is studying these kinds of interactions and facilitating community engagement to develop and use cyber infrastructure? Is that.

A

Okay, let me just repeat.

C

The question that gist of it is what I talked about sounded a lot more technical. What about the socio-technical, because I always comes up front and believe me I know this.

C

But I have a couple of different kinds of answers. One is maybe the best answer is, as we think about this initiative say harnessing the data initiative I mentioned.

A

B

We would like to do.

C

This with some applications in mind right so in.

B

The beginning, you.

C

B

C

B

Might work with the work with the Directorate.

C

To say, okay, let's take and I think in that context we should definitely worry about the social aspect.

C

That's one. The.

B

Other, the other thing I would say, is I I, don't know, I feel.

C

Like we know this now, I mean this is something they didn't know many years, but I think we kind of know this now. So hopefully we're not gonna make the same naive and actually, in that context, I want to mention that one of the projects that we funded they will rip us. University wine I've been funded to do a socio technical evaluation, a graphic study of the house and because, let's not be glib and think this is all going to be successful.

C

Social scientist observe us and tell us so, oh yeah, aware of it and I think one basically summarize again we're aware of the problem and one way to do. It is bite off a little bit at a time so few communities and be aware.

B

A

What about the trustworthiness standards did I miss a.

A

C

A

So you already so it was basically what what about.

C

Trustworthiness standards, so thanks for that question because we talked about exactly that with Reagan before her before we thought and and I.

B

Really like that,.

C

Discussion we had, which is the Reagan, has actually looked very detailed at an ISO standard readiness.

C

113 criteria so I think in the context of harnessing the data. That's you really want to think about that as a research program.

C

Right so maybe it's sub fifty to seventy million dollar, but it's a big research program and it's so in that sense the trustworthiness at one level is not an issue, because all these that just working on it, on the other hand,.

B

You might say.

C

That's so I think.

B

What I would say.

C

Is yes, I think it's about time for us, it's very specific ways about readiness and these ISO standards or other methods, maybe maybe something to use as we.

B

Have the discussion noon.

C

Before the talk, this could also be a highly interactive, complex issue, because what swarthy Nantes.

B

May not be trustworthy.

C

In another case or youn, if something is not so trustworthy, you might actually want to use it. The only thing so understanding.

B

B

So I'm interested in a number of I'm really interested in the number of questions, but the one that just jumps out at me but I have to ask the thinking behind it. Is this idea of data Sciences of discipline, because right now, I think data science is viewed by many on many campuses as an integrated solution and I can't think of.

B

An academia we don't change quickly and computer science is what.

A

B

Have one of the oldest computer science.

B

Probably still being argued about on this campus they're going to be surprised.

B

C

We want to be in your face, so the question is, you know, data science as a discipline might be a controversial.

C

The thinking was so.

B

Now we work very closely with our statistics.

B

C

Example, our stats people are approached almost every day and by statistics. Community saying why.

B

Don't you just.

C

Declare the stats is science.

C

And our own statistics, people will say well bigger than ain't. It's not that's, not the other science. There are Sciences all these other things and I. Think in computer science, I would say yeah.

B

Data is only one.

C

C

So that's why I put that box all the things that you have to think about when you think about the end-to-end data life cycle theory, it's the modeling, its management algorithms. Now, where is the one program that gives you so we purposely for the sake of this workshop said: let's.

B

C

This hassle of people saying.

B

Okay, I can take these two.

C

Forces from here and those three process.

C

That's why it's a visioning, goose I can of exercise. If you step back and just talk about by.

B

The way we do have some.

C

Trepidation here, because.

B

We are not sure.

C

If we can get the right kind of people from the community because we feel like we'll so we're trying to know as best as we can have some dialogues with RNAs colleagues that make sure you get the right kind of people who are willing to think about just for the sake of the workshop, there's a new discipline rather than saying well, where will I teach this and how will I get credits if I teach in that guy's department?

C

So it's just like stepping back and saying what.

B

C

Was no science as.

B

For the exercises now.

C

People can take different lessons from the report. People say well.

B

That map's all.

C

The way to my camp somebody.

B

Might say you know what this is great I'm going to.

C

Start a new day of math and there are also other countries. I was just in India a few weeks ago, and you know people are everywhere are thinking about new things and departments. Much are elsewhere as well. We don't know but yeah I totally.

B

B

Like MSF used to have the ACI as a separate thing across Oh, see.

C

That's that's a mean, so the question was and I said yourself Oh see I. Will there be an OD addict? That's the main question, because that shows that you never 7. So I was joking with my colleagues that actually.

B

C

Figure out exactly how. But if you found out I.

B

C

Some parts of CNS else is here and the DMS possibly make new. Yes,.

B

C

B

I would be out of an.

C

B

C

A

C

B

C

The question was about the ML systems that we talked about and what would be the plan.

C

A

C

Them generate I, don't.

B

Know I mean right.

C

So I think that's that's really when the research issue is is, could you.

B

C

B

Kind of a generic framework.

C

In which you could slot in some of these algorithms, so the framework.

B

C

B

C

B

C

By that followed by this, there are different algorithms for doing those where it comes from is actually what our colleague from IBM mentioned in our meeting. They said, if you look at a Watson for Jeopardy, it was what's the system that was built with each other, then they took that and they created Watson for oncology. That's the system for doing oncology, what's actually a different system for doing archaeology.

C

The same some of the same engineer, some some experiences, but what they found is they had now they're doing Watson for insurance, industry Watson and their frustration is they're having to start from scratch for everything, and in fact it was industry, people who said, and the Amazon person they're also agreed, who said there's gotta be a better way should there must be some more systems we're doing this. So me, it's an open question how you would generalize.

B

Your last comment kind of struck me as reflecting something that I know already I'm a domain expert I pretend to be a data scientist once in a while, but really I'm a domain expert and the hard part about doing the work that were that I'm interested in is NIH. Why usually issues often the source of my data does mandate data sharing, but the problem with the data is the metadata hasn't been captured to allow a non expert to use the data realistically, and you know it's an example.

B

You know I could just make it so I'm a physician I can say you know we could have a data point about people's weight. Okay, actually weight isn't uniform anyplace. Does it mean your lightest weight in the day? Does it mean dressed as me with shoes? It means all these things and that metadata is routinely lost and in fact I would go as far to say, as one of the major sources of data for genetics is DB gap and in fact, the loss.

B

The metadata is intentional by the people that submit the data because they don't really want other people to use it. I mean that's a cynical, provocative statement, but it's not far from the truth, and so you know the role you know when you talk about needing to rebuild Watson it's because that domain expertise wasn't there in the design and so I'm curious. If you have thoughts about how to you know, bring in the expertise of the domain expert today of science.

B

C

Guess the question is about: how do we bring in domain experts to help with the data science process systems and capture the metadata? Well, so that's also some discussion. We had earlier it's clear that capturing metadata is going to be and it's interesting. Actually one could argue that.

B

You know we haven't fully explored the.

C

Whole space of metadata and because you can easily come up with examples where, for just one piece of data there is a and there may be other cases where data doesn't need them, so there's a whole range of things, I'm, not just understanding. What is that range and also understanding what kind of processing can I do with the data? Given that I have this metadata?

C

Actually, the way I would say, that's captured in some sense is anyway I, one of the words I had there was.

B

C

And that's in a way falls under all of this context. Metadata is part of I think we need to now build learning contexts.

C

The domain science side I agree, I mean that's exactly how we want to make progress. That's why, when we, when we do these kind of exercises, we should do it.

B

C

C

And I think we need to understand this as a system to say: okay, how do we take what we learned from these guys and see which part of it in general, which part is III another of those things which, like Raj and others, have experienced when.

B

C

To the main folks, like you and talk about things, every domain person will say, I'm unique my problem. Nobody else has the problems that I have and when you go to them and say no actually are not unique. I've seen this problem elsewhere, they.

B

C

B

You develop this method of saying. Yes, you are unique.

C

B

That interaction that needs to.

C

Happen, I think the answer is you gotta.

B

C

Projects and vehicles by which the domain folks continue.

B

The understanding of the method.

C

B

Is the context.

C

You go respect some of the earlier things. I said you know where of all these situations, where sometimes this data is good enough using an analysis and sometimes it's data, science or computers, technical person, you don't know that that's very domain-specific now, maybe you could say run that over time by observing. What's done, but initially it has to be the domain person look right now. That's all I got it's okay, let's use, but then there's a metadata there as well right when I do their analysis with that data.

C

I need to capture that that's what I did so that downstream processing understands processing was done using that particular.

A

So Jim Beach asks the quality of commercial applications and user environments is increasing in accelerating your usability and user experience. For example, you mentioned Siri applications deliver the value of data science repositories and integration to science stakeholders. Yet science domains are highly constrained by grant funding levels, deliver nimble and user applications to researchers and students. Where would the resources come from to take the value of the infrastructure envisioned to implement.

B

A

It as a translational parenthesis, open-source fantasy applications which are extensive and require external funding to maintain and support, or how can that large financial barrier be lowered to deliver the value of our vision to science, domains.

C

Well, thanks a lot Jim.

C

C

When we going back to the spoken meeting that we had that the whole idea started exactly because some of the folks from the community came just just as we were falling behind in the cloud technologies, it looks like we are falling. The academic community.

B

Is falling behind.

C

So the session was either we.

B

C

To create something, that's open and have an initiative.

C

B

C

B

C

And see what could be done in an industry could help, but academia can at least do work, that's compatible. What's going on and as I mentioned, we.

A

Were able to attract.

C

Very good people from recipe as well, and they were all very supportive because they see the value of academia doing things at the same level that they are doing things so that people that we trained can then so.

B

I would say the way to do it is to actually embrace.

C

What the way industry things in these.

B

C

Think it's not that academia has all the answers, so I think it would be good to get engaged. My own feeling I can say this is very preliminary. We are still in a lot of discussions with vendors and so on, but let's see yeah we are. You know it's possible that industry will help us in some ways getting right question.

B

Of resources, I hope.

C

You know who your local, congressman and Senator is I mean I come from California, where we we don't even see where up senator is very close connections and I think you should go and I think as an NSF employee I'm not allowed to actually delete that.

C

Yeah so question is: does NSF want to put money into translation? Is nice so.

B

C

I, that is something I made up, I put up, but actually I found out again as I say. Was it Chicago last week and Bob Grossman there as a group? And actually, if you go to the website, it says what we do is translational science. So there.

B

Are some people.

C

Who were picked up.

C

In a way you could argue that NSF has done. Is they have some haven't, founded that with the idea program, so.

B

I think yes, more.

C

Of the ITR style.

B

C

Does translational medicine, but why doesn't it so I think that's a great question, because if we really think of data and data science, as this mega thing that we have to that, we have then we should translating averaging using that problem should be from.

B

A

C

Not just something you just.

B

Being enthusiastic about writing a proposal that would utilize IBM, Watson's cloud infrastructure and attacking the very problem that we're going to try to figure out how to build it, Cortana only it's Watson for genetics and we've got lots of data.

B

I think experiences I've had in panels, and you know it's a bunch of computer scientists or a bunch of that's.

C

Not that's just.

B

0 no, it seems to me that somehow.

A

C

B

Technical problem: we need to figure out how around because I agree with you I think the huge amount of rewards impacts are going to come from translational data science, and yet we have no.

B

C

B

C

See it's about supporting translation.

C

B

I totally agree.

C

So I think this is that's. What I feel is that we are entering a new era and you can imagine the highest level you can think of this as an organization like NSF.

B

Recognizing this.

C

Change is happening and saying: okay, that's a real bona fide activity. I think that's the advantage of giving it a name. Then it the people who do it get respect.

C

And then you can have programs whether it happens or not. They about my pay grade, but I think that's like highlighting that.

B

C

C

And actually, you talked about reuse things, there I think one of the spokes that we just hunted the big needle spoke actually using Watson with the encyclopedia and actually doing exactly the kind of Q&A things I mentioned. I think it's.

B

C

In that area, and the first time I actually mentioned this concept of translational data, science was actually at the big date of VI meeting last year and in that slide, I had a thing. It said: CI, reuse, I, don't know how many of you got funding under that, but they used to be a program under o CI or CI Reeves. Actually, I got some money, out of which was the concept that if you build some cyber infrastructure in.

B

C

B

Then you are saying.

C

That I want to take that and use it other project there will actually a pool of money. So this kinds of ideas.

B

C

We know that I mean one sharp is not everything you need to rate on it, so you have like examples.

B

Of a certainly an area.

B

This is there a change in thinking on software maintenance and upkeep you're. Talking about a large stack of cyber infrastructure, which is.

B

You see all these things because.

B

C

B

The question is about software.

C

Maintenance so I.

B

Think the kind of big ideas we.

C

Are talking about here are research. They fall under the category of typical NSF research programs, so.

B

We may be putting.

C

Our hardware and we may be putting forth but we'll all be in the context of research software sustenance right now falls under. There is a program called si.

C

That's where you would go if we had software that needs that has been community adopted. This.

C

B

So if you have, if.

C

You have software that you've demonstrated that a community is using as a doctor and that you need there is a separate program at NSF.

B

I see that I see a lot of amplification sites where you're doing you know whole new research here a whole new research here, and none of these parts are designed in a way they fit together. In that stack.

B

C

So that is a hope. That's suddenly, this kind of the harnessing the data program. The hope is that by creating this whole stack from hardware through the applications that will create some initial set of capabilities services.

C

How do we may come and go that then translate either throughout and what's in the vehicle for that it will have to be other programs. So suppose you created a set.

B

C

That were really cool like say, Watson talking, encyclopedia life and the biology, Directorate might say wow, that's something we want to at the end of the day, not still be fundamentally about research, not really there.

C

B

C

Longer term life, something that you create under a program like this. If, if a particular science area picks it up and.

B

Again, like I, said one way to sustain it.

C

Just through the software program that is already or it could be through at the record that says, I mean new science Directorate, for example,.

A

C

C

Because they find it in their own best.

A

B

It's just a there's another spin on Jason's question which related software, but it's the reproducibility side making software our first class object.

B

C

This the question is about software. Being first class object like I would have to say that's already, there I mean. Actually, if you look at the data management, it actually has worrying about what you need, which actually was taken from the software program. I think that damn cat, so somebody actually created it originally, so there are- and it's been used by multiple programs. So.

B

That least, is.

C

A recognition that software also needs to be treated like data and the.

B

Reproducibility.

C

Is a different question: in fact, let me mention it actually I. This is something I'm very interested in is so everybody talks about reproducible. When do we actually, when do we get to do it? So I've been thinking quite a bit about what would be the role of say, NSF, funding agency and after some folks it's still very hard ideas, but you know it would be interesting if we could provide some kinds of incentives.

C

And actually it's somebody from the community told me is rather.

B

Than incentives.

C

You make it in some sense that is rather than saying: if you produce a reproducible result, you will get an award. You say: if you don't produce this reproduce, you won't get all your money, whichever.

B

C

It I have a feeling unless either journals or something like that, miss princess or maybe.

B

Somebody in your department.

C

B

How they would do it.

C

C

B

C

There, but exactly.

B

How do you, how do we know that.

B

C

So this is a question about how do we make sure that we.

B

C

B

C

That's a different so.

B

The other question.

C

Is how do we foster collaboration to the right and.

B

Actually, that came up in our open.

C

Meeting as well, actually the Microsoft person who.

A

Actually raised this issue saying: look, there are proprietary.

C

Knowledge and we should think about how we might interact the answer. Is they.

B

Are open thinking.

C

About this things, I don't.

B

Know they are the question.

C

The answer to the other one is well: if it's proprietary and nobody knows about it- free fell in the world, stand so.

B

C

However, I think it's not just about general machine learning per se, I think we are talking about building systems, so we.

B

Have a data system.

C

That is doing the X problem. A system may be doing a bunch of things. Maybe it's cleaning.

C

A bunch of things so you'd like to look at all the stuff that it's doing and see how could you generalize that and take it to the next one? My understanding is that's at.

B

Least, that's one level at which.

C

You are people building machine learning, not just at the.

A

I'd like to thank dr. Titan brew, I'd, also like to thank Carl Gustav and Stephanie suber, who have so wonderfully handle the logistics for this meeting and like WebEx, also.

B

Thank those of you who are on the line and through.

A

This remotely, we will be hosting these monthly if you would like to serve as a presenter panelist or would like to hosts one of these data science roundtables for the south hope at your institution. Please, let me know for those in the WebEx I did type in my email. Also, the south big data hub infrastructure, working.

B

Group will be.

A

Friday at 3 o'clock, Carl has so kindly in coordination with reading more organized a series of demos, so we'll be doing one day, two demos each week or every other Friday from now in the next couple months. So with that, thank you all a doctor, Titan will be staying on. A brew will be staying for the next hour.