National Energy Research Scientific Computing Center (NERSC) NUG 2014: A Celebration of 40 Years of Science and Technology, 17 Sep 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: NUG Meeting 2014: Skinner

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

I think we all agree that there's no doubt that the acquisition and analysis of huge data sets is really transforming the way we do science and in the way that the senator is like nurse are operating. Our first speaker, David Skinner, was getting wired up here, really has been a pioneer in integrating data producing facilities in de nurse in shaving.

B

A

And analysis capabilities through the web David's, currently the strategic partnerships leader at nurse and he's going to tell us some of the opportunities and challenges in extreme pain of science. That's good welcome, David thank.

B

You Richard welcome Doug members, so for any of you who were at supercomputing this last year, you heard a talk called more than likely heard of talk by Kathy Alec called more data, more science, Moore's law- and this is a very interesting thought, provoking talk that Kathy gave supercomputing that unveiled some of the first real. What I would call substantive directions about. What are we going to do in data, and let me apologize first by saying that data is not a very descriptive term. I understand means numbers and for different people here in the audience.

B

That probably means different things. So we concatenate a lot of meaning into this this one more data, but you know it can be how you are gathering the data, how you are logistical e, managing the data, how you're analyzing the data, how you're curating the data, how you're making the data do great science after you're done with it. So those are all sort of things that, as you look at these slides to try to take a broad view of data.

B

If you can- and my my enthusiasm in this area is really centered around the science, so I'll be talking a little bit about that, and the part that I will leave out in lieu of more expert voices in area is: is Moore's law I'll talk about a little bit but I'm not going to give any pronouncements about where computing technology overall is headed.

B

So nurse strategy is, is a real, simple one, and sometimes you know simple guiding principles. Are you really would make an organization excel and internally you know, staff and other decisions that are made within nurse really come back to this real, simple sort of criteria, for where should we go? What should we do and we're at a time now in people's generation of data storage, of data data policies that are coming from different areas that we really are asking those sorts of questions again about? Okay?

B

What's the next step, what are what are the sorts of activities that we should continue and what are those sorts of activities that we need to grow into? Keeping all of you nurse users?

B

Are our user community, satisfied and and productive is, is really at the core of that and after many many discussions and a lot of thinking and input from requirements, reviews and other community processes, you know where we're going as a center really comes down to these two points is continuing to increase application capability, and you know the the various prefixes that we use to describe what a big computer is shift over time and currently we're you know calling that usable, exascale, so XO scale machines are within sight and we want to make sure that they work for you.

B

The the second area is a simulation and data analysis and extreme scale. Data analysis is, is the the second of two major thrusts in the current nurse strategic plan and I'll be I'll, be talking about those right now, I wanted to give kind of macro view. This isn't a nurse slide. So don't don't attribute this to nurse design, mrs. from a community report that was called scientific collaboration for extreme scale. Science- and you know you see in here- let's see where the laser some familiar things computing facilities.

B

You know down here, storage, these are sort of bread and butter topics for a nurse go over all, but you know in in the the overall ecosystem of collaborative science and research. You know these are parts of a lot of different resources that coexist and at different strata sort of the you know the physical layer of facilities at the the kind of middleware layer of things that join different science projects together and then these these higher level knowledge knowledge seeking and collaborative environments that that are really tremendously important to scientists.

B

So in taking the in this whole ecosystem here I think it's really important for resource providers for centers. For for all of you, you know to take a broad view of this and figure out. How can we make these components? Work together, really well, the idea of a computing facility as a standalone entity that you would you know, take your punch cards to and- and you know, show up at the computing facility run your computation. That activity is less and less.

B

You know a separate activity from other things and in in many cases it's less and less a a specialized activity. That is, that they're increasing ways that we can sort of stitch scientific data and stitch scientific computing into an overall workflow that that can move discovery faster.

B

So the last leg was sort of a community view on this. This is is more from from nurse view. These are some speeds and feeds from science topics and projects that are currently active within nurse broken out. You know the only thing worse than the term data in terms of describing things is probably big data, but you know we'll use that, and this slide is meant to sketch out kind of what we mean.

B

By that we mean you, no data that comes in volume, data that moves very fast data that comes in a lot of different types and data. That's that needs to be checked, or that has holes or gaps or errors in it right and these are these permeate. The scientific and programmatic concerns that nurse has- and there are some drivers here that are pushing this in two directions that make it a concern that all of us need to to really begin to ponder in terms of strategy.

B

So the top 15 science projects that are sort of data heavy nurse car are broken out here. You know our term, for these are projects a lot of times. These are actually or initiatives other sort of community things. Anybody here who here has a project directory?

B

Okay, so in some sense people at nurse are able to think of all of these things as being in a project directory someplace and and that's what I mean and all of these have project directors, but they come from very, very different places, and you know that diversity has been a real strength of nurse going back. A long time is that we're kind of the place where a lot of the nation science data across? Why you know science disciplines meets and best practices can be shared in the most optimistic case.

B

We can actually do some really cool data fusion in bringing data together from different areas and come up with new discoveries. So these these are some of the current big drivers and they come from all over the place and if you work at nurse nowadays, one of the things you spend some time thinking about is how does this all fit together?

B

The other motivating slide I have a couple more in terms of where are we at now? You know just has to do with with reckoning the the growth rates that we're seeing with detectors and sequencers, and this slide compares to those two where CPUs and memory are going. So if we want to keep the end-to-end process of big team science moving, we need to avoid pileups. We need to invoice impedance mismatches where the detector can't talk to the processor and and so on.

B

So very specifically- and there are some technology trends that we can get into that sort of or behind that that last slide overall. But you know technology trends don't mean a whole lot. If, if they're, you know something that came and went and tell you, I never dealt with it. So I I myself, barely ever dealt with cuda. You know I've sort of dodged, a CUDA bullet in some ways, but there's some bullets.

B

You can't dodge- and this is one which is the amount of data that comes across our our border every day, so every blue dot is a day, their high bandwidth days and low bandwidth days, but over the course of decades. The trend exponential increase in data movement is quite clear and I I like that, because it tells us you know, maybe where we'll be in 2016, and if things continue the way they are right now. So the the traffic has driven really by automated data pipelines, large-scale processing from genomics Large, Hadron, Collider, increasingly image processing.

B

So why image processing while telescopes microscopes light sources all these things? You know in essence our cameras have one sort or another and they generate large amounts of image. Data data comes here, I like to think at least I'm open to feedback from other people, certainly because nurse gives a secure, reliable, fast, open and flexible place for scientific data. We we've done a really good job, keeping out of the crypto card business, thus far and a whole bunch of other little things like that.

B

That that put together mean that scientists can get their work done easier and faster.

B

So, let's take a look at kind of where this is coming from so I told you a little bit about where we are nurse, what people at nurse you're thinking about those sorts of things we're we're driven by what happens out in the community, and this is a picture of a light source, and you know I leave it to you to decide whether this is the light source of today or the light source of tomorrow in actuality, it's sort of both.

B

If you go over to the ALS, there are some people with new cameras, very ambitious data strategies. Overall, there are some people who you know will abide perfectly well for a long time without having to to reboot their they're. Thinking around data and that's fine and that's also something that nurse giz is really accustomed to. You know we have such a broad user base overall, that you know we're not trying to corral everybody into the same direction to get on the bleeding edge or things like that.

B

We, as you know, support Fortran and lots of things that have been around for a long time. So you know a portfolio approach is good, so for some people this is is tomorrow for some people. This is why didn't we start planning this five years ago, and so that this cycle really starts with with these detectors and detectors that are see, CDs are on a super Moore's Law data ascent.

B

There's a need to reduce this data to provide features and information in situ, and this this processor two drives even faster when robots begin to move samples that we have automation, high throughput strategies for for surveying samples and that all that data that's generated needs to come into the computational tools and analysis that are done to to compare data to simulate simulate data to to corroborate it and and ultimately into these.

B

You know stages here, of managing and sharing data that could be within a team to larger teams and, ultimately making the these facilities a secure, integrated, real time and sort of programmable resource. So I it used to be sort of far-flung a little bit to manage somebody using you know a computer and a light source and yes net all together at the same time, and that was a sort of heroic act of scheduling and getting your right allocation at the right time.

B

You know I think it's useful to think about that sort of model of an integrated fabric of do a facilities, as you know, as a potential target to shoot for our users are certainly asking how come I can't use this, and this- and you know, I- want a piece of nurse and I want a piece of this and I want to be able to assemble these all together to to power my science agenda.

B

So in some sense this is real simple. You know if you want to complete that whole loop there, you just put data on the web right, and this is an example from Cappy of actually of how my kids do research for their their homework, and things like that is. If you want to learn about something you just go to google and you put in the name and sure enough knowledge.

B

One could imagine pops up on the screen in this sort of farcical example. Looking for antineutrinos jpeg, lots of data comes up, but, and some of that data might be interesting, but this is not quite at the class of level of world-class science.

B

So, let's, let's look at some ways that we can do a little bit more than just Google for things, so some of that these new data methods that that are certainly in discussion from requirements, reviews and a lot of other a lot of other discussions, and in fact I mentioned that part of the reason that not everybody's here is because they're off talking about these things, you know in DC and other places. So you can. You can read as much or as little of this as you want.

B

I hope you find a few things in there that you find interesting or relevant to you. It's not exhaustive, but my main intent in in you know having such a verbose set of words. There is just to get across that this is not a question of how many disks to buy right. This is not a storage capacity issue that says you know we just need a certain amount of storage in the mall store.

B

All the data people want to do really interesting things with you know they want to do things like deep search where you could ask genuinely. You know you know genuinely interesting scientific questions about data sort of in their native format and get answers from them. There's a whole bunch more here as well, that will sort of get into. But these the this, this massive pile of requests, you know, will surely get kind of winnow down over time as we recognize which technologies really work well, which ones?

B

Maybe we can forego and it's it sort of seems to me the right time did for the community to be asking. How do they see these new data methods being delivered? Is this a collection of tools, or is this a collection of api's, or is it both, and certainly some of the R&D efforts that I've seen before in high performance computing have had kind of a tools, middleware type of approach?

B

And you know it's not clear to me that a single big monolithic tool or collection of panelized tools is, is going to be able to address this. This wide collection of things and one one of those reasons- is that a lot of the computing that people want to get done in the data analysis that people want to get done. They don't necessarily even want to see they want that to happen automatically in the background part of a workflow. They want the answer.

B

They want to be able to make good scientific discussion decisions or achieve scientific insight through knowledge that still delivered to them from day. So where we are right now, at least is is in you know, space where there's I think over 220. Now scientific data and computing application programming interfaces that are written, a sort of modern restful format.

B

You can go to some of these URLs to find them, and so whether it's access to data access to services, you know a lot of these sorts of things are becoming available through programmatic ways, which feeds exactly into what I was talking about earlier about. You know, build your own science capability using the resources that you want from different from different facilities. Ap is are a great way to do that.

B

So I want to tie this into into simulation all kind of bounce between simulation and data, as if they were separate topics a little bit, and this is a imagined, modern scientific work flow from the scientific computing for extreme scale. Collaboration, scientific collaboration for extreme scale. Science report that came out- and this is- is the s 3d.

B

It's written little small up there, but if you know s 3d and GDC, this is sort of their workflow and there are a lot of technology components that are called out here as to where storage and you know, distributed computing where data analytics happens, and things like that. But it's broken into these sort of two streams of post-processing and Institute.

B

Post-Processing is really nothing new to a lot of us and to folks at nurse we've always made queues or you know other sort of spaces available for people who need to stream through data and do do analysis on it.

B

The most important thing I, think about Institute processing is that it's no different than post-processing it's just earlier right and so part of what we're part of what we're working out here is to move these these tools and processes that used to that used to happen after the fact to build them in. So there are variety of reasons you might want to do that, probably the most compelling one is just time to solution.

B

You know if you had to go run your simulation, wait for the results, get the results, go post process them another job in back out there, only a certain number of times that you might do that before it's it's less interesting to do it. You'd rather have a kind of Institute view to that.

B

So bringing Institute process in situ, processing and post-processing together requires a lot of the tools and data methods that I described on the last slide, and you know you might if you're a software design thinker- or you know, motivated to kind of look at this overall workflow and figure out. Well, how do these parts fit in? Who are the stakeholders in the different components and where do they where they come together?

B

So this is a animated rendition of a similar workflow, but here motivated in beamline science, where there's a data pipeline that moves data to storage and computing. That has a prompt analysis component that allows people very quickly to be able to run the simulations and to bring simulations and measurement together to compare them.

B

So this is, you know not far in the future, if you're doing small-angle x-ray scattering at the ALS, you know that you need simulations that that to overlay on your curve, sort of before you're done doing the experiment, and so the the speed with which we can drive. This cyclist is crucially important to that sort of beamline science.

B

So the the last few parts of the talk on data science and some some wrap up about kind of overall questions at nurse this is is is from Kathleen supercomputing talk and I.

B

I was not at supercomputing this year, so it's my my rendition of it, but you know the the excitement that I think a lot of the community has right now about you know, transforming how we interact with data and what we're able to get out of it falls into this broad class of data science and extreme is always better than you know, mundane or whatever so extreme data science is is very much part of what we're interested in so that these new models have discovery, at least in a technological sense involved.

B

You know, being able to reuse and analyze previously collected data to simulate with new models, discover relationships across data sets. This is one area where you know this has been going on for a long time, but you know I think. The the interesting example here is is quasi crystals, the for which a nobel prize was given some time ago. You know that the person who discovered a periodic, pilings and quasi crystals was staring in the electron micrograph and saw something that they didn't weren't able to kind of fit together into their point group.

B

Symmetry knowledge, that's something that you could you could detect. You know across large amounts of data and so being able to discover relationships across data sets, with mathematical analyses as tremendous upside potential overall, given the amount of things that people have discovered simply by happening to stare sort of at the right micrograph at the right time, the being able to fuse data get together from other other disciplines. This is, is not a new phenomenon, certainly overall, but big data in the in the commercial sense is really brought together.

B

You know, statistics and computing together to be able to do this in a way which is is much much faster than then had happened and a lot of times before. Let's mention here with myth, machine learning. So one of the goals of this this model of discovery with machine learning is to to take what would be a postdoc or a substantial.

B

You know human invested research activity that is go, do principal component analysis or support, vector, machine or random forest on this data. Come back and report what you found. You know, they're in both in the private sector and in scientific research. The idea of building such models in a scalable automated way through machine learning you know is, is not going to replace the postdoc or the graduate student or anything like that, but it may allow everybody to move further faster by getting those models generated without without a manual process.

B

So this is a good example of that, and this is if your job is to count cyclones or storms in a large climate data set. You might, you know, use visual inspection, you know to go through and say: hey, you know, I know what a storm looks like I'll go through and find. You know find how many storms there were, and we can count those up and graph them over time. Well, it turns out, you know, they're, pretty good mathematical descriptions of what you know.

B

The vorticity and a storm looks like, and things like that so being able to. Instead of retrieve this data have some sort of manual or partially manual process to count those things up can be replaced by analysis. That's moved to the data and that that counting the counting of storms or other features in the data is, is automated.

B

The storms also finding kind of the the inherent structure in data. That is that, if, if energy or in this case mass is being, you know, convected along particular routes within a system, sometimes it's easier to think about those routes than it is about every single mesh point in the end, the entire sample.

B

So so this is really about machine learning here, the other another unifying abstraction that comes across a lot of these. These projects, in better at nurses that are working with data in new ways, is what I might call radical scaling. It's a you know. By that I mean just doing data fusion and reconciling data that comes from very, very different spatial temporal or domain areas, and these aren't ordered in chronological order, but genomes to life. You certainly had this this viewpoint a long time ago. Let's take sequences and make them useful to biology.

B

Kk base is very much involved in that sort of activity right now. Building models on genomics data microbes do biomes is you know, has a few different goals, but I think the most concretely stated one that I've heard is to study the microbial biome around plants in the same way that the human microbiome has recently been mapped and looking at all, the different critters that live on in people is to be able to do that for plants.

B

The this frontiers approach in energy intensity and the cosmic frontier in hell is is a is a bridging across a very, very disparate scales problem being able to to automatically analyze the image data to to take to move from pixels to models which then feed back into these sorts of activities, and this is a sermon. You know very generalized description of what's happening at a lot of light sources and other places, which is, if I have a machine that can take a lot of high-resolution images of a system, that's difficult to image.

B

How can I take many many such images that maybe contain unknown variables, such as rotation or or other other things, and bring all those together to make one very, very accurate, or you know, structural or high resolution model so in the diffract and destroy experiments that are done down at stanford?

B

Nowadays, you know they drop macromolecular machines through a very powerful laser that lights them up and destroys them at the same time, and all of those samples are at some arbitrary, you know rotation, so their their rotation and space is not known, and you get a little bit of diffraction out of each one. But this sort of process here is, is you know, taking lots and lots of such images and build a single unified model. I'll talk a little bit about this beam. Lined of browser scale is really looking at.

B

How do we connect very very fast data instruments to commodity data instruments, laptops things like that? How can I work with terabytes of data that comes from a light source effectively on a in a remote way being able to couple data from the some of the world's best climate models down to to solve regional problems and and lastly, in the materials genome?

B

Both of these topics we'll hear about later today in the materials genome, has this sort of initial goal of replacing materials design with search, be able to computationally survey, vast expanses of possible materials and then answer questions that are designed focused by by searching through them. All this is is getting sort of moving from the the supercell level up to slightly larger levels where things like batteries are being considered and engineered materials for sunlight to fuels. You know moving from materials to machines.

B

This means you know going up from defects to functionally electronic materials, to nano a mesoscale phenomenon so but there's a lot of upward possibility in these things and one of the the unifying aspects of how we see scientists working with data is being able to accommodate, really radical shifts in scaling, and here are some of the tools that we're talking about and and and we're hearing about from user requirements in the dataspace. Overall, this one's sort of a no-brainer big, fast file systems.

B

Those have been at computing centers for a long time, powerful, flexible computing, all manner of advanced analysis, machine learning, things that allow us to organize data, better things that allow us to move data better, and you know overall, sort of an end.

A

B

End focus in scientific workflows, so in the the variety and veracity part of of the the data slide that I showed before I thought. You know this is an interesting example that is one of the science gateways that runs from risk is America X. They have these towers that are collecting a wide wide variety of environmental ecological climate data that that are, you know, have a wide variety of sensors on them, and these sensors Marshall their data to to nurse and other locations, whether they're organized and analyzed- and you know, I'm, not a climate scientist.

B

There are very specific prefigured questions that that are built into the ameri flux agenda about things that they want to measure carbon and things like that. But you know we have lots of examples of really. You know amazing scientific discovery that comes from the organization and analysis of data. That is not necessarily obvious where it's going. So in this case you know people trying to you know, look at the the noise signal in one of their antennas.

B

You know took an approach where they began a real systematic analysis of that noise to figure out where it's coming from what it's about all that they ended up, telling us something about the overall structure of the universe. So I'm tremendously optimistic that sensor, data combined with simulation data combined with the capability to organize and analyze that data has big upsides, being able to reuse and reanalyze previously collected data is another cross-cutting theme in these projects. You know the you could you can take all the data you want and file it away on a big.

B

You know pile of disks or tape tape archive someplace, but it gets a lot more interesting when people can can interrogate that data in ways that that drive their own research.

B

So the the materials genome initiative and they're kind of spear point project, which is called the materials project, has has been in the news quite a bit recently, and you know I've already already described it a little bit, but you know I think that some of the key key things that are not necessarily obvious unless you kind of come at it from a computing center perspective or about durability of data and curation of data. That is taking an active view in a project about how is this data going to be used later?

B

How can we maximize its its impact and people's ability to leverage it so? Who here who and here is heard of a data management plan- a few people yeah so think about reuse and reanalysis, and things like that when it comes to data management plans, their new requirements on scientists about what they should plan to do with their data and for a lot of you? That means that you'll be writing data management plans in your grant proposals, and things like that and I just want to point out that you know keeping the data in case.

B

Somebody wants it and gosh they're going to have to email me and then we'll figure out what to do with it. You know that that's better than throwing the data away, and it's probably better for your funding agency too, but building in a plan to reuse and reanalyze. The data is, is you know tremendously? You know tremendously more forward-looking in terms of what can be done with all this data.

B

Multimodality is another area that that merits, concern and nursing is a great place to look at multi modality, because we have such a breadth of science users in science topics this example that kathy gave is is from brain. You know people looking at the the structure of the brain and there's quite a bit of activity in this through the brain initiative that president obama announced, but the brain is a big, complicated thing and it'll take people some time to figure it out.

B

I'm sure the one of the routes toward that is looking at data that comes from different modes of interrogation, and you know an example that I'm a little bit more familiar with compared to work in brain and neurology, is in bio imaging, and so you know, if you take one sample and you're able to assay it in two or three different ways: the process of taking those different assays which may not be even registerable or directly comparable, initially being able to overlay them. So you get multiple modes of data at every point or object.

B

Is is a tremendously interesting direction for for data science.

B

So, to give you a kind of big-picture view of this sort of zooming out from nurse nurse concerns over all is that you know that they're there are people within the Office of Science, looking at advancing scientific knowledge discovery and we hear about knowledge systems a little bit through K base and things like that. But this this really is a broad abroad area.

B

You know we'll probably be talking more about data, the knowledge for quite a long time, but I want to connect this sort of upper level 22 what people are doing with that data, and so this is a collection of activities that I'm, hoping that the nurse users in the audience can can kind of take in and look at how they might leverage them. If you see areas that you think are sorely lacking, you know there's still time to to influence some of these discussions and agendas too.

B

So if you're familiar with materials, project and K base, and some of these other things you'll see that you know really attention to end in scientific processes doing what we can with human computer interaction. This is a is a tough that's, a tough bullet point, in particular for scientists, who are not all that well known for making sleek intuitive interfaces overall, but there's a lot of work to be done there.

B

If we want to move the data that we're collecting into this sort of picture of being able to inform decisions- and so you know from a resourcing perspective- we have computing resources, simulation data instruments and data, analytic appliances- you know- maybe these two are the same thing actually, but in this whole process being able to discover analyze, present and then deliver back knowledge to two scientists is is sort of a the big picture bowl here. So how do we get there?

B

I'll tell you first I, don't know, but it's really fun being in discussions about with this topic about people who are trying to figure it out and I. Think just in the last year in particular, we've made a tremendous amount of progress, so these are some of the the puzzle pieces as they come together here for extreme data science, and you know they're.

B

There are a lot of components in here that you're, probably familiar with the one that I want to talk about today, is is sort of about how do we get there in terms of facilities.

B

So this is is I should have underlined concept, but this is a concept that the kathy presented at supercomputing last year. There are other concepts that are out there too, that don't have this XD SF name, but we're all sort of imagining what sort of resources and capabilities would drive this. There was an interesting set of slides that came back from DC yesterday, where there are some discussions going on in the the technical term that was used in some of the slides.

B

There was the data thingy, so you know put on your your algebra hat here and just consider this to be a variable. You know that that stands for something, but it's something that we already know a little bit. You know a facility that can handle the data that comes from the Large Hadron Collider that can handle the data that comes from the joint genome.

B

Institute that can handle the data from from ALS from these different sources you know is, is something that there's quite a track record on already, and so you know, the question is: if we look at where detectors and data trajectories are going, is this sort of the right model for the future? What can we learn from the models that are that exist right now in order to inform where we, where we go from there?

B

If you apply for time at nurse, you can get, you know computing time, you can get an allocation of disk. You know you can sign up for various other smaller things, but but it's really about computing time mostly overall and part of XD SF is, is broadening that to include data services, storage and analytics and network services, which I'll describe a little bit now, and that outer context for this is really about data science, computer science, mathematics and machine learning, and there's lots of software engineering that that's part of this as well.

B

The the science partnerships which I'm excited to be involved in now is is not a huge leap away from what a lot of you might know at nursing already has nurse tries to provide excellent user service and to to work with p IM, making new risk effective for them. You know.

B

Partnerships are, in some sense a an abstraction of that or a generalization of that towards objectives of program, offices, of experiments of projects and and also in the private sector, so that the component on here I haven't talked about yet so I included one slide, is a scientific data, a software-defined networking which comes out of yes net, and you know I don't think, there's anything in her cap right now.

B

Other than do you need to do data transfer over the land that we really ask about, but but you know in this sort of future space that we're imagining you know asking the network to do a lot more than just be. There is is kind of what this comes down to, and so adaptation increasing flexibility and really pushing the limits of what networking technology can do is imagine to happen in this case by making some of those choices available in software right now, you have very little choice at all about how the network works.

B

It's either there. It's not. The changing space of applications is really brought out here. Is that you know in order to get there, you know we need to expand our thinking. You know to include both these seven Giants of data and seven dwarves of simulation. The the the Giants here reference to the National Academies report on frontiers and massive data analysis, and these are not fully unknown things to us.

B

I mean we know a lot of these from existing science applications, but we certainly don't have the depth in how we represent scientific work loads through benchmarks and through proxies and things like that and that that needs to be developed.

B

The the data structures that we use to I mean the the kind of bread and butter of newark in some ways is the you know: mpi OpenMP, Fortran, C Python. You know those sorts of things. The the level at which we interact with the data in many cases doesn't need to be at that kind of bare metal sort of level, and so you know software and and and workflows that allow us to kind of move up a level can be really useful and so fast bit and fast.

B

Query from John, manera, Shoshoni and other folks in the scientific data management group have have allowed scientists to kind of stop messing with their data as much this may be. One way to look at is that they can. They can ask queries against the data without having to get down into the data sort of at a shell level, file level that sort of thing.

B

So this is a tremendously powerful tool to be able to interact with data using data structures that that are not the flat files that are written out of out of a simulation or that come from an experiment. Tigres from debs group is moving in the direction of scientific workflows that that abstract, concurrency and parallelism and other things sort of behind the scenes. I.

B

Think of this sort of like MATLAB or you know, other interfaces that allow you to say that you want to do something, and then you know behind the scenes that figures out what sort of resources are needed to do that and what sort of parallel ism are required to do it there's an our service of parallel, our and side ed service at nurse that have become increasingly popular as well, and so these are all data structures, algorithms and interfaces.

B

The data that allow people to work with larger and larger data sets they're going to be a lot of technology choices ahead as well, and this is a slide from Cathy that compares to the compute intensive architecture and data-intensive architecture. That I will not go full full through fully. But you know looking at these kind of core concepts here, maximizing bandwidth density, near computing versus you know, bringing more storage capacity near the compute or embedding the compute into the storage are are different, different tasks that have different technology.

B

You know optoma in terms of how to deliver data to them, and you know that this comes down to sort of memory technologies, how dense of the sockets. You know to some degree, although we have limited ability to control it, what type of CPUs are available.

B

There are other areas where we have a lot of choice in terms of you know how we allocate a computing resource like Edison or hopper, in terms of how many login nodes are there I heard somebody say on a call Friday that they wished, that hopper had a hundred login notes and I didn't want a hundred walk in doses, so we can do 100 streams, you know to it, and- and so these are the sorts of things that that you know I, don't think hopper is going to have a hundred logon notes.

B

First of all, but you know those login nodes, you know having having more conduits and more pipes to a machine like copper is something that we can treat through. For instance, the network connectivity of the batch nodes and other ideas like that so designing our own CPUs is, is considerably harder. But there's there's tremendous flexibility at a lot of other areas about how we adjust nurse resources to to really drive data. Intensive architectures.

B

This is is a bit more of a political or social question in some ways, which is how do you want to see big computers used and I think everyone would say that they want to see them used well, but and to some people well means they should be busy all the time, and you know, as some of you know, and anybody who doesn't ask you to think about is sort of the balance between having having resources available when you or your scientific work flow needs them compared to having them.

B

You know fully utilized all the time, so we're exploring and examining these trade-offs about how interactivity can work. What one of the the you know, secrets to nurse success, I think overall, has been thus far pretty good isolation between all these different users. You know that that when you get a node or your batch job in as much as we can is, is more isolated. You know from from others, and so your performance expectations, the kind of error situations that you can reach. All of those things are are less wild and wooly.

B

You know they are maybe at some other computing centers, but perhaps for data data, heavy workloads that sort of model as one that we can put aside, and we can do more sort of collaborative resource scheduling overall.

B

So the the last thing I want to touch on is how this is all programmed, and this is this is really Cathy's. You know, passion overall, is how how are we going to which program which programming models are going to excel at delivering on these sorts of challenges and again, rather than declare a winner or give you my own advice about this? You know I'd say that you know that the answer is obvious.

B

It's it's probably both and Newark is, is heavily involved with both at this point, if you are, you know somebody who wants to make sure that these technologies are persistently and indefinitely available as they are come to nug and tell us about it. If you want to you know, if you want to position help nurse position itself for more advanced computing technologies. Come talk to us with about that too.

B

So the last two slides I want to show is just to kind of recap: some of the data science, you know and computing connections that we have are really you know, concepts that are beginning to think across these boundaries in quite a bit more level of integration between detector network and computing.

B

And you know these are really important if the mission goals of some of the detectors that are being built out there are going to are going to be, satisfied is to be able to have that kind of end-to-end science happen very quickly and to be very, very user focused and, in addition to that innovating in the in the simulation side, we'll be hearing later today about the materials project in detail. You know this example is really all about data flow.

B

How do we get data and analysis across the data driven simulation perspective is, is you know really one that doesn't have to start in big data, but it generates durable data that people can reuse, and so one of the key points here is: if you want to make your own materials project in whatever field you're you're in you know, I mean some of the guiding principles.

B

Aren't that complicated is you know, look at collaborative HPC, workflows that produce web-based durable data assets and if you have an algorithm that can produce those, if you need a team of scientists collaborating through you know some social mechanism to do that. There are a lot of ways to get there, but we have great collaborative potential and data by bringing those resources together in a way that they're not just living in your home directory or your project directory, but they become resources for the for the larger community.

B

So the the two directions in terms of data science and computing that that I'll leave you with you know, are really two summed up in our in our strategic plan that at these two planks about usable, XO, scale and extreme scale data analysis. Thank.

B

C

Their three quick questions.

C

Change using fast food and fast track, oh yeah,.

B

So fast bit is a word of lined database. If you want so, let's say you have a big cube of data, three dimensions and at every point, in the cube, you have a few different chemical species.

B

You know the temperature, the pressure you know, maybe the velocity or something like that, and you have a question which is like one of these kind of deep search questions that you're not able to ask Google, which is what's the surface area of the layers that have a temperature between x and y and a carbon monoxide concentration between P and Q right.

B

So, if you throw away all the all the data points that don't match that you know you might be interested in the the surface area of that of that kind of sub selection overall or its volume or other things, and so those indices, things that you might be interested in. Like pressure temperature, carbon monoxide, concentration, those things can be given to fast bit and and indices, can be pre computed that the queries that are done after the fact become very very fast.

B

So if you talk to john john woo right over here, if you want to learn more yeah.

C

Right once read: never file systems, because ninety percent of the fire, a data that was written, was never read. I mean this is primarily I. Think because it was a backup essentially, but things seem time seems to have changed a lot hasn't it. They.

B

Really haven't you know, I'll be the first to admit that I was slow to pick up on it. You know that I I got into computing because I was interested in fast simulation and you know for a long time. I was glad that the backups were there and the file systems worked and stuff like that. But you know it is really a paradigmatic shift.

B

You know to see people starting with data where, where many papers that are written now, you know are opportunistic re analyses of existing data, and so rather than starting with you know, writing your fortran code to simulate the thing you want to simulate that the data is integral to it. To that.

B

A

Thank you David sure.