South Big Data Hub Data Sharing & Infrastructure Group, 1 Feb 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: CI WG demo: Google Cloud Public Datasets Program

Description

Date: 2/1/2019
Presenter: Shane Glass
Institution: Google
West Big Data Hub

A

Rolling on my phone, so I'd like to introduce Shane glass who's at Google, who's gonna. Give us a presentation today about the public data sets that Google Cloud and that are available for the Google services, including Google cloud search and bigquery. We talked about you know how users can use.

B

A

Data and hopefully I'm also hearing about how we can share other data into this, because you know I think that's the it's the it's the joining of all this data. To put it all together, that I think will advance research and make everything better.

A

Shane is a program manager at Google under the cloud developers relation group where he leads the public data set program, and this, as I said, is facilitating high demand. Public data sets in order to make it easier for researchers to access and uncover new insights and and and do things that they can't do. Otherwise.

A

That's really I think the the point I've seen several examples of bigquery in action, and it has a lot of very nice capabilities before during Google now, Shane was a project manager at NOAA's, Big, Data project and his currently serves as public affairs office to the US Army Reserve and has received his bachelor's degree from the University of South Carolina, a master's from the University of Maryland, UC and I'm. Looking forward to hearing all the details here so without further ado, please take it away. Shane yeah.

B

Thanks Nile, yeah I think you really summarized the program nicely of enabling people to do things that they otherwise wouldn't be able to do, or maybe you won't be able to do, but is like really really intensive and really labor-intensive and and probably not worth doing. Otherwise, it's just because the amount of property goes into it.

B

So I appreciate that intro, as Nile said, my name is Shane glass I'm, the program manager for Google clouds, public dataset program I am in Pittsburgh right now, so if you're cold I understand if you're not I'm jealous and yes, I I appreciate everyone dialing today and you started so here's just kind of a quick overview of the topics for today just kind of an overview of the program itself, we'll dive a little bit deeper into bigquery the product and epoch Niall touched on it nicely today.

B

You know: bigquery has some really great capabilities and has some really great potential I in kind of serving some of these workloads, especially for structured data, but also can do some cool things with unstructured data as well and I. Think that will we'll cover that a little bit the demonstrations here at the end that'll be a nice teaser to hopefully keep people sticking around until the end overview of some of the some of the public datasets. We have in the program and, of course, the fun part.

B

At the end, the demonstrations I was telling Carl before everyone joined I have tried to keep Murphy from attending this presentation as Murphy's Law tech demos is a a well-known phenomenon to me. So most of those are pre-recorded. So hopefully we will do this as smooth as possible, but I'm sure I managed to break my last demo, which is supposed to be alive and Carlos.

B

Put a note here in the chat and and I want to echo that you know honey was any questions, feel free to dump them in the chat and I'll try to pause periodically. If anyone has questions that they want to address.

B

So, let's start by just going over the program itself, but I want to start with kind of the scientific process right and not be about 8-step scientific process that we all learned in in middle school or in elementary school, but kind of the scientific process. As we know it today, and this is kind of what it looks like right like I- don't think this is a crazy off-base way of describing it in that you have to discover the dataset where to access it I and in my time, at NOAA.

B

One of the phenomenon that we we continue to hear from users was that in order to use NOAA data, you have to know what did you're looking for and where to look for it. I mean NOAA does a fantastic job of sharing data that they produce for their mission, but it can be difficult times to discover it because there's no central, catalog or all no a dado or even most, of the line. Ups as most the individual line offices, don't kind of have that centralized catalog.

B

But let's say you know you get past that point you've discovered your data set. You discover better to access it. You need now to be able to access it in a bulk machine, readable format. Now, hopefully, the open data Act, which was recently signed into law, will help with that. I does have a requirement for machine readable data publication, but in the meantime, and that can be difficult to find and and to access your to write, a program to parse all of your data into some analysis to write whatever.

B

That is, if it's bigquery, if it's, if it's Excel, which is a perfectly legitimate analysis tool with its if it's tableau whatever it might be, you don't have to load the data into some database. Using this program, you just wrote you need to manage. You need to update. You need to maintain you need to secure this data and this database.

B

You need to update your data regularly and then you, you probably want to link your data or your research with private data, set something that you've produced in your lab and are looking to do some kind of unique analysis on, and you now kind of have to go back through this whole process. To find these public data sets, so these private data sets again and bring them into the same place or go through this process twice and then need to analyze it which to me, is the fun part.

B

You need to share your data and then you need to visualize and communicate your results. That I like needing a cup of coffee after describing this like this is a long kind of complicated process and I. Think that part of our goal here in the public data sets program is to alleviate a lot of this burden from the scientist and I guess. My question to you would be well what if someone else did these steps?

B

What if someone else discovered the data and where to access it, kept it updated, kept it maintained and kept it in an easy to link place for public data and then on the back end kind of helped. You share your data that that would sound pretty nice right and that would allow you to do the analysis and visualize and communicate your results, which is I, think the fun stuff right. That's kind of the get to the fun part. I mean- and the crux of this here is that you know that's really.

B

The purpose of the public data sets program is to kind of take these steps away from from the burden from you, and what we found is that you know we have hundreds of users that are doing all of these same steps in parallel and all repeating the same process, and so we at you know the public dataset program kind of take on this process of working directly with providers of data of onboarding. These datasets, bringing them in keeping them updated.

B

Keeping them maintained, making sure they're well described that they link out to legitimate metadata that they link out to do is that they linked out to source pages for more question and help people share data so that they can focus on their analysis and so that they can focus on visualizing and communicating the results that we think that this kind of allows you to do. I have a lot more science.

B

The you know the kind of didn't thrown out term for how much data prepped and data cleaning are you doing is compared to how much analysis you're doing is about 80% data prep to 20% analysis.

B

I, think that we can offload some of that 80% I mean we're, never gonna be able to offload all of it, because each analysis is gonna, have unique cleaning steps, that's required and and that's okay, but we can offload a significant portion of that and let people refocus some of that 80% of their time and hopefully maybe ship that ship that paradigm to the 80% analysis and and 20% kind of cleaning and prep and finding minute.

B

So with that. So I've kind of now had two injections but I'll give myself a third, so I'm. The program manager for the public data sets program. I am about a year in I, came over to Google in April of 2018.

B

So almost a year ago, I previously managed the NOAA Big Data project working for Ag currents over at NOAA and still work with them pretty closely. Today, as one of our providers and my background is in in data and analytics and and the reason I I like to include that here, is that you know data, the difficulties of finding a data set are very personal to me and net when I was working on my Master's, it took me to find a dataset that would work for the type of analysis.

B

I was doing was big enough to help me learn how to build these models, and that would give me kind of a full-featured model, but wasn't so deeply complex. That I had to be a PhD, climatologist or I had to be an expert in that field to really know how to work with the data. So you know this kind of challenge of finding and helping people discover public data and and helping people share their public. It is is something that's very personal to me in my background.

B

It's something that I'm very passionate about as a result of that. So this is kind of a high level mission statement for the public data sets program and that's them.

B

It's a very natural extension of the Google kind of mission of making the world's information discoverable and useable to everybody, and so that's kind of where we come from here of maximizing the availability and accessibility of high demand, high-value public datasets through Google cloud platform, that's stupid, query and Google, Cloud storage for unstructured data, and the reason we're doing this is to reduce barriers to entry and to maximize the ability for all users to use data.

B

You know we last year we kind of ran a little bit of a marketing campaign talking about knowing what your data knows. I and I think the basis for that was kind of this kind of based on this.

B

This concept, that data is the new oil right that oil and it's it's kind of raw form has value, but it's really after the refinement, much like data when you refine that in insight, that's really where that value comes from and and so making it as easy as possible for all users to get right to the refinement, step. I think that's really kind of the crux of the value proposition we're finding here.

B

You know we think this is done best by providing scalable, centralized processes vetted best practice, cross-functional, launch teams and unified messaging to help bring in everybody and and lowering these barrel to entries, whether it's not knowing how to work with netcdf, so core, grib or or hdf files or or a lack of subject matter. Expertise couldn't make someone the subject matter than in a dataset, but if we can explain to them, hey here's what the shorthand and the column name means and here's.

B

When we say temperature we mean water temperature here versus air temperature there and kind of helping reduce some of these barriers to entry. We think we can really unlock some some really cool and really exciting. New research.

B

Sort our current catalog has about 115 public datasets. It's a pretty ballpark estimate, we're kind of onboarding, more all the time and so I think last I looked as I was preparing these slides. It's 115! You could see this like fun, scatter, shot of other people's logos that I get to take credit for on these slides.

B

These are logos from datasets that we host so you'll see some familiar ones on here. If you're, really looking keen you'll see that the Google logo is actually on here, we are hosting a dataset.

B

That's produced internally looks at political ad spending in the US, from campaigns and committees, as part of our kind of kind of effort to to increase transparency around the subject, and so these datasets are on board and maintained by Googlers, with input and guidance from our providers or partners there and I think that the value add there is that you know we're working our engineers and these Google engineers that are building these world-class search tools, Gmail and and G suite are also you know, kind of the same experts that are helping us maintain these data set to maintain these pipelines to bring in these data sets, so that our users can be confident that what they're using is is the latest and greatest data set from the from the data provider.

B

So more than thousand tables in bigquery currently available for public use, I kind of got some high level metrics here on the next slide. So you know in those hundred and fifteen tables, and or excuse me abruptly 100, datasets and bigquery was the other 15 or in Google Cloud Storage about 2,000 tables across them. We have 42 billion rows across those more than two thousand tables, and this number it continually grows every day, as we continue to keep datasets updated.

B

Some was as frequently as hourly and more than six petabytes of data in Google Cloud storage. So this is primarily things like satellite imagery of looking back down on earth. We're we're also typing some really exciting discussions with astronomers, with with with kind of different different subsets and different users around different communities, and so so there's a great variety of datasets in here that they surprisingly could have some like really exciting impacts for our users.

B

That often don't kind of expect to see you what they're seeing out of these data sets or or when they join it with their personal private data or learning something new that they didn't expect to be them in the first place, also kind of give a brief overview of bigquery I, don't know how much everyone knows about bigquery, like colleague, Carl's on the call probably knows a good bit, but excuse me I call like Stephen who's on the call.

B

It probably knows a good bit, if not Stephen your secret's safe with me and and feel free to to take notes here and no one will ever know so the just kind of a brief overview you know so bigquery is this: serverless, sequel, implementation and I. Think server list is kind of a misuse kind of misnomer in the community right now. It's not that there are no servers behind it. It's that you don't have to manage the servers. You don't have to add servers.

B

If you need to get more capacity, you don't have to turn off servers if you have excess capacity. It's that this kind of scales seamlessly with what your needs are and that's why I kind of say, seamlessly scale without me without the manual management. You know if you need to do a multi, petabyte analysis. You can do that big query and if you need to do a five kilobyte analysis, you can do that in bigquery.

B

Both will be fast and pretty responsive and and both will kind of scale to meet your demand, while only charging for what your actual usage is. So it's it's a really nice data warehouse and access layer with a really nice clean sequel implementation to either slice datasets like really huge datasets that can either be locally, downloaded or fed into GCP. We've got some great demos on that later of kind of some best practices of doing that. That I think will be really helpful for the community.

B

One of the other benefits of bigquery is that it allows you to simply share the data you want, but it's secure enough to protect the data you don't so. These kind of user defined access control logs and the high availability and redundancy of bigquery, combined with this, like very cheap storage and the pay for what you use model.

B

The compute payment model is a really effective way to distribute data between research teams and also publicly, if you're, looking for to support publications if you're looking for reproducibility, but also if you want to share data internally and control, who uses that one of the nice things is that a single copy of the data serves all users.

B

So if all of your users currently have a copy of the same data set downloaded and on their personal hard drives right now, and you don't have to worry about potentially a hundreds of different users or tens of different users or keeping the same data set updated having to pay for storage for the same data set 10 20 30 times having to having to say well. Are you using the latest version of this? Well? No that this version, you never download it.

B

This kind of single, centralized copy updates for everybody and and that kind of allows you to to more effectively and more quickly share. The latest version of the data and I've touched on this a little bit. It's a pricing model, but it Abel's you to focus on science, so really afford it's: affordable, storage, pretty pretty intense, free tears and usage face pricing, so 10 gigabytes a month, free storage and bigquery. Your first terabyte of scanned data is free and it's $5 a terabyte beyond that so I.

B

My opinion is that it's you get this really great performance for really large data. Sets you get integration with a lot of other scientific tools, whether it's Kegel, whether it's pandas Python, our tableau, whether it's you know joining with datasets that are either public or private, but allowing you to kind of focus on that science and not have to worry about the the infrastructure in the backend behind it. I'm. In fact, I think. Yesterday we rolled out a really exciting update for bigquery the bigquery sandbox.

B

That kind of locks you into this free tier bigquery and helps you be sure that you know it's. It's designed to help users get started and become familiar with it in a really fast easy to sign up way. It doesn't require you to put down a credit card, so it keeps you in this free tier of storage and of queries. I mean it does have some some other limitations behind it.

B

I think tables are only persistent for kind of 30 or 45 to 60 days, and the number of top I had, but a really really nice tool that you know my my kind of former colleagues in the federal government have told me is, if something's really exciting for them not having to put down a government Purchase card for something that you know is is not necessarily a set price or a the same skin system price. Every month, we've got some really great case studies available on our website that you know you're all very intelligent people.

B

You certainly don't need me to read them to you on kind of the usage of science data for genomics for chemistry, for climatology, meteorology economics, patents kind of so a lot of great and stuff content in there. I mean encourage you to check some of that out. That's of interest to you, okay, so that's I, think a pretty good overview of what bigquery is now how bigquery, how can bigquery support open data use I've kind of grayed out the business one on here?

B

We use this slide to talk some pretty generalized audiences right, I, don't know that this is particularly applicable to this there's audience I'm more than happy to ungrate if I'm wrong, but I think that you know we can we work really well with the public data side. Specifically works really well with with researchers in joining public data from multiple sources, with with each other or with kind of private internal data, to conduct your analysis and with data providers we talked earlier about.

B

You know providing that one simple copy, providing read access out to users and allowing them to kind of have this really fast, really simple access without having to scale the oncome services. What's really nice about this, the challenge we found when I was at NOAA was that the more popular a data set was the harder it was for NOAA to share that data set the reason for that being. If one user wanted to copy that data set, let's say they wanted a copy of a one.

B

Terabyte data set, no one had to send out one terabyte worth of data to that user, but if a thousand users wanted to all use that same one, terabyte data set NOAA then had to send out a thousand copies, which ends up being a thousand terabytes of that same data set. What's nice about bigquery. Is that when you make this available and read-only, you don't have to worry about scaling the bandwidth to meet that thousand person demand the user. Pay is for the query costs that they incur so the project.

B

That's the billing account behind the project that accesses your data is charged for that query. So, if you make your data public and someone you've never heard of before, comes in and queries it, you don't have to pay for their access. You've only paid for the storage and they're paying for their access in their analysis, so I just want to give a kind of high-level overview of the public data sets. We have in the program if anybody sees their favorite data set on here.

B

Let me know: that's a like thing that only works and audiences that are really passionate about seeing their favorite data set, which I love, because that's this group, if you see your favorite data, set missing shoot. You know and I'd love to talk and learn more about what the use cases are and how we might be able to support. This is like free generic snapshot to some of the data sets we have. This is out of our our kind of catalog of data sets it's. You could see the link there in the bottom right.

B

We call this GCP marketplace. We host whole ton of solutions in there from pre-built virtual machines that have now all these different tools built into that, but this is also a restore our lessening of the public data sets you can see. This is a really diverse, offering here from from Nitsa traffic fatality data when we were supporting them on their solving for safety challenge, you can see NEXRAD level, two kind of radar imagery from NOAA. You can see blockchain data from some of the most popular crypto currencies in the world sunroofs.

B

So you know it's kind of solar data or for how much? How about some? We expect your your your roof to see at a given. You know kind of date and time in the year, and we even have Major League Baseball pitch by pitch data, and so a you know a offering of pretty diverse data sets so we're pretty excited about, and in particular, or we find that our weather and climate data set so really popular and the reason for that being that users understand what temperature means.

B

I can promise you that everyone in Pittsburgh knows what a zero degree temperature means. After this week and I can really promise you that everyone jicama knows what that means, assuming they got as warm as zero degrees. So you know what we find is that users from a really broad range of industries, from retail, from from hospitality, all the way over to climatology and meteorology, find use cases for these data sets.

B

In fact, weather and climate data sets from NOAA were three of the ten most heavily used data sets, and we measure that by the terabytes of data scan I'm, estimating that those data sets those three data sets or maybe a couple gigabytes in size, maybe 10. If I'm really rounding up, we saw more than a petabyte of data scan out of those data sets. You know from from my experience or know we found that that's anywhere between you know, kind of 30 and 300 times more data sort of out of here.

B

Then the NOAA serves out of there. It's it's tough to do. The kind of one-to-one comparison is they're, just very different systems, but you know it's they're, really popular more. You know we're finding and NOAA's finding that you know we're helping amplify kind of that accessibility for users without the taxpayer or the researcher having to pay for that. Whether in climate data sets were we're also our most popular data set in terms of the average daily users and we're two of our six most frequently used data sets.

B

So it's daily average average queries per day, yeah, there's a little bit, I'll freely admit, there's a little bit of cherry-picking going out in the sample size. Right, like you know, it's it's two of six of our most frequently as datasets. It's also two of seven or else I would have said three of seven, but I think that this really clearly illustrates that you know we're we're. Helping NOAA meet this kind of level of demand that existed long before we kind of started working with that.

B

But you know helping them kind of clear some of these barriers without having to scale out kind of their resources to meet this excess demand. May.

C

I ask a question, or are we waiting until the end? No.

B

Please do in fact, I'm gonna take the opportunity to take a drink while you're asking okay.

C

Perfect, thank you. So my name is Florence Hudson and I work with all the hubs and I also work for the NSF cybersecurity Center of Excellence at Indiana, University and I lead a program called TTP, which is cyber security research transition to practice. So as I'm working with the cyber security researcher, some of them are saying like one of them in particular, at RIT and Rochester Institute of Technology, said: I need an intrusion, alert data to test my machine learning, algorithms for cyber security and so I'm kind of on the hunt for David Davis.

C

That's like this. You know or like when I work with smart grid. Guys everybody wants. You know, pmu data, a synchro, phaser data and a lot of this stuff is very you know confidential DoD. Do this? Do that? So you know it's not readily available, but do you have or expect you might have? Data sets like this, which are a non confidential that can be public datasets that you know I'm thinking I would send this link. I was just looking at it, and I saw there were some datasets there. There was like a you know.

C

This probably isn't what I need, but VM series next-gen firewall bundle as an example. Right it says, word firewall and I go. Maybe there's some security data, so I'm thinking that maybe I would send this to the researchers are asking me for datasets and say: do you see anything that might be useful or if you don't? What do you think? Would it is that, like a good idea or a bad idea, it's.

B

Neither a good idea nor bad idea. It's a fantastic idea: I love it no and and I have a little bit of background and security field. I kind of did some of that before I went to NOAA and I've got so I still got some friends in the field. I'd love to reach out to them and see that and I.

B

Think that's one of the things we're seeing is a really popular use case for data sets is training machine learning models in general because you just need a lot of data to do that and no one has you know: most people don't have a lot of disk space, just kind of laying around to to test kind of ideas with and that's I think of a really awesome use case for public datasets on yeah. My email is on the first slide, I'll be happy to go back to it later and write.

C

It damn that would be great if you could and I'll send. You know and I'll hook you up with this guy who's. Looking for intrusion, alert data, that'd.

B

Be awesome Carl. Can we share that out kind of share my contact information after the meeting? Yes,.

C

We can do that great.

B

Yeah so I'll get with Carl and we'll kind of send out an email with some of that contact information in there and I'd love to learn more about that yeah.

C

And I got to share with you, because the other thing that I found to one of these researchers is there's actually a conference called camless conference on applied machine learning for information security and, as you know, if the machines learning there's data involved right. So it could be it's on my list and reach out to more of the people who present there. So we could create something a bit rather interesting that could really support some of this AI machine learning for cybersecurity. Research going on yeah.

B

That'd, be that'd, be a lot of fun. I'd love to follow up with you about that. Thank.

C

You so much Jeff course.

B

And so these some of our are very recently added unstructured data sets and and they're all data sets I'm really excited about, despite the fact that I clearly forgot to replace the national water model picture over here on the right with something a little less grainy. So the what roasting over here on the far left is the her.

B

It's the high resolution, rapid refresh model, I I'm, gonna butcher, the details here, but imma try my best I, believe it's a half kilometer, it's a two half kilometer resolution, high high frequency weather model kind of over the United States, and in fact we just started bringing this on and and I think that there's some some really really cool use cases for this and in kind of a private weather Enterprise, but also, of course, in the research phase as well. You see this like really nice beautiful picture in the middle of of planet Earth.

B

This is from NOAA's goes. Seventeen satellite in cooperation with NASA NOAA launched, goes sixteen and seventeen within the last few years to replace kind of their previous generation of geo orbiting satellites. So it goes. Sixteen now sits over the eastern half of the u.s. kind of focused on you know the eastern coast in the Atlantic Basin. Whereas go seventeenth now over the western half of the US, we have both go.

B

Sixteen and seventeen data on our platform, including the geostationary Lightning mapper, which is really really cool, I, gives you you know kind of detailed, lightning strike information. That is the first time kind of this kind of instruments flown on a geostationary satellite on the far right, as I alluded to earlier, is representation in the national water model. This is two and a half million points of kind of continuous stream flow, soil, moisture, snowpack data over the continental United States as well.

B

So these are all of the we've talked a lot about bigquery and we'll talk a lot about how we can use bigquery to work with some of these datasets in a minute. But these are some of the datasets that that we're hosting in Google Cloud storage that have some really really great scientific applications as well.

B

Okay, so you've listened to me going on for quite a long time. I appreciate that if you've kind of taken a headset off and walk away and God, not something more productive, I, don't know the difference. You still get credit for it, but so I I think this is the fun part. I've got a like handful of demonstrations here and we'll kind of go from what I think being easiest to like most challenging or like most advanced level.

B

Don't worry, you're all gonna leave and you're all gonna be experts in this, and you will be able to allow all your friends, colleagues, family and, and maybe even people you don't like. So let's start with with what sequel and I'm I'm assuming that most people here, if not everyone on the call are familiar with sequel, but if not I want to start from baseline just to make sure that you know we're all in the same page of what I mean to say this, so sequels been around since 1976 I.

B

Think one of the like co-workers is it's gonna. Give me a hard time if I got that wrong, but and kind of the real basic premise of sequel is: is the selectfrom statement so select and say what columns are what variables do I want from what table of my selecting knows those variables from so this is kind of the the basics of what you need, in sequel and in bigquery to access and to start working with data. Okay, so you've mastered the selectfrom statement and you want to do something a little more advanced.

B

You say: okay! Well, this is really great, but what about data or I want to meet a certain condition. So that's where you would add a where statement that says you know you can give it so well, where these kind of columns that you have in your select statement meet these given parameters right, okay, so I you, you stepped up to the where statement, they're able to say: okay I want these data, but only when, within a certain time period, but boy I'd, love to see them in a certain order.

B

I'd love to aggregate them in a certain manner. So that's where you'd bring in your group by or order by statements, order by allows you to say either ascending or descending order for a given variable group, I would say if you had some kind of aggregated function, some kind of a sum or an average.

B

What would you be averaging and then what? How are you grouping them in the results? And let's say you want to get real fancy and you want to go as a joint statement. This is where you would pull in data from two or three different tables: you'd bring them all together on a common key, some kind of common identifier, that's unique for each row and- and you could still do a lot of these- you know you can still do all these. These other statements that we talked about before happen.

B

Okay, so with that out of the way we now have a clean, baseline, a sequel and the first thing we're going to do is we're gonna figure out what are the most famous trees in San Francisco, because I know everyone woke up this morning. It was like you know. This was really been bugging me. What are the most famous trees in San Francisco, good news? We have an answer for you. How do we do that?

B

Well, we're hosting two data sets through a partnership with the with the state of the city of San Francisco, one of them being their true census. They come around. They count all the trees in the city every year, the other being their film locations.

B

Data I, where were TV, shows or movies filmed in the city, and we said hey, wouldn't it be cool if we found all the trees in an address that matched an address of the address of the filming locations and I said we're going to start from easy to harden, and why do I consider this to be the easy one, because we already wrote the query for you, you can go to the you can go to the film locations page on GCP marketplace who pointed to earlier.

B

You can click on the link, it'll load, this query for you and you can run it and you can learn what the most famous trees are in San Francisco. So you know that's kind of one of these things. We've done a marketplace to help people get started, is providing these sample queries and allowing users to click through and so run these queries and the idea being it gets you a good start on the data set. It gives you a place to start and you can kind of play out from there.

B

So, okay, so what's next right, so that's kind of a simplest version of this. So here's something I think is a little more advanced. So what we're doing here is we're taking hurricane data from the no YouTube I. Don't want to see more videos like this we're taking hurricane data from NOAA's I-b tracks, which is kind of international community. Coming together and saying. We agree that this cyclone, this you know hurricane or the this typhoon was- was in this given place this given time. It apparently is shocking the difficult relative to how easy it sounds.

B

So we've taken this field: we've pulled out three key fields on so isometric time: the distance to land and the center point of the hurricane by bringing together the latitude and the longitude in one single field, and for this particular instance, you can see that we've got we're only looking at Hurricane Maria from the 2017 season, so you can see here and it's a little blurry I apologize I have no logical explanation for why but I'll shrink it down and and I'll share some of these videos out as well, and so what you're seeing here is a a great example of how you can subset data in bigquery and get just the observations you're looking for so in this case just Hurricane Maria, but the 2017 Hurricane Maria, so we've gone in and we subsetted this and I cheated a little bit.

B

The query was actually cashed ahead of time. That's why it only took you know three hundredths of a second to run. It I had recently loaded it, but I think one of the cool things that that bigquery can do is connecting out to data studio. This is Google's data visualization platform, and it allows you to kind of connect out and visualize this data very quickly. What I'm doing here is I'm just going in and I'm.

B

Instead of saying that you know the Hurricane Center point was text, we're just changing it to a latitude longitude so that bigquery interprets as data properly and then, with a few clicks, I went out and I kind of sent. You know drag in this, this new observation and we visualized the path of the SAR cane very quickly. You can see. This is the path for the hurricane took and the shading is kind of done by distance to land that the sum of the distance to land here.

B

So you can see you can kind of drill down and see some of these individual points. So this there's a lot of overlap here, so you can drill down and see the individual parameters for a given particular point. You might want to look at so I think this is a really cool way to go from kind of having data to having visualization really quickly and and the nice about data studio. Is it's entirely free and it is super easy to share it's entirely web-based.

B

All you have to do is send someone a link to your visualization once they have the link, they will be able to kind of view and and interact with this, this kind of visualization without having download any other additional software. Any other additional tools like that I love data, studio, I, don't say that because I work for Google it doesn't hurt, but I do love data studio, mostly because I am a self-described data.

B

Visualization enthusiast I even have a favorite data, visualization professor, but we can move on from that before I, embarrass myself, more okay, so I think this is kind of the intermediate step right of we found the data set, we're looking for. We subsetted it to find just the data we want, and then we went out and visualized it and that's a really important part of the scientific process, especially for those of us like myself, there were really visual people, but what if I want to do something even more kind of advanced than that right?

B

And so let's talk about, you know. We talked a lot about ghost 16 data or GCSE datasets kind of unstructured data in Google cloud. So let's take a look at that and how you would discover those data sets. I'm gonna go out to a random search browser I'm, just gonna pick Google I, you know I, don't know why I land it there and you can see kind of one of the first search results. Is this marketplace page we've talked a little bit about I and here we'll load the the marketplace page for go.

B

Sixteen you can see in here. We have a description of the data set. We have some links out to other data and to kind of some of the tools to work with it and if you click on this link down here, it'll take you to the bucket that has the raw data in it, which is the rhondettes CDF files as they're produced by NOAA, which is great, except if you don't know exactly what data set you want to use. If you don't know exactly what subset of the data you want to look for, so it's.

B

If you know, if you have the naming convention of file memorized, if you have kind of some of these other, like very particular things memorized and you're- willing to search through the bucket by all means, don't let me stop you but I think there's an easier way to do it all right. And so, if we go back, I'll go back just a little bit here and you can see as we go through the marketplace.

B

Page I'll click on the big blue button at the top, because that's what we want you to click on, so I put it there and you can see we're loading, a metadata index of the ghost 16 data and what we've done is we're parsing out the metadata in the self-described netcdf files and we're putting it in in bigquery to allow you to search and kind of dig through, and you can see a preview of the data here and one of the exciting things is. Is it gives you you could see at the very end?

B

It gives you the link to the data set as it exists on our site on google cloud, so you can go in much like we did with the Hurricane Maria data. You can subset by the the points that make up the corners of each bounding box. That are the contain. That's the image you can subset by time you can subset by by kind of the channel of the data set, you can look at and it'll give you a list of the files that match that result. So you can go in.

B

You can subset this data in bigquery and then you can grab these file names and you can begin to create these really cool images right and there's the most popular way that I've seen for people do this and I think the way we've done it for a very long time is to take this data and manually download it to your local computer, and if it ain't broke, don't fix it right, but I think it might be broke and here's why? Okay, so we've gone in and I'm gonna subset.

B

The data to include this is one hours worth of files from goes 16 just from the the level and B products as I scroll down. Here you can kind of see you again I cheated. It was cached, that's why the results were so fast, but you can see that you know you get the individual file name now. What you're not seeing is that there are about a hundred a hundred files that come up in this in this they come up in this result, and this is just for one hours worth of data.

B

So if you wanted to visualize all of Hurricane Marija, for instance, you're looking at days worth of data, yes over given points, but you know you're looking for much a much higher level, a much higher volume of data, and so we could take this out. We could go to our terminal and we could use the GS util SDK to download these data directly to our to our computer and it would work you would get the data on there.

B

You wouldn't be charged for it, and this is all well and great, but it takes a while, especially if you're downloading 100 files, because it has to cover the public Internet. There are a lot of points between the whatever data center from Google you're, pulling it out to your on-site Internet.

B

If you're working from home like I, am today because I don't like being cold, then you have to compete with your family member streaming, Netflix or Hulu or whatever you have to compete with your internet service provider, maybe not having a great day, and it can take a while right. So this is a way to do it, but I think there's a faster way to do it and that is putting manually downloading the data into a VM, and so this works.

B

This is a very similar process, we'll go and we'll subset the data again again, I cheated. It was cached ahead of time, but this result comes back in a few seconds. We can copy the file name out and we could put it in a VM in Google, Cloud I. Would click over to my console? I? Would click over to kind of the part of the web UI then I was going to launch new VM I would create an instance. I really just need kind of the basic parameters.

B

I'm gonna enable some of these api's to allow me to download this data. I'm gonna take a few other settings and I'm gonna create a VM. Now the the magic kind of behind the screen the cooking show magic is that it takes about a minute to spin up or to create one of these VMs from thin air. But no one wants to watch.

B

You know if you're watching a Martha, Stewart cooking show no one wants to watch the pie sitting in the oven for an hour and that's why Martha Stewart magically pulls out another copy of the another pie from underneath the counter. So if you see here, you'll see that I magically pull out my VM from underneath the counter so about a minute later, I can come in and I would use the exact same command. I used on my terminal, the benefit here being that the cloud SDK is already preloaded on all these VMs and I.

B

Would download the data again and it's a little faster and it's it's more performant.

B

You can shut down this VM when you're done not charged for the vnu the charge for the storage space behind the VM, but that's very cheap and you're, probably incurring very little cost to do your analysis, but that's kind of a lot of work right of copying the file names that over and there's a more programmatic way to do it, but copying those file names over spinning up this VM having to remember to shut down your VM and and it's kind of slow and I kind of take some time.

B

Well, what if I told you there was an even faster way to do it? Wouldn't it be nice if we could automatically create a bunch of VMs, run some code and then have them shut themselves down, and wouldn't it be nice if each of these steps kind of auto-scale themselves so much like bigquery? Does you kind of pull in more resources when you need them and you turn off those resources?

B

You don't well that's exactly what dataflow does, and this is where I've invited Murphy to join the live part of our demo as I'll pull this up now, I'll pull up this link for those of you that haven't that don't know we, you know what a my colleagues name lack as he goes by they, a brilliant kind of machine learning, expert within Google who's done a ton of work and in the earth science space, including processing, satellite weather data in real-time, using bigquery, so I'll download, I'll kind of scroll down here.

B

We've already looked at a lot of these steps actually and I have taken credit for them. Please don't tell lack, he does know, but you know it doesn't hurt to not remind them, and so we have kind of these steps in here and he goes through. If you wanted to plot a step by step image of Hurricane Maria as it was captured by ghost 16, you could do that using Python using the the pyria sample Python package, and he includes the code here. I get up for it, but what?

B

If, instead of having to spin up the VM manually, he talks about this a little bit in the blog post and instead of having to kind of pre configure these VMs. What if every time a new image came in, you could automatically stand up the VMs that you need to process and add to this image instead of having to take the time to create one JPEG image at a time and kind of string them together.

B

What if this kind of all happened in the background, and that's the benefit of cloud dataflow, of allowing you to kind of connected to these these datasets and and using them in a more repeatable process? So I don't have time, unfortunately, to go through all of this I thought you guys would probably enjoy the other demos more and retrospect I may have been wrong, but so the you know the codes available, gimmick github.

B

When we talk about you know, it goes through kind of all these steps that we walk through today of of pulling the relevant time stamps and locations from bigquery, finding the corresponding file names out of this bigquery search, which you see right here and then running some Python code to process these and create this image, and that really goes back to kind of the core of what we talked about the very beginning here that, if you could go from having to run all these steps of the quote unquote, you know the scientific process is I choked earlier.

B

If you could go from having run all those steps to just doing your analysis and just visualizing your analysis and let us take care of the rest on the back end. I think that that helps everyone kind of get where they want to be and get to the fun part of this I mentioned earlier. I'll share out my contact information through email with Carl I'll, be sure to include a link to this blog and I'll shirt, be sure to include a link to this github page as well.

B

It's it is linked throughout the post, but I think it'll just be easier to have it right in front you so with that that is all I have I'm. You know about ten minutes left I'm more than happy to take some questions. If we run out of time, of course, I'm I'm also more than happy to take those over email as well.

A

Great thanks, Shane, oh good turn things over to questions and see where we go and we can hang out here. I mean it's. You know Friday afternoon, so for a few more minutes. If we need to go past the hour so.

A

That anybody have any questions, comments or other I.

C

I guess this is Steve and I'd be interested in what what the folks at the National data hubs think how they could best benefit from a resource such as this, and what kind of data data sets that you might have that you think would benefit from being in this environment.

C

We've got a couple of the executive directors on the line. I saw Melissa Kragen is on and I believe Meredith Lee, so Midwest hub and West hug, perhaps they'd like to chime in first before so. The other folks jump in sure thanks Leah. This is Meredith. I can definitely say that we have already benefited from partnerships and collaborations with Google cloud and all the different public data sets that we work together to put on that. Nice snapshot that Shane showed from the Department of Transportation, and it's a you know.

C

Amy unruhe from Google in Seattle, actually flew out and hackathon in the Midwest and was mentoring and serving as a chair topically for some of those effort. So it's been really great so far and actually to build upon Steve's question and to look at sort of future collaborations. I wanted to ask Xin and any other Google folks on the line about those that snapshot of all the logos that you showed at the very beginning and apologize.

C

If you already mentioned that's, because they dropped off due to signal for a little bit but I see several and federal agencies there and a couple of cities looks like New York City a couple of times.

C

Could you maybe comment on potential future opportunity user things you've seen where we might be able to help fill in some gaps at a regional scale, fully, knowing that we did a national great effort before, but knowing also that we are regional innovation hubs, yeah.

B

And so I think we kind of focused initially on these national level datasets one because you know they tend to be like really broadly useful right, like they kind of tend to give you at least a good map of like the entire country right we've on-boarded a handful of city datasets. You know. Frankly, we found that since we hit or miss on on some of the usage for these- and you know we we don't have infinite resources, and so you know we we tend to focus our efforts where we can have the most positive impact.

B

You know that being said, you know we recognize that they're also gaps there too, and and we'd love to work with you guys to help fill some of those in and and I'd love to hear, obviously, in a longer forum, then the next few minutes kind of what that looks like and how we can help work with you guys on that I I'm really appreciate you calling out Amy, because she's awesome, I love working with Amy. She does really a really amazing stuff and I almost forgot to mention her.

B

So thank you for calling out her effort and Meredith I'd, also love to talk to you more about bringing in more of the Nitsa data and how we can work with you on that as well. So I'll, if you don't mind following up with me after this I'd love, to kind of continue that conversation absolutely.

C

We're looking forward to phase two and given yourself professed enthusiasm for data visualization I, think it's a great match moving forward and the sandbox is super, exciting I think that's going to go a long way and showing that early value proposition for some of the city and regional and federal connections. Yeah.

B

C

B

Excited about that.

C

This is Steve I I can't so much comment on the regional data sets, but I can just give you a little glimpse of what I've been doing and that's been working on chemical biological data, so I've been working to get pubchem loaded with and I H through the EPA Environmental Protection Agency. They have a it's called DSS talks database, part of the actor datasets, so we're getting all the toxicology data associated with molecular content.

C

I've we've already got the Kemble database from the EBI european institute of bioinformatics. Of course, we have all the short Kemble data available, which is the molecular content from patents. So, basically most of the molecules that have been patented and we also have I'm working on. We have the orange book data from the FDA and I'm working with the genus to try to get that updated, which would be like an international or global dictionary.

C

You might say, or encyclopedia of drugs available globally and we're also getting all the G wast data, which is the genetic data. So we're working to get all that data loaded in as well. We hope to have a announcement in August and at that time, we'll be presenting a lot of the scientific data, sets that we're focused but I've personally been focused on working with partners and as well as some new tools such as crime.

C

So you can do workflow data management that will hook up directly with bigquery and allow people to use other tools, such as a nima crime and other data or cloak sets.

C

So if anybody has interested in molecular- or you know, biological data drugs diseases things of this type, then please feel free to reach out to me and not be sure to try to work with you on that.

A

Sounds good, so I know we're reaching the top of the hour, so I just want to let everybody who needs to run off to the next meeting go ahead and do that. But I wanted to thank the Shannon folks for joining us today and I'll stick around here. If others want to have discussions about path forward or other pieces, and the only other thing I wanted say is that we are swapping around I think order.

A

I, don't know if this was mentioned earlier soft round order of the March talk, so we will have a talk in March about hydro share if I remember right now, so please join us for the Hydra share one in a month, and we will continue to promote that on the usual channels and let everybody know what's going on.

A

Really quiet: okay,.

A

That explains it normally or.

C

B

I I didn't say because I need to have another conversation with Shane, so I.

C

Was gonna reach out to him separately, so I can get a handle around priority areas for.

B

C

B

C

B

Regionally going.

C

Forward and think about that, so thank you.

B

That sounds good looking forward here. Coming Cheers.

C

And Jane I want to thank you very much. You've done a great job, bringing everybody up to speed on this. It's really nice. Thank you very much. Yeah.

B

Well, thank you. I really appreciate the opportunity to speak today. um You know I actually am one of those people that has to kind of run to the next meeting but, like I, said I'm happy to share my contact information with Carl and please feel free to reach out to me. I'd love to kind of continue. The conversation well.

A

Thanks again, Shane and I won't rub it in that. If you multiply your temperature by three we're still warmer than that here in Austin, so you.

B

Know what I I won't I wouldn't wish this on. Anyone I'm glad you guys can feel your feet. We were. We were in the sweet spot this morning where it was warm enough to hold moisture in the air to snow but still cold enough, that the rock salt didn't melt it off the road. So all.

A

Right with that, thanks everybody for coming and happy weekend and we will get together in and March this.

C

Was great, thank you very much and Sheree I referred to following up with you great.

B

Thank you very much.

C

Thank you please Thank You Leah for putting this together as well. Yeah Thank, You, Leah, Thank You Carl, was great thanks.

C