South Big Data Hub Data Sharing & Infrastructure Group, 7 Sep 2018

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: CI WG demo: data.world (Bringing People and Data Together)

Description

Date: 9/7/2018
Presenter: Patrick McGarry
Institution: data.world
South Big Data Hub

A

I would like to welcome Patrick McGarry he's currently building the thriving data community around data dot world as their head of community and prior to this he served as the director of community for the Ceph open-source project at ink tank and later for Red Hat. After a successful acquisition, where we've heard you fir to that, Patrick enthusiastically helps companies to understand and adopt open search ideals through community engagement, conferences and events. So with that and if she's had a chance to share his screen I'll hand it over to you Patrick all.

B

Right, let me get the screen sharing up here.

B

All right, I think you should be able to see my screen now correct.

A

Looks great excellent.

B

Patrick all right well, thank you very much for having me here. I know this is a short talk, so I'll try and get us out of here on time, but you know I hadn't planned on talking a lot about data dot world specifically I'm, usually I have the N viewable position of being in the community side of the house, which means I, don't have to make the company any money, so I just get to play with the cool toys and all the cool people doing stuff.

B

So this was a little bit more aspirational, although I think some of this may be new, so we can speed through that and then have a bit more of a discussion if people would like to do that, so I'll jump right in I always like to start with a few factoids at the beginning of some of my talks. Some of these may be new. Some of these may not be new they're, just some of my favorites, especially this one.

B

The more data was created in the last few years in the sum of all human history, and this is my favorite factoid to start with, because it's true every year, I give a talk so every year it just keeps continuing to ramp up and ramp up and ramp up, and you know headed towards the the Zeta bytes, which I'm looking forward to that one and of all of that data, that's being created a lot of it's being done in the open, with more than eighteen million open, datasets, depending on whose count you tend to believe now.

B

You know that's more than 10x the number of websites that existed when Google launched. You know there. You compare the world of kind of documents on the web to the world of of data sets in the world, and you know the problem, isn't that we don't have the data. You know there's tons and tons of data that's being generated, and you know in many cases it's being shared. The problem is that we can't find and use that data.

B

You know it's it's either hiding on a dusty, ftp, server, somewhere or or even worse, off on somebody's laptop, which is a major bummer, and so you know you start looking at the idea of the Semantic Web. You know and like I said it's that idea of you know they're the web was creating links between documents and data. You know the Semantic Web is creating links between data, and you know this is not a new concept. It's been around for a couple of decades.

B

You know Tim berners-lee, I love to listen to some of the talks that he give talking about. Hey ya invented the Internet, it's kind of cool. It's had some impact on the world, but really, if we can do the same thing with data, it will probably have more like 10x of an impact on the world. The nice part is, you know the the history of usage of the Semantic Web was always okay.

B

Well, go get your PhD, and then we can do some cool stuff, and you know the the barrier to entry was just so high that you know the general masses that long tail of usage just never was able to kind of jump on that bandwagon.

B

But the the interesting thing is you know and and data that world is not the only one doing this for sure is trying to take the ideas and the technologies of the Semantic Web and link data in general and and provide more accessibility and and we're seeing some really very interesting things that are coming out of this.

B

That you know, datasets together, are creating much more value than they would ever apart and we're starting to bridge the gap between the world of open data and the world of the proprietary closed corporate or how, whatever you know, siloed type of data that you may may find interesting, and you know I always like to bring the example of well sales and marketing data for your company is important.

B

You know people like to use that, but now what happens if, instead of having to try to do your own demographics research, you could just take all of the very feature-rich data that the US Census collects on a periodic basis and immediately just merge that into your sales and marketing data and start asking so much more relevant and interesting questions. And so that's the kind of thing that we do you know and and we're seeing a lot more of that just across the industry.

B

You know, we've also done some some interesting work with like the CDC, and you know the the obviously the the the ideal is. What, if you know, cancer researchers from London, LA and Tokyo could all share their work seamlessly so that we could kind of get a multiplication of effort. But you know what, if there's? No, if obviously everyone here. This is not news to you folks, you know we're moving the needle on all of this. So you know the open data movement is is crazy.

B

Unfortunately, you know we're at a very interesting nexus of kind of the academic researchers, the open source world, the open data you know, kind of the individual data enthusiasts as well as kind of the you know, the corporate interests and and the public sector. So you know governments, municipalities, NGOs foundations and every single person that I talked to in any of those groups, always says that you know.

B

Data access across their organization or their community is still incredibly difficult, and- and even when you look at the idea of you- know- data science as this new, exciting buzzword or new, exciting industry, you know the AI folks, don't talk to the ml folks, don't talk to the deep learning folks in a lot of cases, and so you know it's.

B

The idea of these silos that are that are incredibly difficult and you know we're trying, as an industry I think to create this idea of a data-driven culture and that's really difficult to do when you're either. You know when you consider kind of what are the? What does the landscape look like? Are you a lone wolf data scientist? Well, okay, you can consume some stuff. You can do some work and you share your insights, but you know how much impact can you have on the?

B

And so you start looking at you know modern data teamwork and what does that? Look like? Okay? Well, maybe you're, a small/medium business or maybe you're a small research team. You've got five or ten people, and then you start looking at you know who are the people that are modern companies that have been building their companies from the ground up to be data-driven? You know Google and Facebook have thousand or thousands of you know, data workers.

B

However, you might define that you know Amazon in particular, is very good at having a lot of people digging into numbers all the time, but obviously the the multiplicative ability of kind of that longtail. This is the community approach. This is the let's do with open data. What open source is done with code for many years successively, and one of my favorite examples was the idea of when you bring people and data together. There's some really exciting things that can start happening. You know closer to home for data dot.

B

World was the obviously the recent hurricane activity that hit her hit Houston so hard, and there was a lot of stuff going on and there was a lot of you know, people that were trying to jump in and help and they didn't really know what to do. And so you know there was this mobilization around hurricane Harvey that did some really cool stuff. You know people started saying all right. The the phone networks are overloaded.

B

You know we can't get ahold of people, and so people were actually sitting on their rooftops and tweeting about hey I'm. Here you know the water levels, thankfully, have stopped rising, but we can't get anywhere. We can't get out, and so there was actually a couple of different groups that came together and started doing some natural language processing on Twitter.

B

For saying, okay, let's find the people that are in dire need, let's find the people that are screaming for help, but then, as they started to work their way through that list, they started being able to have really impact on groups and say alright. Let's mobilize, you know individuals and coordinate efforts to get people rescued or get people help, and then all the way down to kind of after things had calmed down a little bit. Looking at you know well water reports, and things like that. So there was all of this.

B

You know from NLP to the impact of you know where were things hardest hit all the way down to you know who still has clean drinking water, so you started seeing a lot of really interesting and and impactful things that happened when you put a community a broad community, not just a focused, you know not an organization, not an individual, but a broad community that started tackling the data when it was made available elicited some really from the SP group.

B

Did some interesting analysis and I think that's what you see on the right hand side there the picture of the kind of the hardest hit neighborhoods. You know based on a number of different factors, and then she obviously shared all of her data so that other people could kind of double-check and say: oh, hey, did you think about this? And we've actually seen multiple projects that were spawned off of her work.

B

That started saying well, I want to do deeper analysis on you know: public works impact or I want to take a look at you know individual. You know commercial housing or what have you so it was really interesting to kind of put the power of linking data and people together. So what's next, you know, I will share a little bit of data that world, and if people want to ask questions or whatnot, you know we're a free resource, we're a very good hub model kind of thing.

B

For you know we we get a frequently compared as the github for data. So if you want to do things in public or kind of for the good of mankind sort of thing, you know create a free account and go to town. That's all you got to do you know it's. You know the the way that we stay in business. Is you know people that want to have a lot of private datasets or really large datasets.

B

Then we ask them to help us keep the lights on, but really, as the community guy I'm more excited about, you know, what can we do? You know we have worked with groups to help fight human trafficking or we've worked with you know, foundations and NGOs.

B

We worked at the XPrize Foundation to try and help their global learning initiative like we're always interested in what we can do to help proliferate kind of the the world of open data, and, more importantly, you know, I'm I'm would be remiss if I didn't mention some of the work we've been doing around data practices.

B

The data practices community was something that we started back in last November, where we basically tried to gather a bunch of you know: visionary thinkers around the worlds of semantics and data journalism and data visualization and open source and kind of all of these different people. We put them in a room and just kind of shook it to see what would happen and- and we came out of the other side, with a lot of the thoughts around hey. What can we do to start breaking down? Some of these silos: what can we do?

B

What can what interesting things can we draw from from other people's successes to help the data ecosystem and the in the data community start to thrive?

B

A lot more and one thing that we focused on at that gathering and in those discussions, was that hey, the old waterfall model of development for software was really kind of broken and it wasn't as good as it could be, and so along came this agile manifesto and the agile movement which I'm pretty sure most people have heard about at this point kind of turned software development on its head and did a lot for modernizing software development to where we we think of it today, and so we did the same thing with with the idea of the data community.

B

So we created this manifesto for data practices. It's a set of values and principles that kind of describe what modern ethical data teamwork looks like, and since then we have I think over 1,500 signatories and some really notable authors that were on there. Everybody from you know, DJ Patil, all the way to the folks on you know, working on Jupiter and our communities, people like Bryan, Granger, Fernando, Perez, etcetera, etc. So it's really been interesting and impactful, but more than that, we wanted to move beyond words on a page.

B

So if you remember nothing else of what I said today, just remember data practices org and we're in the process of moving that out into the community.

B

It's it's been community based forever, but you know we just kind of threw it up on Amazon instance in data dot world's farm, and you know so that we could get it out there and share it with the world as we continue to engage with people, and this this continues to evolve, we're starting to develop exercises and workshops with the community to help bring kind of modern data teamwork into places that that may not be quite as modern as they would like to be, and this is all a community effort.

B

It's not a data that world thing Patrick.

C

Just a quick question, since you mentioned remembering something: are you gonna make this slide deck available so that we can refer back to some of the things that you've mentioned, or should we be taking serious notes here? No.

B

I did send the the slide deck to Carl, although I think we're missing a slide at the end there about data that worldbut, but all of most of this that I would care, for you to remember, is all gonna be made available, I'm pretty sure Carl riffing, probably.

D

But we can distribute the slides, certainly that.

B

Would be great thanks very much and.and, it's great that you asked the question because I'm done with my my rambling now I hit my my 10-minute lightening talk, and so now, if anybody has questions I'm happy to make this more of a discussion.

C

Yeah thanks thanks Patrick that was really great to hear about what you're doing and I think the problems from at least my perspective that you highlighted are you know right on myth, dosser, we hear a lot about sort of big data, but I think capitalizing on the promises really requires breaking down silos and just curious to that end. Sort of what you know. What might be some examples you mentioned NLP, which would be processing. You know a stream of data and extracting information from it and put in organizing it.

C

But do you have any other examples where you're taking maybe existing data or you're you're sort of looking at something on a more longitudinal basis?

C

Where you know you, you are charged with sort of archiving historic components of it. You know overtime, yeah,.

B

Well, I mean that's a broad question with a lot of moving parts, but there there are definitely a number of different things that I could use as examples here. As far as archival goes, we've seen a lot within the governmental space in particular and I'm thinking now, especially of like ecosystem type. You know environmental data, so we've done a number of a cathodic of ER mental groups that were afraid of their that they're.

B

You know, data was suddenly going to disappear, and so they put it up on David out world, so it would be in a third party repository and to further those ends. We've actually been working with zone odo to make sure that we could allow our community to mint their own duis so that they could kind of have a better archival story.

B

We already, you know, version all of the data that comes into day to that world and surface all of that stuff, but we're working to kind of increase our capabilities there and allow people to have more interactivity with those versions rather than just being able to you know, download or revert you know and start looking at. You know: how can we create the idea of a data, pull request, kind of thing, and so those are some of the things that we're doing in terms of you know, data archival.

B

You know looking at streaming data and the things that you can kind of do with that. We're also looking at things like you know: hardware failure data, so some of you may be familiar with the Backblaze data set, which takes a look at a really wide number of hard drives and how long it takes them to fail. We're actually working with the Ceph community right now, they've built in a collection agent so that they can anonymously gather.

B

You know what types of hard drives are involved in large storage clusters, and you know we're working with people like Cisco and a couple of others to start getting some data out there to show what the failure rate looks like and what the usage profiles lead to faster failure, and things like that. So there's always you know interesting stuff. That's going on out there, so I hope, I answered your question.

B

C

B

D

This is Johnny I. Had a question did trying to frame it early, you need in the talk. You talked about the Semantic Web, which was a way of organizing links to data. That's spread around the world as you describe data dot world. It feels like you. You simplify things by concentrating all the data that that your overlay gabe's gives access to in one place effectively. You think that is likely to change and over time, you're mixing data across the world. Oh absolutely.

B

So, and even now, all of the data you know the art default for for ease of use is that you know you drag and drop your spreadsheet in day to that world or you connect up your your database via JDBC, connector or whatever, but we're already now seeing a lot of our clients and partners virtualized the metadata into dated out world, but all of the data lives where, where it where it already lives, where you know so we're not moving the data in today to that world for a lot of our datasets, you know.

B

Originally we wanted the data in a single place, so we could start doing some interesting things in a more simplistic way, as we continue to build and scale, but we're to the point now where we're starting to do like VPC deployments, and you know, like I, said that virtualized, let's get the metadata in, so that your data doesn't have to move. This is especially important for very large datasets or, for you know, regulatory type data sets worth FinTech or ensure tech type of ramifications.

B

So yes, I absolutely see a world in which we tend to index the data, but not necessarily move it in-house, as we as we go forward because yeah, the the Semantic Web stuff is really important to us. You know being able to be the the Amazon recommender system where it says. Hey, we see your data. Has this certain shape, here's a couple of things that might be related or might help you to enhance what you're doing.

D

B

D

Ahead No, please carry on in the in the same vein,.

D

Do you have a you, surely have something interesting to say about right about what data I'm sorry moving data it I miss I, want to analyze the chunk of data. I need to move it to somewhere, where I can analyze it or there's resources locally to analyze it? How to how do people generally get the cycles to apply it to the data that great thing well, yeah.

B

That's that's still a hotbed of contention. You know: do you bring the data to the analysis or the analysis to the data right, so it's I and I, and it's interesting working with the AIA and m/l communities, because you know they tend to think in very different ways. You know I like working with some of my friends at Google, because you know they're they're. The guys that are their favorite saying is I forgot how to count.

B

That's all that small, you know, and so, when they're working with massive data you, so you can't move it around. So you got to figure out ways to get the compute to the data. But you know it's: it's interesting, the kind of innovative tricks and things and tools that are coming out, especially in the cloud lawyers right with Amazon versus Google versus Microsoft versus Oracle. You know everybody's got a story to tell about why their cloud is better and I.

B

Think that we're only going to see more and more sophisticated options when it comes to data and analysis and how the two shall meet so I I don't have any very strong opinions. You know I probably fits massive data. You want to bring your compute to it, but for me, looking at the data science landscape, it seems like the more you can do to take a slice of your data or a sample of your data, the better off you're going to be. You know trying to do analysis on terabytes or petabytes of data.

B

Just isn't practical in most cases.

D

Hi sorry I think I missed a good part of your talk. I wanted to be here, but I just had other commitments.

D

So forgive me if you've covered this already, but I've been following your project with with great enthusiasm, I'm, just curious, your if you're, if you're linking up in standardizing identifier x' at all and linking back toward you, know, ontology and and specifically I'm thinking about something along the lines of wiki data and and how you might interface with wiki data so that people can you know so that we can begin to develop this essentially a standards around Semantic, Web and stuff like that and linkages and endings yeah.

B

Definitely, and now let I will answer this question to the best of my ability, but keep in mind. I am neither a data scientist nor an oncologist so so take what I say with a relative grain of salt. But that said, you know we have started to roll out some of our more semantically focused features. You know we have already in the system the idea of matching.

B

So if you a good example, is if you upload a data set and it has a column, that's called zi P and it's all five digit numbers we'll ask you: is this a US, zip code? And if you say yes, we start to infer a certain amount of semantic, meaning from that you know we'll be able to say: hey: do you want to bring in the city or the state or the census tract or the you know, so you can immediately kind of enhance your data set automatically based on some of our in-house ontology x'.

B

We're also starting to get to the point now where we will start building the tools for people to bring in their own custom, ontology x'. Whether this is you know, UPC data from you know: commercial, real estate, retail stuff or whether that's you know, you know the the fisheries data from the Pacific Northwest or whatever it may be.

B

There's there's a lot of work to make sure that we can do custom ontology, so we have definitely talked with and about dbpedia and some of those types of things and so we're still kind of working out what that tool is gonna, look like and how many linkages there might be. That said, data that world has definitely taken the stance of if something exists and it's doing a good job. Let's not reinvent the wheel, so you know when it comes to analysis.

B

We tend to integrate with people like my kirsov power, bi or tableau, or you know, Google data studio or whatever. We don't want to build our own analysis and be yet another tool on the landscape of already tool fatigue people. So you know if there's ontological tools out there, and you know that was the the the subject of our CTO s doctoral thesis. So our is, you know, postgraduate work, so I'm sure he's definitely aware of it and we're trying to not reinvent the wheel again so I.

D

B

That rambling was was close enough to an answer to get me off the hook, but keep in mind. I am NOT a data scientist all.

D

Right well just just a point of clarification because you said dbpedia and I said wiki data and yeah and they're they're similar, but quite different animals. I just want to make sure you know. Wiki data is really doing an amazing job of lining up identifiers and I think that it's probably something you guys could actually make good use of I'm.

B

Sure I mean we both I'm sorry I, miss both, but both of those have been on the discussion. There's a long list of people there's also you know, people that are doing visualization around ontology is like on Cydia, like we have a long list of people that were trying to work in.

B

You know, integrations and stuff with, and you know my background is in open source, so you know the Wikimedia Foundation is near and dear to my heart, so I would I would love to work closely with wiki data to make sure we're doing the best that we can for the community.

A

Okay, well super um if we don't have any other questions, that's probably a good segue into the last part of our agenda and thank you again Patrick for your presentation. We really appreciate that and I can't wait to read through data practices. Org. Oh.

B

Thank you very much, I appreciate the opportunity and it was great to join you guys.