GitHub CodeConf 2015, 30 May 2017

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Microsoft Azure and Hadoop, an OSS Story, Jennifer Marsman - CodeConf 2015

Description

Jennifer Marsman, a software developer and technical evangelist, discusses how Microsoft supports the Hadoop community and their deployment as part of Azure Data Lake.

About CodeConf
CodeConf improves the software community by providing a forum for thought-provoking talks and forging social connections. The third installment of the CodeConf series took place in Nashville in 2015. Attendees came together to discuss open source, best practices, documentation, and community.

For more information on this year's CodeConf, go to:
https://codeconf.com/

A

My name is Jennifer Marsden and I have been with Microsoft for about 13 years now, so I am essentially a grandma, a Microsoft, and one of the things that I have noticed over my time at Microsoft is that there is a real culture change happening. All right, like especially in the last two years, I think we're really seeing the battleship turning in a new direction, especially in regards to open source software.

A

So my teammate Casey spoke yesterday about our open sourcing of dotnet and the Rosslyn api's, and some of that cool stuff, and so today what I want to focus on specifically is our cloud story.

A

So some of the openness that we've got going with with our cloud, which is specifically on Microsoft Azure, so Azure from the beginning, has been a very open platform when it was very first launched, even at the very beginning, there was support to put up websites in this cloud using not only you know, asp.net like a good little net developer would but also PHP and Ruby and Python, and all these other languages as well. We have no js' support.

A

We have all kinds of good stuff and then there's also the capability to put operating systems up so run hold virtual machines in our cloud, and so we give people the option to do several. Windows operating systems there's a whole lot of different Windows options, but we've also supported Linux from the very beginning and there's a number of different distributions. We support. We support a bun, we support core OS, open logics, who say so, there's a number of different options up there and available and different versions of each of these different options.

A

So again from the very beginning, it's been doing that and then there's also something called Azure mobile services and that's specifically kind of a back-end as a service. When you want to do mobile development and from the beginning, we've supported not only a Windows Phone, but also Android and iOS devices. So there are api's, you can call and code. You can write in objective-c to call in to Azure mobile services, and what that gives you is all the things that you typically need when you're doing phone development.

A

So, typically, you need like a database in the cloud right, especially if you want to sync among multiple things or you don't wanna have to store a lot of things on your phone, so it gives you database capabilities.

A

It gives you off capabilities, it gives you push notifications and again, it's not just through our windows push notification framework, but we also integrate with Google Cloud messaging, which is how you do push on Android and Apple's push notification service, so just very inclusive and kind of open on Azure from the from the very beginning from the get-go and so, and we continue to kind of add new things in there. So one of the newer things we have added is the support for Big Data.

A

So in the Big Data world there is something called Hadoop, and so it is an Apache project, Apache Hadoop and currently it is like the industry, what the industry uses for big data and so for the rest of the talk. I'm gonna focus on kind of what we've done in our path and cultural journey, specifically with Hadoop all right so I mentioned Hadoop is used for Big Data now just to level set.

A

Let's make sure we're all on the same page about what Big Data is I know some people when they're new to it are kind of like okay where's, the cutoff. Is it just like data and then that plus one is big data like how does that work, so Big Data when we're talking about Big Data, typically we're talking about the three V's, a Big Data and the first of those is data volume.

A

So when you're working just at a certain quantity of data, that's considered AJ and obviously, if you have petabytes those kind of things, just massive amounts of data EXO bites on even terabytes. That's that's just a lot of data to play with, and those kind of things are considered a big data problem.

A

So when you think about industry trends right now, like the Internet of Things, in those scenarios, there may be sensors on various things and they're all sending data up to the cloud, and then we want to do analytics or something interesting with that data.

A

So a couple of examples: I know there are some cases in that where farmers are using little sensors like on their fields and the sensors can measure things like light and rainfall and wind speeds and all of that stuff and then push the data up there and then be able to make intelligent choices.

A

I'm like which yields to plant and those sort of things, there's also a really amazing connected cow story that I really want to tell, but there's not time so I think I've already like told some people at the conference, but go to youtube and look up connected cow. There's a guy, Joseph, Soroush, there's I think a video clip of him presenting it at strata in New York, and it's just it's awesome. It's amazing stuff, so volume volume is the first thing. The second is data velocity.

A

So what I mean by that is when you have a problem where data is streaming in in real time, and you want to be processing it like in near real time, real time or near real time. So in those scenarios, that's also like a big data problem and the scenario I always think of for this specifically, is that in a hospital, so I think it was about. Five years ago, I attended fairly close to here the kentucky celebration of women in computing and one of the women that spoke. There was a PhD.

A

I forget if she was academically, there's a lot of healthcare stuff in kentucky, but she spoke on and told us how you can actually predict when a heart attack is going to happen in a hospital. If the right sensors are hooked up so think about when you go in the hospital they put all those sensors on you and they're, measuring your blood pressure and your heart rate, and all these other.

A

You know millions of things, and so they actually have the knowledge to be able to detect when cardiac failure is going to occur, but we weren't doing it because we couldn't keep up with the processing and that just infuriated me right, like the fact that we had the ability like we could potentially save lives. We could predict that people were going to have heart attacks, but we weren't doing it because it was just too much processing overhead to hook up like every people.

A

You know every person who came in to these types of things and then do all that monitoring and processing. So another thing that got me really really passionate like five years ago. Whenever this happened about big data and then the last V is variety. It's also sometimes called versatility and some other V's. But essentially what I mean by this one is the concept of you: have data from lots of disparate sources. So a scenario where you might want to do something like this is um March Madness.

A

Let's say we want to write like a machine learning algorithm and we want to figure out. You know we're gonna up our chances in the poll and do some machine learning to try to figure out who's going to win March, Madness tournaments or whatever, and okay. So now, my my lack of basketball now is just going to start to show here, but the things that so we probably want a whole bunch of different kind of sources of data for that and I'm get.

A

Let me just kind of make it more generic, because my baseball now or my basketball knowledge is going to fail me here, but, let's, let's say football or something so look at a football scenario. It says it's played outdoors. You might want to grab weather data right because you know when a team from the south comes up to Michigan, where I live and tries to play in the snow. You'd hope they're going to have some problems, so maybe weather is a factor and how well people play. Maybe things like injuries?

A

You want an injury reports and grab that too. You need individual team stats. You need kind of team collectively, all of those different stats and then, of course, the the features that would go in differ depending on what what sport you're looking at. But you know all the stats that everyone tracks like crazy during their when you're trying to pick out your fantasy team, all of those things I need to play in and then another thing that might be an interesting factor is just like raw emotion.

A

You know like I know when Michigan plays Michigan State and like Michigan State won the previous year, like Michigan comes back with a vengeance right, they are mad. They want to win this year, so things like that. So there's all these different factors and stuff where we might want to draw in and use those together, and so that can also be considered a big data problem.

A

So all these three B's, either kind of by themselves or in combination, are kind of what make up that the big data realm- and this is this big data thing that we want to conquer. So at this point, big data is out there. People are doing it, people are using it. So we're faced with a strategic decision. Microsoft wants to be involved with this. We want to be able to provide you know big data solutions to our customers like what do we do? What do we do about big data?

A

So Microsoft is: oh I got a few laughs, they I do not think I would I'm like Sex in the City I'm, not sure that that will fly with this target audience, but I'll give it a try. So, okay, we got like one laughs so that by the way is on that winkler. Who is the head of the HD insight team, which is the Hadoop running on Azure team?

A

But in this scenario Microsoft does have experience with big data, so we essentially have done like stuff in the past, like there's, Bing Bing runs one of the largest data centers in the world. Where we're keeping, essentially you know our copy of the internet for doing you know, search engine stuff. We also have a sure, of course, in a jurors running our we've been monitoring the health and bringing machines up automatically when they fail and all that stuff. For a long time.

A

We have all the telemetry data that comes in from office and windows, and that has a lot of users and stuff, so we've been managing that for a while and then on things like like Xbox Live, oh my gosh, so all of the Xbox Live users out there and just handling that so so we did have experience in the space of big data. So because we had this this expertise already, it really became a question of you know: build versus, buy right.

A

We had the expertise, so we could potentially build something ourselves or we could choose to buy something or adopt. You know a solution, that's already out there in the open source world and so I'm really excited that we did actually make the decision to let's use what the industry is already using and I think that kind of speaks to some of the culture change that we're seeing at Microsoft because ten years ago, I don't know. If that's would've been the decision, you know so we decided to go ahead and adopt Hadoop.

A

Now Hadoop is an Apache project. It is um open sourced in that manner and Hadoop. Essentially, what it is is a distributed. uh It's distributed, processing right, so you spin up a big cluster of machines and you have a main controller and it kind of uses the MapReduce pattern and forms out a whole lot of work to all these different worker nodes and then reduces down and gives you some output and way oversimplifying things.

A

If you want to talk about it in more depth and talk about some of the other pieces of the Hadoop ecosystem, because there's a lot of other stuff in there, no sequel database, which is HBase and high for real-time processing and all this other stuff. So people want to talk about it more later. Come find me because I'm happy to go geek out a little bit more on this.

A

But essentially this is this is what it is and what it does and Hadoop had essentially gotten a lot of momentum in the industry cloud era and a Hortonworks were already kind of building enterprise quality.

A

Hadoop distributions Facebook was using it and I think enough momentum in the industry that we decided that, yes, it's reached, escape velocity and we're going to go ahead and adopt a dupe and have that run on Asher and one other point I want to make just in the realm of the database world, you see a lot of companies like Microsoft and IBM and Oracle, and we all have you know these proprietary solutions for databases, right, Oracle and sequel, server and stuff like that and Facebook had adopted Hadoop.

A

So this Hadoop infrastructure was running kind of alongside you know. At the same level, you know at Facebook scale with you know, best debris things like Oracle and sequel server. So it was depth, so it just kind of speaks very highly to you know the quality of open source software and the kind of things amazing things that people are doing with it. So I thought that was really awesome, all right, so we're going with Hadoop. We know that so the next question was: how do we do that right to refer to for an active work?

A

How do we? How do we move forward? Do we a branch for our own distribution and maintain our own distribution of Hadoop from Apache? Do we go with one of the existing ones out there I mentioned that cloud era and Hortonworks already had were maintaining their own. You know enterprise grade distributions of Hadoop, and so those were options and such and essentially we decided to go with that.

A

So again, Microsoft chose not to build it ourselves, but to use an existing thing which was kind of cool, and so we, we partnered with Hortonworks and we're using their enterprise distribution and running that in Azure. In our data centers, and one of the reasons we chose awkward norc's specifically, is that I think they were. We were very aligned and kind of how we felt about this, and Hortonworks have always said that their mantra is on Apache.

A

First, so, like all the great stuff that they're doing, they make sure, gets back into the hibachi Hadoop, and so that kind of mentality of you know rising waters lifts all boats right, we're all in a community we're going to help each other out and make it better for everyone, and that was very much aligned with what how we were failing with Hadoop as well all right. So the next question now that we decided we're gonna go with Hadoop I, won't make it run and doesn't run on Windows.

A

So we're going with the solution, we're gonna work with Hadoop and Hadoop was written in Java. You know, Java write once run anyway, I right, uh-huh yeah. It didn't work on Windows well, so the first thing we had to do was go forward and actually make it work on Windows and that kind of got us into the open-source community kind of step by step. So the very first thing that had to happen was the team that forms the Big Data team at Microsoft was our existing data team.

A

So this was a group of folks from the sequel server team. So these guys are, you know, PhDs in you know, database theory and query, optimization and stuff like that. But we didn't have a lot of Java expertise so kind of starting out.

A

We had to like learn, Java and make sure everyone get people kind of get up to up to speed with with the language and adopt another another language, and then the second step was just actually participating in the open source community, so kind of getting to know the culture and understanding how it works and what the right etiquette is and that sort of thing.

A

And then we started submitting issues and patches and and that sort of thing and contributing a little bit and then the first priority there was just ensuring that Hadoop would run on Windows. So do you know, do kind of the basement line work to get it, get it working on Windows and then, after that, we got it working on Windows next leg was actually get it working well on Windows, so the Hadoop running on Linux still just like massively outperformed, Hadoop running on Windows and so we're like.

A

Ok, you know what can we do to up our game here, and so we ended up having to dive like really deep into like how the JVM interacted with NTFS and the specific like Java API is that are calling into the you know.

A

The windows API is, and so it was really interesting to kind of you know you had to do some low level stuff, but we ended up doing a lot of good work there and then contributing it all back to you know the main Apache Hadoop to make Hadoop run really well on Windows, and then we even got to the point that some folks became committers into Hadoop, which is which is pretty cool.

A

So for those of you who came to this conference to learn more about the open source community and may not be familiar with the term when you're a contributor, without that, you could submit bogs and issues and and and here's a here's, a request or submit my code fix and that kind of thing, and then committers are the ones who actually have right access to the source code, so they're the ones that approve those things and actually can write them in.

A

So we got to the level that some folks out Microsoft are actually committers on Hadoop. So that's great to you, know kind of have that vote of confidence that we were. We were good contributors there alright. So where was it hard? Let's talk about where we were kind of we kind of struggled a little bit. The first thing was just a complete culture change.

A

Right again, we had this group from sequel server, so these these Microsoft devs, who had written sequel server and were PhDs really good at query, optimization and database theory, and this kind of thing and they're now working with the open source world, which some of them didn't have that much knowledge of so things like just sparring with random people and a listserv type feeling was, was a new different experience.

A

The other thing was just around on just timing of work and workflow and how that went. We as that you know it's a corporation. We set deadlines right, we have milestones or sprints or that sort of thing, but kind of dates where we want to try to get stuff done by and when you're working with the open source world like a lot of people, are volunteers all right and you can't push your timelines on other people. So we we had to do some adjustments there and figure out how to make that work.

A

Well and then people had to kind of be smart and make sure they allotted time so I'm sure like when we first started, we probably submitting things and into the tree like ok, here's, a here's, a prints worth of work and okay, get that back to me is like 50 different submissions and so all kinds of things, but just kind of learning and working at that right, pace and and trying to figure out how to make deadlines and stuff like that.

A

But it really kind of grew this this new culture at Microsoft, so we were getting there. We were kind of fumbling a little bit and I, don't want to say that we have it perfect, yeah I'm sure we know positive, that we don't, but we're learning and we've actually seen some things that have been kind of changing. That makes me really excited. So the first thing I think a really key turning point here.

A

The fact that when we started, we were all about like making Hadoop run well on Windows and that's awesome, but that's kind of a self-serving goal right.

A

It's helping us to make it to make it one well in Windows, um but there's an initiative called the stinger project and that's another Apache thing for trying to optimize and make some of these things run faster, there's a whole kind of group of them tase and some of these other things are also in that in that camp, and so this initiative, one of the goals, was to make um Apache or these things run faster and hive is a query language.

A

So what kind of like sequel for working with Hadoop and HBase and and others in that in the Hadoop ecosystem, and so it wasn't performing as well as maybe it could. So we were looking for ways to make it better, and so what happened is we took? You know Humvees again, these these PhDs and like query, optimization, who had written sequel server and they were like. Well, you know what you know based on all the stuff that we know from writing: sequel server.

A

We know ways to make this faster and they wrote a paper called the high of 100 and a sickle II put together. Ways to make hive run for some queries, 100x faster, all right, and then we took that and partnered with Hortonworks and Facebook and others and contributed all of that back into into Hadoop and like that is awesome when you think about it right, because that was essentially like intellectual property right.

A

This was some of the I, don't almost a trade secrets, but it was intellectual property that we used to help, make sequel server so great and we were giving it back to the open source community and so like. That's something I'm, really proud of, and another thing I think that signals that culture change that you wouldn't have seen that maybe ten years ago. So in summary, like a bunch of sequel server developers are writing. You know Java code now to improve and support open source. Like that's awesome, that's really really Google.

A

Another point is there something that we introduced at Build in April called Azure data Lake and essentially what that is?

A

Is it's like the underlying storage for a Hadoop cluster when you spin it up kind of the industry standard for the underlying file storage is something called HDFS on the Hadoop distributed file system and we, with with the data lake, what we're doing is taking kind of that standard and making it available like in our infrastructure and so again that I think what that really shows is that we were looking specifically OSS and what people are using now is shaping the investments that we're making today and it's funny too, because I see the teaming we track.

A

We look at what are people using right now? What are the most popular tools? You know, where is the open source community going and that we were making decisions based on that and, like that's awesome like open source, really is helping drive and shape this product? So um in summary, there's just been so many kind of cool things around the hadoop story that I found um these numbers. Don't quote me on these numbers?

A

They actually I told kind of an old slide, so these numbers might actually be even bigger now, but at the time I made the sword. It was about ten thousand thousand engineering hours that we were using it over. Thirty thousand lines of code contributed back into Hadoop. So that's awesome. We responsible for helping get. You know Hadoop on Windows working. We had the hive 100x query speed up. We some of the people Microsoft advanced to the kind of port. Were there there committers into Hadoop.

A

We offered that HDFS service in Azure data Lake, so that we're you know using what the open source community wanted to use in our system. So all of these things together, I think just have me so excited about the way kind of how the battleship has turned and how Microsoft and open-source are working together. Much better now and actually I think that the Big Data team, like the Hadoop team here, is actually hiring.

A

So if anybody wants to be part of this awesome culture change, you know, let me know, but I'm just so excited about where we're going and I just can't wait to see. Although we're not perfect, I think we're. Finally, turning that battleship in the right direction, thanks a lot.

A