Red Hat OpenShift San Francisco 2019 | OpenShift Commons Gathering, 28 Oct 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Delivering On Demand Analytics at Discover with Brandon Harris and Anirudh Pathe

Description

Delivering On Demand Analytics for Data Scientists at Discover with Branddon Harris and Anirudh Pathe of Discover Financial Services.

Filmed October, 28th 2019 in San Francisco.

A

Good afternoon or I guess good morning, everybody, my name is Brandon Harris. The director of data science technology at discover financial services and with me today is on route every.

B

100 Pat I'm senior manager of the science enablement I worked for Brandon, trying to essentially ensure that our data scientists and analysts have the latest in the greatest toolset and they're trained on the best coding practices.

A

So I was interested to hear what sherrard had to say. I think you'll hear some commonalities in the conversation today and what's happening in the open source space and what we've had to do from a financial services perspective, so we're just real quick. The lawyers win again right. So what you're hearing today is is the opinion of myself, an honor ood and not necessarily Discover Financial, Services and speaking of discovery.

A

I know, some of you are probably familiar with us from the Discover card or credit card kind of originating the cash back reward space, but we also are a full-service bank, so we provide bank and deposit products across the board. We've got online checking accounts online savings accounts CDs. We also have various loan options from student loans, personal loans, home equity loans, pretty much a full suite of banking products to make sure that our customers save smarter and can spend more wisely Thanks.

A

So I want to talk a little bit about air 9 air 9 very quickly is our internal data science workbench? It was built in house at discover completely from scratch and building off yeah. Could you? Oh, you got a clicker that'll help. Ok, it was built from scratch.

A

Just do it to solve the number of problems similar to what you are discovered. In fact, it is kind of uncanny. The the problems are so similar, one of the things that really drove us to build out a data science workbench- and there are some good ones out on the market that the teams and Gorgons agents could buy.

A

But one of the reasons that we decided to strike out and build our own was what I like to call kind of defragging analytics at discover and you'll see that in our the slide behind me, it kind of like our why our elevator pitch for why we built this platform, the defragging of analytics, came about because obviously it was very fragmented to start with being in the financial services industry.

A

We've had to be good at analytics and data science for a long time, even before analytics and data science were you know, buzzwords, it's just something we had to do just like many other companies and other industries had to do to stay competitive. So the effort around analytics really grew up tied to specific business units we had analytics and statistical teams focused on risk, obviously, and our financial side of things. We had teams that grew up, slow and focused on marketing.

A

We even have analytics happening, and our internal audit team we're actually doing some really cool things with neural networks. But all these teams grew up tied to these specific business units and they have different tools. They have different needs, different capabilities and we didn't want to force anybody to use the same tool and just said: stop what you're doing, even though you've been doing it for a while and you're good at it. Come use this one tool this one platform and forget everything you've done previously. We didn't want to do that.

A

There's there's strength in the diversity of these tools, so we wanted to build a common platform where all these tools could be used and shared, and that was air nine.

A

Some of the problems we were solving when we sat down and talked with data scientists and the data scientists teams about these about the problems that they were solving or the problems that run into so the big thing that we kept hearing over and over again was storage space. You know, obviously, the cloud helps this out quite a bit, but on premise we had quotas. You know different capabilities for different teams. Some teams needed you know, hundreds of gigabytes or even terabytes. Other teams only needed.

A

You know a gigabyte here there, so it was very difficult to forecast storage needs and lots of teams were running out of space whenever they try to do anything at scale. Another big one was just the overall time it would take to take a model from inception and kind of training and development all the way to deployment that just became very very long. Ideally, our met I.

A

Do it, but really because what had happened is we have inconsistent data between our different environments, so a model might be trained in development, on one set of data and then go into production and now we're either missing a variable or a feature or the data is just not consistent and then one of the big aspects that we saw that people struggled with on premise was not having the freedom to install the packages and the tools that they wanted to use.

A

This kind of came about again being in a regulated industry. We had a shared in a multi-tenant environment for some of the tools, mainly around Hadoop. So if a user wanted to use Jupiter hub, it was on a shared edge node. It was talking to our on-premise aduke clusters and we had very strict controls around what could be installed on that they couldn't just install a package or the latest version of of h2o or something like that. It was very locked down.

A

So it would take weeks, if not months, to get tools, upgraded and new packages installed. So that was what we heard and we set out to solve at least the majority of those with our build out of air 9 and air 9 where it lives. Today is really this intersection of code, compute and data. So to describe that, let me kind of walk you through the user journey here of an air 9 user when a user logs in they use their Windows credentials or Active Directory credentials.

A

They're then presented with a window where they start to think about data provisioning. The first thing they want to do is create a data set to use, and they do this through talking to our cloud data warehouse which is snowflake, they're, essentially browsing metadata tables or our discover data catalog, with the end result of coming up with a query of sequel query that represents the data set.

A

They want to work with most of the time they already have this and they just paste it in there, but they can develop it interactively if they'd like to or if.

B

They already have.

A

A data set: they can skip this process entirely. So then they hit next and then they look kind of similar to what sherrard showed kind of choosing your environment. So you're choosing your tools, is it h2o? Is it SAS studio?

A

Is it Python or our studio, you're, defining the size of the environment as well so size through compute core numbers, the number of cores you're provisioning, the size, the memory you want to use and then choosing the toolset and then hitting next and in the background, we're creating those jobs that are pushing out the data set that you need the environment you need, which is just a pod based deployment that you've defined and then injecting the data that into that environment.

A

So when we talk about code, what I just described was kind of the interactive bottle. When we talk about code and github integration, this is really how we're doing our scheduled work. So we have air flow running in this environment as well, and when we want a user to schedule or want them to test a data transformation pipeline over a week or a month, they can use air flow. An air flow will actually read the code from github it'll talk to the air and ein platform at the right time.

A

The scheduled time, it'll instantiate, a container it'll inject the code, any relevant data sets into the container the model or whatever it is, the process will run and the pod will save the output to a log somewhere and then shut down. So that's kind of how we've engineered air and I delivered this intersection of data code and compute so for the technical design.

A

I'll talk a little bit about how we've kind of iterated we've gone from a very complex messy system on top of OpenShift and and leveraging some AWS specific components to a much more simple implementation. So this is a slide from our chief architect of this product. He likes to quote Albert Einstein a lot, make everything as simple as possible, but not simpler right. It was the approach we take. We had taken so in March of this year we had a lot of services doing various things all running in their own pod deployments.

A

We had a Jenkins server with multiple Jenkins jobs for deployments. Multiple lambda functions, tons of step machines. It was kind of a Rube Goldberg machine of things plugged into each other to get this to work and we've since scaled. This down, we've got the entire airline application into a single pod base deployment all of the services everything it needs to run, and that's how that is scale. This is a completely stateless version of this application and then, in the middle layer we have our operational sequel database, which is just Postgres and RDS.

A

So anytime, somebody makes an action on the air 9 UI application. The results are then stored in the sequel database and there's a watchdog process kind of going on watching for new records in the database to take action on anything, and what that looks like is at a very high level is what you're, seeing here the air 9 application box on the left is in its own namespace, there's a separate kubernetes namespace for the air, 9 application and the box on the right hand, side, there's a separate namespace for these tools. Containers.

A

One of the things I want to call out here is at the bottom is how we're doing persistent storage across all these environments, because this is really the powerful part that helps drive collaboration across the teams and that is using PBS and PVCs.

A

We're essentially mounting EFS and giving users home directories as well as team and shared storage directories, so they're able to open up files to work with in let's say, SAS studio or our studio do something save their output to their home directory or to a shared folder and then open up, h2o or Jupiter hub and then read those same files or work with them simultaneously. Since it's EFS, all those tools are mounted with the same personal home directories and team folders, and all the files and data can flow between those environments.

A

So I mentioned the container lifecycle. This is very, very simple, slide, a very simple concept, but I mentioned it because it was one of the biggest lessons we learned from a self-service perspective is that when you give users data scientists the opportunity to choose a large environment, they will take it. It will not provision the one core for a gigabyte that you give them and say start with a small and work your way up.

A

They start with a super extra-large and you just get a bunch of long-running environments, taking up a lot of resources so very quickly. We focus on lifecycle management, our users have the ability to request environments and then see them in a dashboard at any given moment what exactly they're using what their environments look like and they can stop them or terminate them themselves.

A

That went a little bit of ways to solving this problem of out-of-control environments, but what we had to implement eventually was a Auto expiration of these environments, so without auto expiration we just saw the environments skyrocketing. They would stay out there forever they never get shut down. So we tied this Auto expiration kind of value time to live value, to the amount of resources being used by the environment. So if somebody had provisioned a one, CPU four gigabyte pod running our studio, it might live for a couple weeks before the system shuts it down.

A

If they happen to go to the other extreme and they provisioned 128 core environment with you know, a terabyte of memory, it's gonna live for about 24 hours before it gets shut down and users have the ability to extend these environments once in case they're doing something really important. But it very quickly built into this. This kind of reinforcement of. If you want to use the resources you can have them, but you better give them back right away because other people need to do their work as well.

A

So that's kind of helped us focus on efficiency and and how we're not wasting too much money with these kind of long-running environments, then the last slide I want to talk about, is our resource allocation and and how we implemented this to make sure we're getting the biggest bang for our buck, and this is very much the overcommit model of the hypervisors of the you know: 2000s 2010s I'm, using limits and requests with kubernetes we're actually providing users I'm about a quarter of what they're asking for as far as resources upfront.

A

So it's kind of like the fractional reserve banking of compute. Here we are giving them if they request a 16 core machine and and 64 gigabytes memory, we're probably giving them for four cores and maybe 16 gig of memory. That is what they get out of the box if their workloads scale-up and need to hit that limit, they can certainly do so, and the kubernetes resource allocation and scheduling helps with that. But this is how we're able to get a lot of density out of these pods and containers underlying.

B

A

Fleet of computers really just I three instances. There are five instances running on AWS, so we want to make sure we're efficient with our spend on that.

A

So getting as many of these environments on to a single ec2 instance as important and limits and requests or what helped make that possible and help keep the cost down so I think that's the last slide from the technology perspective on Arun who's, going to talk a little bit about kind of the softer challenges with air 9 and what it looks like on board a bunch of these users and now that you've built something. How did we get them on using this platform?

A

And what did that look like so with that I'm gonna turn it over to honoré thanks.

B

Brendan, so right and touching a few things here, essentially when we build tools for the internal companies, we often overlook a few things: right: user experience, user adoption or even business value, that's kind of overlooked when you build internal tools, but again we work at this cover, which is one about 5jd for awards and customer satisfaction. The past six years we were make sure we have the customer at center of everything that we do internally as well. So we started asking the question of again: what do data scientists need?

B

Their asks are not unreasonable, they're not irrational. Essentially, they ask most simple things, so we started sitting out with our guiding principles, essentially saying: how do we build a slick neat easy-to-use UI UX so that they can get on to it without a problem not having to learn it over for like a week or two weeks? We want them to get running right away, so we build out a sleek little UI, and then we want abstract the tech complexity from it to essentially the customer or internal customer. In this case, our analysts and scientists.

B

They don't have to know about docker, they don't have to know about a container either. All they ever do is to be able to provision the container that they want and get running with it. So week's abstraction, detect complexity completely from the application itself and the other two things which I think are similar in other industries, as well as help and latest versions. Right again, help is hard to find, especially if it's a large company, you don't know who to talk to who don't know where to go.

B

We want to centralize all that together in one single place that you can ask questions, you can get responses and then go to a page to even do some self learning, as well latest versions again brand touched on a bit. Is the complexity in a larger company like ours, especially in financial industries, is difficulty to get the latest in the greatest version right away. You have to go through a series of approvals, go to the procurement and then get the latest tool.

B

We won't abstract them from that as well, and that is actually driving value right now. So what we see today that again the one of the examples that run gave you essentially you can pick and choose the tool that you want to be SAS, h2o or Python. Again, a financial industry problem. We inherit a lot of SAS usage by default.

B

We don't want to give that away right again, if you have to ask them to completely stop what they're doing and restart everything in say, RS for a Python, that's going to take us forever for people to adopt to our tool. So our solution for that was to collaborate internally and now what we saying.

B

Essentially, if I'm, a Python user I, can wrangle my tea run, Python, and so my teammate is an argues that he can come back in and build a model in our without meaning not requiring to learn the same language that actually brought a lot of skill sets together. So we have SAS users. Our use, fight on issues is kind of collaborating in the same tool without having to change their ways again. Our eventual goal is to change the entire company and how they operate on coding practices, but again to solve.

B

For adoption, we had to go with the approach of let's bring them into the same tool set and then can I teach them from there.

B

So teams is our internal communication tool and I know, especially this area of the country is divided between teams in slack, I have my plane in sterile pocket. For now we had to do with whatever we had. Essentially now we centralized all of our help inside of teams in one place and what we started. Seeing again, this is normal with any company too is we have more work than the people that we have to do so we outsource or crowdsourced our help community, essentially when we brought them all together in one single tool.

B

Now we started seeing collaboration that we didn't really expect before. When people ask questions, we wanted to stand up teams to answer them, but now people actually are solving people's problems. They are answering, so we see different departments coming and pitching together, but ideas how to solve things. So people are helping people essentially, so we take all the the chatter that we get in the team's Channel.

B

We kind of put that, together into our need, help dock page, so we have all the FAQ sitting in one place so that now, if I have to get started on a project, I can do that right away. We have sample scripts reference bill codes that I can get started right away. This essentially made super easy for us to onboard users again when you're talking twenty thousand person company. There are a lot of people in the company that need to train and onboard on, and it's not simple.

B

If we just do it one by one and each training class every single week, so we want to get everything all the FAQ. All the help is together in one spot, that kind of helped us get the adoption really high, so the product itself has launched in about March about six to seven months away, and now we've got about 60% of all the alerts users in company on this platform.

B

Already that includes about greater than ninety percent of all data scientists are in the community, but and I say analyst they're, mostly again, any person that uses touches data to provide something of value, I call them an analyst. We've touched about 60% of analyst right now. We hope to complete an adoption of 5% by mid next year. That means every single person, with their tooltips right now can provision environment run the code, deploy it visualize it in a matter of seconds again.

B

We measure ourselves on this as well, essentially how we can get to the maximum or people in the company. The quickest way possible we're getting there again. Bren and I are extremely passionate about delivering products that actually create value and power, our internal customers, so they can do what they do best, which is make customers happy.