Red Hat OpenShift Commons Briefings 2018 | OpenShift Commons, 17 Mar 2018

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: OpenShift Commons ML Briefing: KAML-D intro with Michael Hausenblas (Red Hat)

Description

Michael Hausenblas (Red Hat OpenShift) discusses KAML-D (Kubernetes Advanced Machine Learning & Data Engineering Platform) open source project with the Machine Learning on OpenShift SIG of OpenShift Commons. Learn more about https://github.com/kaml-d/design

A

Alright, still I mean still see my screen: yeah, okay, alright, so my name is Michael.

A

Housing bless, I am a developer advocate in the ownership team, and my background is in one hand, containers you know, committees and stuff, but on the other hand, before that I was at a startup called my bar, and they did essentially a lot of big data fast data that first typeface, when you know Hadoop was, was the cool new thing and there I saw a lot of things: the handoff between data scientists and essentially so we I assume we're almost familiar with the second part, with the handoff between or cooperation between developers and other folks.

A

But there is, if you look at teams that want to write- or you know, deliver a machine learning feature. There is the second gap there and that's what I call this this double divide, and that is between data scientists and data engineers.

A

If you look at the concrete situation, someone might be using I, don't know R or even Python, it doesn't matter, and they you know the data scientists they put together a wonderful model and they have all their parameters orders if something that works on their machine and then come the data engineers and developers that actually write the application in PHP in rails in JavaScript and go color whatever, and they very often need to reimplementation learning model that has riff. You know the data scientist had that already working and dead causes. Friction.

A

That means that, if something changes, if they decide this iterate and have an even better model with no a better cost function or whatever the data, engineers and developers need to to play catch-up with that and camel really focuses on that. Bridging that gap, helping data scientists on the one hand and engineers and developers on the other to work together better so very high level, typically use cases, would be.

A

As a data scientist, you can iterate on your feature on your machine in a guest feature faster and consequent issue, the feature of faster sharing these datasets and models with your colleagues with your data engineers developers here working together to actually get that into product, and if you think, if you are a data, scientist and think of how you are going about versioning, if we would be in the same room, I would now say please show of hand, but you can not or say plus one on the under chat or whatever.

A

Whoever has not done that copying. A data set appending a version one or the current date or whatever doing that again, adding something removing something sharing that with others, and there is the solution for that right. In encode we use git, we unity. We use nowadays typically distributed version control systems like it that take care of that at automatically. Whenever we say take a snapshot of the current code base, then a trail.

A

An entire set of histories is build up and we can go back in time and the same is possible for data sets and gonna go into that in a moment. If you're a data engineer or a developer, then you benefit from the unified way of how camera handles the data set in the models and again think of you might be implementing version.

A

One data scientist are only worth two or we finding the models, we're taking a different approach or whatever, and you want to be able to pick up as quick as possible and also for bugs or to lesser degree- and you know it provides the guidance to run, do approximately. Let's, the machine learning feature is enabled so far. This is very, very early phase, as you can see at him got any running code. Yet the UX is more or less as follows. You essentially four tabs that might well change.

A

I might actually a discussion with Graham who presented I think two months ago to build a hop-on on the Commons video channel. I might actually swap that out. I might actually make these tabs part of attribute the lab. So it's actually turning it around, but essentially you have these four tabs. You have the data tab where user data scientist mainly would upload the data set. You can point it to someone web or just upload it from your local drive, and then you have two ways to that's, essentially metadata level to search for data sets.

A

You have an elastic search back next year. So if you just type in you know finance, then you know this finance data list would be yeah would show that one or you can enter a sequel query and then would go by a different path through price to DB and.

B

So you can query.

A

And then anyone who has been working in a real world situation allows there to be many data sets you're dealing with, so you can very quickly find it every loan data set and whenever you want to you, can that's that's. This green checkmark here essentially take a snapshot of the data set. This is a copy on write and snapshot and essentially only captures the diff.

A

So if you, for example, you have a CSV file and I say into whatever reasons: I want to only take half of it or you have this typical training and test split, 7030 or whatever you can do that with one mouse click extension and the second tab would essentially be currently the idea having to put the hub there or, as I said, if I turn it around in terms of UI, it would be yellow way around, but essentially having the development um mainly for for data scientists to essentially put together them all.

A

The circuit would be the deployment tab, which is mainly for data engineers and developers, and odd folks and observation will essentially be in Griffin and Jaeger be embedded there and in terms of component. As you can see, there is pretty much everything besides the actual workbench, which, as I said now, I'm pretty convinced I'm gonna. Do that as plug in in into the lab. Everything else. Is there right, so we can run it on any plot club platform, you're using Canadians or, in our case, obviously, or push it as the the runtime environment.

A

Not mesh is essentially that thing that takes care of of the snapshots. Essentially what get does for code, and then you have two passes for for the metadata hub, which is presto, DB, a district weary engine that exposes to sequel interface against any kind of storage, block storage whatever and elasticsearch. That captures the other bits of the metadata.

A

But on the right hand, side you see if you're familiar with cube flow, that's essentially flow. So no big, surprise there and just a final note here, depending on your role, you would probably not see all the taps or a data scientist might only see data and development have a data engineer. A developer might see development deployment, ops press might see all of them or observation and deployment. So, depending on your role, you would see different tabs there and that's pretty much it.

C

Yeah, can you put that architecture picture back up and maybe also explain a little bit about what dot mesh is because I think that's the new component in your.

A

Right right right right, not everyone might know that. That's that's a really good point. Thank you very and dot mesh is essentially a new, as did the first. As far as I know, the first representative of a new kind of mesh, in this case betameche, might have heard about service meshes like like sto or confident and others, but don't measure is a day dimension. The dimension essentially means you the same way.

A

Your externalizing functionality in the service mesh with respect to service and networking your externalizing functionality in terms of data in terms of snapshotting data to into the mesh and the mesh takes care of that. Essentially so technically, dot mesh in qualities is a flex volume that just transparently works there. You don't really notice it you're. Just using you know your normal volume there and there is an API so working together with the top mesh folks there with Luke there's an API Python API at this entry.

A

It allows me to say take a snapshot of that volume or whatever, and then I can provide that back in time. So you can imagine at at some point in time. You would have a tree like structure if I, if I click on that finance transaction data set. I would see this tree like structure of snapshots, that I that I've taken up or whoever in that context has taken make sense regarding not matter or any any questions. What mesh.

C

It looks like carol has a few comments in the chat. Maybe if you unmute yourself and.

A

Alright, well should I stop screen sharing, probably.

C

The Carolina saying what you oh I,.

D

Was a Michael I love the way you framed the whole process into the three groups, because I think that was what I was struggling to articulate the other two times that I have set on the cube flow stuff because looking at it as a data scientist or somebody, who's interested in scientific reproducibility. As a data scientist I want to be able to run whatever I want to run on our Python Julia and installed the libraries that I need to get my work done and access.

D

The data sets that I need and in terms of UI UX, whether that's Jupiter lab Jupiter notebooks Zeppelin. Whatever you know, that's what I want to see as the data scientist I, don't care as much about the, although I might want access to it.

D

I might not care as much about the lower like what I would say like for a service level like Jupiter hub, which lets me deploy it to a research, computing group or a team of users or analysts in a Wall, Street firm and then I, see kubernetes being under that still so, I think you've done a good job of capturing.

D

Some key differences between the different user stakeholders.

C

A

B

Had a brief question about I mean people tend to draw this diagram. I've seen this diagram incarnations before and they tend to draw it like waterfall style like you did, but I mean like, as a you know, model jockey if I'm playing that role, I mean I'm, also very driven by what you know. Dad engineering can like give me and I just wondering like if there's any consideration given to kind of like the arrows that might be going in the other direction, where I can't I think in train on stuff that I don't know.

B

Somebody can't tell me how to fetter eight.

A

Right right right, no, absolutely your that's an excellent point and I actually you're, absolutely spot-on, yeah! The pair is obviously go. It's not a. You know from one side to the other that that's that's. The I think I headed a little bit in the description. This iteration, it's right, so you're going back and forwards and you get feed bags like Sony's, deploys that in production ago, like yeah, that seems to work in that region, but not really there can be. Can we somehow you know, update the beta site remodel or whatever?

A

Obviously, yes, it goes in both both directions and I will certainly take that on. Thank you.

C

They're on questions or on this.

D

I just think, following up on what Eric just said, the ability, as a data scientist, to access different data stores which might have different authentication barriers. You know in terms of you know, I met this University I may be able to open. You know access this data, you know, but not you know the one over here at this other University or whatever HIPAA compliance things like that. You know I, think the data scientist wants of you into that data engineering level. They just don't want to be the person to necessarily do it.

C

A

As I'm not aware of any metadata like open source metadata project, that actually does that this bits with elastic, search and and press to be be so if anyone has anything there that yeah I, my goal is always to implement as little as possible. So here anyone has any any project that I can use them. Yes, but I am.