Continuous Delivery Foundation MLOps Special Interest Group, 27 Jun 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: CDF SIG MLOps Meeting 2020-05-21a

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

A

Load, hey, hey something weird with up with my zoom.

A

That's all you can hear me. Okay, yes,.

B

Good evening, I guess the most part. It's.

A

8 p.m. here so I'm guessing it's 10:00 there Jessie, yes,.

C

10 p.m. here I've invited will, along so wills a technical lead from curdling go I mentioned the the meeting with him last week and he was quite interested, so he's just popped. His.

D

C

As well he's dialing in from Portugal.

B

11 a.m. nice and centralized nice.

A

Yeah and probably warm up to.

B

It is just getting warm absolutely it's rather nice, so rainy.

A

And chilly here just fine, because I kind of stuck inside.

C

A

Not really missing, you know, I mean summer was a write-off anyway. So Terry's here, hey cherry cherry.

A

And a few of the people loitering um I guess: should we kick off Terry you.

E

A

The agenda so I just so everyone knows this is being recorded. There's all CDF things. Are you probably meant to say that at every meeting, aren't you I know in I'm sure in Australia meant to say things are being recorded anyway.

E

Yeah, so all right, thanks, Luke I'll.

D

Go ahead, we've.

E

We've we've had I've reviewed those and merged them into the document, so master copy is currently up to date.

A

Yep, so that was from last week, if you're, following along on the document, I added a glossary of terms which I could just there was a anyway, mmm because.

E

I think that yeah.

A

So training model hyper parameter, endpoint training, pipeline training set, there's probably more. We can add, probably parameter as an obvious one that I'm missing you know so hyper parameters are things usually picked by human and the parameters of things? My training, although with newer algorithms, that's even getting more public added where machines in and picked other parameters, so models deployable unit, training pipeline, it's probably analogous to a normal software pipeline and may involve both software delivery. I guess. The difference.

A

That's striking me about machine learning sort of pipelines is that they are much longer running than a typical software pipeline, maybe in industry, and that some software pipelines can be pretty long better. If you had an end to end mill pipeline that involved training, that's reasonable to think you could be running for hours or days, probably even terrier. What do you see several.

E

Days potentially, several.

A

A

Just spent most of today re running a training thing, because I had assigning a variable in a map in Python and I had left a comma at the end. So converted into a tuple for me and and I was staring at it for hours like what is wrong, and then it was one of those things you spot and then it's like you just close, but.

E

Yet the other thing to bear in mind is that a pipeline execution may cost you in the order of two and a half million dollars. Well,.

A

Okay well I spend when I do my own about nineteen US dollars an hour, so I won't so I in those kite. That's kind of the edge case like there's a lot of like resumable stages, the pipeline I. Guess, like you know, I'm trying to think of other pipelines, I've used where you might start it again, but you'll skip over a bunch of stages to get up to where you work or or in those cases, is it like if it breaks at one point, is it broken and you're gonna start again? What's a typical well.

E

Iii think they're going to be some interesting challenges on the on the larger projects, because you know, typically, what you're talking about is a situation where, where you have a large number of GPUs or TP use and you're having to distribute elements of the solution across multiple nodes, so the intermediate products may not be something you can reuse easily. If, if something goes wrong with the pipeline run in in the more complex situations, there are certainly some very interesting technical challenges that will have significant cost implications.

A

I'm, just gonna do another quick, polecats, just well I, remember on just add parameters. If there's anything else, anyone wanted to add to the glossary. Then.

A

Feel free I'll just do that now. Well, I, remember I'm!.

C

Just how are you doing that I'm, a just post I, came across this. That might be useful at some point just place it. The link in it looks like a pretty nice list of curated libraries for faster workflow around some machine learning.

C

I know they're just probably bid one level lower than the the level that we're focusing on. But if it's on point we're looking to you know, curate a set of tools to point people towards this might be a bit of a give a QuickStart or.

A

Some set of data collections and yeah like a I like.

C

I, like how they're broken up into phases like they've got a production phase and a training phrase arise explosion.

B

C

It seems to be in line with our conversations from last week, at least sometimes tough, it's useful or not, but I thought yeah.

A

C

A

They've broken it out into sort of different classes of data like natural language,.

D

A

I, don't know it Rex. This is sounds exciting and then things like validation, optimization eruption, they're kind of things we're familiar with. So that's actually a really good age. So so this is like some. Why is this an organization's like they're sort of the.

C

Guys I mean I know as much as you do: I poked around the guy a bit yeah.

C

It's cause I'm selfish in learning, engineer.

A

Something to keep open in the champ.

C

So, what's the dissuade making that PR just to disarm and I understand when you're talking about the distinctions between ml pipelines and traditional pipelines? Are we talking about anything qualitatively different or is it just quantitatively different? So, for example, you know kubernetes is doing very well in managing large, traditional workloads and you know restarting pods, and things like this and merging memory usage and all that kind of jazz.

C

Is there anything qualitatively different in what we need to identify and manage and how we repair and get the pipeline back on line with a machine learning, job versa, traditionally, I think.

A

Terry would have some sort say because he's been using Tecton, which is kind of a generic sort of pipeline thing and I know. David are on cheque from Microsoft sort of said that there would be there's no reason why they should be different. But to me the thing that leaks out is more the the length of time that things run for is the most problematic having workers a lot of different pipelines over the years.

A

That's the thing that would leap out is like just the the, not necessarily the number of steps, because I've seen just boring software pipelines in enterprises. There might have 50 steps or something you know what it's like yeah, but it's like you know. This thing might run for three hours or in Terry's case, run for ten days and cost a significant amount of resources.

A

And obviously, if it's running for days, you know you can't just have it fail and restart if you like, if using something like Jenkins, if you've got a durable pipelines, mm-hm I'm, you know Tecton runs on two cubed areas, so it's fault tolerant, so I would say.

A

Maybe it's more quantitatively different. Like just the maybe the qualitative difference is that there's data flying through so a trigger for one of these pipelines could be the Neary I'm working on something where the training will be updated for a given account. If you like once a month, it's just there's no reason for that. Maybe there's some other event: significant change in circumstances, so there's some trigger for this pipeline. That's less frequent than just upgrading some software, but in that case, there's chunks of the data that have to go through the system.

A

You know the the raw training set, the training set prepared into you know a flattened out, normalized format. Then you know import it into some other tool. And you know what are your thoughts on that I.

E

Think this is this is where it. It gets quite interesting because for for trivial projects that there isn't much difference, but when you start to deal with significant real-world machine learning problems when you're starting to try and deal with problems where you're trying to replace a human with similar levels of processing capability, then you, you start to hit some some fundamental scaling challenges in in the approaches that we have at the moment. So as a reference, if we look at something like autonomous vehicles,.

E

We're talking about daily training, set volumes in the order of four petabytes of data and to be able to to actually perform the training in the first place, you may need to be using hundreds or even thousands of individual compute units to to train a sufficiently detailed model.

E

So you you hit some fundamental problems in a number of areas. So, typically, when you're, when you're looking at that scale of complexity, people are currently using dedicated architectures, where you effectively have a whole data center, which is dedicated to training one model where you, you have racks and racks of.

E

Physical hardware, which just contains CPU units all connected to high capacity networking switches.

E

On your biggest problem is actually making sure that you have sufficient data distributed across the memory caches on all that GPUs to be able to to actually train against. If you think about it, the GPUs only have a limited amount of RAM, so you might be working with 24 gigs of ram on a on a individual card, and the card may be able to process all of that data relatively quickly.

E

So, if you're trying to get through several petabytes of data in 24, gig chunks, you're you're, moving very, very large volumes of data about at the maximum capacity of the network infrastructure. That useful is.

A

That is that so that's pipeline in the large, but if it's a tool and say like tapped on or Jenkins X or something like that, orchestrating it like it's not moving those bits around. It's orchestrating it isn't like it's and stuff that you're you've been working on and things like Tecton capable of scaling to that or not yet or maybe in the future. Well,.

E

This is one of the technical challenges that we need to detail out, so so there are. There are fundamental constrain at the moment on using a kubernetes based approach, so.

E

Obviously kubernetes is.

E

Based on virtualized nodes and those nodes are only able to currently address pcie-based GPU compute cards and there's a limit to how many cards you can have per nose. So so you may be constrained to a maximum say four or eight GPUs per node.

E

So a single container is is only able to consume the GPU resources of a single node.

E

So to be able to scale map, you would need to have a lot of nodes with multiple GPU cards installed, which is certainly technically possible.

E

But we would need to extend the the pipeline system to wear.

A

E

Across multiple containers, synchrony.

A

What's what's Google's our TPU like? What actually is that I've seen the TPU mentioned a few times? Sorry.

E

You can I, don't know. What's.

A

The Google's TPU, they call it, the TPU I, want to say tenth tensile processing unit, so.

E

Typically they're FPGA cards so they're there they're effectively very similar and in architecture to the to the GPU approach. You've got a a card in a slot in a physical machine in Iraq, and you might be at a I, said four of those cards in one physical machine and then there'll be limits on how many cards you can have virtual machine within the kubernetes No.

A

Well, I claimed, and some.

D

Betting, better.

A

Better $4 for tensorflow specific ones: that's there like it's an optimization that its proprietary to them, I guess.

E

Yeah, so it's you know, there are sweet spots for particular types of loads, but again it's at the moment. An individual card is dedicated to a small piece of processing and if you want to tackle a bigger model, you you need multiple cards.

C

Is that is that a fundamental restriction with kubernetes model, or is it just basically I could from this conversation, sounds like kunos could be pretty easily extended to just support a different type of CPU, I, suppose I'm, sorry.

E

It's about the limits, the PCIe bus right, okay, you're talking about moving data to the point at which you're saturating the PCIe bus and therefore you're doing with the limits of the physical hardware got yeah. So so, when you look at the dedicated compute units, they're they're, not structured like a conventional server, would be basically a rack full of GPUs.

E

Very high speed network interface and a bunch of DMA hardware, so you're, basically shifty, engage they're off the bus into the fast memory in in the cards as fast as you can move it, and the performance of the overall system is is, in many cases possible, mapped by our ability to move data around.

B

So thinking about what I've read in the the main document and talking about how it is important to be agnostic regarding the actual details of the technology we are using and what you've just said about having pretty much a data center dedicated to training a specific model, um this very much sounds as though kubernetes is perfectly fine, but we just need to have an awareness that we're going to have a tiny little node that essentially calls out and says data center. Please do your thing it's.

B

It remains a viable model for running the orchestration, but it doesn't have to the limits of cuban 80s. Don't have to constrain the external tools that we might need for running big pipelines.

B

E

Are a number of constraints at the moment, so obviously kubernetes is designed to run on conventional server hardware. So so, typically it doesn't have the data throughput that you might need to to work a pace on a large model, but also you have a you, have an elastic scaling problem today, because compute hardware is allocated at a node level rather than container level so to to actually be able to use containers to build things. You have to have nodes that have been provisioned with GP or TPU Hardware. Sorry,.

B

I'm not trying I'm not suggesting that kubernetes should be used to do the building it sounds. It seems very clear that kubernetes is not going to be in a position to do that, but it seems reasonable that kubernetes as part of the ml ops infrastructure remains entirely viable. So long as it is simply calling out to whatever custom hardware we might need where custom hardware in the cases you've been talking about, might literally be a whole data center optimized for cycling up petabytes of I've done similar items.

A

Like this, not with Cuba news, but with me so in the past- and you would configure like when you had a piece of work to do, you would label it in such a way that effectively isolate that resource. So no one else would use it so I imagine it would be similar in this. It's not actually the the training isn't happening inside a docker container or a whatever post. Docketing kubernetes uses these days, but it's more. It's just shelling out to something else. To actually do the work in collecting the results or exactly.

B

The direct container can absolutely be the interface between the colossal hardware and the pipeline that we're talking about using as a more repeatable approach to building and deploying these things. So.

E

Again, this is a this is a scaling related issue, so for for the sort of of machine learning experiments that a lot of organizations are doubling with today, you can absolutely build within a Kea.

D

E

Just using yes docker containers and that that's what Jake is excellent during a moment you, you have some constraints in that you need to elastically scale. You are nodes in order to control your costs, because you'll be incurring charges for as long as GPU units are connected to it, an active node. So you need to take certain precautions to make sure that you're you're cleaning down your pipeline successfully after they've finished I'm, not leaving dead pipelines running around consuming resources that you're having to pay for.

E

But there is a a cap to scaling at the moment. Some of that cap could be addressed by some infrastructure level. Changes to cuba Nattie's itself to to allow us to treat node resources elastically, and that may be something that we we can extend kubernetes to to to fit better with in the the ML model of the world.

E

But yes, if, as as you move on to the larger problem, sets, you do need to be able to transition to a different way of working and more dedicated infrastructure, but typically that the infrastructure is not available on the cloud providers.

E

You would have to buy that and set it up in your own data center and then implement local instances of a pipeline system to control that.

D

One thing here, I would like to hear your thoughts. It's a fool agree. The things scale problem when we are training problem models, but I believe the results of qualitative difference between action, which is for me to teen into production or deploying models, means things differently when I'm talking about usual software, which is just putting the binaries start using them. But when we are talking about models and new features, we have also to deal with the already processed data data.

D

So, for example, let's sing in classical float detection system and we were created a new version of the of the model that classifies clients into good or bad and things with a ploy that new model into production. But after that, that is not enough. We need to reevaluate all the existing clients with a new model to get more accurate information about which clients are on fraud and which clients are they get, and that is an extra step also with a lot of scale, problems that typically I, don't believe happens in Jews walls of development.

D

So I true believe that these are difference between ml ox and the box, and I would like to hear your thoughts on these. So.

E

There are some some quite specific challenges, so.

E

Working backwards, if you like typically a model, also requires some associated components, so it may need specific pre-processing for the data that it's it's going to infer on. So so, typically, there will be a a set of components that need to be passed from the training stage into the the service. That's going to implement the an instance of the model and one of the challenges in that space is that they're currently limited abstraction layers that allow you to encapsulate the the model and the Associated data.

E

So one of the one of the linux foundation projects onyx is addressing that challenge, which is basically allowing you to have a an independent.

E

Mechanism, by which you can specify the structure of a model and pass that from one system to another without actually having to pass serialized classes or chunks of python, so that allows you to decouple your implementation from from an explicit version step of dependencies, so that that's that's! That's one of the problems that jenkins x actually addresses to directly.

A

How do you, how do you spell onyx, I know you use this.

A

Double index right.

E

So so so that that's one of the challenges that that we need to understand for the one of the other big challenges is that your datasets are intrinsically linked to your training.

E

So when, when you want to to run a particular training, you need to specify a a set of training data and a set of test data to operate from, and you want to be able to repeatedly go back to to that set of data. So you need to be able to specify a version collection of data and then pass that version collection of data to the pipeline. That's executing the training and.

A

That that versioned collection of data could be in something like s3 is maybe some content-addressable harsh or immutable name, and then, if you were tracking everything else in the kit ops way, you could just say the data is over here or use its fingerprint here, as it's like it doesn't have to be in one monolithic thing. Doesn't so.

E

A

Know know you were using, you were using Google volumes to pass stuff around, for example. Yes,.

E

If your data is small enough that you can afford to move it about, then, then that's a viable approach and that's one of the things that we're using at the moment.

E

But if what you're working on is is a large data warehouse full of petabytes of information, then you're not gonna, be able to just shift that into another cloud instance on the fly to you.

A

Need to have something that gets you that same data back again, like some reference or query or its returns, you the same data each time right.

E

And that's that's that is potentially challenging, because the sequel query doesn't guarantee that you'll get up zones that the data and your database is typically immutable structure. So the data itself may be changing.

A

Mean things like a snowflake would let you do things like that? Like yeah there's certain solutions, so you mentioned test data and training data. So often you can take a set of data and then you know slice a bit for validation and a bit for test. So you're saying for it to be reproduce, and you do that randomly like you would you would take a set randomly pull out test, randomly pull out validation, whatever percentage ratio you want you're saying to reproduce it then that split.

A

Those splits should be kept separate as well, because you want to you wanted to be deterministic, because if it was, if you were just randomly picking different times each time, even though the whole set of data was constant, you would get slightly different results because you'd be randomly picking a different test subset each time. So would you have to keep that test a bit like stored somewhere as well? The.

E

Easiest way to understand this, one is to think backwards from from a real-world scenario. So so, typically your your machine learning model is going to be a decision-making system operating in the real world. So if it's a control system for an autonomous vehicle, then if something has gone wrong, then potentially its killed someone.

E

So so you then have a serious investigation right.

A

So in that case, yes, you do actually need to have separate cause like yeah.

E

So you need very clear compliance reasons. You will always need to be able to have an audit trail that goes from the finished model backwards to the source data and, in many cases, you'll be required to to implement some some level of visibility on on the why certain decisions were made by the model under certain circumstances.

E

So so you may be required to prove that there was no bias in the system or that the model was behaving rationally and a certain set of conditions.

E

But you also have the scenario that you know in the event of a a serious failing in the system. You, you will need to retrain the model and then do regression testing to demonstrate that the model behaves the new model behaves differently under those circumstances. So you all need to be able to reproduce a set of conditions and and check back against potentially earlier training sets to to verify the behavior and.

A

And and often you'd have like source code change along with this as well there'll, be some either how you prepare a feature or how you? Maybe you look at threshold of an output. So all this stuff kind of has to be a version together. Yeah get up start things, yeah.

E

This is the this is the big risk today here is that most of these things are effectively uncontrolled, because the the training scripts themselves are being built in environments that you can't properly control. The data is completely uncontrolled and in many cases, there's there's not even a record of what training data was being used to frame which version of a model. So so it's it's going to be very difficult in an increasingly regulated environment to to meet the requirements of the regulatory compliance unless we can provide pipeline systems that facilitate doing all of this automatically.

D

So for these compliance regulations- and we have really trained our model.

D

Then we are going to well, it has to be serialized and sit to production deployment.

D

The due respect, probably onyx or part of that, but I guess part of the day. Serialization process is also reapplying. Those small test data fragments we had to make sure that is lungs, deserialize I am understanding properly or.

E

So I'm finding it very hard to follow you you're very quiet.

D

Sorry so imagine we are trying. We have already trained a model. We have to use some artifacts mostly test data validation, data things like that and the training has ended. We are happy with that. We.

A

D

Everything in unit serializing that sending and then we analyze the model we deploy it. Well, it has to be deployed, so we need to run some checks. Some of them already run testing done on training to make sure that a new environment is going to behave exactly as a testing one or is that what are your thoughts on that site? So.

A

To me, sorry: yes,.

E

So I would expect that you you, you would do some degree of testing to that level. Although it depends, you know, if you're using a CI, CDA environment jeez your your environments are matching, then then that's less of a concern, but what you?

E

What you need to consider is that certain types of model are actually changing over time, whilst they're in operation, so they are actually subject to degradation under certain circumstances, and you can have scenarios such as catastrophic forgetting where the model will cease to operate in in the way it did when you originally trained it.

E

So you know there are scenarios where you actually need in to continuously model monitor your models on a you know, an hourly or daily basis to to ensure that this they're still behaving in the way you expect them to, but typically you're. Looking at scenarios where you're you're deploying.

E

Updated versions of models into an application over a period of time and what you need to be able to do is regressions. That's the application to to demonstrate that your your changes have improve things and not not broken.

A

E

A

Me, that's that's! That's where things are fairly familiar, it's it's like imagine. You had some third-party library. You were using that updated once a week once a month, there's some new new feature for checking for credit card fraud or something your credit card numbers, and it's nothing to do with machine learning or anything. Then, even though you're taking this binary, you trust its provenance, there's all that sort of stuff. You would still have your own kind of Institute tests or regression tests or integration tests. Whatever you want to call it acceptance this.

A

So this point: it's you know. The model that's being deployed is an artifact like anything else like it's, not really special. In that regard, it might be infinitely more complex in how I can misbehave so yeah, there's more need to test and continuously monitor it. But you know in the simple case that sort of degrades too you know it's. It's just a piece of software that you're putting out there. It just.

D

Wasn't written.

A

By a human, that's the difference, I guess to me: that's my intuition anyway. So there was on.

E

The type of model, so so, if it's you're, if you're, if you did creating a static model, then those those rules apply right. But if the, if the model is self learning, yeah.

A

If it's uh yeah, if it's you know, for example, spotting it's doing anomaly, detection or some sort of reinforcement, learning, yeah, it'll, it'll, its behavior, will change and yeah I. Suppose then you've got more. uh You want to have it even deeper, like scenario, sort of based testing or regression testing that you know that it hasn't made the things that were bad. That once happened, don't happen again that yeah it's.

E

There are also some other subtleties in in in the area because we're deploying models in in ways where we expect them to behave more like a human would than a piece of software, and so the the challenge can actually be that the the model is remaining the same, but the environment is changing.

E

So a good example of that is that you know we currently have a bunch of models that are failing because they were trained, pre, coronavirus and and now the whole behavior of the world has changed, and the models are no longer predicting accurately because the environment change underneath them yeah I.

A

Mean that's. That's if you written a bunch of code- and you know handcraft decision, trees or use the rule system, you can come across the same thing. It's just a human mystics or like your assumptions, change and yeah I! Guess: he'd! Be you more? What you're saying is you're more likely to encounter this stuff with sophisticated models, they're more sensitive to their environment? Maybe yeah.

E

Well, this is this is the challenges where, if you're, if you explicitly coding something, you know what your assumptions are and therefore you know when there's change, but when you're training a model, you don't actually know what features it's it's detecting. You just know that it's it's giving you a result that you were expecting from the data that you're measuring on the output side. So you don't actually know explicitly what is influencing the decision that the model is making yeah classic example of that is they were. There were models trained to recognize.

E

Lung cancer from x-rays and they had very, very high success rates until they worked out that the test, the training data, they were using typically included a ruler in the image for the positive diagnosis, because somebody had been measuring the size of the tumor.

E

Actually behaving correctly, but because they've detected a different feature to them, the one you expected yeah.

A

You're getting little leakage and excreted leakage. There, : I've, seen some I've been reading about some there's more interesting ensemble models that would help with things like that and ensemble means like there's there'd, be that model trained and there'd be other unrelated models, perhaps maybe different sets of data that would maybe even off-the-shelf things that would go well. I recognize that as an x-ray, I recognize that as a ruler and that self becomes a feature that feeds into something else, so yeah I think the state of the arts moving neck. Is that yeah?

A

There's lots of amusing examples like that and yeah I'm sure, there's more or maybe apocryphal stories as well? Okay, so one piece of feedback from last week that someone put it in the document I'll paste, the link was Google's. Take on ml ops for those that are interested I had a bit of a flick through of it. It seemed to line up with some of the things we've been talking about.

A

The main thing I feel lot of interest, where's the diagram near the top. That shows you know how much work there is around data collection, feature, engineering and monitoring and serving and the actual machine learning codes a tiny little square in the middle I thought. That's that's a really nice diagram and I showed everyone. So for words.

A

Worse, that's bugles, take on things and they they classify things from level zero through to zero is like basically manual clicky, clicky, notebooks, to I think to being the using C ICD pipeline automation, which is so they have sort of three three levels of maturity. I guess I thought that was an interesting thing to note.

A

Another thing, a minor thing: Jerry was there was a ticket on the chickens, X testing I just wanted to ask like for doing like acceptance, tests of the ml quickstarts that would result in a web app running right, like there'd, be a web app. That does a decision like if you start a given email. Quickstart, you end it. You end up with a one app running, but to two pipelines. Is that right.

E

So you, you start, two pipelines of one. One pipeline is running the training, and so it doesn't result in you know a container image, but what it does do is it updates the code base for the second pipe for the second out andrey triggers the via.

A

Via gate, so he does it get and that triggers it and then you end up. Okay, that's that's cool.

A

A

It was a issue, someone open that I was going to include in the document which was a bunch of interesting links to articles. I thought we could have a section in the roadmap, all that all the readme of just interesting resources. I thought that was interesting and something else that I was asked by tracy from the cdf is like, if we thought about doing some more or some blogging on on ml, specifically for the cdf, like that, has a rapidly growing audience of users, sort of introducing the sig and the concepts.

A

So that's something to think about. I might have a think about that at some point, because there's plenty good content already on the on the rodent, just sort of trumpet out there that might bring some more interesting interest. You know for yeah.

E

We need to get the word out.

E

I've got part of the presentation that I was planning to do on the conference circuit this year. So I can I can finish that off and we can. We can use that as well.

A

So I was getting a little bit of timer aside at the start of the week. The start of the week- and me is typically quiet. I know. Just who knows what this is like when you work with people in the u.s. Monday is a glorious day of peace.

A

Of course, Saturday mornings of shitshow, so I was going to look through some of those other technical technology requirements, some of the things that I've come across, but one of the ones I thought that if you had a chance Terry to have a crack at would be the implications of privacy in gdpr. That's I have no idea about any of that.

A

That's if you had anything, looks on that'd be great to get him in there. Cuz there's a whole lot as I'm still learning all this stuff myself, but as I do it I realized that I'm getting my hands on all sorts of data and flinging them here and storing it there and training this thing here and and like how does that relate those datasets that train the model? How do they relate they've got PII and is the model parameters end up? Having PII and them I don't know it's theirs. That's a big open question. You.

E

A

There's a few others I want to have a crack at there.

E

Are some fundamental conflicts between the requirements for machine learning and the legal compliance with gdpr?

E

So so, for example, if you trained a model that included somebody's data and then that person then with through the right to use their data legally, you would be required to remove your model and remove the aspects of that model that were based on the data that you got from, that individual I would.

A

Effectively made retraining it with everything else, because you can't go in just like in our brains. You can't get into our brains and memory as much as I would tell us. We can be that's not. They were.

E

Yeah, so that effectively gives you a path to do a denial-of-service attack on anyone using machine learning, because you can, you can create spurious requests and remove data from from their system and forced them to constantly retrain.

A

There was um was an interesting talk. Ikura was might have been a Linux Foundation when without talking about the IP of training models, so we heard there's lots of work and prior art and even in legal test cases on things like GPL, and you know the Apache License and things like that, what they mean and who owns it and copyright. But if a model is a binary, that's compiled from data like who?

A

How does that copyright leave if that model can exist without that data input, as well as the hyper parameters that were chosen by the human and the algorithms in all of the feature? Engineering choose a human thing, partly, but there's some I don't know. If I chose in the technical challenges question like IP management, that's probably many more than Linux foundations. They I agree. That's probably thinking about that. I thought that was an amusing idea issues.

A

You can write something by hand or you could train a model to do it, but they have potentially vastly different IP implications. You know.

C

Wondering if we should include in the document section arounds can you should you treat the change model as like a traditional dependency?

C

I thought that was quite a useful conversation that you guys had and just did a quick scan and the document and that's sure those particular issues have been covered. I'd be happy to just write up an issue to capture as yeah as I understand. It ends yep an.

A

Issues good too, because then, when we get time, we can look at it and cool.

B

A

Then it's easy enough to copy, and if it has descriptors, you know be good, yes and yeah. One other thing that just popped in my mind, see you mentioning before about a sort of qualitative and quantitative differences stuff. So one thing that's become apparent to me. Using this stuff is like in a software development, workflow you'll be working on an algorithm or tuning something.

A

Typically, the feedback loop is pretty quick and you tend to work on one thing at a time: I'm: finding when I'm training models to do things it can take hours for it to crunch things. So I would sometimes I might have a few different ideas to try and I'll. Maybe try five of them at once and I'll. Let them all on different sets of clusters of machines and then come back and sort of pick the winner and go okay yeah that won't work that didn't and that.

E

A

An end a my work, maybe she can find those two two ideas together, so that is fundamentally different, like you don't tend to, if you're running some unit tests for some gnarly mutate into well I'll write this three different ways and have three different Suites and then run and then give up a job yeah.

C

A

No I mean you might, if you're doing, if you having a really bad day with we're testing something you might just have a Hail Mary, but yeah.

D

A

That's quite normal in models and that's something I.

C

A

That the in theory, the sort of get based approach that Terry sort of worked on with Jenkins thanks in theory, lets you do where you can have each one of the experiments could be a branch hmm to go off and then or maybe even a poor request. And then you just come back and look at your pull, requests and decide which one pick the winner delete the rest and that's I that doesn't quite have a parallel, but yeah.

A

It's something that's so this is more for developers than Amin machine learning for data scientists is completely natural for them, so they don't need to them. But for me this was like a it's like ah this stuff takes a long time. If I do this, serially I'm gonna be spending weeks working on, it's like I need to man, and that's something I still struggle with is like managing the different ones.

A

It's like well I'll, take this and this because you got to tweak the training data, tweak the parameters, the hub parameters and and leave a note that you did something so in theory that sort of branch wears with no sort of pour requested. Notes- or you know something that's trafficked as a version thing with everything together.

A

Self-Contained is the dream, so I'm kind of excited that that's taking Asha to do it. Hopefully, so.

E

The the vision in in in that space is do not just facilitate you being able to run parallel experiments with with different feature sets, but also to actually enable you to create evolutionary systems where you can house it. Have your models, compete against each other and influence their write, their own parameters.

E

So you basically spawn a set of clusters that will will evolve a solution of a particular problem. So.

C

When, when we're talking about a managed pipeline for the the training of these models- and we talk about these evil- evolutionary techniques, for example,.

C

That's a really interesting problem because I'm thinking about you know this scenario that it takes a week to train this model and you're three hours out from the end of the week and the system crashes. Okay, we have to start this whole thing all over again or is there some kind of break point that we can identify through that process?

C

Is this some yeah and it seems like that's what we're trying to inject we're trying to inject linear processes, or at least a linear overlay onto a fundamentally nonlinear process, to unable to kind of retro actively build up the state of the world at a particular point, to avoid building that state of the world from the raw data again and then from that point. Hopefully we can just maybe do the last day's worth of training rather than the last four days.

A

It's not a secret like yet, or is that that seems like something like tensorflow would do like it would have some sort of a journal or something you know it's do.

C

You have any like dementia or worries about that Terry. Do you think there's something fundamentally going on under the hood? That would make that hard to achieve a.

E

Lot of it depends on the the type of model you're constructing, because in many cases this is akin to operating in in chaotic systems where the initial conditions can have a huge effect on on the final outcome.

D

E

What you're doing is is hill climbing, then the point at which you starts is actually going to influence. You know a final outcome, drastic thing, so you may get stuck in a you know: local Maxima, that's different, see the result. You'll get from started at slightly different point in time. So so there are many situations in in the fundamental modeling where you, depending on how you approach the problem, you may not actually have a repeatable solution, so in many cases you you can run the same training with the same training set.

E

Nor get the same result.

E

B

A

At just sorry,.

B

Austrians can see, isn't this about capturing all the parameters and making it repeatable, rather than about being able to snapshot the process part way through and therefore resume from three days into the four day. Training I know they're both important and relevant problems, but I don't quite understand how they're linked.

E

Well again, you, you may have very very small changes in your processing.

E

Multiplied to a very different result across two two runs, so you know you you. You have to consider that in a lot of cases, you're you're doing parallel, processing across multiple GPUs and then combining the results and doing more processing. So the sequence in which the combination happened right.

B

E

You talked about race.

C

Conditions yeah.

E

C

A

I'm I'm gonna have to go off, I'm beat after today I'm sure it's I know it's later for Jessie just.

C

Past 11:00 yeah.

A

Yeah so I made a note to sort of continue this discussion next time, because it's a good start, you probably die dive into some of the specifics. Of these longer running, looks like.

E

We need to start writing some of this stuff up and capturing in the document. Yeah.

A

And then I've got some more things to fill out there and yeah. There might be some more technical challenges to fill in around checkpointing. Having said that, I think there's plenty of ground to cover. So there's a lot of like Terry, said at start, there's a lot of companies and enterprises doing interesting stuff with machine learning that aren't sort of at this extreme end like they're doing like the data sets, I'm feeling means they're in you know under a gigabyte and it still takes hours to train thing.

A

That's not big data by any stretch of the imagination and there's a lot of value in that. So there's still a lot of good stuff to be done there, but it's worthwhile thinking about these sort of extreme cases, cuz. It's fast moving field and fast bang and yeah yeah. Well, thanks! Everyone. Thank you.