GitLab GitLab Advanced Software Engineering Course, 9 Dec 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Machine Learning Engineering in less than 30 minutes

Description

Machine Learning Engineer is a recent role that is getting more and more traction. Join us as we try to summarize what are the skills and responsibilities of a Machine Learning Engineer in less than 30 minutes.

A

So I heard you're curious about what is machine learning engineering so today I'll try to cover this topic in 30 minutes or less we'll give an overview of what is machine learning? What are the challenges that you can face with machine learning? What is machine learning engineering and what is what skills are necessary for a successful machine learning engineering? My name is eduardo mune, I'm a full stack developer here at gitlab on develops area.

A

So, first of all, what is machine learning that is the main question that we have to answer before understanding. What machine learning engineering is in the first place.

A

Machine learning is a way that we can instruct a computer to to a machine or to to learn how to make estimations or predictions or recommendations based on past data data. Here can be images. Text events clicks anything that you can convert in a format that the machine can understand.

A

I like to say that machine learning is basically an automated informed guesser, so it tries to guess something and the better data it was displayed before the more informed the gases are and the best the gases will be.

A

So, if you think about, if you replace the entire conversation that we have uh forward machine learning by informed guesser, things will make a lot more sense, even so with this, let's give an example suppose that I have this dog and I want to create a an algorithm or a way to detect not only for this dog, but for any image.

A

Whether this is the image is a cat or a dog.

A

So what what I'm going to do for this? I'm going to need a machine learning model. A machine learning model will make this prediction for me and for me to create a machine learning model. The first thing that I need is the data, so I have to collect a lot of data on uh dogs and cats, so that I can show the the machine so that they can learn from this. They learn the difference between both.

A

Once I have all this data, I can apply a machine learning algorithm on top of it and then that will create a machine learning model that will be able that to given an image, give a probability of whether that's a dog or a cat. For example. This is a 70 percent, uh a dog so very quickly, brief explanation of machine learning, of course, but uh some some terms that might get you confused uh machine learning and artificial intelligence.

A

For example, artificial intelligence is a super set of machine learning, so machine learning is part of artificial intelligence. Artificial, artificial intelligence is a way is not a way, but the studies on how to reproduce human intelligence on a machine, and there are many ways to do that. Machine learning specifically, is how to do this by looking at past data or past experiences, and the second one that you see a lot happening is the difference between machine learning and statistics, and the reality is that no one really knows the difference.

A

Actually, so it's just two names: they they use the same tooling. uh They solve very much the same problems. There are subtle differences, sometimes that they say. Okay. Statistics is more academical, they are more theoretical, where io machine learning is more practical, more uh empirical, but in the end they are just the same thing. uh Machine learning is even called sometimes statistical learning, so yeah don't bother with this. uh These discussions.

A

So now that you have uh some idea uh of what is machine learning?

A

Well, very briefly, of course, how can what are the some of the challenges when the when, once you create a model, or once you are trying to solve a model with machine learning? What are some challenges that you can face uh on the process? So the first one is data.

A

A model is like any machine learning uh product that uh that is great is just as good as the data that is fed into it. That is where it's trained, so you cannot create a really good model, a good model from bad data data. The data must have high quality, meaning it must capture all possible events, all possible classes. It must be fresh. So if the data is too old, the machine learning model will not be able to reproduce or make good predictions.

A

You might you you need to access to give access to this uh to this data, somehow somewhere the infrastructure around providing this data as well both about collecting the data about providing data about labeling uh this data, so getting the data right is one of the major challenges with machine learning.

A

The second one after you have the data is model creation and here's the biggest challenge for a in all machine learning uh development is to know what needs to be created in the first place. How can you answer a business need with how can machine learning answer a need? Your business has? How can we create a profitable, a product that can really impact users or that really solves a problem that users are having or businesses?

A

Having or anything like that, then, once you know what problems need to be solved, then you have, for example, data processing.

A

How do you pick up the data? How do you pre-process this data? Is this data? uh How can you maintain the data? How can we version the data? um Then you have the problem of which machine learning algorithm you're going to apply. um There are many many algorithms out there. Each one solves a specific uh problem. Choosing the correct one uh might make a lot of difference and it's just not not choosing the the the algorithm. That brings the best results.

A

There are a lot of constraints around, for example, processing power. How much time does it take to train the model? How much time does it take to hyper parameter tune um and things like that and then also matrix? So a machine learning model is created to solve a problem.

A

You only know if you're actually solving that problem, if you have metrics uh to measure so, for example, if I want to create a recommender system like netflix, the recommender system specifically exists to solve a user need which is find stuff we like, but it needs a metric that that the model can translate and can understand so, for example, for netflix it could be number of hours, watched per user or time spent browsing. uh We you want to help users spend less time browsing the the catalog, for example, and more time watching.

A

And then, once you create the you created algorithm, uh then the next challenge is about model deployment, there's a huge difference between creating a model and creating a model that goes into production and goes uh and and is used by user users. So you have to scale that model. It's not only one person that is acting the models, like not the the the model creator that is, uh is accessing them.

A

The results, the predictions of that model, the guesses of the model, is thousands millions billions of users that might be using that uh that output, so scaling is really complicated. How do you update your models? How do you version your models? How do you monitor them? How do you know they are doing the right thing?

A

How can you make deployment easy without a lot of errors throughout it? So this is another entire set of challenges uh with uh the development of machine learning, so you have pro and another one that is more. We are uh getting more and more worried about is about private privacy and ethics involved with machine learning, so privacy means like what are you going to do with the data? Sometimes you can extract information that was used to train the model from the results, and that is very complicated. How does privacy play with it? Also ethics?

A

How is this model used? uh Is this model even models created with really good intentions? They might backfire and create uh societal changes. uh For example, it can models like machine learning models.

A

They replicate the biases that exist within the data. So, if you fit in biased data, it will give biased predictions, for example, if for police use case for facial recognition, if it uses only.

A

The pictures of people who are already arrested arrested it might create a bias towards minorities, um and that is very complex and it's not a trivial solution to work on it's uh it's a discussion that is gaining a lot of light recently and it's a very important discussion to have um so. This is another thing that we need to worry about, so this is just a very quick overview of what might be the some of the challenges so just to give an idea of what the complexities are within uh developing a machine learning model.

A

Now that you know this, we can explore a bit more of on what is a machine learning engineer.

A

Here we had the challenges right and usually what happens is that at least in this division? You're gonna have one profession specialized on each one of these areas.

A

So, for example, the data side you're going to have a data engineer that is specialized on how to move data around how to collect data, how to label data for the model creation you're going to have the data scientist, which is a person that knows how to create models, how to answer business needs how to talk to the business and how the best machine learning, algorithms to use, how to create metrics and so on and so forth, and on the last one, the model deployment. That's what the machine learning engineer comes in.

A

So it is a software engineer that is specialized into deploying machine learning models. That is what a machine learning engineer is and note that privacy and ethics are part of all of the equal parts of everyone.

A

So note that this doesn't mean that a machine learning engineer will never work with model creation or data- quite the contrary.

A

It's quite common for a machine learning engineer to go and create a model, a simple model, it's quite common for the machine learning engineer to go and have to create the data pipelines. It's just that. That's not the focus. um The machine learning engineer is about deploying those machine learning models they, even if they create the the the model, it's not where they will spend most of the time.

A

They will just create a model uh that will be 10 of the work and then they will try to dedicate as much time as possible into deploying that model same thing with data, and that is also valid for the data scientist and for the edit engineer. Sometimes the data science scientist needs to work on deploying the model, but usually since machine learning engineer is specialized in this, their solutions will be more scalable, more maintainable and so on so forth.

A

Now that we have this, so this is what I what I said before so a machine engineer. Is the software engineer specialized in putting machine learning models into production, and I would like to emphasize here that it is a specialization of software engineering. So this is the work of a machine learning engineer. It is a software engineer, it's just that they work on this niche. uh That is machine learning, so they will do.

A

They will deploy systems, they will create systems, they will create uh scale system, their design systems that are specialized in machine learning, in in putting machine learning models into production now to cover those areas. What are the skills necessary for a successful machine learning engineer? That's that's another important question to have like I said like going from this slide that we showed before.

A

These are the main, some of the main problems that a machine learning engineer will face. So what are the excuse necessary to tackle those uh those those problems? So, first of all the first part software engineering- this is the core of machine learning engineering.

A

This is where you well, I'm sure an engineer needs to spend most of their time. Studying. Of course, um why are not mandatory? First of all python python, you can create an early language, but python became almost the lingua franca of of machine learning.

A

You have r and you have other languages that are used for deployment, but python is the one that is used everywhere. So if there is one language to choose from choose python- and second, is how track maintainable code so testing code modularization, something that is happens- a lot with machine learning engineers is how to move code from jupyter notebooks into pure python.

A

For example, it is very common for the data scientists, or even for the machine learning engineers to create the models in jupiter notebooks, but jupiter notebooks don't play well with the with uh production environments, so it it's often common for the machine learning engineer to move from drifter notebooks into pure python, so also system design how to design scalable software so, like I said it's different to create a a small website that runs on your laptop to creating a website that runs for millions of users right same thing, so it's very different to create a machine learning model that runs for you, one user and create a scalable pipeline where millions of users can use that machine learning model.

A

It's also really important to use cloud compute to understand cloud computing so how to create a system. This is part of the system design. What are the the concepts uh that are common across this tree? The the big vendors, uh how to uh set up a system, a small system how to put a website online things like that, uh it's very important to to be familiar with um in the second area. You are a machine learning engineer so, even though it is an ex specialization of machine learning, it is important to know machine learning.

A

So, first of all, it might be weird, but it's really important actually to know linear, algebra computers understand data through matrices through numbers and vectors and most machine learning. Implementations are actually matrice operations uh in some way of another. It's just multiplying large matrixes, so an image is just a matrix, a matrix of of numbers between 0 and 255.

A

And you use these representations to fit into machine learning models, so understanding, linear, algebra and working well with linear algebra is quite important. Actually, when you're implementing new machine learning models, another one is data management by data management. I mean how to clean up the data, how to select the correct features for the model, how to do feature enhancement. This is a bit more towards model creation.

A

Most of the things here on the machine learning area are about more a little bit more about model creation, but you need to understand this in order to implement this on your final system. So it's not like. You cannot just clean the data and then just that just lives on the jupiter notebook, and then you just send them all or no that data needs to be cleaned on production as well. So you need to understand the algorithms that are used and how to implement them.

A

It's also important to understand the different tools that I use. For example, tensorflow, scikit-learn pytorch, each one of them has their own peculiarities, had their own use cases and.

A

Each business need my. I might ask a different implementation, for example, and on top of understanding the tools, it's important to have knowledge on the different algorithms uh that that can be used. So, for example, there we have neural networks, decision, trees, xg boost and many many many others each one of them has their pluses and minuses. Sometimes it's a very simple algorithm, but it runs really fast.

A

Sometimes it's a very slow, uh it's a very, very good algorithm, but it depends on a lot a lot of data, so choosing the right algorithm will depend on the use case and on the restrictions around um evaluating machine learning. Models is also very very important over here. So how do you know that your machine machine learning model is doing the right thing right, so you created one machine learning model, you deployed it and then what is it working is expected? Is it bringing value? Is it decreasing the value of the two?

A

How can we know if one machine machine learning model is better than the other? um How can we understand the long-term impacts of that machine learning model and then finally, ethics and privacy? I already mentioned this a bit but unders. This is it's really hard to find a book or something on this topic?

A

It's it's hard to study this, but it's really important to be aware of how over the news and how models that are created with the best intentions can actually create uh really bad changes that were unexpected beforehand.

A

And the next one, the next area of skills is around mlaps, so devops is about making it easier to deploy.

A

Applications and similarly mlaps is about making it easier to deploy machine learning models so how to automate the deployment, how to create an architecture where we can quickly ship a model into uh into production. What are the patterns that the community uses, for example, pipelines, model registry, future store prediction service and so on and so forth? I'm not going to explain any of them. uh It's not on the scope of this presentation.

A

My my goal is actually just to drop the names and, if you're interested um you can search online later on- and I gave here some examples- note that we- I don't.

A

Endorse any of these tools, except, of course gitlab, but it's they're, just examples that are there. I don't know if they're good or not- or I have no goals of this- it's just that they're just examples um and yeah. So where do you get these skills?

A

There were many, I I said a lot of words uh so far, but what are? How can? How can we go about going after this knowledge, so some books that I recommend I added over here uh this machine learning engineering book? Is, I really really enjoyed it? It's a ver, it's a shark book book.

A

It doesn't go in too much depth on each topic, but it gives it looks at machine learning from the software engineering perspective, so most books that you're going to find around are actually looking at machine learning from an academic perspective or from the model creation perspective here.

A

Not so much is about the the the engineering problems and the engineering issues that are around um and I wouldn't spend too much time studying algorithms or going to deep into the algorithms, because as a machine learning engineer, of course you must know and it it will always help, knowing more and more about the machine learning.

A

But over time you don't gain so much about going about getting too deep into machine learning, algorithms in itself, it's better to just understand the applications and the downsides of of each one, rather than really really understanding in depth, each of them so focus on the concepts.

A

So, instead of studying one uh trying to understand all of the different uh model registries understand why a model registry is necessary. Why is it over there? What problem does it solve rather than on implementation? So, okay, you have mod ml flow. You have comet, you have seldom. You have many model registries, but they also the pro specific problem and it's part important to understand what are the. What is the problems that they are solving.

A

And that's it. um This was very short and I tried to give uh just a quick overview uh just to give a taste of what machine learning engineering is, um and my goal is just make you curious. So if you're curious about this, uh this was supposed to be one way of just dropping the word so that if you're interested you can it, you can have a a a point of start of what you look for uh when you are trying to uh to to to increase your skills in this area.

A

So thank you for joining and have a great day.