Red Hat OpenShift Data Science | OpenShift Commons Gathering 2021, 28 Jan 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Keynote: Overcoming Dataset Bias in Machine Learning KateSaenko Boston University MIT-IBM Watson Lab

Description

Keynote: Overcoming Dataset Bias in Machine Learning

KateSaenko Boston University MIT-IBM Watson Lab

OpenShift Commons Gathering on Data Science
January 28, 2021
https://commons.openshift.org/gatherings/OpenShift_Commons_Gathering_on_Data_Science.html

Find out more about OpenShift Commons, please visit: https://commons.openshift.org

A

Hi, my name is kate sayanka, I'm a professor at boston university and at the mit ibn watson, ai lab, and I'm very happy to be here today to talk about my research on data set bias.

A

So um I'm going to start with talking about the success of ai and computer vision, so computer vision is.

A

Ai technology that can analyze visual scenes- and you can see here an example of it- applied to detecting cars and buses and pedestrians and images, and it's quite good and getting better so here's an example of computer vision for object detection in a different scene. We can also train computer vision models to classify other objects that are maybe even cartoon characters, and we have quite accurate models for face recognition, emotion, recognition and a lot of this is becoming a product right. So we're seeing computer vision being used as a product.

A

Maybe in your phone you might have a face id that verifies your face against the database to unlock unlock your phone. So that's very exciting.

A

However, we also have some problems with this technology with computer vision, but it also applies to machine learning in general, which is the problem of data set bias. That's what I want to talk about today.

A

So what do I mean by data set bias? Well suppose that you're training a model to recognize pedestrians, and you collected a data set. That looks something like this and you train your neural network and it seems to work really well on your held out test data from the same kind of data that you collected, and now you deploy your model on your product on a car. But now this car is in new england, whereas your training data was in california.

A

So immediately, you see a very different visual domain with different weather conditions like snow, for example, that you didn't have in training data, because there's not much snow in california, but also pedestrians will look different because they're wearing heavy coats and so on. So all of a sudden. The model that worked really well on your source data that you've trained it on does not work so well anymore, and so we call this problem. Data set bias.

A

We also call it domain shift, so the so. The problem of data set bias is essentially this issue that the training data looks different from your test data that you're actually faced with um it's different in terms of the distribution of data.

A

That's the more general way of putting it, but you might qualify it, for example, as a difference between one city that you trained on and a new city that you're testing on, or it could be, a data set bias to images collected on the web, whereas at test time you are getting images from a robot which also looks look different. They have different backgrounds, different lighting and different pose.

A

Another common data set shift that we see in machine learning is from simulation to real images.

A

So here, for example, if you're simulating something for robotics and training, your machine learning algorithm on the simulated data, it's not going to generalize very well to real data.

A

This could also happen with demographics, for example, if your training data is biased in a way that, where light-skinned faces, are over-represented in the training data, but then at test time you are applying the model to darker skin faces again, you will have a data set, biased issue and the model will not work as well on the test data.

A

This could also happen with different cultures. Let's say: you're, classifying weddings and you trained on western weddings of from from western cultures and then a test time. If you get an image of a wedding from a different culture, your classifier will not generalize very well will not be able to recognize it. So so there are lots of lots of different ways that data set bias could happen.

A

That's my point now: let's look about look at what actually this means in terms of the accuracy of the machine learning model. So here is a very simple example. Very famous data set called mnist. Everyone knows what mnist is it's just 10 digits that are hand written. So if we train on this data set, we know with modern, deep learning. We can get very high accuracy of more than 99 accuracy.

A

However, if we train for the same 10 classes of digits, but our training data looks like this. This is a street view house numbers data set. Now this model tested on the mnist data set achieves much lower performance, 67 accuracy, that's really really bad for this problem and by the way, even when the data set bias is not as extreme.

A

For example, we train on the usps digits which look to the human eye look quite similar to mnist, and yet the bias in the data leads to a similar drop in performance and, if you're curious, if we swap and we train on mnist and test on usps, we have similarly poor performance. So so that's just an example of how this dataset bias could affect, even in a simple case of digit classification could affect accuracy.

A

Okay. Now what about real-world implications of data set bias? Have we seen this in the real world? Well, yes, I believe we have. This is one example, that's quite famous now, which is uh the fact that in face recognition or gender classification, uh some researchers have actually evaluated how well commercial existing commercial systems from amazon from ibm from other companies. How well they work? What is the accuracy they achieve on different demographics?

A

And you can see here according to one study, they work a lot worse on african-american and female faces than on caucasian and male faces. So again, that's in large part due to data set bias.

A

Another very sad example of potential data bias is this accident that the self-driving vehicle was involved in a while back the uber self-driving car, which, according to some reports, did not recognize the pedestrian because it was not designed to detect pedestrians outside of a crosswalk.

A

So if that's in case, if that's your data set bias that in your data set all the pedestrians are on a crosswalk, then yes, your machine learning algorithm will not be able to recognize them as well, if they're, not in that context of a crosswalk, but maybe in this case you know just jaywalking without a crosswalk, so you might be wondering well wait a minute can't we just fix this by collecting more data.

A

If we don't have pedestrians, not in crosswalks, let's just collect more data like that right. Well, there are a few problems with that. The first is that some types of events just might be rare, like jaywalking pedestrians. They just might be very rare events and we may not necessarily want to force people to jaywalk, so we can collect more data.

A

So that's one problem, but another really big problem is the cost of data collection. So imagine that we wanted to label uh images from cars. That's an example. You see here. This is from the berkeley. Bdd data set well labeling uh 1 000 pedestrians with per pixel segmentation labels that you see here where the label has to uh identify each pixel that belongs to that pedestrian, it's quite expensive, so it costs maybe about a thousand dollars per 1000 pedestrians.

A

And now, if you imagine the huge, sheer variety of visual data that we would want to cover in our data set, we want multiple poses. We want multiple genders age, race, clothing, style and so on, and so on, like and somewhere in there. We want people who riding bicycles, maybe not riding bicycles or maybe riding tricycles right. So, if you think about you know how many different factors of variation we would have to cover, this very quickly becomes untenable and and just too expensive to collect label data. That's balanced across all of these variation factors.

A

So what um actually causes poor performance right? You might be wondering that as well. Well, you know can't my deep learning algorithm just get better. Maybe I just need a better algorithm that will generalize and do better on test data. So there are a couple of problems that is caused by dataset bias that current models cannot handle.

A

The first problem is that the training and test data distributions are different. So here you have an example of two digit domains.

A

The blue points and the red points are from these two digit domains, and you can see that when we visualize this data, we do this by extracting features from these images, using the deep learning model that we trained and then plotting it in a t-sne visualization. So this is what we get. You can see that clearly, the distribution of the training blue points is very different from the distribution of the test points, and so this is a theoretical problem.

A

Actually, when these distributions are different, we can show that theoretically, they're actually bound on how well our model will generalize. Another problem is uh that a model trained on the blue points is not as discriminative so.

A

The features it learned are not as discriminative for the target red domain, and you can see that because the blue points are much better clustered into different categories than the red points right, so you just may not be learning uh good features uh for these test points the target domain, so fortunately, there are quite a few techniques that we can use to alleviate this.

A

I've listed a bunch here. What I want to talk about here today is the technique of domain adaptation, um but you know: there's always data augmentation, there's always using sort of batch normalization. Some of these techniques can help in the case of dataset bias, but let's talk about domain adaptation, so in domain adaptation we design a new machine learning approach that tries to adapt the knowledge from the label source data to the unlabeled target domain. Okay. So our goal here is to learn a classifier that achieves a low expected loss under the target distribution.

A

And importantly here we assume that we have a lot of labeled data in the source domain, but we also get to see unlabeled data from our target domain. We just don't get to see the labels right because labels are expensive to collect. So we assume that we do get to see some unlabeled data, at least from the target domain. So what can we do?

A

Well? The first technique- it's very, I would say fairly common now and fairly standard in the literature which is adversarial domain alignment. So here we want to take a neural network which I'm showing here is this encoder convolutional neural network because we're dealing with images.

A

So we always use convolutional networks and we have some training data with labels and now so, if we train this using regular classifier loss, we can generate features from our encoder cnn, and here I'm just showing it for two classes for clarity and then the last layer will be our classifier layer. So we can visualize the decision boundary that it learns between one class and the other class.

A

Now if we also get to see some unlabeled data from our target domain, so let's say we put a camera on the robot and it can explore its environment and snap some photos. So now it has some data, it's just not labeled, and so, if we apply the encoder cnn train on the source directly to this data, we already know that we'll see a data set shift like this, so the distribution of the target points will be shifted with respect to the distribution of the source, blue points and so in adversarial domain alignment.

A

Our goal now is to align these two distributions: the blue source distribution and the orange target distribution.

A

So how can we do this? Well, a very standard approach is to add another piece to the neural network which we call the domain discriminator. This is just a classifier that tries to assign a domain label to these input examples and if we train it with a gan loss with a an adversarial loss.

A

Essentially, then we iterate between the domain discriminator, trying to separate the distributions and then, in the next step, we update the encoder in such a way that it can fool the discriminator, so the discriminator's accuracy goes down and in the process the encoder learns to align the two distributions so that if everything goes well, the discriminator can no longer cannot tell the difference between the domains and these features have become domain and variant. Essentially, so that's adversarial alignment and here's an example of it working for those two digit domains that I showed you earlier.

A

You can see that, in fact, after adaptation with adversarial alignment, the two distributions of the red and the blue points have now been aligned. Almost perfectly, and in fact, classification also goes up considerably, um so it's not just that the distributions are aligned. It actually does improve classification accuracy.

A

Another technique that I want to mention is alignment in pixel space.

A

So what I mean by that is suppose again: we have source data with labels and some unlabeled target data, and now, instead of just doing adaptation with alignment like, I just showed you what, if we first translate our source data in image space, so we're generating new images from the originals.

A

But now these new images look like they come from the target domain.

A

So this is a similar idea of aligning the two data sets, but now we're aligning them in pixel space because we're actually generating the images themselves and not just generating features.

A

So the advantage is that once we've done that, if we're able to train this generative adversarial network that can translate from the source to the target domain, now we have data that looks like it came from the target domain, but it has labels because the original data is from the source. So it's labeled with the categories that we need for training and by the way we can still add feature alignment on the feature space to this overall architecture and in fact we have experimented with that in our paper, which is at the bottom.

A

So if you're interested, you can take a look, but if we do do both feature and pixel space alignment that can further improve our performance on the target domain.

A

Okay, uh well that's great. This pixel space alignment seems pretty neat, but so far we've been assuming that we have unlabeled target data. In fact, what I didn't tell you is that, in order for that method to work, it needed to see quite a lot of unlabeled data from the target domain, but what if we only get one image or a couple images from our target domain?

A

Well, unfortunately, the existing method like cyclogan or cicada, that I showed you doesn't quite work.

A

So instead, what we need to do is take a source domain image, which is our content essentially, and we want to translate it to a new visual domain, but we only have one example, let's say of that domain. So in this example, our content is a dog and we want to preserve the pose of the dog, but we want to change the style or the domain of the dog into this other breed. I don't, unfortunately, don't know what breed of dog.

A

Maybe you know what breed of dog this is, but anyway, just have one example of this new breed, and so we actually propose a method that can do this, and this is the result.

A

So you can see in the generated image. We took the original source image and we added the style of the target image, but preserving the pose of the dog right, so the content is preserved.

A

So this method we call it cocof unit- was published recently in eccv 2020.

A

I'm not going to go through the details because I don't have time, but essentially the model takes a content image and a style image encodes it using a content, encoder and a style encoder, and then combines these two encodings using an image decoder to generate the output image.

A

Here are more examples, so we have the style image on the top and then the content image below it and then the resulting generated image with our cocof unit approach at the bottom.

A

We can take a look and see that we're able to even just using a few uh sometimes just one- we've tried one or a couple images of the target domain. Here the domain is a breed of the animal or a breed of an animal, so we can change. um The pose is the same from the dog, but the um sorry. The the pose is the same from the content image, but the breed is uh is taken from the style image.

A

So you can see how um this is working quite well and if you're curious compared to the previous approach, which is which is called unit that we're building on actually we're improving on that quite a bit because, as you can see, if unit is not able to translate images using just a single style image, it's kind of generating fairly poor results. In this case, and on average, when we evaluate on a large data set, we also see significant gain using our coco approach.

A

So that's another example of pixel domain translation.

A

One other example that I want to show you really quickly is using this idea for adaptation and robotics. So here we have a robot, that's trying to insert an object into another object, let's say a peg into a hole or is trying to more generally. We can apply this to other manipulation tasks and our input data is coming from the depth sensor. So it looks like this there's an rgb image, but what we're using is actually the depth image, so you can see it in the in the middle here, but to train.

A

So we want to train a computer vision model, a neural network that will control the robot arm to perform the task, but to train.

A

We want to use simulated images, so we simulate this kind of problem and generate fake depth, images and train the the neural network, but the problem that we run into is, of course we have a gap between the training domain of simulated data and the target domain of real depth images, and so what we tried is using pixel level domain translation to solve this data set bias problem without collecting any label data in the target real domain.

A

You can see here an example: real depth, view image and then a similar simulated image and then the last one is we take the reel and we translate it into the simulated domain, and you can see that it's now. Looking a lot more like the simulated data, so we're closing this domain gap. Okay, great! So um I'm going to wrap up here. Just to recap what I talked about so data set bias is a pretty major problem for machine learning in general, but for computer vision. Specifically, that's mostly what I work on.

A

So that's what I focused on, and I showed you a couple of ways that we can mitigate this problem using either feature space domain alignment or pixel space domain alignment.

A

I also think you know we could discuss if we have time after this, some even more general ethical issues related to data sets, for example. Recently there was a paper, that's generating quite a lot of interest that looks at the dangers of large language models and points out that language models are being trained on progressively larger and larger data sets.

A

So it's almost like the opposite of the problem that I talked about where we have a huge data set that we're training on and now the problem that they're pointing out is that this data set might contain all kinds of bad data like offensive data um or just you know, even private data and by training the model on it. We don't know what kind of biases or bad undesirable things that it's learning right.

A

So that's that's kind of a related, but different ethical issue, and this paper by the way is one of the co-authors is tim gabriel, which you might have heard that actually she was forced to leave google over this specific paper so yeah. So there are quite a few ethical issues and I'm happy to discuss that or anything related to what I talked about and thank you very much for your.

A

A