Research Reveals Inherent AI Gender Bias

Unless you’ve been living under a rock—an impossibly cozy, privileged, particle-filtering rock—2020 has probably opened your eyes to how far we still have to go to achieve a truly fair society. COVID-19 has revealed and reinforced existing oppression, inequality, and vulnerability, from systemic racism and a lack of affordable healthcare to a spike in domestic violence.

And if like me, you’ve been lucky enough to comfortably and non-essentially continue working from home, safe, knowing if you got sick your hospital bill would be covered and you’d never be denied treatment because of the color of your skin, this year has probably also made you painfully aware of how you’ve been part of the problem.

So, the bare minimum we can do right now is to get things right in our respective fields of work, using little slices of expertise and influence as responsibly as we possibly can.

In the field of data, one way to do this is to dig into and understand the dark side of AI. We have to ensure the data our models are trained on isn’t bad, biased—or both, making harmful decisions and judgments on our behalf, with predictive policing one particularly topical example.

Even as a data scientist, who spends big chunks of her time scrubbing and prepping datasets, the idea of potentially harmful AI bias can feel a little abstract; like something that happens to other people’s models, and accidentally gets embedded into other people’s data products.

That was the case for me—until a few weeks ago.

What’s the quickest, easiest way to identify masks on camera?

Imagine you’re looking for a side project to dig into, and you’re curious to rank the most socially responsible street corners in the United States. Can you code a tool that allows you to connect to thousands of street-cams around the country simultaneously and determine the proportion of pedestrians wearing masks at any given point in time?

When I found myself in this situation a few weeks ago, I turned to Google’s Cloud Vision API, an artificial intelligence tool that gives developers access to models trained on large datasets of images, to get my code up and running quickly. Pre-trained tools like this one are a bit of a black box for the end-user—you don’t know what kind of images taught it to see, but it’s Google. So you trust it has seen a ton of images and hope it will do a good job on yours.

Amongst features like text detection, color scheme, and a reverse image search called Web Detection, Google also detects labels for different elements of an image and exposes a confidence score attached to each label.

Having successfully used the Cloud Vision API in the past, I put on my mask, snapped a webcam picture, and pinged Google to tell me what’s going on.

We detected 10 labels: Face: 96.57%, Duct tape: 94.51%, Head: 88.43%, Neck: 86.85%, Nose: 84.2%, Mask: 73.92%, Mouth: 73.84%, Headgear: 67.48%, Costume: 59.9%, Face mask: 52.31%

There it was. Mask at 74% confidence. Not too bad, though I had expected better. Admittedly, some of the other labels were a little off—I didn’t see any headgear, nor was I wearing a costume, and my mouth was not visible. And then there was duct tape. That’s a bit strange.

Perhaps the dataset used for training simply didn’t have enough examples of masks to confidently identify my KN95 as a mask. Google, along with the rest of the world, probably didn’t see the pandemic coming. And my image quality wasn’t the best.

Wait…duct tape? Why would Google think my mouth was covered by duct tape–with 95% confidence?

Masks are mistaken for duct tape—an emerging pattern

Thinking it must be a weird, one-off bug, I tried a different mask.

We detected 10 labels: Face: 97.17%, Lip: 90.09%, Head: 89.02%, Shoulder: 87.33%, Duct tape: 87.01%, Nose: 84.2%, Neck: 83.05%, Mouth: 82.92%, Joint: 77.83%, Selfie: 72.36%

There it was again: Duct tape, with a confidence score of 87%. That’s pretty darn confident considering my face mask looked nothing like duct tape. On top of that, mask didn’t even show up in the labels this time.

Ever the idealist, I hoped this might still be a coincidence and tried again.

To make this a little easier for the AI, I fed it a classic model—the blue surgical mask. Health professionals have been donning these for a long time now.

We detected 8 labels : Hair: 97.12%, Shoulder: 92.5%, Neck: 86.06%, Joint: 73.96%, Mouth: 67.95%, Duct tape: 66.16%, Black hair: 60.66%, Brown hair: 55.65%

I certainly expected the PPE or mask label to show up, but instead—though with a slightly lower score this time—there was duct tape again, with 66% confidence. No mask in sight.

Not to go all Carrie Bradshaw, but I couldn’t help but wonder: What does a training set look like that so confidently suggests to the AI model that half of my face was covered by duct tape?

And was this an issue for men? Or was the model more likely to suggest duct tape if it identified a woman in the image, revealing a worrying bias toward gendered violence?

Unfortunately, there’s no way to get a glimpse at the former. Out-of-the-box tech tools like Cloud Vision aren’t exactly known for transparency when it comes to training data. So, I took a stab at investigating the latter.

In February, Google removed the labels “man” and “woman” from their Cloud Vision tool, stating that inferring gender from appearance isn’t possible and shouldn’t be attempted. But how much is a Band-Aid solution like this worth if gender bias drifts to the surface in more insidious ways?

And while Google inspired this investigation, I wanted to know if this was an issue for not only this particular tech giant, but other companies with pre-trained image-recognition services as well. Would IBM Watson and Microsoft Cognitive Services throw up similarly twisted results?

Would you take a mask selfie, for research purposes?

To run the numbers on this, we needed a decent sample of people in masks to put through the computer vision models developed by Google, IBM, and Microsoft respectively.

Beyond images found via web search, we solicited mask images from colleagues and friends to get a balanced sample. The final dataset consisted of 265 images of men in masks and 265 images of women in masks, of varying picture quality, mask style, and context—from outdoor pictures to office snapshots, from stock images to iPhone selfies, from DIY cotton masks to N95 respirators.

Google Cloud Vision

Out of the 265 images of men in masks, Google correctly identified 36% as containing PPE. It also mistook 27% of images as depicting facial hair. While inaccurate, this makes sense, as the model was likely trained on thousands and thousands of images of bearded men. Despite not explicitly receiving the label man, the AI seemed to make the association that something covering a man’s lower half of the face was likely to be facial hair.

Beyond that, 15% of images were misclassified as duct tape. This suggested that it may be an issue for both men and women. We needed to learn if the misidentification was more likely to happen to women.

Of the 265 images of women in masks, the model correctly identified 18% as containing PPE, at half the rate for the men. Cloud Vision also suggested “facial hair” as a label for 8% of images.

Most interestingly (and worrisome), the tool mistakenly identified 28% as depicting duct tape. At almost twice the number for men, it was the single most common “bad guess” for labeling masks. Also surprising was Cloud Vision’s confidence score in this suggestion—an average of 0.93 for women and 0.92 for men, the model certainly made an educated guess. Which begged the question–where exactly did it go to school?

IBM Watson Visual Recognition

Turns out, Watson doesn’t exactly mistake face masks for duct tape. Instead, it kept things vague with tags like “restraint/chains” and its subcategory “gag”–which had similarly worrying connotations.

On top of that, it was more than twice as likely to happen for women—the tags “restraint (chains)” and “gag” showed up for 23% of the images of women and only 10% of men, echoing the Google results. Furthermore, Watson correctly identified 12% of men to be wearing masks, while it is only right 5% of the time for women.

As for the confidence levels, the average score for the “Gag” level for women hovered around 0.79 compared to 0.75 for men. While the absolute numbers look slightly more damning than Google’s, Watson was a bit hesitant to assign those labels.

Microsoft Azure Cognitive Services Computer Vision

What first stood out was how genuinely bad Microsoft appeared to be at correctly tagging masks in pictures. Only 9% of images of men and 5% of images of women were correctly tagged as featuring a mask.

Apart from that, no labels such as duct tape or gag and restraint made their way into the results. Good news—though the model displayed a more innocuous gender bias.

Overall, for 40% of images of women, Azure Cognitive Services identified the mask as a fashion accessory compared to only 13% of images of men. Going one step further, the computer vision model suggested that 14% of images of masked women featured lipstick, while 12% of images of men mistook the mask for a beard. These labels seem harmless in comparison, but it’s still a sign of underlying bias and the model’s expectation of what type of things it will and won’t see when you feed it the image of a woman.

It’s not (just) them, it’s us

It’s fascinating albeit not surprising to realize that for each of the three services tested, we stumbled upon gender bias when trying to solve what seemed like a fairly simple machine learning problem. That this happened across all three competitors, despite vastly different tech stacks and missions isn’t surprising precisely because the issue extends beyond just one company or one computer vision model.

We are not suggesting that a data scientist at Google maliciously decided to train the computer vision model on twice as many images of women whose mouth is covered with duct tape as compared to men. And we’re almost certain no one at IBM thought it would be a smart business decision to suggest a woman in PPE is being restrained and/or gagged. But do you remember Tay, the Microsoft Twitter chatbot that started posting racist slurs? Neural networks trained on biased textual data will at best sooner rather than later embarrass its makers, and at worst they will perpetuate harmful stereotypes. The same holds true for machines that learn to see through a skewed lens.

That lens, of course, is our very own culture, offline as well as online—a culture in which violence against women, be it fictional or real, is often normalized and exploited. Consider this seemingly silly example—a simple image search of “duct tape man” and “duct tape woman” respectively revealed images of men mostly (though not exclusively) pictured in full-body duct tape partaking in funny pranks, while women predominantly appeared with their mouth duct-taped, many clearly in distress.

If this kind of bias was reflected in Google’s training set, too, it’s no surprise that we’re seeing masks mislabelled as duct tape at a much higher rate for women.

Beyond duct tape and chains, the numbers revealed how unprepared global tech companies, just like governments and healthcare systems, were for the arrival of a global pandemic during which a large proportion of the population would be required to wear masks.

Across the board, all three computer vision models performed poorly at the task at hand. However, they were consistently better at identifying masked men than women. Since masks are usually seen in a medical context, perhaps the training data depicted more male medical professionals in masks than women.

This doesn’t just raise questions of gender bias—racial and cultural bias are yet to be explored. After all, wearing masks in public places has been the custom in East Asia for years—why were the models so bad at recognizing them?

Holding AI models to higher standards

While it’s critical to address the underlying condition, we must still treat the symptoms. So, what can we, as end-users of technologies such as Cloud Vision and Azure Cognitive Services, do?

Do your research every time you ping a service you didn’t teach to think and do what we’ve done here—unearth and remove labels and tags that reflect unfair and harmful bias. This is the absolute minimum we should always do. The less you know about your tools, the more you need to test.

The more diverse the team, the more likely it is that they'll notice things you would never spot on your own. I was baffled by the duct-tape label because I’m a woman, and therefore more likely to receive a duct-tape label back from Google in the first place. But gender is not even close to the only dimension we must consider here.

To go even further, train your own models. Take note, this path is not without danger. If your homegrown model comes up with harmful results, then something was wrong with the training data, and that’s going to be on you. On the upside, if you spot dodgy labels and questionable results, you can retrace the steps and change future outcomes.

In case you’re not ready to go fully solo yet—or don’t have the time and resources—you can also train custom models on top of the technology provided by companies such as Google, IBM, and Microsoft, course-correcting their mistakes with a smaller dataset than what you’d need to train from scratch.

When working with partners such as Google and Microsoft or IBM, or simply a paying customer of the services these companies provide, hold them accountable when red flags appear.

A model is only ever as socially responsible as its dataset

Beyond damage control and Band-Aid solutions, we must work diligently to ensure that the artificial intelligences we build have the full benefit of our own, natural, intelligence. If our machines are to work accurately and to responsibly reflect society, we must help them understand the social dynamics that we live in, to stop them from reinforcing existing inequalities through automation, and put them to work for good instead.

The good news is that if you’re ever feeling helpless, you can always start with your data, whether you’re working with petabytes of data or a 10mb CSV file. Scrub it, clean it, label it, test it, browse it. Do the boring work. If you take one thing away from this study, keep in mind that more training data doesn’t necessarily mean you’ll get it right.

After all, we’d quite like our street-cam analyzer to suggest that 56% of people on the street are staying safe—not being gagged and restrained.