Bias and Fairness in ML Systems for People with Disabilities

Jeffrey P. Bigham

@jeffbigham

Jeff on a Panel at CSUN with Merrie Morris, Matt Heunerfauth, and Shiri Azenkot

This blog post is based roughly on notes I wrote prior to a panel on AI and Ethics at CSUN 2019. CSUN is an amazing conference, which brings together accessibility professionals, assistive technology vendors, people with disabilities, and a handful of academics. This panel was headed by Merrie Morris and Megan Lawrence, and Matt Huenerfauth and Shiri Azenkot also participated.

Machine learning is starting to impact everything we do — it’s being used to screen job applicants, it’s being used to drive cars, it’s being used to inform medical decisions, recommend products, and set legal consequences.

We are naturally concerned with how machine learning bias may negatively impact people with disabilities. This is underscored by examples of machine learning bias popping up all over the place, and the negative history of people with disabilities being treated unfairly.

Machine learning learns patterns from data, and so the bias observed from machine learning systems is deeply connected with the data on which they are trained. This has led to calls, like, “we need to remove bias from the data”.

All data is biased

We can’t remove all bias from data, because all data is biased.

Who:  Sometimes data is biased because people with disabilities aren’t included in the dataset at all, which is what people think of first in this context. That needs to be fixed. Often, data is biased because the data is captured from people and people have all sorts of bias, potentially including bias relating to people with disabilities, and then that data gets included along with all the rest.

Where:  Data might also be biased because it was only collected in the United States, or only from college students, or only from people who already happen to own an iPhone, or only from people who happen to use a particular app. Maybe a more representative sample of people was invited to participate, but the data ends up being unrepresentative because some people found it easier to get off work or make it to the data collection place.

Maybe the website for collecting data itself was inaccessible in some way, making people affected by that inaccessibility unable to participate (Amazon Mechanical Turk is inaccessible, for instance).

What: Data never captures the whole true state of the world. In some way we end up “flattening” into a representation of the world so that machine learning has a chance to find a pattern, and that representation has inherent bias. For a long time, we created explicit features and then learned statistical patterns over those, e.g., in categorical datasets, or even in older methods for computer vision. More recently we let the algorithms themselves do the flattening from whatever data they are given, but the flattening is still happening -- we just understand less about how the flattening is happening or what consequences it might have. For instance, it might be creating features equivalent to “height”, which could very easily lead to biased results, that we might have been more careful to screen out had we created those features by hand.

Bias in datasets is incredibly important to understand, but it’s perhaps even more important to understand how the system using machine learning behaves.

Jutta Treveranis has found that at least one self-driving car vision system doesn’t recognize people in wheelchairs as being people because they move differently than, say, people who are walking or people riding bikes. That could be a result of bias, or it could be a result of the ruthlessly statistical way machine learning algorithms work.

Whatever data that is being collected about people forms some sort of distribution. There’s nothing magical about machine learning, it simply finds patterns in this distribution and uses these patterns from past observations to make predictions about future ones. If data about you looks similar enough to data that was labeled as a person in the dataset, then your data (and thus you) get labeled as a person. Maybe people in wheelchairs don’t get labeled as people because there isn’t any data that looks like their data in the dataset.

This is more likely to impact people with disabilities, who for a variety of reasons, may look, move or behave in ways that aren’t represented well by data collected naively.


Representative Samples Aren’t Enough

Even if we manage to collect a “representative sample[1]”, our systems will still exhibit bias. The reason is that people with disabilities are quite diverse, and even a so-called representative sample may not have enough data to learn from for abilities and ways of being that are less common. Maybe that’s compounded because statistically fewer people who use wheelchairs responded to the call for data collection, or fewer feel confident traveling on broken sidewalks and around dangerous human-driven automobiles. With less data to learn from, models will be less certain, causing them to fall back on what they’ve generally observed from the dataset.

When machine learning approaches don’t have a lot of data similar to you, they lean more heavily on whatever they would have thought if they hadn’t seen your data at all (these are sometimes called priors, or the likelihood of a thing prior to seeing a particular instance). Priors might also be thought of as the truth within stereotypes. For example, if I told you that I’m going to show you a picture of a person who identifies as a man, and give you $1000 to tell me whether he has short or long hair, you might reasonably say “short”. The data would probably back you up on this guess being most likely to be correct, even though clearly many men have long hair.[2] Just as people are prone to assume things about people based on their prior models of the world, so can machine learning models.

This is incredibly problematic when you have important cases for your application that don’t occur very often in your dataset, even if the sampling you use is “representative.” There are fascinating (and disturbing) results that show that machine learning models end up acting *more* biased than one would expect. If 60% of men have short hair, the model might predict 80% of men to have short hair. They so often gain benefit from highly weighting whatever the most likely case is, they end up overweighting it.

Striving for an unbiased dataset is a good start, but ultimately misses the point if our goal is an equitable and fair system. A self-driving car should recognize all people at an equal rate, and so if we can’t make self-driving cars that recognize people using wheelchairs as well as people walking, then those self-driving cars aren’t ready to be on the roads. It does not matter if only 1% of the people they observe while driving use wheelchairs (or 0.0001%); if they get that 1% wrong, they are 100% failures.

The Hard Road Toward Fair Systems

Building fair AI systems draws heavily on methods from HCI. We need to find stakeholders, we need to figure out what data to collect (and iterate on it), and we need to think carefully about when our method will fail and who those failures will affect.

How and can we design for the diversity of use cases the ML system will encounter? This is incredibly difficult because we’re trying to predict in advance the unexpected situations the system will encounter, e.g., unknown unknowns. People aren’t necessarily perfect about this either, but deliberate human design processes do much better than machines can statistically derive from data.

Humans can also decide that predicting all possible unknowns is untenable. We might end up deciding that we can’t adequately predict outliers in a way that makes us comfortable; we might need to go back to the drawing board and solve a different problem, e.g., maybe we should augment human drivers who are already pretty good at the things AI struggles with, to make them better at things humans struggle with.

Counter-intuitively, we might need to introduce new bias into our dataset if we want to create systems that are more fair. For instance, stratified sampling can be used to oversample in the under-represented parts of the distribution. If people who use wheelchairs aren’t highly represented in the dataset, we might need to go out of our way to collect lots and lots of data from people who use wheelchairs.

When we think about metrics for our systems, we can’t simply consider accuracy numbers that treat all errors the same either. Forcing neutrality is a very common destructive bias. Missing all wheelchair users cannot be seen as being 99% correct. We need to include in our metrics, and thus our loss function, the idea that the performance of our algorithm is only as good as the performance on the worst-performing group. We might need to develop new algorithms that specifically look to minimize errors on so-called outliers, or more likely just thoughtfully apply well-known approaches.

Ultimately, bias in datasets is an important component of fairness in ML; but, what we really care about is bias, fairness, and equity of our results. Fortunately, we know a lot about achieving this -- it’s just a hard road. We should recognize there is no easy answer. An algorithm isn’t going to solve this, and more data isn’t going to solve it.


To create ML systems that are fair, we need to do the hard design work to figure out what problems to solve, what it might mean to achieve acceptable performance on those problems, and how we can iteratively create fair outcomes. These need to be 1st-class design constraints from the beginning.

-- Jeffrey P. Bigham

http://www.jeffreybigham.com

@jeffbigham

Big thanks to Joseph Seering, Laura Dabbish, and Walter Lasecki, who helped several parts of this blog post sound better, include more perspectives, and not be outright wrong. The remaining controversial and/or incorrect parts were at my insistence.


[1] Figuring out what “representative” means is basically as impossible as removing bias from a dataset. Do we mean representative of the whole world, a region, of current pedestrians, etc.?

[2] Revisiting our discussing of all data having bias, this would likely be almost universally true if you collected your data of hair length from participants in the military, it would be less likely if you collected your data from 80s hair bands.


This page and contents are copyright Jeffrey P. Bigham except where noted.
Blog posts are not intended to be final products, but rather a reflection of current thinking and/or catalysts for discussion, like tweets but longer.