Why Stopping Algorithmic Inequality Requires Taking Race Into Account

In 2019, Michigan lawmakers banned car insurance companies from using certain “non-driving” characteristics, like zip codes or credit scores, to calculate the price of their customers’ coverage. The reform was meant to prevent companies from upcharging certain customers on attributes not directly based on how well they drive, which is especially important when drivers in the predominantly Black city of Detroit have long paid the highest insurance prices in the country.

However, an investigation by The Markup and Outlier Media found that after the reform, the system has remained stacked against marginalized communities. Across the state, insurance companies charged Michigan’s most expensive car insurance rates in neighborhoods with the most Black residents. In addition, insurance companies are able to charge more to drivers with low insurance scores, which aren’t technically credit scores but are also based on customers’ credit information; a 2007 Federal Trade Commission report states that this practice “likely leads to African-Americans and Hispanics paying relatively more for automobile insurance than non-Hispanic whites and Asians.”

To counteract the stark racial differences in their pricing systems, insurance companies should explicitly account for race by using statistical methods, according to a growing number of critics.

“That [discrimination] you’re finding in Michigan exists everywhere,” said University of Minnesota Law professor Daniel Schwarcz, who has spent much of his career studying how to rectify discrimination in insurance markets. “And that’s the inevitable result of the system we have.”

↩︎ link

Where Current Methods Fail

Imagine a model that predicts the amount of cat food consumed by each member of a family, pets included. Although a kid or dog (or adult) may sneak a bite every now and then, in all likelihood cats are the ones primarily chowing down on the cat food. By adding a “species” variable to our model, we could use it to pretty accurately predict how much cat food every family member will eat.

The ability of a variable to help predict the outcome of interest is known as its predictive power. Since “species” is closely associated with cat food consumption, this variable likely has predictive power that the model will try to take advantage of.

But say we prohibit our model from taking species into account, so we delete the variable. Now, the model may start trying to recoup predictive power by inferring the species of each family member using other variables, like how much everyone naps. Let’s say our model has a “nap time” variable with data on how many hours a day a family member spends napping. If the model “knows” that one household member naps for 16 hours a day, it may essentially use its nap duration as a proxy for being a cat and use this insight to predict that member will eat lots of cat food.

Two side-by-side photos of a sleeping Maisie and Mr. Tumnus — Caption: A “nap time” variable could help a statistical model predict the amount of cat food consumed by a household family member. Credit:Natasha Uzcátegui-Liggett

But this applies to more than just cat food. Other models could derive similar conclusions from a “nap time” variable, leading to potential unequal treatment of cats when predicting how much a family member sheds, how high they can jump, or who ate the fish left on the dining room table.

This type of problem is called “omitted variable bias”; to avoid this discrimination by proxy, researchers like Schwarcz advocate for an alternative approach.

“If we don’t want you to take into account the predictive power of these suspect factors, then let’s actually estimate what the predictive power is and remove it from the model using statistical methods,” Schwarcz said.

Report Deeply and Fix Things

Because it turns out moving fast and breaking things broke some super important things.

Give Now

When it comes to calculating car insurance prices, a “discrimination-free” method requires building a model that predicts a customer’s estimated claim costs using all variables—including driving characteristics, like number of speeding tickets, and non-driving protected characteristics, like race—and then averages out those costs across the protected attributes like race. For example, instead of ignoring race, for a Black policyholder, the model would not only estimate the claim cost associated with that policyholder being Black but also estimate what their claim cost would be if they were White instead of Black, Asian instead of Black, Native American instead of Black, and so on.

The average of these values is the final estimated claim cost for that policyholder. Race and other protected attributes are likewise averaged out of predictions for all policyholders regardless of which demographic group they belong to.

↩︎ link

Methods That Ignore Race Can Result in Disparate Impact

John Yinger, trustee professor emeritus at Syracuse University, has done considerable work on disparate impact, the idea that policies can still be discriminatory if protected groups experience decidedly worse outcomes as a result of that policy. For example, a company policy to only give promotions to people over 6 feet tall, leading to an all-male management team. The flip side of disparate impact is disparate treatment, where policies explicitly treat protected groups differently—like a company saying that only men are allowed to get promotions, similarly leading to an all-male management team.

There’s no such thing as data that doesn’t reflect a racial component.
John Yinger, trustee professor emeritus at Syracuse University

“You’ve got this problem in the data, which is if you put in race, you have disparate treatment. And if you leave race out, you likely have disparate impact,” said Yinger, who first proposed the idea of controlling for discrimination in a 2002 book about mortgages called “The Color of Credit.” “There’s no such thing as data that doesn’t reflect a racial component. And so you have to recognize it and then take it out. That’s the only way you can get fairness.”

But collecting and incorporating sensitive information like race into a model feels counterintuitive to many, which makes this practice a hard sell.

Robert Cheetham, creator of a predictive policing system called HunchLab, told The Markup for a 2021 investigation that his team looked into the idea. They considered requiring their model to make recommendations for officer patrols evenly across a jurisdiction to control for racial inequalities in the predictions.

“We didn’t include race, because we felt like it is so hard to make that case to the public. It would just be incendiary,” Cheetham said.

However, there may be ways to mitigate some of that discomfort. One method would be to only collect race information for a portion of the population, and then use a neural network to infer the best way to shape the model to control the effect of race for everyone, as proposed in a 2022 study. The authors suggested insurers could obtain the necessary information from enough customers for the neural network to make its estimates by “offering special discounts to customers who are willing to disclose information on protected characteristics.”

If comfort with collecting and incorporating a sufficient amount of contentious personal characteristics, even only for the purpose of controlling for its effects, caught on, the uses could be widespread.

↩︎ link

How Fair Is “Fair Pricing”?

Eliminating proxy discrimination means that policyholders who are similar (apart from their protected characteristics) will pay similar prices. This is different from demographic parity, which would ensure pricing equality between demographic groups.

In practice, it can be nearly impossible to fully satisfy the requirements of both. Many methods that accomplish demographic parity produce claim estimates that explicitly depend on protected variables—the definition of discrimination—while avoiding proxy discrimination doesn’t guarantee demographic groups are treated equally.

In order to ensure demographic groups are treated equally, insurers would have to adjust policyholders’ premiums, raising some while lowering others. But since each insurer has a unique customer pool with varying demographic makeups, each insurer would have to make different adjustments. This may result in the same policyholder arbitrarily being charged very different prices by different insurers—a situation that is not only tricky to explain to customers but also extremely difficult to regulate.

Andreas Tsanakas, a co-author on the study “What Is Fair? Proxy Discrimination vs. Demographic Disparities in Insurance Pricing,” argued against enforcing demographic parity at the portfolio level but said in an email that a reasonable argument could be made for implementing it across an entire insured population, removing some of the arbitrariness of the adjustments. Tsanakas, however, still advocates for prioritizing elimination of proxy discrimination.

One potential consequence of these methods is an increased risk of adverse selection, a phenomenon in which a policyholder has more knowledge about their propensity to claim than the insurer, allowing them to obtain coverage at a lower price—one that is not reflective of their risk. This can have detrimental effects on the insurer, which may eventually result in premium increases for all policyholders.

Research that examines the welfare cost of fair pricing in insurance has found that fairness requirements can be costly for both the insurer and the policyholders, depending on how they are enforced and the competitiveness of the insurance market. However, using data from the car insurance market in France, the paper found that when fair-pricing methods, such as the “discrimination-free” method, were imposed across the entire market, overall consumer welfare improved.

↩︎ link

How Colorado Is Trying to Regulate Price Discrimination

As insurance models use more and more consumer data to train machine learning models, possible discrimination by proxy is now much more difficult to discern.

In response, Colorado’s legislature passed a bill in 2021 that protects consumers “from insurance practices that result in unfair discrimination on the basis of race, color, national or ethnic origin, religion, sex, sexual orientation, disability, gender identity, or gender expression.”

Regulators are still working with insurance companies to figure out how to implement these new rules, but a key element will be incorporating protected characteristics into insurers’ models and paying close attention to how the models use the data.

Jason Lapham, Colorado’s deputy commissioner for property and casualty insurance, explained that a crucial first step is determining which pricing variables function as proxies for protected characteristics.

The identification method should sound familiar: “You introduce race, or whatever protected class characteristic, explicitly into the model, and then look at the difference between the model without race, or whatever protected class characteristic that we’re talking about, and the model with that variable included,” Lapham explained. He noted that it’s not clear yet what new responsibilities companies will have once proxy issues are identified.

“There’s no insinuation that insurance companies are doing this intentionally,” he continued. “It’s that there can be unintended consequences because of the things that are baked into this data that can then lead to either perpetuating, amplifying, or reproducing the unfair discrimination that led to what we’re observing in the first place.”

The Markup and Outlier Media asked Michigan’s Department of Insurance and Financial Services if insurance regulators in the state had looked into controlling for discriminatory elements in pricing models. “We are always monitoring industry trends and collaborating with regulators across the country to identify ways to strengthen our consumer protection efforts,” said department spokesperson Zachary Dillinger, but he did not highlight any specific examples.

Why Stopping Algorithmic Inequality Requires Taking Race Into Account

Why Stopping Algorithmic Inequality Requires Taking Race Into Account

Share This Article

This article is copublished with

Premium Penalty

Where Current Methods Fail

Report Deeply and Fix Things

Methods That Ignore Race Can Result in Disparate Impact

Where could “discrimination-free” methods be used?

How Fair Is “Fair Pricing”?

How Colorado Is Trying to Regulate Price Discrimination

Why Stopping Algorithmic Inequality Requires Taking Race Into Account

The Latest

We caught companies making it harder to delete your personal data online

Prison inmates can take college classes, but often with no internet and limited tech

Should Lyft and Uber charge more if your battery is low? California may soon ban that