Black Loans Matter: Fighting Bias for AI Fairness in Lending
Black Lives Matter. These three words have loomed large in the sociopolitical psyche of the United States since the founding of the movement in 2013. The #BLM movement, distinct from any specific organization, is principally focused on combatting police brutality, mass incarceration, and other forms of violence resulting from structural and cultural racism. Freedom from violence is among the most basic human needs articulated in Maslow’s Hierarchy. Another is financial security.
Today in the United States, African Americans continue to suffer from financial exclusion and predatory lending practices. Meanwhile the advent of machine learning in financial services offers both promise and peril as we strive to insulate artificial intelligence from our own biases baked into the historical data we need to train our algorithms. In our paper, Black Loans Matter: Distributionally Robust Fairness for Fighting Subgroup Discrimination, published in the Fair AI in Finance workshop of the 2020 NeurIPS conference, we evaluate this challenge for AI fairness in lending.
We focus on a critical vulnerability in the group fairness approach enforced in banking today. Insofar as we want similar individuals to be treated similarly, irrespective of the color of their skin, today’s group-level statistical parity measures fail because algorithms can learn to discriminate against subgroups. We recommend an alternative approach drawing from the literature on individual fairness, including state-of-the-art methods published in 2020 in top AI conferences. Specifically, we propose Distributionally Robust Fairness (DRF) and an efficient training algorithm called SenSR to enforce it, where SenSR also relies on an algorithm called EXPLORE to learn a fair metric from the data. These methods have been validated in the AI peer community; now they need to be validated in the real world. We hope this paper will serve as but the first step in the right direction.
Racial discrimination in lending
Lending is a leading opportunity space for AI technologies, but it is also a domain fraught with structural and cultural racism, past and present. Lending itself is also a historically controversial subject because it can be a double-edged sword. On the one hand, access to affordable credit is an important tool for financial freedom and prosperity, and for extending equal opportunity to minority households and minority-owned businesses. The Consumer Financial Protection Bureau (CFPB) estimates 20% of the U.S. adult population is under-served for credit and people in this population are more likely to be from minority groups. On the other hand, lending can be an aggressive and predatory practice that lures people in and traps them in cycles of debt. The policies and practices leading up to the 2008 financial crisis illustrate this tension between lending as helpful versus harmful. One of the core mandates of the Consumer Financial Protection Bureau is to help us achieve the right balance.
The term redlining was coined in the 1960’s by the sociologist Joe McKnight to describe the government-sponsored practice of discriminating against black neighborhoods, as shown in this “residential security” map of Philadelphia from 1936.
A 1936 Home Owners’ Loan Corporation (HOLC) “residential security” map of Philadelphia, classifying various neighborhoods by estimated riskiness of mortgage loans. The HOLC maps are part of the records of the FHLBB (RG195) at the National Archives II Archived 2016-10-11 at the Wayback Machine.
In 1970’s, to fight discrimination, Congress passed the Fair Credit Reporting Act and the Equal Credit Opportunity Act. Unfortunately, by 1988, little had changed and the investigative journalist Bill Dedman earned a Pulitzer Prize for a series of essays entitled The Color of Money, in which he showed how mortgage lending practices discriminated against African Americans in Atlanta and beyond through redlining and reverse-redlining, which is the predatory practice of targeting minority groups with credit they can’t afford.
That was over 30 years ago. Where are we today?
Author’s Note: The topic of algorithmic fairness is at once technical, sociological, and political. Such compound complexity, coupled with the intractable challenges of racial justice, makes for a difficult landscape with no simple solutions. Moreover, we as the authors of this paper, not being from the African diaspora ourselves, are limited in our ability to truly understand the experiences of African Americans. Nevertheless, we are here to do our small part for the cause of racial justice in lending and algorithmic decision-making. This paper being a collaboration between Wells Fargo and IBM Research, we hope this intersectional analysis can be informative for banking practitioners, public policy makers, and technology companies alike.
Consumer-Lending Discrimination in the Fintech Era
Unfortunately, as the Black Lives Matter movement reminds us, we do not yet live in a post-racial society. In a landmark 2019 study of 3.2 million mortgage applications and 10.0 million refinance applications from Fannie Mae and Freddie Mac, Bartlett et al. found evidence of racial discrimination in face-to-face lending as well as in algorithmic lending.
Not only do Black and Latino applicants receive higher rejection rates of 61% compared to 48% for everyone else, they also suffer from race premiums on interest rates, paying as much as 7.9 basis points more on mortgage interest rates. That difference results in an annual race premium of over $756 million dollars per year in the United States.
This is an extraordinary injustice and it is one being raised at the highest levels of government and society. U.S. senators Warren and Jones cited Bartlett et al. along with other studies in a letter to the Board of Governors of the Federal Reserve System, the Office of the Comptroller of the Currency, the Federal Deposit Insurance Corporation, and the Consumer Financial Protection Bureau, with a detailed admonishment to do more to combat algorithmic lending discrimination. Amidst a resurgence of the Black Lives Matter movement in 2020, we are especially mindful of lending outcomes for black Americans, whose financial prospects have been most undermined by structural and cultural prejudice, past and present. As modern decision-making processes have become increasingly dependent on machine learning, public backlash has spiked in numerous sectors as people grapple with the incongruence of modern values and the historical data skeletons unearthed by our amoral learning algorithms.
The good news is, there are things we can do about this. We share the view that biased algorithms are easier to fix than biased people. In the study, algorithmic lending improved acceptance rate parity, but it was still guilty of imposing racial premiums of 5.3 basis points for purchase mortgages and 2.0 basis points for refinance mortgages (recall “reverse-redlining” above). The researchers speculate this may be due to the ability of an algorithm to discriminate on the basis of intragroup variation. The fact that such discrimination can pass the compliance protocols of lenders is the subject of our paper.
How AI fairness in lending works today
At a high level, there are three components to enforcing fairness in lending today. First is separation. Model developers are excluded from the fairness process to make sure they can’t game the system. Second is blindness. Model developers can’t see sensitive attributes, and in many cases they aren’t even collected except by proxies provided by the government. Third is group fairness, or statistical parity measures for known minority groups relative to the majority. All three of these have problems and pitfalls, but in this paper we focus on group fairness.
Group fairness says we want to ensure relative statistical parity between defined groups. Some of the most common group fairness metrics are demographic parity, equal odds, equal opportunity, and prediction rate parity. For example, if we consider men and women as two groups in a loan application, we want to see the same acceptance and rejection rates. Pretty straight forward? Maybe not.
What happened with the Goldman Sachs Apple Card?
On November 7th, 2019, the creator of Ruby on Rails tweeted that the Goldman Sachs Apple Card was sexist because his wife received 1/20th of the credit limit he received. The tweet went viral, with many others saying the same thing, including Apple co-founder Steve Wozniak. Just three days later, the New York State Department of Financial Services announced a legal investigation. One year later, if you search Goldman Sachs women credit, or Apple women and credit, the searches are full of allegations of discrimination. Thus the incident brought a significant reputational, legal, and financial cost for both companies.
Surely, Goldman Sachs conducted group fairness analyses as part of its compliance process, so what happened? We can only speculate, but there are clues in the reporting that point to a potential subgroup discrimination problem. Many of the complaints were from husband-and-wife cases with shared assets and the model developers were blinded-by-design to the gender of the applicants.
Here’s a fictional example to illustrate one possible explanation of what may have happened.
Suppose we have an algorithm based largely on personal income. We have high, medium, and low credit limit groups. We constrain the algorithm to successfully ensure perfect parity between men and women. Now we consider fictional individuals David and Jamie. David has a high personal income. Jamie has none. Therefore our algorithm gives them very different credit limits.
But it turns out, David and Jamie are married. David works a paid job while Jamie is a homemaker. They share all of the household income and assets, so they actually have very similar creditworthiness. Now instead of considering just men and women, consider a subgroup called homemakers. Today in the United States, most homemakers still happen to be women. Given joint-tax filing, the fact that “homemaker” is a recognized occupation, and that household income and shared assets are key variables for predicting creditworthiness, we can see how individuals in this subgroup could form a reasonable expectation of equitable treatment relative to their partners. And in the face of unequitable treatment, we can also see how this group might be upset and take to Twitter with claims of gender discrimination.
This illustrates the vulnerability of group parity approaches to fairness: it is difficult, if not impossible, to define and test for all possible subgroups and combinations of subgroups.
The aforementioned group fairness metrics have a limitation: they require a definition of the protected group (e.g. African Americans). In the real world, identity is much more complex and heterogeneous with numerous subgroups and subgroup combinations within and across protected groups. This has been observed by scholars of behavioral medicine, for example, who argue that inattention to subgroup diversity can result in disparate outcomes for significant minority populations (Birnbaum-Weitzman). Group-level statistical metrics can easily be manipulated by humans and (especially) by machine learning algorithms enforcing fairness as a constrained optimization problem. A given subgroup may be large enough to pose a legal and reputational risk if treated unfairly while remaining sufficiently small enough to pass the meta-group fairness metrics. We call this subgroup discrimination.
Let’s return to the topic of race, and let’s consider loan applications for Tyree and Ryan, which are names from a dataset of common African American and Caucasian names.
Let’s say Tyree and Ryan are very similar in every way except for race. Tyree is African American. Ryan is Caucasian. We run a fictional profit-maximizing algorithm that satisfies group fairness. Three applicants from each group are rejected, the others are approved. But look at Tyree and Ryan.
We’ve got a problem of counterfactual robustness here in which the model is not invariant to race, given that Tyree and Ryan are similar but experience different outcomes. Tyree was denied while Ryan was approved. That doesn’t seem fair at all.
Maybe this subgroup represents middle-income applicants and the algorithm has decided that for some reason black middle-income applicants aren’t worth the risk while white middle-income applicants are. To understand why this might be, let’s go deeper and look at interest rates. Here we’re assigning annual percentage rates to the upper, middle, and lower income groups. We’ve achieved group fairness between black and white borrowers with a shared average of 20%, but notice the discrepancy for low-income applicants at the bottom. Black borrowers are paying significantly more than white borrowers.
This is a simplistic but reasonable strategy for maximizing profit. The algorithm can give especially favorable terms to high income African Americans in order to offset more predatory terms for low-income African Americans. Thus middle income black borrowers are subject to a form of redlining or exclusion, while low-income borrowers are subject to reverse-redlining with racial premiums on their interest rates. And yet our group fairness approach says everything is okay. This might be a reason why that 2019 study shows persistent racism in lending in spite of stringent fairness rules.
Introducing individual fairness
Because we can’t anticipate all possible subgroups, an approach called individual fairness emerges as the way we can deal with subgroup complexity to ensure that similar individuals like Tyree and Ryan are in fact treated similarly. In a 2011 paper, Cynthia Dwork and her co-authors proposed individual fairness as follows.
Let a machine learning model be a map , where and are the input and output spaces. Individual fairness is defined as the satisfaction of this inequality:
where and are metrics on the input and output spaces and is a Lipschitz constant. The fair metric encodes our intuition of which samples should be treated similarly by the machine learning model. In other words, assuming a distance metric to define similarity (for example, between two people of similar creditworthiness), we’re going to enforce upon the model a Lipschitz condition, such that the distributions over outcomes are indistinguishable up to their distance. For example, if Tyree is and Ryan is , their outcomes should be similar because their distance from one another is very small.
Theoretically, this solves the problem, but practically there are two challenges here:
- In this approach it’s not possible to provide any practical generalization guarantees due to the requirement that the inequality holds for all possible pairs of individuals.
- Defining an appropriate fair metric is non-trivial.
Distributionally Robust Fairness
This year in ICLR, Mikhail Yurochkin and his co-authors proposed an approach to individual fairness that draws from ideas in adversarial robustness. It’s called Distributionally Robust Fairness (DRF) and it represents the first practical approach to individual fairness.
Again we let a machine learning model be a map , and with respect to the fair metric we say it is DRF-fair if it satisfies the following condition:
where is a loss function, is the machine learning model, is the empirical distribution of the data, is a small tolerance parameter, and is the Wasserstein distance between distributions and with the cost . Comparing this definition to the Dwork definition of individual fairness, here the loss function replaces the metric on the output space and the tolerance parameter is analogous to the Lipshitz constant . The crucial difference is that DRF considers average violations of individual fairness making it amendable to statistical analysis establishing generalization guarantees. DRF measures by how much the loss can increase when considering the fair neighborhood of the observed data.
We can use an algorithm called SenSR to train a Machine Learning model to satisfy the DRF condition, where SenSR can either use sensitive subspace or algorithm we propose called EXPLORE to obtain a fair metric from the data.
Here’s how it works in more high level terms:
Observe the data and learn a fair metric.
- Generate a fictional set of all possible similar individuals to see which ones have different outcomes, as measured by the loss.
- Take the single worst counterfactual i.e. the most unfair example, then update the parameters to account for this. See the image below as a rough representation of this in which the largest X is the single worst counterfactual.
- Repeat until DRF condition is met.
In this way, we guard against the unfair outcome in which Tyree is denied while Ryan is approved, despite them being of similar creditworthiness.
Experiment 1: Fighting bias against African American names
The experiments run by Yurochkin et al. demonstrated this principle empirically using public datasets since actual lending datasets are too sensitive to access in this forum.
For example, they trained an NLP sentiment analysis model using pre-trained word embeddings for positive and negative word, then ran the model on the list of common Caucasian and African American names. Unlike words such as trustworthiness versus corruption, which connote clear positive and negative sentiments, names should be relatively neutral, and certainly so across race. Unfortunately, because we do have racial biases in our NLP training data, names do in fact produce positive and negative sentiments according to these models. In a post-racial world, there wouldn’t be a discrepancy between common white names and common black names. In reality, you can see African Americans in the dark blue suffer from a significant negative bias.
If we were to use group fairness to correct for this, we could effectively bring the means together to zero, but this is pretty superficial because the range still allows for massive disparity in the treatment of similar individuals like Tyree and Ryan. With SenSR, however, the gap is closed effectively. Tyree and Ryan, and virtually everyone else, have much stronger guarantees of equitable treatment without regard to race. And critically, in this experiment and others, accuracy is not compromised.