Machine Bias: There’s Software Used Across the Country to Predict Future Criminals. And it’s Biased Against Blacks. – ProPublica
When a full range of crimes were taken into account — including misdemeanors such as driving with an expired license — the algorithm was somewhat more accurate than a coin flip. Of those deemed likely to re-offend, 61 percent were arrested for any subsequent crimes within two years.
We also turned up significant racial disparities, just as Holder feared. In forecasting who would re-offend, the algorithm made mistakes with black and white defendants at roughly the same rate but in very different ways.
- The formula was particularly likely to falsely flag black defendants as future criminals, wrongly labeling them this way at almost twice the rate as white defendants.
- White defendants were mislabeled as low risk more often than black defendants.
Could this disparity be explained by defendants’ prior crimes or the type of crimes they were arrested for? No. We ran a statistical test that isolated the effect of race from criminal history and recidivism, as well as from defendants’ age and gender. Black defendants were still 77 percent more likely to be pegged as at higher risk of committing a future violent crime and 45 percent more likely to be predicted to commit a future crime of any kind. ropublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing”>Machine Bias: There’s Software Used Across the Country to Predict Future Criminals. And it’s Biased Against Blacks. – ProPublica
So the issue in using the sw is what?
So computer programmers are racist?
SW code is innately racist?
The code is just crap at forecasting?Report
Code is crap and in a way we can’t ignore.Report
Oh I agree. It’s bad code. It doesn’t predict reality. I just wasn’t sure whether or not anyone was trying to assert that the code was racist ’cause white guys likely wrote it.Report
When a bot makes another bot racist, I don’t think you can say it’s the programmers’ fault. Microsoft’s stupidity aside…Report
@damon
I doubt anyone intentionally inserted things into the code to produce the outcomes we are seeing. I think it is possible that racialized attitudes and associations among those who developed the code might have contributed to the biased output.
Regardless (and I attempted to edit my initial comment to include this but must not have submitted it), bad code is bad code and should be fixed or abandoned. That this particular code is bad in such a way that exacerbates existing biases in the system only makes the need for action more pressing. But, yea, any bias in the code need be addressed.
Here’s a question: Imagine that the code is biased but less biased than the previous, human-based system… would we still use it?Report
” I think it is possible that racialized attitudes and associations among those who developed the code might have contributed to the biased output.”
How could that be the case? I have to assume that the code was tested for predictive accuracy. I mean, really, no testing, but something that seems to generate data so different from the facts may not have been tested ENOUGH.
“Imagine that the code is biased but less biased than the previous, human-based system… would we still use it?” We could. If it’s demonstratively better, shouldn’t we, even knowing it’s limitations? What should be use instead if not this?Report
If it’s demonstratively better, you use it, and you put your thumb on the scales where you know that it’s weak.
We do this in medicine too — you get a ballpark, and then you go “well, here’s where you’re different from “generic patient” “Report
@damon
It is hard to say much about the development of the code because of the “black box” approach of its maker. But if a coder was empowered to say, “Hey, I think Variable Y correlates with recidivism so I’m going to make that a high valuer marker,” and his believe is based on some sort of biased understanding/assumption, than that is leaking in. It’d seem odd to develop the code that way but, again, we don’t know. I’m not saying that is what happened. Just that it is impossible.
It’d seem to me that a less-biased code is better than a more-biased person. Then again, we see lots of opposition to driverless cars because of the non-zero risks they carry… even though the risks are MUCH less than that of human drivers. For whatever reason, we are uncomfortable with imperfect machines.Report
Also, are the results subsequently checked against reality?
If Amazon shows you two ads you don’t click on and one you do, that improves the accuracy of their ad targeting algorithm that very night.
It sounds like in this case, if the software predicts recidivism in two people who don’t reoffend and one who does, that fact is fed back into the system to improve its accuracy exactly never.Report
I have to assume that the code was tested for predictive accuracy.
Looks like that might not be a sound assumption:
Also this really stuck out to me:
In a study of the accuracy of the software relative to race, they checked for rates of error and found them comparable – but it never occurred to them to compare separately the rates of false positives vs. false negatives?Report
Hah…
Sad to say I’m not surprised. Can’t say how many times I’ve seen folks make decisions on something other than data only to have it come back on them.Report
https://en.wikipedia.org/wiki/Inductive_biasReport
In one sense, it is just bad statistics. On the other hand, because (for obvious reasons) they cannot directly ask about race, they cannot factor that in to the analysis.
Which the sad fact is, the analysis might be better if they could include race, if race was negatively correlated with factors that, in whites, are positively correlated. If those factors are common enough, and if they are fairly predictive across the broad population, then blacks lose.
On the other hand, if you include race, and if race then ends up with a positive coefficient, well that’s a fucking political nightmare. In that case, the software would then punish blacks merely for being black.
Blah.
It would be possible to set the model up so that race could be included only as a negative cofactor for other variables. Then you avoid direct correlations. But still, just asking race is going to set off alarm bells. Would we ever trust some whitebread big-company that’s comfy with the folks in Tallahassee?
(A big part of Florida politics involves shady deals with sketchy companies whose CEOs play golf with the “not-so-good old boys” in Tallahassee. They are really good at setting up contracts with shitty oversight. It’s manifestly ugly.)
In any event, the “racists” that Damon is asking about are everyone in the system who kinda doesn’t care that black individuals get a raw deal under this system. That is how structural racism works. All of us who find the system basically comfortable, cuz we’re seldom at the wrong end of the imbalance. Why change? It feels okay.
Double blah.
Does anyone expect black people to be okay with this? Perhaps they should be incredibly furious.Report
This was my question, is race a variable in the data set the software is analyzing?Report
From what I read, no. I read a piece on this the other day (not sure if it is the one quoted here or not). One thing it mentioned is that the companies (yes, most of these are made by third-party private companies) consider their algorithms proprietary and therefore refuse to allow journalists to see the whole kit-and-caboodle. They say they don’t use race.
And I’d be really surprised if they said, “Enter Race: If white, subtract 5 violence points; if black, add 5 violence points.” It doesn’t seem like we know exactly how or why the code is spitting out what it is spitting out and probably can’t know unless/until they open up the black box.Report
@kazzy — The article actually links to some of the questions they ask, which gives me a good guess what is happening. Basically, this looks like uneven correlations. In other words, they ask questions such as “if your parents separated, how old were you?” and “How many of your friends have been arrested?” and such.
The things is, these probably track different between black subjects and white subjects. But more, they perhaps correlate differently between blacks and whites. Which is to say, black culture has more of this stuff (probably as a whole), and thus it tells us less about the subject. However, because the correlations are built from the entire data set, the variables are thus weighed according to how they affect a mix of white and black subjects. In other words, they are too objective. They fail to account for how the variables have (in a sense) a different meaning in the life of a typical black person versus a typical white person.
Does that make sense?
#####
Note that they also ask if the subject is a “gang member or suspected gang member?” (emphasis mine) This is an obvious place where police bias will slip in. I bet the question correlates strongly, and if they cops are unfair in how they respond, then black people get a raw deal.
Anyhow, without seeing the fitted model the company uses, we cannot know.
#####
It is DEEPLY FUCKING UNJUST for the police to use tools like this without oversight and transparency. Florida has long had a particular problem with this kind of bullshit. I cannot imagine how this is not rejected in equal measure by both liberals and conservatives.Report
Ding! Ding! Ding! Give the lady a cigar, or kewpie doll, or whatever they hand out these days (128 GB memory stick?). If you don’t know the race/ethnic background, you can’t possible set your variable weights correctly.Report
Thanks @veronica-d and @oscar-gordon . That makes sense. I have only the simplest of backgrounds in stats and none in tech/coding.
Above when I discussed with @damon the potential for coder bias filtering in, I think I implied more intent than I meant to. I think calling “the code” racist is silly; it is a code, inanimate and simply executing what its designers told it to. And I wouldn’t even call the coders themselves racist. If anything, it is ignorance — of both stats and cultural stuff.
This reminds me of that computer company a few years back that developed facial recognition software for their cameras that didn’t detect Black folk. In reality, the software was poorly designed and could not pick up darker skin tones. I don’t know what went wrong on the software end, but as there were apparently no dark skinned people involved in the creation, the problem wasn’t noticed. A clear failure and one that is related to systemic and institutional bias but the idea that the company (was it HP?) was racist and made racists computers was just silliness.
However, given the gravity of the decisions this code is being used to make, holy fuck how did they let this happen?Report
Because software development requires customer feedback. If all your customers are telling you the code is doing great and giving them exactly the data they want/like/expect, how are they supposed to know they need to tweak things. Developers are not social scientists.
The better question is, did the development company have social scientists on staff or retainer helping them develop the tool, or did they just grab a bunch of research papers and crib something together?Report
It seems this is a situation where we shouldn’t apply the typical business model. Before the government employs a system like this, it needs to be vetted and vetted again.Report
Absolutely, but that isn’t a failure of the business model, it’s a failure of government to practice due diligence when choosing their tool.
I find this typical, unfortunately, given how the software I help develop is used. Government requires that we are able to create extremely accurate & consistent results before they will accept our simulation results during given stages of regulatory approval for something, but the software they use is approved with only a cursory vetting*.
*Granted, we are dealing with vastly different levels & departments of government.Report
I wonder, would there be some kind of requirement that the code not include feedback to observe its success and adjust its model accordingly?
As I alluded to above, if I click on an ad in a Google website but not another one, those data points immediately go to improve Google’s ability to target ads.
Which in a way is maybe possible because the stakes are so low – each misjudged advert costs the tiniest fraction of a penny, so Google is free to tinker with their weightings until they get the very best accuracy.
But in this case, a misjudged risk score costs months or years of someone’s life, and/or leads to someone being victimized by a parolee who probably shouldn’t have gotten parole. So the stakes are so high there’s a reluctance to tinker with the weightings, leaving the mediocre-to-coin-toss-random accuracy impervious to improvement.Report
That’s what I meant. The government should employ a different system than the company’s typical clients.Report
Or, even worse, did they use values provided by their customers, thus enshrining all their biases into the system since of course they’re going to get back the results they expected?Report
Heh, yeah, that’ll skew the results.Report
At a minimum, yes in that their clients are the ones doing the arresting and re-arresting – your likelihood of being arrested for for something like driving with an expired license depends on
– your likelihood of doing so
– the likelihood a cop is in your neighbourhood
– the likelihood the cop pulls you over and checks your license
– the likelihood the cop arrests you rather than ticketing you
Only the first depends on the individual – the rest is all largely an expression of the biases of the client.Report
@kazzy — Honestly, tho, if this is the sort of company that gets hired by police departments in Florida, there is near certainty that the management is racist-as-fuck. The problem is not that they deliberately design the algorithms to fuck over black people. This is not that kind of naked malevolence. The problem is, they probably don’t really care if it fucks over black people. In fact, fucking over black people is likely seen as a not-really-bad side affect of an otherwise lovely algorithm.
The reality is, to not fuck over black people sometimes requires a bit of extra effort. It requires research, insight, exploration. It also requires a kind of humility, to see that your own worldview is limited and that your whitebread life experience is not the default.
Which, data is data. A classification model is a classification model. An error is an error. But which errors? Is there a bias? Against whom?
There is always a bias, at least in any good algorithm. (Which actually that is a technical statement.)
In any case, a socially responsible company would predict that the algorithm is likely going to be not-race-neutral, and they would invest in research to mitigate this. It can be done. Additional statistical method would be needed. (Myself, I’m imagining a hierarchal Bayesian model that tries to capture the effect of race/culture/etc., without directly measuring those variables. But anyway, there are many ways.)
The point is, however, they have to want to do it. They surely do not. Nor do the cops. Fuckers.
#####
Yeah the facial recognition error was HP, and indeed you are correct, the whitebread software engineers simply didn’t think about black people.
My own employer has made similar mistakes. It’s endemic in Silicon Valley culture, which is at least 3940239499943030300304 times more motivated to be social responsible than your average Florida not-so-good-ol-boy.
(I used to live in Florida, by the way. It’s a cesspit of terrible.)Report
If they admitted to being a gang member that is clear cut. If they have gang tattoos that is pretty good evidence of gang involvement as well. Do you think police check the box just cuz?Report
If they have a tribal tat then they must belong to an actual tribe.Report
Sure some made up hipster tribe?Report
Maybe the Bad Boy Club or whatever a barbed wire tat is suppose to mean.Report
There are NCIC codes for tattoo positions and classifications. Having two classifiers who have both been properly trained to agree on the position and classification of a given tattoo is not a certainty. You can imagine how useful the input of Officer Random Joe, or, worse, a third-party identification, is to an automated system.Report
Yeah i know. There are plenty of white power tats that are meant to be clear signals. However i wasnt’ making a serious comment, i was just messing around around regarding dorky tats mall rat tough guys think look mean.Report
Oh, I know – I just used your comment as a springboard since this is something I have actual professional experience with, so I can contribute to the thread with more than just snark, and run-on sentences with oxford commas that would make the Texas GOP blush(*).
Some pie-in-the-sky idealists would like to do automatic tattoo recognition. You can imagine the roadblocks you have to hurdle before you can even consider that you’ve properly started that task.
(*) Yes, that was one sentence. I take pride in my writing.Report
I can accept many things…but don’t drag the Devil’s Own Punctuation mark, the Oxford Comma, into this.Report
You can have my Oxford Comma when you pry it from my mouldering, cold, dead fingers. The comma makes a difference.Report
It makes a difference, yes – but I think the second party would probably be more fun.Report
There’s the semicolon tattoo, now I’m trying to think how one would go about making an Oxford comma tattoo (that would be distinct from a regular old comma).
Maybe commas between the first knuckles of one hand – the comma between index and thumb, if present, would be the Oxford comma…Report
I have no doubt that the police are usually correct about gang affiliations. However, I would not trust that they are always correct. Nor would I trust that their errors are “fair and neutral.” In other words, they would almost certainly not mark me, veronica d, as being a likely gang member. They would be correct. I am not in a gang. But a young black dude in a hoddie, who is sullen, talks back? Maybe they’ve seen him talking to some “known gang members,” and thus conclude that he is a member. But perhaps he is not.
If you ask them, no doubt they will express certainty. The public will believe them. After all, we’ve all seen cop shows. We know that the cops know all the gang members. Cop shows couldn’t possibly be false, could they?
Honestly, I have no idea. I doubt any of you do either.
(I know that TV shows give a preposterously inaccurate view of transgender life. I think they give a pretty bogus view of the lives of sex workers. I assume “gangsta life” and “cop life” and “mobster life” and so on are similarly dramatized and bogus.)Report
Also: what makes a group of people who spend time together a “gang” vs. “not a gang”? The police identifying them as such, presumably.
I mean, yes, there are some objective-sounding criteria one could apply. But the more you look closely at it, the more it seems like “I know it when I see it”
(group being convened specifically for the purpose of crime = gang)
(no wait, a white collar extortion racket isn’t it – OK, crimes other than ones of insider trading and fraud type stuff)
(no wait, those folks who always smoke up and go for a walk together after work don’t count – non-white-collar and non-victimless crimes)
(no wait, those hick white boys who go playing mailbox baseball after school don’t count – non-white-collar, non-victimless, profit-oriented crimes)
(no wait, those guys who write graffiti together count even in the absence of profit motive)
And gosh, the more you try and fail to get to a good tight definition, the more “I know it when I see it” starts to asymptotically approach “I know it when I see its zip code and skin colour”Report
I get that you really want the definition to be racist, but…violent crime?Report
Off the top of my head, the Hell’s Angels and Aryan Brotherhood are typically described as gangs.Report
The existence of things that are not about race doesn’t mean there aren’t things that are about race. Of course the really clear cases remain really clear.
The Aryan Brotherhood, Hell’s Angels, Cosa Nostra, Triads, Indian Posse, Medellin Cartel, Crips, etc., remain clearly gangs. The Shriners, church choir (yes, even at a black church!), Meals on Wheels, etc., remain clearly not gangs.
It’s the edge cases I’m talking about – the group of friends who would generally be considered nogoodniks, who sometimes get into fights generally cause trouble – the determinants of whether they’re a “gang” or “just a bunch of troublemakers” are not race but they are racially coded. So my example of “black coded” graffiti vandalism raising the likelihood of a group being labeled a “gang” but “white coded” mailbox baseball vandalism raisign the likelihood of a group being labeled a “bunch of troublemakers”Report
Also, as even P.J. O’Rourke noted, in a perfect world the process would adjust for “Of course I was talking to known gang members. Half the dudes on my block are known gang members.”Report
It would seem to me that this would open up the state and/or code companies to pretty big lawsuits. Providing bias in someone’s mind is really difficult; you need some sort of smoking gun. But if you can point to the code and say, “See? It unfairly gave me a higher risk score and I spent a year long in prison than I should have,” that seems like a pretty ironclad case. But what do the legal beagles say?Report
This is a long piece and there’s a lot here, but IMHO the most striking result was:
[EDIT] I read the ORs wrong, this is a huge disparity but not completely flipped.Report
All that tells me is that the weight they give the variables is off, nothing more.Report
Dude, their “algorithm” is just a score, the only thing they can do is give the wrong weights to the variables.Report
I don’t know the specifics of how the software works, of course, but generally the weights in machine learning algorithms are determined by finding the best fit for a set of training data. Hence “machine learning”—the machine learns by studying the data. Which is to say, it’s unlikely that any human actually chose the weights.
If I had to guess, I would say that the issue is likely that for various reasons some variables correlate with outcomes differently in blacks and whites. If that’s the case, it seems likely that the accuracy could be improved by using race as a variable.Report
Adding race to the process can be beneficial – for example, blacks had worse outcomes in heart disease treatment for years before studies were re-run with more granularity and discovered that there is a statistically significant racial component to what treatments do and don’t work for a given individual.
The problem, as @veronica-d has stated, and as I can attest from a decade of experience – the criminal justice system is basically as racist as fuck. So introducing race into a discussion in that environment can be … fraught.Report
Well, we can’t know the specifics of the software because it’s proprietary (!), but every machine learning algorithm – whether it’s OLS or SVMs or boosted regression trees or neural networks – is, fundamentally, trying to fit weights to features/variables in the data. Saying “well they just got the weights wrong, nothing more” is like saying “that bridge builder just didn’t include enough structural supports, nothing more”. Getting the weights right is the whole task here.Report
If your model is so open that just getting a few variables wrong provides that much of a discrepancy in outcomes, you don’t really have a model of reality, you just have a spreadsheet.Report
Taking a closer look at the details of Pro Publica‘s analysis here, I think I see what’s going on. The software actually looks like it’s doing a lot better than the main story implies, and with less racial bias.
First, a clarification of what this software actually does. It doesn’t claim to accurately predict which defendants will reoffend. That’s science-fiction stuff, if not outright fantasy. What it does is classify defendants into high-risk and low-risk groups (actually, there are risk scores, not just two baskets, but Pro Publica‘s analysis breaks it into high- and low-risk groups). To the extent that more defendants in the high-risk group reoffend at higher rates than defendants in the low-risk group, the model is working. That’s what “risk” means. It’s probabilistic.
So how does it stand up under that criterion? For defendants in the high-risk category, 63% of black defendants and 59% of white defendants reoffended. In the low-risk category, 29% of white defendants and 35% of black defendants reoffended. For violent crimes, 21% of black and 17% of white high-risk defendants reoffended, vs. 9% of black and 7% of white low-risk defendants.
If we don’t come in with science-fiction expectations, the algorithm actually isn’t doing such a bad job. High-risk defendants really do reoffend at a much higher rate than low-risk defendants, and there aren’t major differences in the predictive value of these categorizations between races. In fact, from this perspective, it seems to be biased slightly against white defendants. And this is just chunking it up into “high risk” and “low risk.” I suspect that it does better when you break it down into smaller risk score brackets, rather than two categories subsuming broad risk score ranges.
So how do we resolve these conflicting views of the data? Why does it look so much worse the way they tell it? First, let’s clarify what their statistics mean, since they’re phrased ambiguously. When they say that 23.5% of white and 44.9% of black defendants were classified as high-risk but didn’t reoffend, that doesn’t mean that 23.5% and 44.9% of white and black defendants classified as high-risk didn’t reoffend. It means that of the white and black defendants who didn’t reoffend, 23.5% and 44.9% were labeled as high-risk. Likewise, of those who did reoffend, 47.7% of white defendants and 28% of black defendants were classified as low risk.
This apparent paradox is a result of two facts: a) black defendants really were more likely to reoffend (51% vs. 39% overall) and thus appropriately classified as high-risk more often, and b) the classifications are probabilistic, and not all high-risk defendants reoffend.
Let’s suppose, to illustrate how this works, that a high-risk defendant has a 60% chance of reoffending, and a low-risk candidate has a 30% chance of reoffending. Furthermore, let’s suppose that 20% of male defendants are low-risk and 80% are high-risk, and that 80% of female candidates are low-risk and 20% are high risk. For the sake of this example, all of this is correct by postulate*.
So what happens? For men, we get 6% false negative, 14% true negative, 32% false positive, and 48% true positive. For women, we get 24% FN, 56% TN, 8% FP, and 12% TP.
Note that the false positive rate for men is four times the false positive rate for women, and the false negative rate for women is four times the false negative rate for men. Of men who didn’t reoffend (TN + FP = 46% of total), 70% (32 / (14 + 32)) were classified as high-risk, compared to 13% (8 / (56 + 8)) for women. The real kicker is that we get this huge skew in these numbers despite the fact that women are fully 2/3 as likely to reoffend as men, with 54% of men (6% FN + 48% TP) and 36% of women (24% FN + 12% TP) reoffending.
These numbers are even more skewed than the ones Pro Publica published, and yet, by assumption, the model is correct and not biased by sex in any way. The skew in these numbers is the product of the fact that high-risk defendants don’t always reoffend and low risk defendants sometimes do, and that a greater percentage of men than women are high-risk.
By the way, I can’t find their definition of recidivism, but at the very least it must include being arrested, and probably also being convicted or pleading guilty. I’m pretty confident that there are more unsolved crimes than false convictions, so defendants in both low- and high-risk categories are actually at higher risk of reoffending than is suggested by these statistics. I won’t venture a guess as to whether this introduces any racial bias.
All of which is to say that the model actually seems to work reasonably well, and is not obviously biased against black defendants. If there’s a problem, it’s with the results being used improperly, not with the model itself.
*Apparently women are actually more likely to be classified as high risk. This is surprising, since recidivism rates are generally lower for women. This might warrant further follow-up; maybe they missed the real story.Report
You should tell ProPublica and their propaganda machine.Report
>>All of which is to say that the model actually seems to work reasonably well, and is not obviously biased against black defendants. If there’s a problem, it’s with the results being used improperly, not with the model itself.
Again, I don’t understand the purpose of separating the model from it’s intended use. No one here is arguing that this model gets a suboptimal AUC in the general population. The problem comes from using a model that has uncertainty correlated with race to make sentencing decisions without accounting for that uncertainty.Report