> But the model designers were aware that features could be correlated with demographic groups in a way that would make them proxies.
There's a huge problem with people trying to use umbrella usage to predict flooding. Some people are trying to develop a computer model that uses rainfall instead, but watchdog groups have raised concerns that rainfall may be used as a proxy for umbrella usage.
(It seems rather strange to expect a statistical model trained for accuracy to infer and indirect through a shadow variable that makes it less accurate, simply because it's something easy for humans to observe directly and then use as a lossy shortcut or to promote alternate goals that aren't part of the labels being trained for or whatever.)
> These are two sets of unavoidable tradeoffs: focusing on one fairness definition can lead to worse outcomes on others. Similarly, focusing on one group can lead to worse performance for other groups. In evaluating its model, the city made a choice to focus on false positives and on reducing ethnicity/nationality based disparities. Precisely because the reweighting procedure made some gains in this direction, the model did worse on other dimensions.
Nice to see an investigation that's serious enough to acknowledge this.
They correctly note the existence of a tradeoff, but I don't find their statement of it very clear. Ideally, a model would be fair in the senses that:
1. In aggregate over any nationality, people face the same probability of a false positive.
2. Two people who are identical except for their nationality face the same probability of a false positive.
In general, it's impossible to achieve both properties. If the output and at least one other input correlate with nationality, then a model that ignores nationality fails (1). We can add back nationality and reweight to fix that, but then it fails (2).
This tradeoff is most frequently discussed in the context of statistical models, since those make that explicit. It applies to any process for deciding though, including human decisions.
Congrats Amsterdam: they funded a worthy and feasible project; put appropriate ethical guardrails in place; iterated scientifically; then didn’t deploy when they couldn’t achieve a result that satisfied their guardrails. We need more of this in the world.
What were the error rates for the various groups with the old process? Was the new process that included the model actually worse for any group, or was it just uneven in how much better it was?
> None of these features explicitly referred to an applicant’s gender or racial background, as well as other demographic characteristics protected by anti-discrimination law. But the model designers were aware that features could be correlated with demographic groups in a way that would make them proxies.
What's the problem with this? It isn't racism, it's literally just Bayes' Law.
Let's say you are making a model to judge job applicants. You are aware that the training data is biased in favor of men, so you remove all explicit mentions of gender from their CVs and cover letters.
Upon evaluation, your model seems to accept everyone who mentions a "fraternity" and reject anyone who mentions a "sorority". Swapping out the words turns a strong reject into a strong accept, and vice versa.
But you removed any explicit mention of gender, so surely your model couldn't possibly be showing an anti-women bias, right?
> A more concerning limitation is that when the city re-ran parts of its analysis, it did not fully replicate its own data and results. For example, the city was unable to replicate its train and test split. Furthermore, the data related to the model after reweighting is not identical to what the city published in its bias report and although the results are substantively the same, the differences cannot be explained by mere rounding errors.
Very well written, but that last part id concerning and point to one part: did they hire interns? How cone they do not have systems? It just cast a big doubt on the whole experiment.
What nobody seems to talk about is that their resulting models are basically garbage. If you look at the last provided confusion matrix, their model is right in about 2/3 of cases when it makes a positive prediction. The actual positives are about 60%. So, any improvement is marginal at best and a far cry from ~90% accuracy you would expect from a model in such a high-stakes scenario. They could have thrown a half of cases out at random and had about the same reduction in case load without introducing any bias into the process.
> What nobody seems to talk about is that their resulting models are basically garbage.
The post does talk about it when it briefly mentions that the goal of building the model (to decrease the number of cases investigated while increasing the rate of finding fraud) wasn't achieved. They don't say any more than that because that's not the point they are making.
Anyway, the project was shelved after a pilot. So your point is entirely false.
The model is considered fair if its performance is equal across these groups.
One can immediately see why this is problematic, easily by considering equivalent example in less controversial (i.e. emotionally charged) situations.
Should basketball performance be equal across racial, or sex groups? How about marathon performance?
It’s not unusual that relevant features are correlated with protected features. In the specific example above, being an immigrant is likely correlated with not knowing the local language, therefore being underemployed and hence more likely to apply for benefits.
In your basketball analogy, it's more like they have a model that predicts basketball performance, and they're saying that model should predict performance equally well across groups, not that the groups should themselves perform equally well.
A big part of the difficulty of such an attempt is that we don't know the ground truth. A model is fair or unbiased if its performance is equally good for all groups. Meaning e.g. if 90% of cases of Arabs committing fraud are flagged as fraud, then 90% of cases of Danish people committing fraud should be flagged as fraud. The paper agrees on this.
The issue is that we don't know how many Danish commit fraud, and we don't know how many Arabs commit fraud, because we don't trust the old process to be unbiased. So how are we supposed to judge if the new model is unbiased? This seems fundamentally impossible without improving our ground truth in some way.
The project presented here instead tries to do some mental gymnastics to define a version of "fair" that doesn't require that better ground truth. They were able to evaluate their results on the false-positive rate by investigating the flagged cases, but they were completely in the dark about the false-negative rate.
In the end, the new model was just as biased, but in the other direction, and performance was simply worse:
> In addition to the reappearance of biases, the model’s performance in the pilot also deteriorated. Crucially, the model was meant to lead to fewer investigations and more rejections. What happened instead was mostly an increase in investigations , while the likelihood to find investigation worthy applications barely changed in comparison to the analogue process. In late November 2023, the city announced that it would shelve the pilot.
Does anyone know what they mean by reweighing demographics? Are they penalizing incorrect classifications more heavily for those demographics, or making sure that each demographic is equally represented, or something else? Putting aside the model's degraded performance, I think it's fair to try and make sure the model is performing well for all demographics.
Can someone explain to me - assuming they have enough data - why not train different models explicitly for each group / subgroup you want to model? You even could then just take the top N% of each group by score, effectively guaranteeing equal treatment for each group. Why would this not work?
Amsterdam reduced bias by one measure (False Positive Share) and bias increased by another measure (False Discovery Rate). This isn’t a failure of implementation; it’s a mathematical reality that you often can’t satisfy multiple fairness criteria simultaneously.
Training on past human decisions inevitably bakes in existing biases.
Why is there so much focus on "fair" even when reality isn't?
Not all misdeeds are equally likely to be detected. What matter is minimizing the false positives and false negatives. But it sounds like they don't even have a base truth to be comparing it against, making the whole thing an exercise in bureaucracy.
Fraud detection models will never be fair. Their job is to find fraud. They will never be perfect, and the mistaken cases will cause a perfectly honest citizen to be disadvantaged in some way.
It does not matter if that group is predominantly 'people with skin colour X' or 'people born on a Tuesday'.
What matters is that the disadvantage those people face is so small as to be irrelevant.
I propose a good starting point would be for each person investigated to be paid money to compensate them for the effort involved - whether or not they committed fraud.
Some groups will be more disadvantaged than others by being investigated. For example for welfare, I expect fraudsters to have more money to support themselves or less people to support (unless the criteria for welfare is something unexpected).
So I'd say that there also needs to be more protections than just providing money.
Nevertheless the idea of giving money is still good imo, because it also incentivizes the fraud detection becoming more efficient, since mistakes now cost more. Unfortunately I have a feeling people might game that to get more money by triggering false investigations.
The goal is to avoid penalizing people for their skin color, or for gender/sex/ethnicity/whatever. If some group have higher rate of welfare fraud, the fair/unbiased system must keep false positives for that group at the same level as for general population. Ideally there should be no false positives at all, because they are costly for people, who were marked wrongly, but sadly real systems are not like that. So these false positives have to be spread over all groups proportionally to sizes of the groups.
Though the situation is more complex than that. What I described is named "False Positive Share" in the article (or at least I think so), but the article discusses other metrics too.
The problem is that the policy should make the world better, but if the policy penalizes some groups for law breaking, then it can push these groups to break the law even more. It is possible to create biases this way, and it is possible to do it accidentally. Or, rather, it is hard not to do it accidentally.
I'd recommend to read "Against Prediction", it has a lot of examples how it works. For example, biased False Negatives are also bad, they make it easier for some groups to break the law.
The better definition of equal performance would obviously be that the metrics for the detector - accuracy or false positive rate etc would be the same for all groups.
I won't comment on why it's defined the way that it is.
Edit: it looks like they define several metrics, including ones like I mention above that consider performance and at least one based on what number or percentage is flagged in each group.
There are multiple different ways to measure performance. If different groups have different rates of whatever you're predicting, it is not possible to have all of the different ways of measuring performance agree on whether your model is fair or not.
> Why would you assume that all groupings of people commit welfare fraud at the same rate?
Because the goal is NOT just wiping out fraud, but, instead, minimizing harm or possibly maximizing positive results.
Minimizing fraud is super easy--just don't give out any benefits. No fraud--problem solved.
That's not the final goal, though. As such, the ideal amount of fraud is somewhere above zero. We want to avoid falsely penalizing people who, practically be definition, probably don't have the resources to fight the false classification. And we want to minimize the amount of aid resources we use policing said aid.
The goal is to find a balance. Is helping 100 people but carrying 1 fraudster a good tradeoff? Should it be 1000? Should it be 10? Well, that's a political discussion.
Yes it is. This is some ideal world thinking, that has nothing to do with reality and is easily falsifiable, but only if you want to see the real world.
> Why would you assume that all groupings of people commit welfare fraud at the same rate?
What's the alternative? It's an unattainable statistic, the people who get away with crime. Instead, what ends up getting used is the fraud rates under the old system, or ad hoc rules of thumb based in bigoted anecdotes.
So instead you delcare that you don't think that ethnicity is in and of itself a cause of fraud. Even if there may be any number of characteristics that tend to indicate or motivate fraud that are seen more in one specified ethnicity than another (poverty, etc.), and even though we should expect that to lead to more fraud. We can choose to say that those characteristics lead to fraud, rather than the ethnicity, and put that out of scope.
Then we can say that this algorithm isn't meant to solve multiculturalism, it's meant hopefully not to exacerbate the problems with it. If one wants to get rid of weird immigrants, non-whites, or non-Christians, just do it, instead of automating a system to be bigoted.
Also, going after the marginal increase of rates of fraud through defining groups that represent a small portion of the whole is likely to be a waste of money. If 90% of people commit fraud at a 5% rate and 10% commit it at a 10% rate, where should you be spending your time?
"Unbiased," and "fair" models are generally somewhat ironic.
It's generally straightforward to develop one if we don't care much about the performance metric:
If we want the output to match a population distribution, we just force it by taking the top predicted for each class and then filling up the class buckets.
For example, if we have 75% squares and 25% circles, but circles are predicted at a 10-1 rate, who cares, just take the top 3 squares predicted and the top 1 circle predicted until we fill the quota.
So if I want to make a model to recommend inkjet printers then a quarter of all recommendations should be for HP printers? After all, a quarter of all sold printers are HP.
As you say, that would be a crappy model. But in my opinion that would also be hardly a fair or unbiased model. That would be a model unfairly biased in favor of HP, who barely sell anything worth recommending
Is this crazy or what? My take away is that the factors the city of Amsterdam is using to predict fraud are probably not actually predictors. For example if you use the last digit of someones phone number as a fraud predictor, you might discover there is a bias against low numbers. So you adjust your model to make it less likely that low numbers generate investigations. It is unlikely that your model will be any more fair after your adjustment.
One has to wonder if the study is more valid a predictor of the implementers' biases than that of the subjects.
tbrownaw|8 months ago
There's a huge problem with people trying to use umbrella usage to predict flooding. Some people are trying to develop a computer model that uses rainfall instead, but watchdog groups have raised concerns that rainfall may be used as a proxy for umbrella usage.
(It seems rather strange to expect a statistical model trained for accuracy to infer and indirect through a shadow variable that makes it less accurate, simply because it's something easy for humans to observe directly and then use as a lossy shortcut or to promote alternate goals that aren't part of the labels being trained for or whatever.)
> These are two sets of unavoidable tradeoffs: focusing on one fairness definition can lead to worse outcomes on others. Similarly, focusing on one group can lead to worse performance for other groups. In evaluating its model, the city made a choice to focus on false positives and on reducing ethnicity/nationality based disparities. Precisely because the reweighting procedure made some gains in this direction, the model did worse on other dimensions.
Nice to see an investigation that's serious enough to acknowledge this.
tripletao|8 months ago
1. In aggregate over any nationality, people face the same probability of a false positive.
2. Two people who are identical except for their nationality face the same probability of a false positive.
In general, it's impossible to achieve both properties. If the output and at least one other input correlate with nationality, then a model that ignores nationality fails (1). We can add back nationality and reweight to fix that, but then it fails (2).
This tradeoff is most frequently discussed in the context of statistical models, since those make that explicit. It applies to any process for deciding though, including human decisions.
thatguymike|8 months ago
tbrownaw|8 months ago
jaoane|8 months ago
[deleted]
GardenLetter27|8 months ago
What's the problem with this? It isn't racism, it's literally just Bayes' Law.
crote|8 months ago
Upon evaluation, your model seems to accept everyone who mentions a "fraternity" and reject anyone who mentions a "sorority". Swapping out the words turns a strong reject into a strong accept, and vice versa.
But you removed any explicit mention of gender, so surely your model couldn't possibly be showing an anti-women bias, right?
Viliam1234|8 months ago
That may be logically correct, but the law is above logic. Sometimes applying Bayes' Law is legally considered racism.
https://en.wikipedia.org/wiki/Disparate_impact
3abiton|8 months ago
Very well written, but that last part id concerning and point to one part: did they hire interns? How cone they do not have systems? It just cast a big doubt on the whole experiment.
BonoboIO|8 months ago
Without figures for true positives, recall, or financial recoveries, its effectiveness remains completely in the dark.
In short: great for moral grandstanding in the comments section, but zero evidence that taxpayer money or investigative time was ever saved.
stefan_|8 months ago
TacticalCoder|8 months ago
[deleted]
bananaquant|8 months ago
xyzal|8 months ago
Amsterdam didn't deploy their models when they found their outcome is not satisfactory. I find it a perfectly fine result.
delusional|8 months ago
The post does talk about it when it briefly mentions that the goal of building the model (to decrease the number of cases investigated while increasing the rate of finding fraud) wasn't achieved. They don't say any more than that because that's not the point they are making.
Anyway, the project was shelved after a pilot. So your point is entirely false.
tomp|8 months ago
The model is considered fair if its performance is equal across these groups.
One can immediately see why this is problematic, easily by considering equivalent example in less controversial (i.e. emotionally charged) situations.
Should basketball performance be equal across racial, or sex groups? How about marathon performance?
It’s not unusual that relevant features are correlated with protected features. In the specific example above, being an immigrant is likely correlated with not knowing the local language, therefore being underemployed and hence more likely to apply for benefits.
atherton33|8 months ago
In your basketball analogy, it's more like they have a model that predicts basketball performance, and they're saying that model should predict performance equally well across groups, not that the groups should themselves perform equally well.
wongarsu|8 months ago
The issue is that we don't know how many Danish commit fraud, and we don't know how many Arabs commit fraud, because we don't trust the old process to be unbiased. So how are we supposed to judge if the new model is unbiased? This seems fundamentally impossible without improving our ground truth in some way.
The project presented here instead tries to do some mental gymnastics to define a version of "fair" that doesn't require that better ground truth. They were able to evaluate their results on the false-positive rate by investigating the flagged cases, but they were completely in the dark about the false-negative rate.
In the end, the new model was just as biased, but in the other direction, and performance was simply worse:
> In addition to the reappearance of biases, the model’s performance in the pilot also deteriorated. Crucially, the model was meant to lead to fewer investigations and more rejections. What happened instead was mostly an increase in investigations , while the likelihood to find investigation worthy applications barely changed in comparison to the analogue process. In late November 2023, the city announced that it would shelve the pilot.
golemiprague|8 months ago
[deleted]
zeroCalories|8 months ago
dannykwells|8 months ago
Jimmc414|8 months ago
Training on past human decisions inevitably bakes in existing biases.
ncruces|8 months ago
octo888|8 months ago
LorenPechtel|8 months ago
Not all misdeeds are equally likely to be detected. What matter is minimizing the false positives and false negatives. But it sounds like they don't even have a base truth to be comparing it against, making the whole thing an exercise in bureaucracy.
Fraterkes|8 months ago
unknown|8 months ago
[deleted]
londons_explore|8 months ago
Fraud detection models will never be fair. Their job is to find fraud. They will never be perfect, and the mistaken cases will cause a perfectly honest citizen to be disadvantaged in some way.
It does not matter if that group is predominantly 'people with skin colour X' or 'people born on a Tuesday'.
What matters is that the disadvantage those people face is so small as to be irrelevant.
I propose a good starting point would be for each person investigated to be paid money to compensate them for the effort involved - whether or not they committed fraud.
WhyIsItAlwaysHN|8 months ago
Nevertheless the idea of giving money is still good imo, because it also incentivizes the fraud detection becoming more efficient, since mistakes now cost more. Unfortunately I have a feeling people might game that to get more money by triggering false investigations.
djohnston|8 months ago
[deleted]
ordu|8 months ago
Though the situation is more complex than that. What I described is named "False Positive Share" in the article (or at least I think so), but the article discusses other metrics too.
The problem is that the policy should make the world better, but if the policy penalizes some groups for law breaking, then it can push these groups to break the law even more. It is possible to create biases this way, and it is possible to do it accidentally. Or, rather, it is hard not to do it accidentally.
I'd recommend to read "Against Prediction", it has a lot of examples how it works. For example, biased False Negatives are also bad, they make it easier for some groups to break the law.
andy99|8 months ago
I won't comment on why it's defined the way that it is.
Edit: it looks like they define several metrics, including ones like I mention above that consider performance and at least one based on what number or percentage is flagged in each group.
tbrownaw|8 months ago
bsder|8 months ago
Because the goal is NOT just wiping out fraud, but, instead, minimizing harm or possibly maximizing positive results.
Minimizing fraud is super easy--just don't give out any benefits. No fraud--problem solved.
That's not the final goal, though. As such, the ideal amount of fraud is somewhere above zero. We want to avoid falsely penalizing people who, practically be definition, probably don't have the resources to fight the false classification. And we want to minimize the amount of aid resources we use policing said aid.
The goal is to find a balance. Is helping 100 people but carrying 1 fraudster a good tradeoff? Should it be 1000? Should it be 10? Well, that's a political discussion.
fluorinerocket|8 months ago
BonoboIO|8 months ago
throwawayqqq11|8 months ago
[deleted]
pessimizer|8 months ago
What's the alternative? It's an unattainable statistic, the people who get away with crime. Instead, what ends up getting used is the fraud rates under the old system, or ad hoc rules of thumb based in bigoted anecdotes.
So instead you delcare that you don't think that ethnicity is in and of itself a cause of fraud. Even if there may be any number of characteristics that tend to indicate or motivate fraud that are seen more in one specified ethnicity than another (poverty, etc.), and even though we should expect that to lead to more fraud. We can choose to say that those characteristics lead to fraud, rather than the ethnicity, and put that out of scope.
Then we can say that this algorithm isn't meant to solve multiculturalism, it's meant hopefully not to exacerbate the problems with it. If one wants to get rid of weird immigrants, non-whites, or non-Christians, just do it, instead of automating a system to be bigoted.
Also, going after the marginal increase of rates of fraud through defining groups that represent a small portion of the whole is likely to be a waste of money. If 90% of people commit fraud at a 5% rate and 10% commit it at a 10% rate, where should you be spending your time?
dgfitz|8 months ago
djoldman|8 months ago
It's generally straightforward to develop one if we don't care much about the performance metric:
If we want the output to match a population distribution, we just force it by taking the top predicted for each class and then filling up the class buckets.
For example, if we have 75% squares and 25% circles, but circles are predicted at a 10-1 rate, who cares, just take the top 3 squares predicted and the top 1 circle predicted until we fill the quota.
wongarsu|8 months ago
As you say, that would be a crappy model. But in my opinion that would also be hardly a fair or unbiased model. That would be a model unfairly biased in favor of HP, who barely sell anything worth recommending
Scarblac|8 months ago
talkingtab|8 months ago
One has to wonder if the study is more valid a predictor of the implementers' biases than that of the subjects.
precommunicator|8 months ago