This month, Amnesty International published an unsettling report that summarizes the Dutch Childcare Benefits Scandal, in which tens of thousands of parents and caregivers were falsely accused of fraud by the Dutch tax authorities, with disastrous results. These parents and caregivers had to repay large sums of money immediately and in completely, leading to severe debt problems, unemployment, and broken families.
The AI system used to assign a risk score to subjects used nationality as one of the variables to determine risk of fraud. This has led to racial profiling. The Amnesty International report provides extensive recommendations how to avoid this in the future. One of the recommendations is to ban the use of data about nationality or ethnicity in training sets.
That won’t work. Removing this data may still leave a lot of other data from which nationality or ethnicity can be derived. For example, the “new citizen” field in the form to request childcare benefits. And removing this field may still leave us with other traces of nationality or ethnicity in the data, such as name, zip code, or other fields of which we currently do not know that they are related to nationality.
More importantly, the recommendation suggests that it is possible to remove national or ethnic bias from a system. But this is impossible. As I will explain below, AI systems are never unbiased.
We can still use these systems to improve our decision making, but this requires critical reflection on the recommendations AI systems give. To explain how this can be done, let’s first look at what happened.
The AI system was trained on past examples of correct and incorrect applications for childcare benefits. Applications received a risk score between 0 and 1 and applications with a high-risk score were reviewed by a civil servant, who had the power to label an application as fraudulent. The civil servant had no information about why an application was classified as high-risk.
If an application had a high-risk score, the civil servant requested more information from applicants. But when applicants asked information about which part of an application was considered to be incorrect, often they received no answer.
After an application was labeled as fraudulent, applicants had to repay large sums of money immediately and as a whole. This cause severe financial distress. Some families could not pay their mortgage or rent and were evicted from their house. Under the stress of financial problems, there were divorces and there are indications that some children were placed out of home.
Nationality was one variable used in the classification. When the tax authorities suspected that applicants with a certain nationality submitted fraudulent applications more often than applicants with another nationality, the entire group came under suspicion. The tax authority then screened all applications from people of that nationality in search of irregularities. Civil servants spoke in abusive language about suspected fraudsters, such as “an Antillian nest” to refer to families with Caribbean roots.
The tax office went in overdrive trying to identify fraud because a perverse incentive was in force: The AI system and the part of the organization using it to identify fraud, had to finance itself with the proceeds of repaying childcare benefits by those they identified as fraudsters.
What should have happened?
In a previous blog I listed the sources of bias in AI systems and listed a series of nine questions that should be asked of any AI system to manage this bias. In the following diagram, the sources of bias are listed in red, and the questions are listed in green. The first five questions, in the upper half of the diagram, should be asked once, before the system is used. The remaining four questions, in the lower half, should be asked of every case to which the system is applied.
In the case of the childcare benefits AI system, let’s assume that the sample of cases on which the system has been trained consists of all previous applications. (1) If the tax office behaved in the past as it behaved during the build-up to the scandal, they may have labeled applications from ethnic minorities fraudulent relatively more often than other applications. We don’t know, so this introduces unknown uncertainty in the risk score computed by the system.
(2) The tax office reported that variable information campaigns and software improvements caused a shift in how applications were filled in, which means past cases are not necessarily representative of future ones. This introduces additional uncertainty in the interpretation of the risk score.
(3) For those people whose first language is not Dutch, the forms may be hard or impossible to fill in. I have once tried to fill in a U.S. tax form and did not understand most of the fields. And the zillions Dutch-language forms that I have filled in often contained questions that I could not understand either. So the forms are likely to contain mistakes, and this is more likely for people with a foreign background.
(4) This leads to the core aspect to be clarified: What does the risk score measure? Even though it is a number between 0 and 1, it is not a probability. Does it measure the similarity between the current form and past applications in which something was wrong? Organizational tendencies to simplification may have led users to believe that the score indicates the likelihood of fraud.
(5) Another aspect to be clarified are the error rates. How often can a risk score be high for an application with which there is nothing wrong? How often will the score be low for an incorrect application? There will always be false positives and false negatives and users must understand these rates.
All these system-level questions make implementing the AI system not a routine matter as implementing an information system. Users must spend more effort on understanding what the output of an AI system means.
In the application of the system to a particular case, we have already seen that mistakes are possible (8) and that future cases tend to become different from past cases (9). So let’s look at the two conceptually hard questions: (6) Which variables contributed to the risk score and (7) what is the story behind the score?
The fact that the civil servant who uses the risk score did not receive information about which variables contributed to the score makes the score unusable. It makes the score a case of “computer says so”. For this reason alone, the system should not have been used.
Which leads to the core question about high scores: Why is the score high? Did the application make a mistake in good faith or did they try to commit fraud? Did the applicant fill in the form herself or did she give it to a third party? Did the third party make a mistake or try to commit fraud? What is the story behind the score? It seems that the default story assumed was that a high score means that there is a high probability of fraud.
Traditionally, automation saves us work. Computations can be done faster, in bulk and more reliably, and the administrative worker knows exactly how to interpret the output of a system.
It is different for AI systems. These systems do save time by selecting a small number of cases for further manual consideration, but to handle these cases meaningfully the nine questions listed above must be answered. This is not just work, it is conceptual work. It makes things harder. It takes time. We can speed up the process by skipping these questions. Unfortunately, this reduces decision quality.
Due to imperfections in any AI system (the red notes in the diagram above), any AI system is biased. This must be countered by asking the nine questions listed above. The suggestion by Amnesty International that bias can be removed is wrong. It may have the unintended consequence that users believe that, after “removal”, recommendations of the system are unbiased, which motivates them to skip any or all of the nine questions. All AI systems must be used critically.
Using nationality as a variable to compute a risk score was a minor problem compared to the problems raised by skipping the nine questions. Nationality may even have been a useful variable because it would explain the difficulties that some people may have to fill in a form. It was not the machine that was xenophobic. It was the organization that used the machine.