6. Concrete Case Study: The Toeslagenaffaire
This piece aims to dive deeper into an example of the usage of machine learning and the perceived reactions to this. In my opinion, the Toeslagenaffaire is a current topic that teaches us what there still is to learn about artificial intelligence and its applications. A short summary of the whole debacle is that a lot of people eligible to receive allowances / benefits (toeslagen in Dutch) were flagged as fraudsters. This lead to enormous debts to pay back for thousands of innocent persons.
Now, I understand that most case studies are actually quite concrete works and that this title might feel like a bit of a tautology. However, I am also aware that I have a tendency to dive far into the abstract and so a bit of underlining might not be the worst idea. I will try to stay as close to the point as possible, veering off into the abstract only when necessary (or judged to be interesting enough). Without further ado, let’s get concrete.
Background
What ended in a scandal - and the resignation of the Cabinet - started with a completely different scandal. In 2013, it became widely known that various citizens of Bulgaria were managing to illegally profit from the Dutch allowance system. By pretending to reside in the Netherlands, they qualified for a whole variety of allowances from health care benefits to child care support. Of course these allowances are meant to be rewarded to the people that have to pay for the connected services, like insurance and day care. By faking whole lives, the Bulgarian fraudsters got the money and never paid the costs.
Politics is far from my area of expertise, but I understand enough to know that actions like this - especially when widely known - require reactions. This was the case for the fraud case as well. The Cabinet of that time opened Pandora’s box, filled with questions and doubts rather than evil and hope. The conclusion was obvious: Stronger measures to combat fraud were needed, and so they would be implemented.
If you have ever read a science fiction story, you can see where this is going. The problem arises, the solution is proposed and eagerly accepted and in the end, the solution turns out to be a bigger problem than the original problem was. A tale as old as time, and this affair could be summarised in the same way. Many more people were appointed as fraudsters by the new system, as it was way stricter. This could be seen as a well-working system, but many of these newly appointed criminals did not do anything wrong at all or at least had no bad intent. In other words, false negative cases lessened but false positive cases boomed. This happened to the point of thousands of people being buried in debt while they never once tried to commit fraud, the scandal now known as the Toeslagenaffaire.
Now, a few years later, the saga even knows another chapter. Compensation regulations have gone so far that - again - the wrong people are affected by good intent. The people actually guilty of committing fraud now apply for compensation and receive this, opening a whole other can of worms. Considering the whole affair, the Cabinet reminds of someone trying to steer a ship for the first time. Wild movements that cause the whole vessel to be thrown around, rather than the more appropriate smaller movements and the required patience. However, the goal of this text is not to criticise all the measures taken throughout the years by the tax authorities. So let us make sure to stay close to the case we are trying to study:
In what way did the original stricter measures go wrong, and what role did machine learning play?
When first talking about machine learning, we talked about the appealing characteristics that self-learning models have. Automatising tasks saves a lot of energy and some jobs just require more manpower than an organisation can bring up without using models. This is exactly the kind of problem that an organisation like the Dutch Tax and Customs Administration (or Belastingdienst in Dutch) runs into. Instead of considering all individual cases of possible fraud - which would entail checking all the cases in which allowance has been given out - it would be much less work to only consider the cases that are “more likely” to be fraud. There are multiple ways to achieve this goal, and our main protagonist of today’s story is one of these ways: The model that became known as the Risicoclassificatiemodel (yes that is actually one word in Dutch).
The name translates literally to “risk classification model”, and I will refer to it as the RC model from now on. The RC model did - and the past tense is appropriate as it is currently discontinued - exactly what its name suggests: Classifying allowance cases according to their so-called risk, the likelihood of being the result of fraudulent behaviour. In this, it was one of multiple automatised systems and instruments that allowed the manual labour of the tax administration to be limited to judging only those cases that were highly likely to be actual swindlers. We specifically consider the usage of this model for the assignment of child allowance, which is meant to support parents whose child stays at day-care when there is no one at home to take care of them.
Due to the stricter approach taken, many parents were mistakenly identified as fraudsters. This has resulted in a conflict that is still unresolved and a very current topic in the politics of the Netherlands. Earlier this month, a political picture award (the Prinsjesfotoprijs) was given out to a photo of victims of the affair pleading for help in The Hague. While the RC model has never had a final say of judgement, supplying a risk score and determining the preselection of possible fraudsters are large factors in determining the verdict laid out by manual workers. The system as a whole was too strict, but special emphasis has been put on the model in reactions to the complete affair. What was so wrong with the model that it should be picked out? Let’s dive into the analysis and reactions to the usage of a self-learning model in such an important case.
The Analysis
The RC model was considered a black box model by all involved parties. As we know from our previous research, this means that all agents involved were acting as if there was no knowledge whatsoever about what goes on between the input and output. We know there must have been some weighing of variables for example, but for the sake of the analysis this is not considered anywhere. Therefore there has been no analysis or research on the “inside” of the model, and the only approach taken has been completely model-agnostic. Let’s have a look at the analysis performed by “Het College voor de Rechten van de Mens”, which can be found here.
The used method is very simple. It is more aimed at considering the complete outcomes of the data set than at trying to understand the inner workings of the model. What do we mean by this? The analysis went as follows: Researchers considered the whole data set and looked at the distribution of persons according to the used variables. Afterwards, they did the same thing for only that group which was classified as likely to have committed fraud: the risk group. For example, consider the variable Heeftpartner which indicates whether someone had a registered partner at the time of application. Relatively, there were 8.39 times more people with no partner in the risk group than in the complete data set. If, say, for the whole data set 100 out of a 1000 people were people with no partner, this would be 839 out of a 1000 in the risk group. This is a big difference, and clearly the variable Heeftpartner carried quite some weight.
Does this mean that the model is a discriminatory model against people with no partner?
This is an important question, but it depends heavily on the interpretation of discrimination. In the political discussions and texts surrounding the affair, discrimination is used in the way that is similar to unjust bias: as something that should not happen. Of course a model distinguishes between cases and uses the differences between different persons to determine a result, but discrimination occurs when this is done to a fault. In machine learning the term “bias” is used frequently. Consider a bias towards people with white skin when a model is trained on a data set than contains mostly white faces. There are more ways in which bias and discrimination can - and will - originate, and it is important to determine when something is discrimination rather than an intended feature of the model.
Therefore it is a difficult question to answer, whether the model discriminates against people with no partner. The RC model obviously leans towards suspecting people with no partner, but this could also be explained in a non-discriminatory way. To understand this, think of the example of an algorithm to determine a person’s chance of developing diabetes. This model obviously leans towards assigning a higher chance to people with a higher BMI. I do not mean to downplay the effects of discrimination in machine learning models, as it is a large and persistent problem. It is merely important to understand that this too is a matter of discussion, and so we should look at the cases that are being discussed about. One variable that has been the topic of much debate, is av_nationaliteit.
A large part of the reason to take a stricter approach to check for fraud was the scandal caused by Bulgarian people with no legitimate residency in the Netherlands. There were cases known of the same approach taken by citizens of various other countries as well. From a completely alien perspective, it thus seems obvious to include a variable to represent the nationality of the person asking for allowances. From a human-in-2022 perspective, this seems a very slippery slope.
The variable av_nationaliteit represented whether or not the case in question affected a person with or without the Dutch nationality. Let us discuss two possible problems with the inclusion of this variable in the RC model: one legal, one ethical. The direct legal mistake that the programmers made had to do with the scope of the variable being too wide. WeThe ethical problem that comes with considering nationality as a variable comes from the unavoidable line this creates between the way in which different groups are treated. We will first make a small detour into the legal, and then move on with the main text to discuss the ethical concerns and the debate this has provoked. I will also give some more background from the more theoretical point of view and conclude with my own views on the spectacle.
A small detour into the legal
Often in the creation of larger machine learning models, developers can afford to be slightly careless. At least for accuracy’s sake, similar results can be reached by including different variations of a variable. However, this can lead to poor design as there are guidelines for the usage of certain information. As we can read in a report (link downloads immediately) by the “Autoriteit Persoonsgegevens” (AP), using nationality in the RC model was one such mistake. The illegality of this inclusion was due to the accessing of too much information when there was an alternative possible. Instead of considering nationality, the developers should have considered a combination of multiple factors including nationality. This would have directly corresponded to the actual eligibility of receiving allowances rather than an indirect factor, and it would have protected the privacy of the considered persons to a larger extent.
This is quite an interesting point by the specific authorities, in my opinion. What does this judging actually mean? It seems to imply that the task of the RC model was to identify those cases that were clearly not eligible to receive allowances. However, it is unlikely that a person actively attempting to commit fraud would be picked out by this. The kick-off scandal consisted of Bulgarian citizens who were able to register at a home address in the Netherlands, making their application to receive allowances completely valid through this indicator. It is indeed true that this direct factor would be more respectful of privacy and more accurate for certain cases, but the judging depends heavily on the perceived purpose of the actual model: A term that was weaved into one of our earlier discussions as well.
The Discussion
Last year, Amnesty International published a report under the (translated) title Xenophobic Machines. This was a reaction to the Toeslagenaffaire with the discussed topics of this post as main object of investigation. The English version of the report reads:
“The scandal also included racial profiling by the risk classification model, which is the focus of this report.”
The combination of the phrases xenophobic in the title and racial profiling in the introduction already leaves no room for doubt about the organization’s position on the matter. The main thesis of the report can be separated into two topics: Using nationality as an indicating factor of criminality and the usage of self learning systems, specifically those of the “black box” kind. As we are speaking about the usage of the variable av_nationaliteit here, we should discuss the former topic of the report. However, we will see that these two parts are closely intertwined in the final conclusion that Amnesty gives and so we will talk about self learning systems shortly after.
Using nationality as an indicating factor of criminality goes in against all our moral and ethical ideas of fair treatment by a state. Amnesty is of the opinion that this is what happened in the Toeslagenaffaire and they heavily condemn the authorities for this. Looking at the numbers, there is a strong case to be made. In the report by “Het College voor de Rechten van de Mens” that we discussed before, it was shown that persons without the Dutch nationality were selected as part of the risk group 5.74 times more often than persons with the Dutch nationality in the year 2018. This is a big difference that nationality makes, that leads to a simple consequence: When more people from a group get selected as part of the risk group, more people from that group will also be falsely accused of fraud. The group of people that do not hold the nationality is over-represented in the risk group relative to the others, and that is the same for the false positive cases.
Then why would the developers of the RC model even include this variable in the first place? The simple answer to that is that it worked. This immediately points at the difficult process that is building a fair model. In cases that do not influence any sensitive topics, say the recognition of a tree in a picture, a developer can just go about their job adding variable to a model until it becomes very “intelligent”. Of course, there is still danger in over-fitting a model - such that it only works for trees in the test set and not for new trees - but the variables can be very abstract and still add to the accuracy of the machine. The machine does not think the same way that we do and so the variables that are added do not need to make sense of us. Being used to such abstraction between human developer and machine model can easily lead to the externalising of what is actually happening. Even though we would maybe not use nationality ourselves when considering fraudulent activity - or maybe we would even though we agree we should not - when a machine does it, it feels different. It becomes a variable that adds to the accuracy of the model rather than discriminatory activity, or at least it could in the eyes of an unknowing developer.
There was no programming mistake made by the developers, the error - that I do not deny - resulted from a lapse in moral judgement. When weighing the real life inclinations to the abstract accuracy, the scale tipped towards the abstract accuracy in the minds of the engineers. This leads me to my next point: The machine did not make mistakes, humans did.
What is important to understand, is that the decision to include nationality makes sense if considered purely from the perspective of accurately catching fraudulent behaviour. For the machine, this was the only given goal. There was no side clause that included the goal of not discriminating, that was never the machine’s job. On the flip side of that same coin, the model would not have used the variable if it had not helped to achieve this one goal. In 2019, the variable was actually not used by the model anymore as it had learned that there was no correlation strong enough to warrant the usage of the nationality of a person. First, the factor helped and so the model used it. Afterwards, the factor stopped being of use and so the model stopped using it. That is all there was to it, just like it would be with a variable that would look at the income of a person or the distance to the nearest child care unit.
Most of us have seen one article or the other on bias in machine learning. From recognition software that is more accurate for white people to speech recognition that only works well for white males, bias has become widely known to be problematic. This, in combination with the usage of nationality and the results that become apparent, lead to an easy conclusion for the researchers at Amnesty: Governments should not use black box systems - or even the wider set of self learning algorithms - anymore in an area of important societal impact. But what problem are we really fighting here? The machine is given data by researchers, who also decide on the variables that are accessible for the machine, and it only decides on the best course of action to achieve the one goal that is given to it by the same researchers. Then where does the xenophobic nature of the machine come from?
A different point of view
A few weeks ago I attended the Brave New World conference in Leiden, where Joanna Bryson gave a talk. The perspective on bias that she advocated for is that AI does not amplify it, it just shows it. This conclusion is supported by her own research on the matter of semantic biases in texts. Machine learning trained on texts from the internet, even news pages, revealed problematic biases. This includes a more negative connotation towards people with a non-white skin colour and a tendency towards thinking of programmers as male rather than female. Even though these are stereotypes that we do not wish to associate ourselves with, Bryson’s research showed that the biases in machine learning are nearly perfectly correlated with our own. In the words of the researchers (with emphasis added by me):
“Our findings suggest that if we build an intelligent system that learns enough about the properties of language to be able to understand and produce it, in the process it will also acquire historical cultural associations, some of which can be objectionable.”
When the RC model picks out a disproportionate amount of people that do not hold the Dutch nationality to be checked in the risk group, this is discriminatory behaviour. But this is not discrimination that is only present in the machine. Following Bryson’s narrative, this just shows the presence of an unwanted bias rather than amplifying it. The problem with this can not be brought back to the inner workings of the RC model, as it was just doing its job. Rather, we will have to accept the conclusion that the bias originated at the human level before being shown by the model. There could - and probably should - still have been human interference to not allow for this bias to affect the outcome of the automated process, but that was never part of the task of the RC model as designed by humans. We could even say that the AI helped overcome this negative bias, as the variable av_nationaliteit was not used by it anymore in the last year of its usage, showing that the presupposed correlation was no longer present considering all recent cases.
The conclusion about black boxes drawn by the researchers at Amnesty is meant well, but I cannot help feeling like that they are missing a certain point. Yes, it is dangerous to use technology - especially smart technology - when working in sensitive fields. Before starting my own research this year, I would probably have wholeheartedly agreed with the conclusion to ban self learning algorithms in these fields as the current state of our interaction with machine learning can be problematic. However, I am now of the opinion that this should not be the end of the road. By advocating a stance so directly opposing AI, we miss out on the opportunity to learn and we are even at risk of concluding that the AI was the only thing wrong here. What we have is proof in numbers that mistakes were made in classifying allowance cases on the basis of nationality. Of course it is attractive to attribute this solely to this machine that we feel so distant from, but the reality to face is that this points out unwanted bias in us as humans. If this is how the model turned out, that might mean that profiling based on nationality is a present problem in the way allowance cases are handled. This was not due to an algorithm, it just manifested in the algorithm and this could have helped us in improving the procedure instead. Presently, it can help us as we can start working on the unwanted biases that the AI has pointed out.
Concluding
Distributing allowances is a difficult task to do well, and there will always be negative consequences of the inaccuracy of this procedure. The balance between calling out all fraudulent cases on the one hand and helping out all honest applicants is precarious, and steering the ship of the tax authorities comes with overcompensation on both sides. While this is partly unavoidable, the process of distribution should still never result in a scandal like it did now. Part of this problem could be found in the way that machine learning was used. This was the case for much debate in legal as well as ethical corners. The legal precedent sets up an interesting discussion, the ethical attempts to find an answer.
There is a big problem with the use of a nationality variable which leads to ethnic profiling. However, completely abolishing the use of self learning algorithms does not seem to be a proper solution to me. The largest reason for this is that we should not attribute the negative results of training and oversight to the inner workings of the RC model. Doing so can lead to us missing the point: Bias does not originate in a model, it merely shows through in the workings of a model. If the model worked in a problematic manner, this means that there are questions to be asked on the level of the authorities, even on the level of a society in which this worked.
Machine learning models do not have implicit values. There is no right and wrong programmed into a model, besides the actual goals given by the developers. So how can we blame a model for being only “intelligent”, when it is designed to be Artificial Intelligence? And why would we ban self learning models, instead of asking questions about the system which gave these models the values which they used? These are questions that invite further discussion, on this level and on a more theoretical level as well. To those discussions I do not have the conclusions yet, so let me conclude by summing up my take on the use of the RC model and the results of this in the Toeslagenaffaire:
The error of the tax authorities was not in deciding to use machine learning, it was in failing to learn from the machine.