In my previous blog, I reviewed five sources of bias in machine decision-making. Here they are:
- The population. Many people have prejudices that will end up being sampled. Statistical learning algorithms summarize this bias in the models that they construct from these samples. Trying to eliminate this bias requires formalization of moral values that cannot be fully formalized, change with time and depend on local culture. This may cause more problems than it solves.
- The sample. Samples with which algorithms are trained (and tested) are biased in the sense that they are not representative of the population. Random samples are almost never representative. Stratified samples are to some extent representative, but the members of each class are treated as identical, which they are not. This kind of bias cannot be eliminated.
- Sensors and observers. Phenomena in the real world are translated into data by sensors and observers. These may be biased and not all of this bias can be eliminated.
- The algorithm. Learning algorithms introduce bias in the form of simplifying assumptions. Some models make very few assumptions but this increases another kind of error, variance across training samples. Those models are biased by the training set.
- The decision maker. To apply the statistical model to a case, the decision maker judges that the case is sufficiently similar to the training set. If the case is actually different in ways relevant to the decision, then this assumption is a bias.
The following diagram summarizes these sources. The prediction machine is a piece of software that contains the statistical model constructed by the learning algorithm.[i]
All these biases can be reduced but none can be eliminated in all decisions. Yet, the tech press suggests that we can fix this and otherwise should stop it. In some cases, the use of prediction machines has indeed been abandoned.
Proponents of machine-supported decision-making point out the benefits of using prediction machines. Algorithms can be made to have less bias than people and, unlike people, they treat cases with the same data in the same way[ii]. Some of them argue for replacing human by machine decisions. For example, Rayid Ghani of Carnegie Mellon University says that “We are still using these algorithms called humans that are really biased. We’ve tested them and known that they’re horrible, but we still use them to make really important decisions every day.”
This discussion frames the problem as a choice between letting machines decide without the help of people or letting people decide without the help of machines. This is a false choice. We live in an imperfect world and the goal of total perfection, in this case the goal of total lack of bias, is in the end destructive.
Scientists have learned how to produce reliable results even though researchers have limited cognitive abilities. Engineers have learned to construct reliable systems from unreliable components. We should use their techniques to improve decision-making too.
In what follows I review nine questions that decision makers should ask when using a prediction machine to improve their decisions. These questions are borrowed from the methodologies of science and engineering.[iii]
(1) Assessing prejudices in the population
Even if the decision maker aims to reduce bias in the population, he or she cannot change this at the time of constructing the training set. Moreover, we may need to construct a statistical model of the population in order to become aware of bias at all. Data scientists and decision maker should keep asking the question whether the population is based in a way that affects the statistical model.
(2) Assessing the representativeness of the sample
No sample is representative of the population in all possible ways. The data scientist must assess in which ways the composition of the sample covers the composition of the population, and in which ways it does not cover it.
If the data scientist wants to know about prejudices in the population then their aim is to include these in the sample. But if the decision maker wants to avoid common prejudices then they want to exclude these from the sample. In any case, both need to understand the sampling method and what it means for the relation between sample and population.
(3) Reliability of sensors and observers
Getting reliable data about the sample in the training set may take up 80% of the effort of constructing a prediction machine. Data may have been collected for different purposes, sensors may have been biased and people may have made mistakes and omissions when entering data. Data scientists must summarize data quality issues and decision makers must be aware of this to assess the trust they can put in the prediction machine.
Decision makers must also ask how reliable the data is about the case to be decided.
(4) Assessing construct validity
Data sets do not always contain the data about the phenomena that the decision maker is interested in. For example, in a system that assigned patients to risk categories, the risk category was based on health-care cost, whereas what the decision maker was interested in was health risk.
Some people have a larger budget to spend on health care than others. People who will or can spend less on health will then systematically be assigned to a lower risk category. Machine predictions based on this measurement are biased against patients with a lower health care budget.
The technical term for the match between what decision makers want to know and what the data represents is construct validity. Health care cost is not a valid indicator of the concept (construct) of health risk. Its construct validity is low.
The company that developed the system worked with researchers to find other variables that are better indicators of health risk. According to one report, it was able to reduce bias by 84%. The company also stated that doctors should use their own expertise to make a decision, implying that it is not responsible for any bias in the final decision. This is an exercise in pointing to others that I will return to in my next blog about how to embed prediction machines in the ecosystem for decision-making.
Sometimes, the construct validity of data cannot be improved. In the COMPAS system to predict recidivism of suspects, the decision makers were interested in the likelihood of future crimes. But the data represented arrests, not crimes. Arrest rates depend in part on how heavily a neighborhood is policed and on how low the threshold for arrest is. Not every crime leads to an arrest. And not every arrest leads to a conviction. Arrest data is used because it is what police departments record. But it has limited validity if you want to know about crime rates.
Assessing the limits to construct validity of the data helps decision makers interpret the meaning of machine predictions.
(5) Explanation of the kinds of classification errors made
Any statistical learning algorithm is designed with error rates in mind. Decision makers must understand these error rates when they use prediction machines to make their own decisions. However, explaining these error rates to decision-makers is a daunting task.
For example, in the COMPAS system to predict recidivism of suspects, the percentage of false positives (people predicted to commit crimes again, but did not do so) was much higher for Blacks than for Whites. For false negatives (people predicted to commit no more crimes, but in fact did commit them), the bias was reversed.
Researchers in machine prediction and the developer of the system pointed out that the percentage of true predictions was the same for Blacks and for White. However, regardless what was predicted, the numbers show a higher rate of recidivism for Blacks than for Whites. And that means that there will be a higher percentage of false positives for Blacks.
The COMPAS critics demanded fairness in the study population, on which the system was trained. The COMPAS designers provided fairness on the target population, for which the system was used to make predictions. Unfortunately, it is not possible to satisfy both fairness criteria (the one used by COMPAS and the one used by its critics) at the same time.[iv]
This episode illustrates three things.
First, understanding error rates is hard. Understanding the fine distinctions usually requires some mathematics or some visual explanations.
Second, defining error rates that are non-discriminatory involves making uncomfortable choices. The prediction machine cannot be fair in all possible ways at the same time.
Third, explaining the choice of error rates to decision makers is daunting. The discussions about COMPAS fairness are abstruse. They cannot be understood by judges, the people who use the system. Yet, they must understand the issues if they are to take responsibility for their decisions.
The world of big data and machine predictions is statistical and we need to do more work to make error rates and risks comprehensible for decision makers.[v]
(6) Explanation of the correlation
Machine learning is the identification of correlations in a data set. For example, supervised learning of a classifier identifies correlations between features of a case and possible outcomes of the case. To understand why a machine predicted a particular outcome, you need to understand which features of the case contributed most to this prediction. What also helps is to understand what the system considered to be alternative, but less likely, outcomes.
For example, the LIME system generates explanations of a prediction by identifying the most important features that contributed to a classification. It does this in a smart ay that works for any learning algorithm. In an example, the classification of an image of a tree frog turns out to hinge on the two red eyes of the frog. Alternative, but less likely classifications given by the system are a pool table with two red billiard balls and of two red balloons. This information makes the prediction of the system easier to understand for the user — and also removes the magic from machine intelligence.
To see how useful this is, consider a system that predicted that people without fixed telephone lines are less likely to turn up at court. This was a relic from the days when there was no mobile telephony, but decision makers still used this prediction. Had they known which factor explained this prediction, they would have made different decisions.
(7) Explanation of the underlying mechanism
A correlation found among variables in a data set may indicate an underlying mechanism. Or it may be a meaningless coincidence present in the training set.
For example, if a prediction machine diagnoses a likely flu when a patient has headache, fever and sneezes, a physician who uses this machine knows that there is a biological mechanism that leads from the flu to these symptoms. This increases trust in the prediction.
The mechanism that produces a correlation need not be a causal relation among the correlated variables. It may be a third phenomenon that produces the correlation. For example, ice cream sales and crime rates are correlated, but this is not a causal relationship. The explanation of the correlation is that in warm weather, people are more likely to be on holiday (enabling burglaries) and buy more ice cream.
To trust that a correlation is meaningful, we must believe that there is an underlying mechanism that produces it. Mechanisms consist of physical, biological, psychological, social or digital entities that interact with each other, which produces an effect.[vi] They may be nondeterministic and we may not fully understand them. But at least we must believe that some mechanisms exists that produces the correlation.
For some applications of machine learning, it is hard to imagine any mechanism that could be responsible for the correlation. Prediction if someone is a criminal, gay, or an autist purely based on face images belong in this category.
Knowing which features contributed to a prediction (question 3) is not enough. The decision maker must know that there is an underlying mechanism that explains this prediction. If there is no credible mechanism, then the prediction is not credible either.
(8) Judging similarity
When a statistical model is applied to a case, the human decision maker assesses similarity between the case to be decided and the cases in the training sample. If this similarity judgment is based only on the observable features of the case, then conclusions drawn from it are shaky.
For example, the following similarity judgment is based on features only:
- Brains can think.
- This walnut has the same outward features as a brain.
- This walnut probably can think.
What is missing is structural similarity. The mechanisms of thinking present in the brain are absent from walnuts. We are sure of this even if we do not fully understand these mechanisms.
To assess similarity of the case to be decided to the cases in the training set, the decision maker must assess whether the underlying mechanisms that explains the correlation in the training set is also present in the case at hand.
For example, the mechanism that explains the correlation between flu and symptoms like headache, fever and sneezing is not only present in the training sample, it is also present in the patient currently diagnosed by the physician. This justifies the use of the machine prediction in the diagnosis of this patient.
As a counterexample, the algorithm used in the UK to predict school exam results incorporated zip code in its model. Students from a zip code where most students have lower results, where downgraded. However, there is no mechanism that produces a student’s exam grade from the average exam grades in her zip code in the past. As soon as the Office of Qualifications and Examinations Regulation realized this they abandoned the predictions. This illustrates the importance of understanding which factors contribute to a decision (question 3), identifying whether this is a coincidence or based on a mechanism (question 4) and assessing whether this mechanism is present in the case to be decided (question 5).
Nine questions to ask when using a prediction machine
The following diagram shows how the questions are related to the different sources of bias.
I summarize each question in a number of variants.
- Are there any prejudices in the population that could affect the statistical model and influence the decision? How do we know that these prejudices exist?
- Does the training sample represent the population in some way? In which way is it not representative of the population? How does this affect the quality of the decision?
- Reliability. How reliable are the sensors and people that entered the data in the training set? What biases can be present (in addition to biases present in the population)? What data has been omitted?
- Construct validity. How valid are the measurements? How is the data measured? How reliable are the sensors? What questionnaires are used and are these filled out truthfully? What is the meaning of the variables? What do we want to know about the case? Is this adequately represented by the data?
- Classification errors. What classification errors are made? What are the true positive and true negative rates on the sample? What percentage of predictions is correct? In which way is the model fair, and in which way is it unfair?
- What variables most contributed to the prediction? What are the alternative (but less likely) predictions?
- What is the theory of the prediction? Is there a mechanism that can explain the correlation in the sample? Do we know which mechanism? Or do we simply assume there is one?
- How reliable are the sensors and people that entered the case data? What biases can be present? What data has been omitted?
- In which way is the case similar to the sample cases? Is the similarity feature-based? Is the case structurally similar to the sample cases? And in which ways is it different? Is this relevant for the decision to be made? Is the mechanism that explains the prediction present in this case?
An undesirable conclusion
Trying to find answers to these questions will inevitably slow down decision-making. Answering these questions will improve decisions but requires effort and takes time.
This leads to the undesirable conclusion that automation makes decision-making harder. This is annoying. Machines should make life easier. They can always do something much better and faster than people, or they can do something that we could not do at all.
This is true for prediction machines too: Statistical leaning algorithms can recognize patterns in data where we cannot. They can make correlations visible that we cannot see. But they can also make correlations visible that are meaningless or irrelevant. And this means that we have to think about the meaning and relevance of correlations, which means we have to go through the nine decision-making questions listed in this blog.
Using prediction machines to make decisions can make decisions better or worse, depending on the quality of the population, sample, sensors, observers, data set, learning algorithm, and decision maker. To make them better and not worse, the decision maker must answer the nine questions listed above.
But which decision maker has the time and capability to do so? A judge who takes 30 seconds for a decision about detainment does not take time to answer these questions. And who is the decision maker anyway? The art of slow decision-making may be noble and but impractical.
This leads us to the design of the decision-making ecosystem. Who is involved in a machine-supported decision? How are the tasks divided over the participants? Who bears which responsibility? This is the topic of my next blog.
[i] A. Agrawal, J. Gans, A. Goldfarb. Prediction Machines: The Simple Economics of Artificial Intelligence. Harvard Business Review Press, 2018.
[ii] S. Goel, R. Shroff, J.L. Skeem and C. Slobogin. “The Accuracy, Equity, and Jurisprudence of Criminal Risk Assessment” (December 26, 2018). Available at SSRN: https://ssrn.com/abstract=3306723 and http://dx.doi.org/10.2139/ssrn.3306723
[iii] R.J. Wieringa. Design Science Methodology for Information Systems and Software Engineering. Springer 2014.
[iv] S. Linn. “A new conceptual approach to teaching the interpretation of clinical tests.” Journal of Statistics Education, 12(3), 2004. https://doi.org/10.1080/10691898.2004.11910632
[v] G.Gigerenzer. Risk Savvy: How to make Good Decisions. Penguin, 2015.
[vi] R.J. Wieringa. Design Science Methodology for Information Systems and Software Engineering. Springer 2014. Chapter 14, “Abductive Inference Design”.
No Comments Yet