A decision support system to allocate health care to patients discriminated against Black people. A recruiting tool, now withdrawn, was biased against women. Online advertising algorithms have been shown to discriminate against Blacks. An algorithm that predicted school exam results was biased against students from disadvantaged communities and in favor of students from private schools and was withdrawn.
Critics of machine-supported decisions respond to these and other examples of bias by demanding that these systems be abandoned. Unless machines can be totally debiased, they say, decisions about people should be made by people.
Proponents of machine learning point out that people are biased too. And since machines are easier to debias than people, we should debias machines and delegate decision-making to them. As a bonus, they are guaranteed to treat similar cases in a similar way.
I agree with the observations of both sides but disagree with their conclusions.
First, I disagree with the critics that biased machines should always be abandoned. Complete absence of bias is impossible. And biased machine predictions can still be useful.
Second, I agree that bias in machine decisions can be made smaller than human bias, at least in some respects. But this does not mean that we should leave decision-making to machines because they do it better than we do. Rather, decision makers should work to understand machine bias and use it to reduce their own bias and improve their decisions.
Understanding bias in machines and using this understanding to improve your decision takes effort that will slow down decision-making. This has implications for the design of the decision-making ecosystem. Who is responsible for a decision? Who has the time and obligation to understand machine predictions and take responsibility for the decision?
The motivation for this blog is that we should not move to a world in which people don’t care about their own bias and think the problem is solved if they delegate their decision to a machine.
This blog has two parts. In the first part I review the kinds of bias in machine decision making and show they cannot be eliminated. In part 2 I show how decision makers can turn this limitation into a positive force to improve themselves and their decision-making.
Machines support decision-making by making predictions based on a statistical model constructed by statistical learning algorithms from large data sets. This involves four major tasks.
In the first phase, a sample from a population is selected. For example, a sample of personnel hiring decisions is selected from the population of all personnel hiring decisions.
Next, data about the sample is collected. Job role, demographics of the candidate, accept and reject decisions, performance of accepted candidates with similar characteristics, etc. Part of the sample will be used as training set, the other part as test set.
Third, a learning algorithm constructs a model from the training set. This may be a simple linear regression, a deep neural network, a support vector machine, etc.[i] The model is tested on the test set, to estimate error rates. This may lead to improvements of the model.
Fourth, the final model is used to make a prediction about a new case. This means that the variables that characterize the new case are fed into prediction software, which uses the model to predict likely values of a missing variable. For example, based on data about a new candidate, the system will estimate a likely value of an unknown variable, such as the fitness of the candidate for the job.
This application of the model to a case is called prediction, although the mathematics has nothing to do with predicting the future. “Prediction” in this context is estimating an unknown value of a variable from known values of other variables. The model is implemented in a software system called a prediction machine.[ii]
These four tasks may overlap in time and may even be all performed in parallel. For example, feedback about the outcome of new decisions may be added to the data set on which the algorithm is retrained periodically. For our analysis this makes no difference.
The ecosystem of a decision
Machine—supported decision making takes place in an ecosystem of people, organizations, and devices. The following diagram shows a network of actors that are part of this ecosystem. The arrows denote information flow. The numbers indicate possible sources of bias, explained below. For simplicity, the testing process has been omitted from the diagram.
Now let’s look at the sources of bias in more detail.
(1) Bias in the population
In the real world, people have prejudices, which lead to unfair, biased decisions. People who make hiring decisions may have biases against women or Blacks that they may not be aware of. People who label pictures for machine learning have biases that may lead to derogatory labeling. If their decisions are used to train a machine, then the machine will exhibit the same biases.
This kind of bias is not a mathematical property of the data set but a moral property of the population being sampled. Because it is present in the population, it may be present in samples of the population.
Some researchers are trying to transform data sets collected from morally unacceptable samples in data sets that could have been collected from morally acceptable ones. This effort is well-intentioned but rests on a questionable assumption, namely that we can formalize morality in mathematics. Not all of morality can be formalized.
Moreover, morality may depend on culture and it evolves over time. Even if we could formalize some of morality, this formalization would have to be parametrized by culture and would have to change as our moral insights change.
Finally, transforming biased data sets into morally ideal data sets may introduce more problems than it attempts to solve. What if the decision maker does not agree with these ideals? What if the decision subjects don’t agree with them? What if the transformation introduces new moral imperfections?
The bottom line is that bias in the population will not go away and some of it will end in the training data set.
(2) Bias in the sample
The sample used for training the algorithms may be biased because it is not representative of the population. This is another bias concept, namely a lack of representativeness.
But what does it mean to be “representative”?
Random samples are not representative
Some authors consider a randomly selected sample to be representative of the population. This is a mistake.
A sample is selected randomly from a population if every element is selected independently from others and is selected with equal probability as the others.[iii] Unfortunately, random samples are rarely representative of the population from which they are drawn.
For example, suppose an urn contains an equal number of black and white pebbles. A random sample from the urn might contain only black or only white pebbles. Improbable, but possible. In fact, this selection is just as improbable as a selection of an equal number of black and white pebbles — the sample that most of us would consider representative of the population.
Random samples are almost never representative of the population they have been drawn from. The requirement of randomness stems from the fact that repeated random sampling from the same population allows one to estimate population characteristics.[iv] This is not a property used in statistical learning.
But perhaps we can construct representative samples systematically?
Covering samples are not representative
In a series of papers, Kruskall and Mosteller identify nine different meanings of the term “representative sample”. [v] One of these is the concept of covering. Suppose we take a variable, e.g. color, and select sufficiently many cases for each value of the variable?
In the urn example, we would put all black pebbles in one urn and all white pebbles in another and select a random sample of the same size from each. The combined sample would be more representative of the population of black and white pebbles than a random sample taken from the original urn. Procedures like this are followed in stratified sampling.[vi]
The problem here is that the pebbles in each urn are treated as identical, which obviously they aren’t. Different pebbles have different shapes, texture, weight, age, size, etc. And there are many shades of black and white.
Perhaps we should distinguish more colors? And perhaps we should also partition according to marital status, sex, family size, income level, occupation, geographic area and education level? This will give us close to 60 000 classes of pebbles, even if we disregard color.
And still we left out many potentially relevant variables. Still we treat different cases in the same class as identical. Which of the potentially infinite number of variables should we choose? Who determines which variables to add? Who determines that we now have a level of granularity that makes decisions by the system fair? When is our data set large enough to have a sufficient size for each class? A stratified sample represents the population in a very limited sense and does not represent it in all other senses.
But perhaps one of the other concepts of representativeness reviewed by Kruskal and Mosteller is useful for our purpose? To cut a long story short: No, there isn’t.[vii] The phrase ”representative data set” is a feel-good term. Data sets do not represent the population in many ways, some of them unknown.
(3) Bias in data collection
What is euphemistically called “data collection” is in reality data construction. All data originates from physical events. Thermometers, accelerometers, gyroscopes and other sensors are triggered by physical events and produce signals, which are then stored as data in databases, log files etc. From there, they find their way into the training set used by a statistical learning algorithm. No sensor is perfect, all of them have some margin of error, and some of them will be biased in some way.
Or data is produced by observers, for example people who fill out a questionnaire about subjects. The observers use a physical keyboard, click on fields of an online form, etc. But let’s assume that the physical input devices are reliable. Then the observer is still a major source of bias.
For example, the data used by COMPAS, a system that predicts the likelihood of recidivism of criminal defendants, consists of the answers to 126 questions about the defendant and 11 questions about the criminal attitudes of the observer. Some of these questions assume total correctness of police records, others rely on memory of the defendant, and yet others require self-interpretation. There will be some bias on some of the answers.
And sometimes the observer is the subject itself. We enter many data by navigating through web pages and apps and provide a lot of data in response to questions. Not all of that is equally reliable.
The designers and engineers of learning algorithms may try to restrict bias, try to measure it, and try to compensate for it. But there are limits to this and there will be some bias in the training set caused by unreliable sensors and observers.
(4) Bias in the algorithm
Statistical learning algorithms construct a statistical model of a population by making simplifying assumptions. These assumptions introduce bias, but now in a fourth meaning of the term.
For example, in linear regression, the relation between two variables is assumed to be linear, and assumption that is almost always false but makes the model easy to understand. The algorithm tries to fit the best straight line (or n-dimensional plane) through a set of data points. This leads to a systematic error in the model.
We can reduce the number of assumptions, which will reduce model bias. But this will give a model that is overfitted to the data set. For example, a deep neural network will construct a model of a data set with little bias but it will be sensitive to even very small changes in the data set.i
The price we pay for dropping simplifying modeling assumptions is high variance of the model across data sets. We may view this as yet another kind of bias, not imposed by modeling assumptions but induced by the training sample. When bias caused by simplifying assumptions is low, variance across training sets is large, and the other way around.
(5) Bias in the application to a case
When a statistical model is applied to a case, the model estimates the value of some unknown variable based on what is known about the case. For example, if it has found a correlation between color and size of the pebbles in a sample, it uses this correlation to predict a size when given the color of a new pebble, or to predict the color given its size.
The application of the model to a case is based on the assumption that the case is similar to the cases in the training sample. If the case is in fact different, then using the model to predict a property of the case is an example of bias.
For example, if the model has been trained on pebbles but is applied to a seashell, then the prediction will have a pebble-bias towards the seashell. The machine will still make a prediction based on case data, but the prediction will be meaningless.
In clinical decision-making this is called distributional shift.[viii] New cases may have relevant differences for which the information in the training set is not sufficient. Clinicians who recognize this will proceed with caution.
In contrast, if decision makers use a prediction machine as if the new case is similar to the cases in the training sample, even though it is not, this is a form of bias. This fifth kind of bias is a property of decision maker.
Decision makers must do their best to judge the (dis)similarity of the case to the training sample. But people are imperfect and we cannot expect all similarity judgments to be flawless.
Machine bias should be used to reduce human bias
To sum up, there are five sources of bias in machine decision making.
- The population. Many people have prejudices that may end up being sampled. Trying to eliminate this requires formalization of moral values that cannot be fully formalized, change with time and depend on local culture.
- The sample. Samples with which algorithms are trained (and tested) are biased in the sense that they are not representative of the population. Random samples are usually not representative. Stratified samples are to some extent representative, but the members of each class are treated as identical. This kind of bias cannot be eliminated.
- Sensors and observers. Phenomena in the real world are translated into data by sensors and observers. These may be biased and not all of this bias can be eliminated.
- The algorithm. Learning algorithms introduce bias in the form of simplifying assumptions. Some algorithms make very few assumptions but this increases another kind of error, variance across training samples. Those models are biased by the training set.
- The decision maker. To apply the statistical model to a case, the decision maker judges that the case is sufficiently similar to the cases in training set. If the case is actually different in ways relevant to the decision, then this assumption is a bias. Decision makers should do their best to avoid this, but we cannot expect all similarity judgments to be flawless.
Does this mean we should abandon machine predictions altogether? No, we live in an imperfect world and should not aim for total perfection. We should work to reduce bias where possible, but this will not lead to unbiased prediction machines.
Instead of asking for completely unbiased decision-making, we should design the decision ecosystem in such a way that a decision maker makes better decisions with a prediction machine than without it. We can do this using techniques from science and engineering. In the next blog I will discuss how to do this.
[i] G. James, D. Witten, T. Hastie, R. Tibshirani. An Introduction to Statistical learning: with Applications in R. Springer, 2013.
[ii] A. Agrawal, J. Gans, A. Goldfarb. Prediction Machines: The Simple Economics of Artificial Intelligence. Harvard Business Review Press, 2018.
[iii] I ignore the difference between selection with and without replacement here. This does not affect my argument.
[iv] R.J. Wieringa. Design Science Methodology for Information Systems and Software Engineering. Springer 2014.
[v] W. Kruskall and F. Mosteller. “Representative sampling, I: Non-scientific literature”. International Statistical Review, 47(1) (April 1979), 13-24. “Representative sampling, II: Scientific literature, excluding statistics”. International Statistical Review, 47(2) (August 1979), 111-127. “Representative Sampling, III: “The current statistical literature”. International Statistical Review 47(3) (December 1979), 245-265). Representative sampling, IV: The history of the concept in statistics, 1895-1939”. International Statistical Review 48(2) (August 1980), 169-195.
[vi] D. Freedman, R. Pisani, R. Purves. Statistics. Fourth Edition. Norton & Company 2007.
[vii] One meaning that seems useful is that of representativeness as ideal type. But reconstructing a sample so that it only contains ideal cases results in a sample that is not representative of the population. The other meanings discussed by Kruskal and Mosteller are, using their numbering: (1) general, usually unjustified, acclaim for the data (the emperor’s new clothes); (2) absence of selective forces (impossible to prove); (3) miniature of the population (a model train set, which contains only a few of the features of the original); (6) a vague term to be made precise later in the same work; (7) a specific sampling method (e.g. random sampling or stratified sampling); (8) permitting good estimation of a population characteristic; (9) good enough for a particular purpose (e.g. for an order-of-magnitude estimation of a population characteristic).
[viii] R. Challen, J. Denny, M. Pitt, L. Gompels, T. Edwards, K. Tsaneva-Atanasova. “Artificial intelligence, bias and clinical safety.” BMJ Quality and Safety, vol 28 (2019), pages 231-237.