| June 06, 2022

Augmenting the Capabilities of Human Decision-makers with Human and Artificial Intelligence Complementarity

Machine learning (ML) algorithms are becoming increasingly capable of performing numerous real-world tasks, ranging from image classification and text generation to autonomous driving and game playing. In some cases, the performance of ML models can even exceed that of human experts [2]. Yet despite their impressive capabilities, such models can make errors that yield negative real-world consequences. This reality has led researchers to consider using these models in conjunction with a human decision-maker, rather than on their own. In an ideal setting, the strengths of ML models would complement those of humans — i.e., the model should make correct predictions when the human is incorrect and vice versa. This notion of complementarity is key to building highly performant human/artificial intelligence teams.

The approach raises several questions:

How can we algorithmically combine a model’s predictions with those of a human? A simple approach like voting will not suffice since there are only two predictions. We instead use the confidence of the model and human as an important signal to determine which prediction to follow. However, this confidence can be misleading; for example, ML models tend to be overconfident in their predictions [1].
Under what circumstances will a human-machine team outperform a human or model that is working alone? Understanding the scenarios for which we can achieve complementarity is an important step towards knowing when and how to deploy human-machine teams in the real world. For instance, humans and machines typically have different accuracy levels for any given task, so it is crucial for us to learn how this difference in accuracy affects the human-machine team performance.

Our recent work investigates these questions by proposing a Bayesian statistical model that combines human and machine predictions [3]. By fitting this model to data, we can assess the factors that influence complementarity and derive theoretical limits on the performance of human-machine teams. Our experiments focus on image classification, but we believe that our findings will likely generalize to other domains.

Humans and machines express confidence in very different ways, which presents a significant challenge. A ML model usually outputs a probability distribution over all possible predictions, wherein we can interpret the probability that is assigned to a given prediction as the model’s confidence that said prediction is correct. In contrast, human decision-makers often only produce a single prediction; their associated confidence is typically a categorical score such as “low,” “medium,” or “high.” Now we will explore our combination model’s ability to handle these different confidence notions.

Figure 1. Directed graphical model that depicts the combination model. Shaded nodes represent observed variables, and unshaded nodes represent latent (unobserved) variables. Figure courtesy of Mark Steyvers [3].

Technical Aspects of the Combination Model

Let’s dive deeper into our model’s technical aspects. We assume that our dataset consists of images and that each image \(i\) has a ground-truth label \(z_i\in\{1, 2, \ldots, L\}\). The human labeler predicts a label \(y_{h,i}\) for every image and provides a confidence rating \(r_{h,i}\), where \(r_{h,i}\in\{\mathrm{'low', \ 'medium', \ 'high'}\}\). Similarly, the ML model predicts a label \(y_{m,i}\). However, the machine confidence \(\gamma_{m,i}=[\gamma_{m,i,1},\ldots,\gamma_{m,i,L}]^T\) is a probability distribution over all possible labels — i.e., \(\gamma_{m,i,j}\) is the probability that the model assigns to label \(j\) for input \(i\). To put these two types of confidence data on equal footing, we assume that latent score vectors \(\lambda_{h,i}\) and \(\lambda_{m,i}\) both have components that correspond to each label:

\[\lambda_{h,i}=\left(\begin{matrix}\lambda_{h,i,1}\\\vdots\\\lambda_{h,i,L}\\\end{matrix}\right) \qquad \qquad \lambda_{m,i}=\left(\begin{matrix}\lambda_{m,i,1}\\\vdots\\\lambda_{m,i,L}\\\end{matrix}\right).\]

To recover the human-produced label from \(\lambda_{h,i}\), we normalize \(\lambda_{h,i}\) and sample from the resulting categorical distribution. Similarly, we model the human’s categorical confidence rating with an ordered probit distribution whose parameters depend on \(\lambda_{h,i}\). To recover the machine’s output, we simply use the normalized version of \(\lambda_{m,i}\). And to model the correlations between human and machine predictions, we assume that the latent human and machine scores for each label are jointly drawn from a Gaussian distribution, wherein the latent parameter \(\rho_{hm}\) controls the correlation between human and machine scores:

\[\left(\begin{matrix}\lambda_{h,i,j}\\\lambda_{m,i,j}\\\end{matrix}\right) \sim N\left(\mu(i, j),\ \left(\begin{matrix}\sigma_h^2&\rho_{hm}\sigma_h\sigma_m\\\rho_{hm}\sigma_h\sigma_m&\sigma_m^2\\\end{matrix}\right)\right).\]

After defining our combination model, we can infer the unobserved parameters by fitting the model to a dataset. In particular, we use a variant of Markov chain Monte Carlo methods to draw samples from the posterior distribution of our model parameters. Note that an analogous combination model can combine the predictions of two humans or two machines. The directed graphical model in Figure 1 visualizes our combination model; we encourage readers to check out our published work for further details [3].

The ImageNet-16H Dataset

To assess the complementarity of humans and machines, we collected a new image classification dataset that we dubbed “ImageNet-16H.” This dataset is derived from the well-known ImageNet dataset and contains 4,800 images from 16 different classes; each image has been distorted with noise to make classification more challenging. In addition to the ground-truth labels from the original ImageNet dataset, we also collect labels from several human labelers for each noisy image. Our dataset is freely available to download.

Figure 2. Sample images from our image classification dataset, which we call ImageNet-16H. 2a. Examples of images that machine classifiers are able to correctly classify with high confidence, but that are difficult for humans to classify. Correct answers, left to right: bird, boat, bear, bear, oven, and oven. 2b. Examples of images that are relatively easy for humans but difficult for machine classifiers to classify. Correct answers, left to right: car, car, cat, cat, bear, bear. Figure courtesy of Mark Steyvers [3].

Findings and Results

We investigate complementarity by inferring the latent parameters in our statistical model with our ImageNet-16H dataset. First, we train several ML classifiers (in this case, convolutional neural networks) to classify the images in the dataset. Using our Bayesian combination model, we then combine the predictions from these ML models with the human-generated labels and confidences in our dataset. Doing so allows us to observe both the combined human-machine accuracy and the latent correlation between their respective predictions. Additionally, we generate several human-human and machine-machine combinations to serve as baselines for comparison.

Figure 3. Results of comparison between human-human \((\textrm{H, HH})\), machine-machine \((\textrm{M, MM})\), and hybrid human-machine \((\textrm{HM})\) teams. 3a. The accuracy of various teams with error bars corresponding to 95% confidence intervals. 3b. The posterior distributions over the latent correlations of various teams. Figure courtesy of Mark Steyvers [3].

We summarize our main findings as follows:

Human-machine teams often yield higher accuracy than human-human teams or machine-machine teams. In Figure 3a, we plot the accuracy of various teams that comprise one or two humans \((\textrm{H, HH})\), one or two ML models \((\textrm{M, MM})\), or a single human and single ML model \((\textrm{HM})\). The hybrid human-machine teams consistently outperform their non-hybrid counterparts.
Humans and machines have low levels of latent correlations and make different types of errors, so hybrid human-machine teams exhibit lower correlation than non-hybrid teams (see Figure 3b). Human-machine teams thus exhibit more complementarity than non-hybrid teams, which explains why these hybrid teams achieve higher levels of accuracy than their non-hybrid counterparts.
Figure 4. Observed and theoretically predicted complementarity of various human-machine teams as a function of the human and machine accuracies. Circles filled in red indicate that the hybrid human-machine team outperforms non-hybrid teams on held-out test data. The shaded region indicates the region of complementarity that was predicted by our theoretical analysis based on a moderate level of latent correlations. The dashed line indicates the best-case scenario, in which the latent human-machine correlation is zero. Figure courtesy of Mark Steyvers [3].

Human accuracy, machine accuracy, and correlation between the predictors all affect the level of complementarity. Figure 4 compares the accuracy of human labelers (on the horizontal axis) with the accuracy of machine classifiers (on the vertical axis). Each data point represents a unique human-machine combination that is colored red if the combination exhibits complementarity, in the sense that it obtains higher accuracy than the corresponding non-hybrid combinations. We observe a fairly narrow band of complementarity, meaning that complementarity can be difficult to achieve if the human and machine have vastly different levels of accuracy. We derive additional theoretical results that allow us to compute the region of complementarity from human accuracy, machine accuracy, and their latent correlation in [3]. This theoretical region—shaded in red—closely matches our empirical results.

In summary, our research demonstrates that humans and machines often exhibit complementary strengths. We can leverage this complementarity to obtain better performance on downstream applications. However, limitations exist that are controlled by human accuracy, machine accuracy, and the correlation between their predictions. In the future, we hope that practitioners will design ML models with complementarity in mind to ultimately develop algorithms that are safer, more accurate, and more reliable.

Acknowledgments: This research was supported by the National Science Foundation under awards 1900644 and 1927245, as well as by the Irvine Initiative in AI, Law, and Society. In addition, the author received support from the Hasso Plattner Institute Research Center in Machine Learning and Data Science at the University of California, Irvine.

References
[1] Guo, C., Pleiss, G., Sun, Y., & Weinberger, K.Q. (2017). On calibration of modern neural networks. In Proceedings of the 34th international conference on machine learning (PMLR 70) (pp. 1321-1330). Sydney, Australia. Proceedings of Machine Learning Research.
[2] Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., … Hassabis, D. (2017). Mastering the game of Go without human knowledge. Nature, 550(7676), 354-359.
[3] Steyvers, M., Tejeda, H., Kerrigan, G., & Smyth, P. (2022). Bayesian modeling of human–AI complementarity. Proc. Natl. Acad. Sci., 119(11), e2111547119.

Gavin Kerrigan is a Ph.D. candidate in computer science and Hasso Plattner Institute Fellow at the University of California, Irvine. His research interests are in the development and application of machine learning algorithms and probabilistic modeling.