| July 13, 2020

The Mathematics of Mass Testing for COVID-19

By David Donoho, Mahsa Lotfi, and Batu Ozturkler

Nobel Prize-winning economist Paul Romer views the global COVID-19 shutdown as an economic calamity. Trillions of dollars of economic losses fell on individuals, businesses, and governments; these losses will become permanent if society does not soon resume pre-virus activities. Romer, who previously served as Chief Economist of the World Bank, proposes a confidence-inspiring path out of the shutdown based on dramatically expanded COVID-19 testing. He indicates that everyone in the U.S. should get tested every two weeks. Those who test positive should self-isolate while the rest of the economy reopens, enabling new jobs and investments. Romer’s simulations reveal that this technique will keep the population’s active infection rate below five percent, ensure that most people do not get infected, and spur rapid economic recovery.

Unfortunately, we currently cannot possibly test at the levels Romer envisions, which involves screening seven percent of the population every day. For context, only about four percent of U.S. residents were tested for COVID-19 from March through May.

Inspired by the call for a dramatic scaleup of testing, statistician David Donoho reviewed a recently emergent research front in which mathematical and statistical ideas—implemented through data science—can enable a rapid expansion of testing capabilities worldwide. He surveyed this new front during an invited presentation at the inaugural 2020 SIAM Conference on Mathematics of Data Science (MDS20), which occurred virtually in May and June, in hopes that the SIAM community would contribute to novel research trends related to COVID-19.

Such trends continue to come together in part due to the new medRxiv preprint server for health sciences. Mathematical scientists have enjoyed the arXiv preprint service for nearly 30 years and are accustomed to sharing information universally, rapidly, and freely. However, medical literature was always restricted and paywalled. In 2019, medRxiv was founded as an offshoot of bioRxiv to provide the capability for globally visible medical research preprint literature.

The COVID-19 crisis brought medRxiv to life. A stream of COVID-19 postings began in January 2020; by March, hundreds of items were flooding in daily. Submissions covered everything from individual case reports and documentation of care protocols to fully-developed articles about therapeutic interventions intended for major journals. Contributors included doctors sharing patient data, medical teams conducting clinical trials, public health officials analyzing national databanks, medical device engineers discussing new technologies, and citizen scientists focusing on COVID-19 during lockdown.

Figure 1. An overview of the pooling scheme proposed in [4]. A total of \(n\) patient samples are pooled into \(m\) tests in a combinatorial manner. Each patient sample is included in multiple tests and each test contains multiple patient samples. Figure courtesy of [4].

Several of these articles addressed the need to scale up testing efforts. Many papers sought to show that one could repurpose existing COVID-19 test kits in a multiplexed fashion, resulting in a substantial expansion of the total patient caseload under screening without increasing the number of testing stations or utilizing extra test kits.

During his talk, Donoho spotlighted two early papers in this burgeoning literature. One paper multiplexed patient samples up to five at a time in an organized protocol that immediately expanded effective testing capacity—the number of patients whose disease status can be determined—by more than a factor of two [1, 2]. Another submission suggested that much more was possible [7]. This paper documented the ability to multiplex up to 64 patient samples at a time and still detect the presence of COVID-19 in one individual patient. In principle, a single quantitative reverse transcription polymerase chain reaction (RT-qPCR) run on a pooled sample thus determines whether anyone among a group of 64 patients has the disease.

One principle is common to papers in this emergent research front: most people are not actually infected during population-level testing, so they will not have the active virus in their test samples. Intuitively, we do not really need to consume one test kit per patient if so few patients are actually infected. Instead, we need both multiplexing and math (see Figure 1).

Formally speaking, a test measures a sample’s viral load. If we consider a group of \(N\) patients, the vector \(\bf{x}\) of \(N\) viral loads will be sparse (mostly zeros) because the majority of people are not infected. Mathematically, the problem thus seeks to most efficiently—i.e., with fewest test kits and the least possible wall-clock-time delay—determine which entries are nonzero in a large, sparse vector of viral counts.

Multiplexing involves pooling samples from several patients, wherein each patient’s sample appears in multiple pools and each pool contains samples from multiple patients (see Figure 2). The total viral load in each test sample is roughly the sum of the viral loads contributed by every sample in that pool. One can collect the viral loads that underly \(T\) tests into a \(T\) by \(1\) vector \(\bf{y}\). The tested viral loads \(\bf{y}\) are related to the original viral loads \(\bf{x}\) via matrix multiplication \(\bf{y}\) \(=A \bf{x}\), where \(A\) is a \(T\) by \(N\) binary matrix that indicates which patients’ samples contribute to which test pools. Now the problem involves inferring the sparse vector \(\bf{x}\) from (noisy, partial, binary, and linear) information about \(\bf{y}\). Researchers sometimes employ multiple rounds of measurement \(\bf{y}\)\(_r=A_r \bf{x}\), \(r=1,\dots,R\), with round \(r\)'s pooling matrix \(A_r\) dependent on the last round’s test results \(\bf{y}\)\(_{r-1}\). One can then infer \(\bf{x}\) from \(\bf{y}=[\bf{y}\)\(_1, \bf{y}\)\(_2, \ldots, \bf{y}\)\(_R]\), once again obeying \(\bf{y}\) \(=A\bf{x}\) — now with block matrix \(A = \oplus_{r=1}^R A_r\). Scientists can build matrix \(A\) with appropriate use of randomness or delicate constructions that involve special bipartite graphs such as the expander graphs used in [8], special structured matrices, or even information-theoretic codes.

Eventually, one deciphers the test results \(\bf{y}\) to determine which samples must contain the COVID-19 virus. The key point is that the total number of tests \(T < N\), so practitioners are using \(T\) test kits to evaluate the disease status of \(N\) patients.

Figure 2. A binary matrix of size \(16\) \(\times\) \(40\) that tests \(40\) patient samples in \(16\) tests. Each row indicates which samples must be pooled together in each test. Figure courtesy of [4].

Deciphering the medRxiv proposals exploits knowledge of the instance data \((A,\bf{y}\)\()\) and assumed sparsity of \(\bf{x}\). These papers are motivated by many techniques with which the SIAM community is familiar, including sequential approaches such as group testing or combinatorial group testing and one-round approaches inspired by compressed sensing and one-bit compressed sensing.

Donoho overviewed the basic RT-qPCR test—the gold standard of COVID-19 testing efforts—and explained why this technology could pair well with multiplexed samples. He also highlighted several additional research efforts that explicitly made the connection to mathematical sciences. For example, one team utilized combinatorial group testing [3, 5] and others employed compressed sensing [4, 8]. Both techniques propose one-round-only methods and show—as in [6]—that they can successfully infer the disease status of \(N\) patients from \(T\) tests at low levels of population prevalence, where

\[N \approx 10 T. \tag1 \]

When successful, this marks a tenfold expansion in the number \(N\) of patients whose disease states can be determined for the same number \(T\) of units of RT-qPCR machine time and test kits (\(N\) patient samples and the associated per-patient processing are of course still required). This approach would naturally require a very low prevalence of infection; however, \(N\) can be significantly larger than \(T\) even at a higher prevalence. A nice benefit of these two methods is that they are one-round-only \((R=1)\), so they experience less processing delay than multi-round procedures — a fact that patients will certainly appreciate.

Donoho was floored by the rapidity with which this research front developed, producing in mere weeks not only fascinating proposals and ideas but actual protocols for daily use. To close his MDS20 talk, he looked beyond today’s RT-qPCR standard for COVID-19 testing and discussed some new technologies that may soon arrive. The need to dramatically increase global testing is clear, and the exuberant growth of new research fronts—combined with key mathematics-based enabling technology—inspires hope for future testing endeavors.

This article is based on David Donoho’s invited presentation as part of the 2020 SIAM Conference on Mathematics of Data Science (MDS20), which occurred virtually in May and June. Donoho’s presentation is available on SIAM’s YouTube channel.

References
[1] Abdalhamid, B., Bilder, C.R., McCutchen, E.L., Hinrichs, S.H., Koepsell, S.A., & Iwen, P.C. (2020). Assessment of specimen pooling to conserve SARS CoV-2 testing resources. Preprint, medRxiv.
[2] Bilder, C.R., & Tebbs, J.M. (2012). Pooled-testing procedures for screening high volume clinical specimens in heterogeneous populations. Stat. Med., 31(27), 3261-3268.
[3] Dingzhu, D., & Hwang, F. (2000). Combinatorial Group Testing and its Applications (2nd ed.). In Series on Applied Mathematics (Vol. 12). Singapore: World Scientific.
[4] Ghosh, S., Rajwade, A., Krishna, S., Gopalkrishnan, N., Schaus, T.E., Chakravarthy, A., …, Gopalkrishnan, M. (2020). Tapestry: A single-round smart pooling technique for COVID-19 testing. Preprint, medRxiv.
[5] Shental, N., Levy, S., Skorniakov, S., Wuvshet, V., Shemer-Avni, Y., Porgador, A., & Hertz, T. (2020). Efficient high throughput SARS-CoV-2 testing to detect asymptomatic carriers. Preprint, medRxiv.
[6] Verdun, C.M., Fuchs, T., Harar, P., Elbrächter, D., Fischer, D.S., Berner, J., …, Krahmer, F. (2020). Group testing for SARS-CoV-2 allows for up to 10-fold efficiency increase across realistic scenarios and testing strategies. Preprint, medRxiv.
[7] Yelin, I., Aharony, N., Tamar, E.S, Argoetti, A., Messer, E., Berenbaum, D., …, Kishony, R. (2020). Evaluation of COVID-19 RT-qPCR test in multi-sample tools. Clin. Infect. Dis.
[8] Yi, J., Mudumbai, R., & Xu, W., (2020). Low-cost and high-throughput testing of COVID-19 viruses and antibodies via compressed sensing: system concepts and computational experiments. Preprint, arXiv:2004.05759.

David Donoho is a professor of statistics and the Anne T. and Robert M. Bass Professor of Humanities and Sciences at Stanford University. He has delivered SIAM’s John von Neumann Lecture and is a recipient of the AMS-SIAM Norbert Wiener Prize in Applied Mathematics. Mahsa Lotfi is a postdoctoral researcher in the Department of Statistics at Stanford University. She received her Ph.D. in electrical engineering and signal processing from the University of Texas at Dallas. Batu Ozturkler is a first-year Ph.D. student in the Department of electrical engineering at Stanford University.