Optimal Time-dependent Prevalence Estimation and Classification

By Prajakta Bedekar, Anthony J. Kearsley, and Paul N. Patrone

The SARS-CoV-2 pandemic demonstrated that the ability to track a disease can affect our lives in numerous ways. For example, forecasts of the rise and fall of COVID-19 cases influence public health decisions. Inherent epidemiological assumptions underlie these predictions, and mathematicians and scientists are increasingly scrutinizing these assumptions in search of more accurate representations of reality. Many types of questions drive their efforts. What is a case? How can we best interpret a sample from an individual? How can we use individual samples to reasonably extrapolate information about an entire population?

To investigate such queries, we must understand the fundamentals of testing. In practice, laboratory technicians test a person’s blood, saliva, or nasal swab sample for markers that indicate an ongoing infection (via an antigen or PCR test) or a history of infection (via a serology test). In particular, serology testing can quantify antibodies that are present in a sample. Scientists then classify these measurement outputs to assess whether the patient was previously infected. When randomly administered to individuals within a population, antibody tests help estimate the fraction of the population with a history of infection.

Figure 1. The blue histogram corresponds to transformed training data for negative samples [1]. The red curve is the probability density function that we modeled from this data. Figure courtesy of [2].

To that end, let us explore the traditional analysis of classification and prevalence estimation. Qualitatively, we know that a higher antibody response indicates a higher chance of previous infection. But what value determines this important cutoff? In other words, how high must a value be to confidently confirm a history of infection?¹ Researchers traditionally perform the assay on pre-pandemic samples, which are classified as “negative.” They calculate the mean and standard deviation of these sample values, then classify the sample values in the interval that is centered around the mean with the radius thrice the standard deviation as negative; every other value is classified as positive. However, problems with this approach are apparent. How can researchers determine information about positive populations without any knowledge of their distributions? What is the error due to misclassification? How is the classification affected by prevalence?

One approach attempts to unravel the relationship between prevalence and classification [4], by demonstrating that coupling probabilistic modeling of the underlying data with optimization machinery can help formulate and answer epidemiological questions. In this approach, we use available training data to model the conditional probability of obtaining a particular measurement value, given that the sample is known to be positive or negative. Any sample value is a realization of these underlying conditional probabilities, weighted by the probability of picking a positive or negative sample from the population respectively. Because the classification boundary itself hence depends on the prevalence, we must perform the prevalence estimation before classification. Over the last year, we have further refined these results by accounting for the variations that arise from time-dependent effects.

Figure 2. Training data for positive samples [1]. The blue dots represent the positive sample values at day \(t\) in their personal timelines. The contours show the probability density function that we model from this data. Figure courtesy of [2].

In 2021, we explored and explicitly accounted for time’s critical effect on antibody levels [2]. The antibody levels of an infected individual increase after a delay, and then decay. This process occurs on people’s personal timelines, which begin on different days based on when they became infected. The number of personal timelines that start at a particular time is in turn governed by the progression of a pandemic through a population on an absolute timeline, i.e., the fraction of the population that is infected each day. The total probability of an antibody measurement on a given day during the pandemic is thus a convolution of the effects of the personal and absolute timelines.

Our approach utilizes the underlying properties of antibody kinetics to build up conditional probability models for personal timelines. Because one’s immune system does not start generating an antibody response immediately after infection, the conditional probability distribution for a recently infected individual is the same as that of a negative individual. As was previously mentioned, the antibody response to infection peaks after a few days and then decays; it asymptotically approaches a level that is identical to that of someone with no history of infection. Figures 1 and 2 plot the training data [1] with our models for personal antibody response for negative and positive individuals.

Figure 3. Mean values across 1,000 synthetic sets of prevalence estimates. The inset shows the underlying true sinusoidal \(f\), which is the fraction of the population that is infected in a given time period. The mean prevalence estimates for various numbers of samples \((N_s)\) per time period are plotted with variance error bars over time: \(N_s = 100\) (green dotted line), \(N_s = 1000\) (black dotted-dashed line), \(N_s = 10^4\) (blue dashed line), \(N_s = 10^5\) (red solid line), and true prevalence (magenta line). Figure courtesy of [2].

We can estimate the prevalence on a given day in the absolute timeline via a Monte-Carlo estimate that incorporates antibody values for a subset of the population. This estimate is unbiased, and a larger sampling of the population yields a more accurate approximation of the prevalence. Figure 3 illustrates this effect for synthetic data. We can use these estimates to obtain optimal classification domains on a given day by minimizing the total rates of false positive and negative classifications. Such optimal domains necessarily change with time, regardless of whether the prevalence is changing or constant (see Figure 4).

Figure 4. Optimal classification partition of the domain for the first 100 days of a pandemic with constant prevalence. Yellow represents positively classified values and blue represents negatively classified values. Figure courtesy of [2].

We now arrive at the crux of the matter: time is an important variable for classification. An antibody value that we classify one way today could very well be classified differently tomorrow. This mutability is necessary because the relative likelihood of sampling a person with one particular antibody value and a history of infection changes with time. While a static approximation of this process fails to leverage these underlying patterns, the current framework offers many tools to explore phenomena such as changes in antibody level after reinfection and vaccination — both scenarios crucially depend on time.

Public health decision-makers rely on an accurate analysis of measurements to arrive at useful conclusions. As we demonstrate here, the language of probability and optimization theory helps researchers concretely define questions that pertain to epidemiology while also providing implementable solutions. Nevertheless, a myriad of relevant questions—which pertain to topics like time-dependent multi-class classification, optimal indeterminate class with given bounds on sensitivity and specificity, and the effects of model form error—still remain.

¹ Though we consider one-dimensional antibody data here, multidimensional data is generally available and provides valuable information — though the actual modeling might be nontrivial [3].

Prajakta Bedekar presented this research during a contributed presentation at the 2022 SIAM Annual Meeting, which took place in Pittsburgh, Pa., last summer.

References
[1] Abela, I.A., Pasin, C., Schwarzmüller, M., Epp, S., Sickmann, M.E., Schanz, M.M., … Trkola, A. (2021). Multifactorial seroprofiling dissects the contribution of pre-existing human coronaviruses responses to SARS-CoV-2 immunity. Nat. Commun., 12, 6703.
[2] Bedekar, P., Kearsley, A.J., & Patrone, P.N. (2023). Prevalence estimation and optimal classification methods to account for time dependence in antibody levels. J. Theor. Biol., 559, 111375.
[3] Luke, R.A., Kearsley, A.J., Pisanic, N., Manabe, Y.C., Thomas, D.L., Heaney, C.D., & Patrone, P.N. (2022). Modeling in higher dimensions to improve diagnostic testing accuracy: Theory and examples for multiplex saliva-based SARS-CoV-2 antibody assays. Preprint, arXiv:2206.14316v2.
[4] Patrone, P.N., & Kearsley, A.J. (2021). Classification under uncertainty: Data analysis for diagnostic antibody testing. Math. Med. Biol., 38(3), 396-416.

	Prajakta Bedekar is a postdoctoral fellow with a joint appointment between the Department of Applied Mathematics and Statistics at Johns Hopkins University and the Applied and Computational Mathematics Division at the National Institute of Standards and Technology.
	Anthony J. Kearsley is a staff research mathematician in the Applied and Computational Mathematics Division at the National Institute of Standards and Technology.
	Paul N. Patrone is a staff research physicist in the Applied and Computational Mathematics Division at the National Institute of Standards and Technology.