| January 23, 2023

Key Epidemiological Parameters for SARS-CoV-2 Outbreaks and Variant Selection From Noisy Data

The COVID-19 pandemic changed the work strategies of epidemiologists. Instead of iteratively refining models for a fixed data set over the course of months or years, researchers experienced an insatiable demand for definitive knowledge about a new pathogen. Suddenly, preliminary predictions based on noisy and often incomplete data became elements of a real-time, public discussion in a politically charged atmosphere. Conducting scientific studies that influence public decision-making is difficult, but it is not impossible. Here we discuss some of the key takeaways from our work with both noisy data and a high demand for certainty during an ongoing pandemic.

\(R_0\) and \(r\) for Initial Outbreaks

At the beginning of the SARS-CoV-2 outbreak in Wuhan, China, in late 2019, researchers around the world wanted to estimate two fundamental epidemiological parameters: the early exponential growth rate \((r)\) and the basic reproductive number \((R_0)\). The former describes the rate at which an outbreak grows in size and provides an epidemic doubling time (expressed as \(\log_e (2)/r\)), while the latter is defined as the average number of secondary infections that result from an index case in a fully susceptible population. One typically derives \(R_0\) from estimates of \(r\) and the estimated distribution of the generation interval, i.e., the time interval from the moment the virus infects the donor to when it infects the recipient in a transmission pair [5]. Broadly speaking, larger values of \(r\) indicate a more rapidly growing epidemic, while larger values of \(R_0\) indicate an epidemic that is harder to control.

The earliest estimates of \(r\) and \(R_0\) came from case report data in Wuhan [4]. By fitting an exponential growth model to initial case count data, researchers estimated \(r\) to be between 0.1 and 0.15 each day—which translates to an epidemic doubling time of five to seven days—and \(R_0\) to be between 2.2 and 3. These approximations were very similar to the estimates for SARS-CoV-1 in 2003, leading to initial optimism that SARS-CoV-2 would not pose a global threat and would ultimately succumb to regional interventions — much like SARS-CoV-1. As we now know, this view was far too conservative. The problem was that the rapidly changing technological and clinical landscape made the data from Wuhan unreliable; any method that assumed those early data were robust generated misleading results.

Our team approached this problem in January 2020 not by trying to “fix” the noisy data from Wuhan, but by collecting extensive case reports and travel data for people who moved from Wuhan to other provinces. Focusing on individuals who were infected in Wuhan but detected outside of Hubei province (where Wuhan is located) sidestepped the issue of unreliable data that came directly from Wuhan. Unlike those at the epicenter of the pandemic, provincial health systems were prepared for incoming cases and began to rigorously test everyone who entered each province.

To further isolate potential sources of bias in different data collection systems, we designed two inference approaches to reconstruct the preliminary dynamics of SARS-CoV-2 in Wuhan [6]. We found that the early epidemic in Wuhan doubled every 2.4 days, suggesting an extremely rapid spread that progressed much more quickly than previously thought. We further estimated \(R_0\) to be between 4.7 and 6.6 — also significantly higher than previous approximations, even when we incorporated the substantial uncertainty in other epidemiological parameters that influence \(R_0\). When we reached this conclusion in early 2020, it was initially met with extensive criticism and disbelief, especially since no major outbreaks outside of China were occurring at the time. However, subsequent outbreaks in Europe and the U.S. suggested that our estimates of the two fundamental parameters were accurate [1, 2].

Using simulations to connect our results to real-world implications, we found that incorporating the possibility of asymptomatic transmission—which was not yet evident at the time—meant that even extensive quarantine and contact tracing of symptomatic individuals would not control the epidemic locally. Instead, early and strong control measures like social distancing were required to stop the virus’ spread [6].

Retrospectively, we realized that inaccuracies in the earliest estimates of \(R_0\) and \(r\) arose from large uncertainties in the case count data that were collected in Wuhan in late 2019. Low surveillance intensity, a lack of validated diagnostic tools for SARS-CoV-2, heterogeneity in symptoms, and so forth—issues that are often associated with a novel pathogen outbreak—were responsible for these uncertainties. Unfortunately, many forecasting models and public health policy decisions throughout 2020 integrated the initial conservative estimates of \(R_0\) and epidemic doubling time, largely ignoring the later (and often more accurate) estimates that were based on better data. It is impossible to know exactly what we could have done to avoid this scenario; however, a critical examination of the appropriateness of standard methods for the unique situation would certainly have helped frame the discussion in terms of how noise and uncertainty shaped the early data.

Selection Coefficient for SARS-CoV-2 Variants

As COVID-19 spread globally, researchers began to worry about viral evolution and selection. Bette Korber of Los Alamos National Laboratory was the first person to identify positive selection in SARS-CoV-2 in the D614G mutation by tracking the change in frequency over time [3]. At first, her discovery faced widespread incredulity; critics wondered how a single amino acid could change the phenotype so drastically. Other studies around that time employed phylogenetic-based methods and found limited evidence of selection at D614G [9], thus framing an apparent controversy surrounding SARS-CoV-2’s possible evolution towards higher levels of contagiousness.

We approached this issue from a different perspective. Rather than attempting to track the places at which D614G emerged globally, we considered its first entry into a country as a unique trial and modeled the time to extinction or fixation (when all sampled viruses have the D614G mutation) [7]. By treating countries like pseudo-independent units, we explicitly modeled the heterogeneity in selection effects due to differences in the ways in which various countries collected SARS-CoV-2 data. This method ultimately led to a more stable estimate of the danger of new variants.

After our attempts to estimate \(r\), we tried several different types of models—from a complex stochastic model to a simple statistical model—under the assumption that knowing the validity of different model structures was as important as the inferences themselves. We found unambiguous evidence for selection, which suggested an increase in contagiousness for D614G and several other variants — even in the presence of very high levels of migration [7]. We later extended this work to prove that increases in recent infections by a previous variant (e.g., Delta) consequently heighten the selection effect for a new incoming variant (e.g., Omicron) [8]. However, heterogeneity in measured selection effects at the country level was very large. If we had used noisy data from only one country or ignored the data’s natural hierarchical structure, we easily could have significantly under- or overestimated the selection effects for new variants.

Lessons Learned

Ultimately, we concluded that applied epidemiology during an ongoing pandemic is severely complicated by the noisy and incomplete data that stem from a novel pathogen outbreak. Even simple tasks, such as estimating an exponential growth rate, become challenging if auxiliary assumptions are left unexamined and subsequently unmet. While we do not have an immediate solution to these problems, we attempted to implement workflows that acknowledge early data’s potential inaccuracies and the helpfulness of analytical methods—like novel data cleaning techniques and explicit modeling of random effects—when accounting for data uncertainty. We can therefore collectively prepare for the next pandemic by developing a body of knowledge that encompasses strategies to best handle potentially unreliable data, realistic standards for the integration of uncertainty in epidemiological parameters, and statistical software that incorporates those concerns.

More broadly, our experiences during the beginning stages of the COVID-19 pandemic clearly demonstrate that uncertainties in early data collection for a novel outbreak can yield diverse or opposing conclusions from different research groups. As scientists, we should resist the urge to quickly settle scientific issues and form consensuses too prematurely; the rigorous discussion and evaluation of different findings will presumably lead to more accurate knowledge and better public health policies. According to theoretical physicist Richard Feynman, “The first principle is that you must not fool yourself, and you are the easiest person to fool.” This is a good reminder, especially when the stakes are high.

References
[1] Ke, R., Romero-Severson, E., Sanche, S., & Hengartner, N. (2021). Estimating the reproductive number R₀ of SARS-CoV-2 in the United States and eight European countries and implications for vaccination. J. Theor. Biol., 517, 110621.
[2] Ke, R., Sanche, S., Romero-Severson, E., & Hengartner, N. (2020). Fast spread of COVID-19 in Europe and the US suggests the necessity of early, strong and comprehensive interventions. Preprint, medRxiv.
[3] Korber, B., Fischer, W.M., Gnanakaran, S., Yoon, H., Theiler, J., Abfalterer, W., … Montefiori, D.C. (2020). Tracking changes in SARS-CoV-2 spike: Evidence that D614G increases infectivity of the COVID-19 virus. Cell, 182(4), 812-827.e19.
[4] Li, Q., Guan, X., Wu, P., Wang, X., Zhou, L., Tong, Y., … Feng, Z. (2020). Early transmission dynamics in Wuhan, China, of novel coronavirus-infected pneumonia. New Engl. J. Med., 382(13), 1199-1207.
[5] Park, S.W., Bolker, B.M., Champredon, D., Earn, D.J.D., Li, M., Weitz, J.S., … Dushoff, J. (2020). Reconciling early-outbreak estimates of the basic reproductive number and its uncertainty: Framework and applications to the novel coronavirus (SARS-CoV-2) outbreak. J. R. Soc. Interface, 17(168).
[6] Sanche, S., Lin, Y.T., Xu, C., Romero-Severson, E., Hengartner, N., & Ke, R. (2020). High contagiousness and rapid spread of severe acute respiratory syndrome coronavirus 2. Emerg. Infect. Dis., 26(7), 1470-1477.
[7] Van Dorp, C.H., Goldberg, E.E., Hengartner, N., Ke, R., & Romero-Severson, E.O. (2021). Estimating the strength of selection for new SARS-CoV-2 variants. Nat. Commun., 12, 7239.
[8] Van Dorp, C., Goldberg, E., Ke, R., Hengartner, N., & Romero-Severson, E. (2022). Global estimates of the fitness advantage of SARS-CoV-2 variant Omicron. Virus Evol., 8(2), veac089.
[9] Volz, E., Hill, V., McCrone, J.T., Price, A., Jorgensen, D., O’Toole, Á., … Connor, T.R. (2021). Evaluating the effects of SARS-CoV-2 spike mutation D614G on transmissibility and pathogenicity. Cell, 184(1), 64-75.

Ruian Ke is a staff scientist at Los Alamos National Laboratory whose research centers on modeling the dynamics and evolution of viral pathogens, including HIV, influenza, and HCV. Since late 2019, his research has focused heavily on the use of mathematical modeling and machine learning approaches to understand the transmission, infection, and evolution dynamics of SARS-CoV-2. Ethan Romero-Severson is a computational epidemiologist in the Theoretical Biology and Biophysics group at Los Alamos National Laboratory. His work on infectious disease epidemiology bridges the evolutionary biology and mathematical modeling of viral pathogens like HIV, HCV, and bunyaviruses.