About the Author

Topological Artificial Intelligence Forecasting of Future Dominant Viral Variants

By Guo-Wei Wei

SARS-CoV-2 is an extremely sophisticated virus with 29 different proteins. This number includes the spike protein, which enables viral cell entry through its interaction with human angiotensin-converting enzyme 2 (hACE2) at the virus spike receptor-binding domain (RBD). The strength of this interaction is proportional to virus infectivity [8]. Mutations may occur randomly, but natural selection favors RBD mutations that strengthen viral infectivity and evolutionary fitness [4]. For example, approximately 32 of the 50 mutations of COVID-19’s Omicron variant are located on the spike protein; non-proportionally, 15 are on RBD to optimize viral evolutionary advantages.

The spike protein is also the main antigenic target of COVID-19 antibodies that are generated by either infection or vaccination. Spike protein-bound antibodies prevent the SARS-CoV-2 virus from interacting with hACE2 and subsequently block viral cell entry, while antibodies that compete with hACE2 on the spike RBD can directly neutralize the virus. The non-covalent binding between a viral spike protein and an antibody works like a zipper, with the virus spike RBD acting as the zipper’s upper teeth and the antibody serving as the lower teeth. The RBD mutations cause the zipper to misalign or break off, which leads to the (partial) loss of antibody protection and potential reinfection. In some cases, one or several of the vital RBD mutations can significantly enhance RBD-hACE2 binding and dramatically disrupt the binding between the spike RBD and protective antibodies.

Figure 1. Topology deciphers the virus code. Image courtesy of Rui Wang.
Forecasting of the emerging dominant variants helps policymakers plan preventive measures and allows biopharmaceutical companies additional time to develop future vaccines and antibody drugs. However, such forecasting is one of the most challenging scientific tasks of our time. Identifying the mutations that are vital for virus evolution involves striking complexity; a spike protein consists of more than 1,700 amino acid residues, and its dynamical degree of freedom exceeds 5,100. In contrast, the million-dollar Navier-Stokes existence and smoothness problem concerns only three-dimensional dynamics. Additionally, each residue can mutate into one of 19 alternative amino acids with a wide range of chemical, physical, and biological disparities. This possibility creates an astronomically large mutational space that is scaled as \(20^N\) (where \(N\) is the number of involved amino acid residues), thus making full experimental deep mutational screening unfeasible. Moreover, each set of mutations may potentially contribute to a new viral variant. The genotype-phenotype mapping between a variant and its infectivity and/or antibody resistance is highly nonlinear and involves intricate geometric and combinatorial complexities. Innovative strategies are therefore necessary.

Topology offers a solution to this intriguing problem (see Figure 1). Traditional topology addresses the invariants of a geometric object under continuous deformation; the homeomorphisms and homotopies may not refer to metrics or coordinates and are too abstract to find use in biological analysis. However, persistent homology—a new branch of algebraic topology that employs multiscale analysis to bridge the gap between complex geometry and abstract topology [1, 6]—effectively simplifies biomolecular complexity. Persistent homology analyzes biomolecular data in terms of a simplicial complex. The underlying topological space is equipped with filtration to create a family of simplicial complexes, which is a nested sequence of multiscale subsets. But biomolecular systems involve a wide range of interactions—like covalent bonds, hydrogen bonds, van der Waals, electrostatics, hydrophilicity, hydrophobicity, and so forth—that the whole molecular persistent homology analysis would miss. Element-specific persistent homology (ESPH) overcomes this obstacle by embedding physical, chemical, and biological information in topological invariants. The power of ESPH was exemplified through its dominating victories in worldwide annual competitions about computer-aided drug design, which is one of the most competitive fields in modern science [7]. It remains to be seen whether this topological tool can withstand the outburst challenges that are associated with the ongoing COVID-19 pandemic.

Right before the pandemic began, researchers developed an ESPH-based deep learning method that offers state-of-the-art predictions of mutation-induced binding affinity changes of protein-protein interactions — including antibody-antigen interactions [9]. Scientists integrated this method with genotyping and sequence alignment to create an artificial intelligence (AI) platform, which revealed that SARS-CoV-2 evolution and transmission follows Darwin’s natural selection [4]. The study singled out two vital spike protein residues at positions 452 and 501 that “have very high chances to mutate into significantly more infectious COVID-19 strains” long before they occurred in prevailing SARS-CoV-2 variants, such as Alpha, Beta, Gamma, Delta, Epsilon, Theta, Kappa, Lambda, Mu, and Omicron [4]. During this process, ESPH delineates the crucial geometric and biophysical characteristics of mutants in the astronomically large topological space that contributes to virus infectivity, vaccine breakthrough, and antibody resistance.

Figure 2. The receptor-binding domain (RBD) mutations of Omicron subvariants at the RBD-hACE2 interface, as well as their mutation-induced changes in binding free energy (BFE). 2a. RBD mutations of Omicron subvariants at the RBD-hACE2 interface. The 12 shared mutations are shown in cyan, BA.1 mutations are plotted with magenta, BA.2 mutations are marked in yellow, and BA.4 and BA.5 mutations are labeled in orange. One can match the colors to the chart in Figure 2b. 2b. A comparison of predicted mutation-induced BFE changes for various SARS-CoV-2 variants and subvariants. According to the Boltzmann distribution, a variant with higher BFE change has an exponential advantage in infectivity. Image courtesy of Jiahui Chen.

When reports of the new Omicron variant first emerged in late November 2021, no relevant data was available because experiments had not yet been performed. But within a few days, a topology-based AI platform forecasted Omicron to be nearly three times more infectious than Delta, capable of escaping nearly 90 percent of vaccines, and resistant to essentially all U.S. Food and Drug Administration-approved monoclonal antibodies. Experiments confirmed these predictions in the following weeks [3]. In early 2022, a new subvariant of Omicron called BA.2 started spreading. On February 11, the same topology-based AI platform forecasted BA.2 as the next dominant variant [5]. Six weeks later on March 26, the World Health Organization announced BA.2’s global dominance. 

Many Omicron subvariants were circulating around the world by late April, including BA.1, BA.1.1, BA.2, BA.2.11, BA.2.12.1, BA.3, BA.4, and BA.5 (see Figure 2). These subvariants involve in numerous spike protein RBD mutations with very subtle differences, therefore demanding more discriminative mathematical tools. Although persistent homology is an outstanding tool for the characterization of topological invariants, it is insensitive to the homotopic shape variations in protein-protein interactions that are crucial to viral evolution and transmission. A recent study tackled this challenge with the persistent Laplacian (also known as the persistent spectral graph): a topological Laplacian that is designed to capture both the topological persistence and homotopical shape evolution of data [10]. Its harmonic spectra fully recover the topological invariants of persistent homology, while its nonharmonic spectra unveil homotopical shape evolution. On May 1, 2022, persistent Laplacian-based AI projected Omicron BA.4 and BA.5 to become the new dominating COVID-19 variants [2]. This prediction became reality in late June.

[1] Carlsson, G. (2009). Topology and data. Bull. Am. Math. Soc., 46(2), 255-308.
[2] Chen, J., Qiu, Y., Wang, R., & Wei, G.-W. (2022). Persistent Laplacian projected Omicron BA.4 and BA.5 to become new dominating variants. Preprint, arXiv:2205.00532.
[3] Chen, J., Wang, R., Gilby, N.B., & Wei, G.-W. (2022). Omicron variant (B.1.1.529): Infectivity, vaccine breakthrough, and antibody resistance. J. Chem. Inf. Model., 62(2), 412-422.
[4] Chen, J., Wang, R., Wang, M., & Wei, G.-W. (2020). Mutations strengthened SARS-CoV-2 infectivity. J. Mol. Biol., 432(19), 5212-5226.
[5] Chen, J., & Wei, G.-W. (2022). Omicron BA.2 (B.1.1.529.2): High potential for becoming the next dominant variant. J. Phys. Chem. Lett., 13(17), 3840-3849.
[6] Edelsbrunner, H., & Harer, J. (2008). Persistent homology — a survey. Contemp. Math., 453, 257-282.
[7] Nguyen, D.D., Cang, Z., Wu, K., Wang, M., Cao, Y., & Wei, G.-W. (2019). Mathematical deep learning for pose and binding affinity prediction and ranking in D3R Grand Challenges. J. Comput. Aided Mol. Des., 33(1), 71-82.
[8] Walls, A.C., Park, Y.-J., Tortorici, M.A., Wall, A., McGuire, A.T., & Veesler, D. (2020). Structure, function, and antigenicity of the SARS-CoV-2 spike glycoprotein. Cell, 181(2), 281-292.
[9] Wang, M., Cang, Z., & Wei, G.-W. (2020). A topology-based network tree for the prediction of protein-protein binding affinity changes following mutation. Nat. Mach. Intell., 2(2), 116-123.
[10] Wang, R., Nguyen, D.D., & Wei, G.-W. (2020). Persistent spectral graph. Int. J. Numer. Method Biomed. Eng., 36(9), e3376.

Guo-Wei Wei is an MSU Foundation Professor at Michigan State University.