SIAM News Blog
SIAM News
Print

High-Performance Computing and Artificial Intelligence Approaches to COVID-19 Therapeutics

By Rick Stevens

The COVID-19 pandemic uprooted nearly everyone’s lives, changing the way people live, work, think about the future, and even interact with each other. It has also had a profound impact on science. Researchers around the world pivoted from their scientific and technological projects to use their critical experience and expertise to tackle COVID-19. My group is well-versed in high-performance computing (HPC) and artificial intelligence (AI), and we immediately found a way to utilize our research strengths and supercomputers to accelerate drug discovery for COVID-19 therapeutics.

Figure 1. Argonne Computing Leadership Facility’s computing resources support large-scale, computationally intensive projects that aim to solve some of the world's most complex and challenging scientific problems. Figure courtesy of Argonne National Laboratory.
While the scientific community initially identified several promising leads for the treatment of COVID-19 via high-throughput screening of existing and approved drugs, none of them have yielded effective drugs that can inhibit the disease’s causative virus (namely the severe acute respiratory syndrome coronavirus 2, or SARS-CoV-2). The virus codes for approximately 30 sets of proteins; about half of these proteins are plausible drug targets and even fewer are priority targets. Various drugs that inhibit individual protein functions can target different elements of the virus’ life cycle. Unlike vaccines, the goal of therapeutics is to simultaneously hit multiple stages of the life cycle to maximize impact.

We wished to apply virtual screening to this problem by using computers to identify small molecules that can bind to and potentially inhibit the viral proteins, then validate that they indeed inhibit the virus with laboratory methods. However, given that the chemical space of small drug-like molecules is estimated to be \(10^{60}\), this can be an arduous task — even if one employs all of the world’s supercomputing infrastructure. We thus devised a novel approach that integrates HPC, AI, and machine learning (ML) in a deep way to identify small molecules that could serve as potential leads for the development of effective antiviral therapeutics.

Building HPC-computed Features from Open-source Chemical Libraries 

To fuel surrogates for virtual screening with ML, one must find a way to represent the molecules. The scientific community has developed many such approaches over the years, including descriptors, fingerprints, graphs, and two-dimensional (2D) and three-dimensional (3D) images. Our prior work in cancer research and antimicrobial resistance offered a wealth of information as we attempted to generate representation for all compounds. We settled on the simplified molecular input line entry system (SMILES), molecular descriptors, fingerprints, and 2D images as possible representations.

Generating the compound representations for ML required significant computing. We gathered data across 23 public sources for a total of approximately 4.2 billion molecules. We then converted each molecule to a canonical SMILES and computed around 1,800 2D and 3D molecular descriptors, molecular fingerprints that encode structure, and 2D images of the molecular structure; all of this was summarized in roughly 60 terabytes of data. The computed data provides crucial input features to AI models for the prediction of molecular properties such as docking scores and toxicity. The data processing pipeline used about two million core hours on several current HPC systems: Argonne Computing Leadership Facility’s Theta (see Figure 1), Texas Advanced Computing Center’s Frontera, and Oak Ridge Leadership Computing Facility’s Summit. All of this information is publicly available online and in our corresponding paper [1].

AI-based Filtering of Drug Candidates

AI/ML methods must also learn from extensive training datasets. In this case, our objective was to learn whether a small molecule could bind to a protein target as quantified by its docking score. We first generated a virtual screening dataset of approximately seven million compounds that we selected for diversity against each of the 13 viral protein targets. As when generating representations of compounds, we again needed to extensively optimize our workflow to scale on our supercomputers. We used this data to train deep neural networks to predict the docking score, then scooped up and re-docked the predicted top scoring small molecules for each protein target.

Our work continues, as we are part of a larger collaboration that is building an integrated pipeline to select promising drug candidates. This multi-stage campaign includes high-throughput ensemble docking to identify small molecules, AI-driven molecular dynamics to model specific binding regions and understand mechanistic changes that involve drugs, and the binding of free energy calculations for promising leads. We are currently developing methods to train ML models and use them to provide an upfront stage for enrichment that drives the subset to which the docking is applied.

This is an interesting open question because utilizing the ML model to infer an initial docking score provides a 50,000x performance improvement in speed when compared to the cost of evaluating the docking score. This incredible speed advantage and level of accuracy can significantly support the enrichment of small molecules that can potentially target SARS-CoV-2. More information on this work is available in our resulting paper (see Figure 2).

Figure 2. The Integrated Modeling Pipeline for COVID Cure by Assessing Better Leads solution represents an entire virtual drug discovery pipeline, from hit to lead through to lead optimization. The constituent components are a deep-learning based surrogate model for docking (ML1), Autodock-GPU (S1), coarse- and fine-grained binding free energies (S3-CG and S3-FG), and S2 (DeepDriveMD). Figure courtesy of [2].

SARS-CoV-2 Medical Therapeutics and More

By running this complex workflow, we experimentally screened more than 2000 compounds on SARS-CoV-2 viral protein targets; our efforts led to 63 validated hits in the downstream antiviral assays. We have further derived about 600 additional molecules to bracket the initial hits and are characterizing them using experimental techniques.

This work has also boosted other scientific endeavors. We continue to advance HPC and AI-driven methods to accelerate drug discovery not only for COVID-19, but for a diverse set of diseases as well. Researchers are already employing the large compound database to support cancer studies and a variety of other projects that utilize ML. 


Rick Stevens presented this research during a minisymposium presentation at the 2021 SIAM Conference on Computational Science and Engineering, which took place virtually in March.

Acknowledgments: Research was supported by the Department of Energy’s (DOE) Office of Science through the National Virtual Biotechnology Laboratory, a consortium of DOE national laboratories that is focused on response to COVID-19. Funding was provided by the Coronavirus CARES Act.

References
[1] Babuji, Y., Blaiszik, B., Brettin, T., Chard, K., Chard, R., Clyde, A., …, Wagner, R. (2020). Targeting SARS-CoV-2 with AI- and HPC-enabled lead generation: A first data release. Preprint, arXiv:2006.02431.
[2] Saadi, A.A., Alfe, D., Babuji, Y., Bhati, A., Blaiszik, B., Brettin, T., …, Wifling, D. (2020). IMPECCABLE: Integrated Modeling PipelinE for COVID Cure by Assessing Better LEads. Preprint, arXiv:2010.06574.

Rick Stevens is an associate laboratory director at Argonne National Laboratory and a professor of computer science at the University of Chicago.

blog comments powered by Disqus