| August 07, 2023

Predicting Biophysical Properties of Proteins

An important biological application for machine learning (ML) models is the prediction of electrostatic solvation free energy and the binding energy of proteins and ligands. These quantities help researchers better understand proteins for the purposes of disease prevention, immunization, and drug design.

Our project aims to design mathematical models that can be solved with efficient and accurate numerical algorithms to ultimately discover the useful information that is hidden in protein structural data. Specifically, we incorporate physics-informed features in algebra, topology, geometry, and electrostatics within a ML model. This work has two specific goals:

Utilize the aforementioned features to form the framework of a neural network (NN); the algebraic features stem from element-specific interactions [6], the geometric features are statistics of protein geometry and the dielectric interface between protein and water, and topological information is obtained via computational homology and software such as GUDHI.
Incorporate electrostatic features with message passing interface (MPI)-based parallelization of the boundary integral Poisson-Boltzmann (PB) solver, which can rapidly provide the electrostatic profile of solvated proteins.

The input data for the model is drawn from the Protein Data Bank (PDB). Each protein has a corresponding PDB file that provides the sequences of amino acids in the peptide chains, coordinates of the atoms in three dimensions, atom types, and so forth. Given the protein structure, the force field algorithm or software [4] can assign restrictions (bounded and non-bounded) on atoms and govern energy, potential, and forces.

We begin with a well-established model called TopologyNet, which integrates element-specific persistent homology and convolutional NNs to predict the affinity of protein ligand binding [2]. With help from Guo-Wei Wei’s group at Michigan State University, we were able to reproduce the results from [2] with a Pearson correlation coefficient of \(R_p=0.78\) between the predicted and true values for binding affinity. We used topological features to generate this result, and hypothesize that the incorporation of electrostatic components will increase the ML model’s prediction performance — thus motivating the need for fast and accurate methods to compute electrostatic features.

To calculate these features, we consider both the pairwise Coulomb interactions between partial charges and the electrostatic reaction potential \(\phi_\textrm{reac}\) that is output from the PB model. This model governs the electrostatic interactions between biomolecules and their solvent environments under the implicit solvent framework. Quantities that are governed by the PB model—such as potential, field, energy, and force—provide useful information about protein properties, protein interactions, protein folding pathways, and more.

Figure 1. The Poisson-Boltzmann model and its molecular surface. 1a. A two-dimensional schematic of a solvated biomolecule. 1b. A triangulated molecular surface of protein 1A63. Figure 1a courtesy of Weihua Geng, Figure 1b courtesy of [5].

Modeling and Methods

The PB model is given by

\[-\epsilon_1\nabla^2\phi(\mathbf{x})=\sum^{N_c}_{k=1}q_k\delta(\mathbf{x}-\mathbf{y}_k), \enspace \mathbf{x} \in \Omega_1 \tag1\]

\[-\epsilon_2\nabla^2\phi(\mathbf{x})=-\kappa^2\phi(\mathbf{x}), \enspace \mathbf{x} \in \Omega_2, \tag2\]

where \(\Omega_1\) is the molecule domain and \(\Omega_2\) is the solvent domain, separated by the molecular surface \(\Gamma\) (see Figure 1a). Moreover, \(\phi\) is the electrostatic potential; \(q_k\) is the partial atomic charge; \(\epsilon_1\) and \(\epsilon_2\) are respectively the permittivities in \(\Omega_1\) and \(\Omega_2\); and \(\kappa=\big(\frac{2Me^2_c}{k_BT}\big)^{1/2}\) denotes the Debye-Hückel parameter with \(M\) as the bulk concentration of ions per cubic centimeter, \(e_c\) as the charge of an electron, \(k_B\) as the Boltzmann constant, and \(T\) as the temperature.

The interface and boundary conditions are given by

\[\phi_1(\mathbf{x})=\phi_2(\mathbf{x}) \enspace \textrm{on} \enspace \Gamma, \tag3\]

\[\epsilon_1\frac{\partial\phi_1(\mathbf{x})}{\partial\mathbf{n}}=\epsilon_2\frac{\partial\phi_2(\mathbf{x})}{\partial\mathbf{n}} \enspace \textrm{on} \enspace \Gamma, \tag4\]

\[\underset{|\mathbf{x}|\rightarrow\infty}\lim\phi_2(\mathbf{x})=0, \tag5\]

where \(\mathbf{n}\) is the outward normal vector on \(\Gamma\).

We solve the PB model via the treecode-accelerated boundary integral solver [5]. The main computational steps are as follows:

Triangulate the molecular surface using MSMS software [7] (see Figure 1b)
Use the boundary integral method to obtain a set of coupled integral equations
Discretize the integral equations with centroid collocation to yield a linear system
Solve the linear system with the generalized minimal residual method via the treecode method to accelerate the matrix vector product in each iterative step
Compute the electrostatic solvation free energy given the resulting reaction potential:

\[E_\textrm{sol}=\frac{1}{2} \sum_{i} q_i\phi_\textrm{reac}(\mathbf{x_i}). \tag6\]

The use of the treecode algorithm in step 4 reduces the computational cost of the matrix vector product from \(\textrm{O}(N^2)\) to \(\textrm{O}(N\log N)\), where \(N\) is the number of faces in the triangularization. The treecode algorithm is able to reduce the complexity by replacing more expensive particle-particle interactions with less expensive particle-cluster interactions, which we evaluate with a far-field multipole expansion [1]. Since the system in question is usually large and the PB model needs to be solved for many cases, we must leverage parallelization techniques to develop efficient algorithms that compute the electrostatic features. We investigate two methods—sequential and cyclic—to parallelize our algorithm for computing the electrostatic potential (see Figure 2) [3]. We employ the MPI standard to implement the parallelization methods and use the MPI_Allreduce and MPI_SUM reduction operators to handle all necessary communications.

Figure 2. Sequential versus cyclic parallelization schemes. Figure courtesy of [3].

The sequential implementation organizes work to a number \(n_p\) of central processing units (CPUs) by assigning the first \(N/n_p\) particles to the first CPU, the second \(N/n_p\) particles to the second CPU, and so on. Doing so results in varying times on each CPU, which indicates a severe load imbalance. In the cyclic implementation, however, each CPU calculates the particle-tree interaction for particles that are uniformly selected from each leaf. This approach offers improved load balance and parallel efficiency with ease of implementation, since particles at different locations have different interactions (particle-particle versus particle-cluster) with other particles throughout the tree.

Results and Concluding Thoughts

To study the model’s performance, we focused on one protein with PDB ID 1A63 that has 2,069 atoms and 69,969 triangles. Figures 3a to 3c demonstrate the load balancing impacts of the sequential versus cyclic schemes for 32, 64, and 128 MPI tasks. The cyclic scheme experiences a greatly improved load balance over the sequential scheme; this result justifies the cyclic scheme’s use.

Figure 3. Load balance comparison between sequential (blue) and cyclic (orange) parallelization schemes for 32 (3a), 64 (3b), and 128 (3c) message passing interface (MPI) tasks. Figure courtesy of Elyssa Sliheet.

In conclusion, this work seeks to incorporate electrostatic features into existing ML frameworks for the prediction of biophysical properties of proteins in order to optimize prediction performance. We must parallelize algorithms for the calculation of these electrostatic features when handling large numbers of sizeable biomolecules. The MPI-based parallelization of the boundary integral PB solver can rapidly provide the electrostatic profile of solvated proteins.

We are currently in the process of incorporating a preconditioner into the model to calculate the electrostatic potential and further speed up the computation. We are also applying this project to COVID-19-related proteins and aim to submit a manuscript for publication in the near future [8]. After completing this work, we will have a PB solver that governs electrostatics and utilizes both fast algorithms and parallelization, ultimately allowing us to accurately and efficiently generate electrostatic features for the proteins; this PB solver will complement existing ML models [2] and enhance overall performance.

Elyssa Sliheet delivered a minisymposium presentation on this research at the 2023 SIAM Conference on Computational Science and Engineering (CSE23), which took place in Amsterdam, the Netherlands, earlier this year. She received funding to attend CSE23 through a SIAM Student Travel Award. To learn more about Student Travel Awards and submit an application, visit the online page.

SIAM Student Travel Awards are made possible in part by the generous support of our community. To make a gift to the Student Travel Fund, visit the SIAM website.

References
[1] Barnes, J., & Hut, P. (1986). A hierarchical \(\textrm{O}(N\log N)\) force-calculation algorithm. Nature, 324, 446-449.
[2] Cang, Z., & Wei, G.-W. (2017). Topologynet: Topology based deep convolutional and multi-task neural networks for biomolecular property predictions. PLOS Comput. Biol., 13(7), e1005690.
[3] Chen, J., Geng, W., & Reynolds, D. (2021). Cyclically paralleled treecode for fast computing electrostatic interactions on molecular surfaces. Comput. Phys. Commun., 260(2), 107742.
[4] Dolinsky, T.J., Czodrowski, P., Li, H., Nielsen, J.E., Jensen, J.H., Klebe, G., & Baker, N.A. (2007). PDB2PQR: Expanding and upgrading automated preparation of biomolecular structures for molecular simulations. Nucleic Acids Res., 35, W522-5.
[5] Geng, W., & Krasny, R. (2013). A treecode-accelerated boundary integral Poisson-Boltzmann solver for electrostatics of solvated biomolecules. J. Comput. Phys., 247, 62-78.
[6] Nguyen, D.D., & Wei, G.-W. (2019). AGL-score: Algebraic graph learning score for protein-ligand binding scoring, ranking, docking, and screening. J. Chem. Inf. Model., 59(7), 3291-3304.
[7] Sanner, M.F., Olson, A.J., & Spehner, J.C. (1996). Reduced Surface: An efficient way to compute molecular surfaces. Biopolymers, 38(3), 305-320.
[8] Yang, X., Sliheet, E., Iriye, R., Reynolds, D., & Geng, W. (2023). Computing electrostatics of COVID-19 proteins with parallelized boundary integral Poisson-Boltzmann solvers. In preparation.

Elyssa Sliheet is in the fifth year of her Ph.D. program in computational and applied mathematics at Southern Methodist University, where she works under the guidance of Weihua Geng. Her research focuses on numerical algorithms and machine learning models for the study of biophysical properties of proteins, with applications to cancer and COVID-19.