| April 10, 2019

Working at the Intersection of Deep Learning with Protein Structure Classification and Prediction

Proteins are responsible for most functions in our body. They form as an extended chain of amino acids and fold into a three-dimensional (3D) structure that governs their function [1]. Determining proteins’ 3D structure is key to understanding how they work, why they cause diseases, and how researchers can design drugs to block or activate their functions [2]. Prediction of proteins’ atomic structure from their amino acid sequence is a 50-year standing grand challenge in biology. Current experimental approaches to discern the 3D structure of proteins are too costly to keep pace with the wealth of protein sequences that genome-sequencing projects generate. In fact, the rate at which new protein sequences are discovered outpaces the rate of experimental structure determination by orders of magnitude. Thus, establishing a computational approach that bridges the sequence-structure knowledge gap will significantly impact bioinformatics, biology, and medicine.

Figure 1. Cartoon representation of the GDP-bound human KRas crystal structure.

Two major developments are fueling hopes that a computational solution may be within reach: (1) the Protein Data Bank, which stores all experimentally-solved structures, has grown sufficiently large in size to train deep learning algorithms, and (2) the combination of deep learning algorithms and computing power can now extract information from very large datasets. The impressive performance of DeepMind (the artificial intelligence company associated with Google) in the independent community-wide experiments called Critical Assessment of Protein Structure Prediction—established to assess and advance structure prediction—provided a definitive glimpse affirming that a solution is near.

As part of our explorations at the intersection of deep learning and protein structure research, we have been using a technology called convolutional neural networks (CNNs) to classify protein structures. CNNs have made substantial breakthroughs in feature extraction and image recognition tasks by modeling how the human brain processes inputs of information through different layers of representation. However, their application to extract complex features from scientific data such as protein structures is not straightforward. These systems learn hierarchies of features by examining small areas of an image through “windows” called kernels. Therefore, CNNs can only preserve local spatial relationships between image components, which is disadvantageous. In addition, the learned features depend on their specific relative positions from their viewpoint in the data. CNN users compensate this deficiency by adding different rotations of the training data, a process known as data augmentation. Researchers have proposed capsule networks [3] as a solution to this problem. A capsule is a group of neurons that collectively compute an activation vector that represents both the existence and properties of a specific entity — a feature in an image, for example.

CNNs also learn nonlinear interactions between input variables to achieve more accurate predictions, which makes the interpretability of their models quite difficult. This means that it is hard to explain why CNNs produce certain results. Capsule networks exhibit more intrinsic interpretability properties. One can use the activation vectors’ components to explain why the network detects certain features.

Figure 2. Capsule network architecture.

We have successfully implemented and applied a capsule network architecture to the classification of Ras protein family structures (see Figure 1) on GPU-based computational resources [4]. The Ras family of proteins is of great interest in cancer research because they are considered to be undruggable, due to the lack of obvious cavities on their lobular surfaces [5]. Ras proteins are related to 95 percent of pancreatic cancer and 45 percent of colorectal cancer. The proposed capsule network (see Figure 2) trained on two-dimensional and 3D structural representations of protein structures can successfully classify HRas and KRas proteins, which are subclasses within the Ras family. Our results indicate that capsule networks achieve an accuracy improvement over traditional convolutional networks while simultaneously improving interpretability through visualization of activation vectors. These promising results expose a unique opportunity for the exploration of issues related to interpretability of deep learning for computational structural biology.

Leveraging on our previous experience using capsule networks to classify proteins, we plan to address the more complex problems of protein structure scoring and quality assessment via our novel implementation of capsule networks, then explore the combination with other deep learning architectures. A typical computational approach to protein structure prediction is to sample the protein conformational space by generating a large number of 3D structures. One then evaluates the quality of these structures and chooses those that are most optimal.

Proper selection of protein structures becomes an important factor in accurate protein structure prediction. The selection is completed through scoring functions that combine certain features to provide indicators of protein structure quality. However, current scoring functions do not consistently select the best structures. Deep learning offers great potential to improve protein scoring by using sets of annotated structures and understanding the relations between the features and structure quality. We expect that this approach will help determine other features that may produce better scoring functions.

This article is based on a minisymposium talk entitled "A Deep Learning Approach to Protein Structure Classification and Prediction," which took place at the 2019 SIAM Conference on Computational Science and Engineering, held in Spokane, Wash., earlier this year.

References
[1] de Jesus, D.R., Cuevas, J., Rivera, W., & Crivelli, S. (2018). Capsule Networks for Protein Structure Classification. In 2018 ACM/IEEE Supercomputing Conference. Dallas, TX.
[2] Dill, K.A., & MacCallum, J.L. (2012). The protein folding problem, 50 years on. Science, 338(6110), 1042-1046.
[3] McCormick, F. (2015). KRAS as a Therapeutic Target. Clin. Cancer Res., 21(8), 1797-1801.
[4] Sabour, S., Frosst, N., & Hinton, G.E. (2017). Dynamic Routing Between Capsules. In Advances in Neural Information Processing Systems 30 (NIPS2017). Long Beach, CA.
[5] Schaffausen, J. (2012). Advances in structure-based drug design. Trends Pharmacol. Sci., 33(5), 223.

Wilson Rivera is a professor of computer science and engineering at the University of Puerto Rico, Mayagüez Campus. His research interests include parallel computing, cloud computing, and big data analytics. Rivera has authored more than 30 papers in a broad range of fields, including parallel computing, genetic algorithms, optimization, and machine learning.

Silvia Crivelli is a computational biology scientist at Lawrence Berkeley National Laboratory where she has conducted research in protein structure prediction and protein scoring. She created and led the WeFold collaborative project to tackle roadblocks in the field. Crivelli is a long-standing member of the Critical Assessment of Protein Structure Prediction community and a pioneer in building and running open collaborative efforts within this group. She led efforts to develop two highly innovative molecular modeling tools: ProteinShop and DockingShop. Crivelli has authored more than 30 papers in a broad range of STEM fields including parallel computing, molecular visualization, optimization, mathematical modeling, and machine learning.