| August 16, 2023

The Same, but Different, but Still the Same: Discovering Differences Between Neutrophil Elastase and Cathepsin G

Neutrophil elastase (NE) and cathepsin G (CG) are two remarkably similar proteins that differ significantly in their respective functions. They are both part of the so-called trypsin/chymotrypsin serine protease family and are both found in white blood cells, essentially serving as the immune system’s first responders to fight inflammation [3]. All proteins are comprised of amino acid residues that are linked in a specific sequence. The folding pattern of this amino acid chain (i.e., the protein structure) decides each protein’s function. Although NE and CG only have 32 percent of their amino acid sequence in common, their structures are almost identical (see Figure 1). Because they are from the same serine protease family, they also share the same active triad — which can cleave and process attackers of the immune system. Interestingly, NE can cleave the Shigella virulence factor, which is among the leading causes of diarrheal deaths; CG, however, cannot [6]. Nevertheless, a single mutation in CG can recover the ability to cleave Shigella [1].

Figure 1. The structures of neutrophil elastase (NE) and cathepsin G (CG). The active triad, which is responsible for protein functionality, is located in the center of each protein. Figure adapted from [6].

What variations between NE and CG cause their difference in specificity? How do their structures compare in solvent, as opposed to the crystal structures that exhibit a great deal of structural similarity? Do they undergo distinct conformational changes that then give rise to the different functions?

We conducted all-atom molecular dynamics simulations to probe the motion of the proteins in solvent. These simulations mainly utilize Newton’s equations of motion to generate stop-motion movies; the frames are each 5,000 femtoseconds apart and contain the atomic coordinates of the protein and the surrounding water molecules and ions. We then compared the movies to identify differences between the structures. If the structures had the exact same number of amino acid residues, we could easily attempt to compare the trajectories with the SiMBols package [5]. However, NE has 240 amino acids but CG has only 224. The scheme that compares the two protein structure simulations must therefore answer several important questions. How can we superimpose the two protein structures so that we can focus on the proteins’ internal motions? And how can we create matches between the amino acid residues to know which residues to compare with each other? To investigate these questions, we suggest a three-step process.

Figure 2. Overlaid structures of neutrophil elastase (NE, in blue) and cathepsin G (CG, in red). The first and last amino acid residues of each protein are marked as such. To initialize the alignment procedure, we translate both structures so that their geometric centers coincide with the origin. We then rotate NE such that the triangles of the two proteins are coplanar. Figure courtesy of [6].

The Alignment

If two protein structures share the same length of amino acid sequence, researchers can employ tried-and-tested approaches to superimpose the structures. However, these approaches (e.g., the Kabsch algorithm, which is based on the root mean square deviation) fail if the lengths differ. A 2007 study proposed an algorithm that solves the problem by utilizing the Fréchet distance, which compares two curves without requiring the same length [2]. The two protein structures are aligned as follows:

Use the Kabsch algorithm to individually align each structure to its first simulation snapshot
Transpose the structures so that their geometric center is the origin
Repeat for every snapshot:
(a) Rotate one structure such that coplanar triangles are created from the origin and the first and last amino acid residues (see Figure 2)
(b) Calculate the Fréchet distance
(c) Generate a rotation axis from two random residue positions and rotate the whole structure around the axis at a small angle
(d) Calculate the Fréchet distance and keep the new configuration if the distance is smaller than before; otherwise, revert the rotation
(e) Repeat for a finite number of iterations.

Because the algorithm will always get stuck in a local minimum for the Fréchet distance, it is impossible to tell if it has found the best alignment. To reduce the uncertainty, we repeat the process numerous times from the beginning in order to explore multiple local minima and ultimately select the best alignment.

Figure 3. An excerpt of the penalty matrix for the dynamic time warping algorithm. 3a. Each entry shows the calculated penalty value; the tuple indicates the corresponding residue pair. Whenever the path does not backtrack diagonally, residues are not matched one-to-one. 3b. The yellow connections show the determined matches. Figure courtesy of [6].

The Matching

To compare the differences between the two protein structures, we must match residues from NE to residues from CG and vice versa. Since the protein lengths are different, one residue can match with multiple residues in the other structure. The dynamic time warping algorithm creates the matches by generating a penalty matrix \(P\) based on the Euclidean distance between two residues [4]. The \(i\)th residue’s location for NE and CG is respectively given by \(a_i\) and \(b_i\). We obtain the entries of the matrix \(p_{i,j}\) from

\[\begin{eqnarray} p_{1,1} &=& \lVert a_1-b_1\rVert\\
p_{1,i} &=& p_{1,i-1}+\lVert a_1-b_i\rVert\\
p_{i,1} &=& p_{i-1,1}+\lVert a_i-b_1\rVert\\
p_{i,j} &=& min(p_{i,j-1},p_{i-1,j},p_{i-1,j-1})+\lVert a_i-b_j\rVert. \end{eqnarray}\tag1\]

We choose a minimal path through the matrix by back-tracing the lowest penalty values; Figure 3 visualizes this process, and each marked entry in the matrix is a match. Because it is distance-dependent, the matching can change during the course of the simulation and is therefore repeated for every snapshot.

Comparison and Conclusion

We utilized the similarity measures implemented in SiMBols [5] to compare the matched pairs. We opted for the fast-computing Kullback-Leibler divergence, since the comparison occurs for each snapshot of the simulation trajectory. Figure 4 depicts the mean comparison and its standard deviation.

Figure 4. Because residues are not matched one-to-one, the difference between neutrophil elastase (NE) and cathepsin G (CG) is not symmetric and both directions must be considered. Both structures show little overall movement, as indicated by the root mean square fluctuation. The greatest differences occur in regions that are more versatile. Interestingly, the third residue of the active triad—Ser195—seems less conserved in its location than the other two active residues. This observation could explain the different functions of the two proteins. Figure courtesy of [6].

The gray bars in Figure 4 are associated with experimentally-identified loops that coincide with more versatile regions in which the conformations of NE and CG differ. The active triad, which is directly involved in protein function, is also evident. The first two residues—His57 and Asp102—are located in troughs and their positions are conserved. But the third residue, Ser195, undergoes a small conformational change. As this serine is known to stabilize the protein’s reaction, the small variation might explain the differing behavior of NE and CG.

In conclusion, we demonstrate a workflow to compare conformational changes in similar proteins whose amino acid sequences have different lengths. We repeat the individual steps of the workflow multiple times to establish statistical validation and sample the complete space of possibilities that stem from a molecular dynamics simulation. We can then interpret the results in a biological context, which might allow for reasonable suggestions of different mutations or highlight the importance of certain regions in a protein’s structure.

Fabian Schuhmann delivered a minisymposium presentation on this research at the 2023 SIAM Conference on Computational Science and Engineering (CSE23), which took place in Amsterdam, the Netherlands, earlier this year. He received funding to attend CSE23 through a SIAM Student Travel Award. To learn more about Student Travel Awards and submit an application, visit the online page.

SIAM Student Travel Awards are made possible in part by the generous support of our community. To make a gift to the Student Travel Fund, visit the SIAM website.

Acknowledgments: The CARL Cluster at the Carl von Ossietzky University of Oldenburg and the North German Supercomputing Alliance provided computational resources for the simulations. The author gratefully acknowledges the computing time granted by the Resource Allocation Board and provided on supercomputers Lise and Emmy at the National High Performance Computing Alliance (NHR) Center at Zuse Institute Berlin and NHR at Göttingen as part of the NHR infrastructure. The calculations for this research were conducted with computing resources under project NIP00058.

References
[1] Averhoff, P., Kolbe, M., Zychlinsky, A., & Weinrauch, Y. (2008). Single residue determines the specificity of neutrophil elastase for Shigella virulence factors. J. Mol. Biol., 377(4), 1053-1066.
[2] Jiang, M., Xu, Y., & Zhu, B. (2007). Protein structure-structure alignment with discrete Fréchet distance. J. Bioinform. Comput. Biol., 6(1), 51-64.
[3] Korkmaz, B., Horwitz, M.S., Jenne, D.E., & Gauthier, F. (2010). Neutrophil elastase, proteinase 3, and cathepsin G as therapeutic targets in human diseases. Pharmacol. Rev., 62(4), 726-759.
[4] Sakoe, H., & Chiba, S. (1978). Dynamic programming algorithm optimization for spoken word recognition. IEEE Trans. Acoust. Speech Signal Process., 26(1), 43-49.
[5] Schuhmann, F., Ryvkin, L., McLaren, J.D., Gerhards, L., & Solov’yov, I.A. (2023). Across atoms to crossing continents: Application of similarity measures to biological location data. PLoS One, 18(5), e0284736.
[6] Schuhmann, F., Tan, X., Gerhards, L., Bordallo, H.N., & Solov’yov, I.A. (2022). The same, but different, but still the same: Structural and dynamical differences of neutrophil elastase and cathepsin G. Eur. Phys. J. D, 76, 126.

Fabian Schuhmann is a postdoctoral researcher at QuantBioLab in the Institute of Physics at the Carl von Ossietzky University of Oldenburg in Germany. His research interests revolve around the tailored development of analysis tools to understand biological and biophysical simulation data. More information about his work is available on his website.