| May 02, 2022

Building a Mathematical Toolbox for Biological Network Analysis

Large public databases that aggregate the many results of experiments in cell biology are increasingly expanding our knowledge of the relationships between human genes. Graph or network representations are appropriate for several types of data; for example, the classical protein-protein interaction (PPI) network places an edge between two genes if experimental evidence indicates that their associated proteins physically bind within the cell. Analysis of biological networks like PPI networks can yield powerful insights into important problems in functional genomics and predict previously unstudied genes that are related to key biological pathways that affect human health and disease.

Much like social networks, biological networks tend to be low-diameter, “small-world” networks. Homophily is a guiding organizational principle in most types of biological network data [10], meaning that genes typically associate with other genes that are involved with similar functional processes, pathways, and diseases. Some of the issues that researchers study in social and biological networks are directly analogous, and variants of diffusion processes—also called network propagation [6]—can immediately uncover important biological relationships and insights. However, several new problems require the development of novel domain-specific mathematical techniques in order to take full advantage of unique aspects of the biological network domain.

Biological Networks Are Like Social Networks

The dense, low-diameter, small-world properties of both social and biological networks make it difficult for us to define a local neighborhood; notably, an ordinary shortest path distance is not sufficiently informative beyond direct neighbors. When we look to neighbors’ neighbors (with a shortest path distance of two), we see an explosion in neighborhood size that can quickly reach a large fraction of the network in question for many types of association networks. We can resolve these ties in proximity to a more expressive and fine-grained measure of similarity via network propagation, which involves running short or lazy random walks from a gene of interest and observing the frequency at which different nodes are reached. A comprehensive survey of the multiple variants of network propagation that the computational biology community used to analyze PPI networks prior to 2019 is available in the literature [6].

Next we review one specific variant: diffusion state distance (DSD) [3, 5]. DSD is distinguished from other related measures in that it is a true distance metric that satisfies triangle inequality [3]. In particular, it is a graph embedding method that associates each node or gene with a vector. We define \(H^k(u,v)\) as the expected number of times that the \(k\)-step symmetric random walk (starting with \(k=0\)) visits node \(v\), and we classify the diffusion state vector that is associated with node \(u\) as the vector of the \(H^k(u, v_i)\) values for all the nodes \(v_1,...v_n\) in the graph. The DSD between nodes \(a\) and \(b\) is then

\[\textrm{DSD}^k(a,b)=|H^k(a)-H^k(b)|_1,\]

where \(| \cdot |_1\) represents the \(L_1\) norm of the vectors. DSD satisfies triangle inequality [3]; these differences go to zero as the walk mixes for larger \(k\), meaning that this measure converges as \(k \rightarrow \infty\).

Figure 1. Experimental interactions between human genes in the neighborhood of PARK2 and PINK1—both of which are implicated in Parkinson’s disease—as shown by the STRING network [12]. This figure was generated from https://string-db.org with seed genes PARK2 (in red) and PINK1 (in light pink).

DSD also solves the ties in proximity problem. For a variety of different settings in the PPI network, we can show that the \(t\) closest DSD neighbors define coherent functional neighborhoods in biological networks. This realization leads to improved performance on classical inference problems like functional label prediction [5]. For instance, a “double-spectral” DSD approach followed by spectral clustering yielded the best performance in the 2016 Disease Module Identification DREAM Challenge [4], suggesting the presence of new genes that are involved in illnesses such as type 2 diabetes and Crohn’s disease.

DSD is just one of numerous network propagation methods, and we could instead substitute alternative diffusion-based graph embedding techniques that have been previously proposed in the context of social networks. All of these diffusion-based method variants provide increased power and insight for the core classical problems that arise in biological networks, including functional label prediction, link prediction (i.e., predicting new PPIs that have not yet been experimentally observed), and disease module identification, which is directly analogous to the community detection problem in social networks (see Figure 1). The best such method is domain- and problem-specific. Customizing the appropriate embedding for the domain at hand then becomes an important and compelling challenge.

Biological Networks Differ From Social Networks

Although we can directly map many important problems in biological networks to well-studied social network analogs, three categories of domain-specific differences are worth highlighting: (i) The organizational principles of the networks themselves can be different, (ii) the nature and availability of ground-truth data to test methods may differ, and (iii) we can leverage the power of evolution in biology. Here we provide an example of each of these aspects.

The organizational principles of the networks themselves can be different. It is well known that social networks obey the principle of “triadic closure” so that triangles are likely; i.e., a friend of many of my friends is also likely to be my friend [9]. When we perform link prediction on a social network, some of the highest confidence predicted links tend to occur between nodes with a lot of common neighbors. However, many protein-protein interactions are of a “lock and key” type. Specifically, if a set of keys exists for which a large overlapping subset opens two locks, we should predict that the rest of the keys that open lock 1 will also open lock 2, not that the two locks directly interact. This concept forms the basis of the recently proposed “length three” measure [8]; incorporating this insight into a more complicated embedding measure improves PPI link prediction methods [7].

The nature and availability of ground-truth data to test methods may differ. As remarked previously, the disease-gene module detection problem for biological networks is directly analogous to the so-called unsupervised community detection problem for social networks. However, method performance for community detection in social networks is usually measured in one of two settings:

A setting in which ground-truth communities are given (either from real data sets or via synthetic data wherein a clique or dense subgraph is planted [2]), and we measure how well the method recovers these communities.
A setting in which we assume that no ground-truth communities are known and instead measure performance based on the clusters’ mathematical coherence, according to standards such as modularity or conductance.

Neither setting perfectly captures the typical situation in computational biology domains, where ground truth availability is rarely all or nothing. Rather, we are most often in an in-between state with partial ground truth—not uniformly sampled—from which we can nonetheless still derive principled performance tests based on how well our method aligns with the available data. For example, the designers of the 2016 DREAM Challenge conceived a clever way to test the predicted communities’ alignment with known biological disease genes. They used curated sets of genome-wide association study sequencing data as an independent empirical source to produce sets of genes that were associated with human disease, then awarded points for each community in which more genes for a certain disease occurred together than otherwise expected [4].

We can leverage the power of evolution. Many experiments that attempt to understand human disease are performed on model organisms like yeast, fruit flies, or mice. Because some of this information is encoded in the form of a PPI network, a powerful idea involves the use of co-embedding to simultaneously embed both networks in a space where local neighborhoods indicate functional similarity, even across species. On the surface, this looks a lot like network alignment [11]: another well-studied problem in the context of social networks. But if the two species are as evolutionary distant as humans and mice, we cannot expect a one-to-one alignment. While we can match some human genes to unique mouse genes that evolved from a common ancestor, the mapping for other genes may be many to one or many to none if the gene in question was duplicated or lost in one or both of the species. We therefore examine methods that perform a bijective alignment of a subset of genes with unique matches, then utilize distance within each network to complete the embedding. We unlock improvements in function prediction for human genes as well as for less-studied species of interest with minimal experimental data [1].

Outlook

The rich and important set of problems within the space of biological networks yields a fruitful path to important collaborations between experts in computational biology and network science. As we customize and develop a toolbox for the mathematical analysis of biological networks, critical biological and biomedical applications will continue to emerge.

This article is based on Lenore Cowen’s invited presentation at the 2021 SIAM Conference on Applied and Computational Discrete Algorithms, which took place virtually last year in conjunction with the 2021 SIAM Annual Meeting.

References
[1] Arsenescu, V., Devkota, K., Erden, M., Shpilker, P., Werenski, M., & Cowen, L.J. (2022). MUNDO: Protein function prediction embedded in a multispecies world. Bioinform. Adv., 2(1), vbab025.
[2] Bickel, P.J., & Chen, A. (2009). A nonparametric view of network models and Newman-Girvan and other modularities. Proc. Nat. Acad. Sci., 106(50), 21068-21073.
[3] Cao, M., Zhang, H., Park, J., Daniels, N.M., Crovella, M.E., Cowen, L.J., & Hescott, B. (2013). Going the distance for protein function prediction: A new distance metric for protein interaction networks. PloS One, 8(10), e76339.
[4] Choobdar, S., Ahsen, M.E., Crawford, J., Tomasoni, M., Fang, T., Lamparter, D., … Marbach, D. (2019). Assessment of network module identification across complex diseases. Nat. Methods, 16(9), 843-852.
[5] Cowen, L., Devkota, K., Hu, X., Murphy, J.M., & Wu, K. (2021). Diffusion state distances: Multitemporal analysis, fast algorithms, and applications to biological networks. SIAM J. Math. Data Sci., 3(1), 142-170.
[6] Cowen, L., Ideker, T., Raphael, B.J., & Sharan, R. (2017). Network propagation: A universal amplifier of genetic associations. Nat. Rev. Genet., 18(9), 551-562.
[7] Devkota, K., Murphy, J.M., & Cowen, L.J. (2020). GLIDE: Combining local methods and diffusion state embeddings to predict missing interactions in biological networks. Bioinform., 36(Supplement 1), i464-i473.
[8] Kovács, I.A., Luck, K., Spirohn, K., Wang, Y., Pollis, C., Schlabach, S., … Barabási, A.-L. (2019). Network-based prediction of protein interactions. Nat. Comm., 10(1), 1240.
[9] Liben-Nowell, D., & Kleinberg, J. (2007). The link-prediction problem for social networks. J. Assoc. Inform. Sci. Tech., 58(7), 1019-1031.
[10] McPherson, M., Smith-Lovin, L., & Cook, J.M. (2001). Birds of a feather: Homophily in social networks. Ann. Rev. Sociol., 27(1), 415-444.
[11] Singh, R., Xu, J., & Berger, B. (2008). Global alignment of multiple protein interaction networks with application to functional orthology detection. Proc. Nat. Acad. Sci., 105(35), 12763-12768.
[12] Szklarczyk, D., Gable, A.L., Nastou, K.C., Lyon, D., Kirsch, R., Pyysalo, S., ... von Mering, C. (2021). The STRING database in 2021: Customizable protein-protein networks and functional characterization of user-uploaded gene/measurement sets. Nucleic Acids Res., 49(D1), D605-D612.

Lenore J. Cowen is a professor of computer science at Tufts University, with joint appointments in the Department of Mathematics and the Genetics Program. She currently serves as principal investigator and director of the Tufts T-TRIPODS Institute for foundational data science.