| July 21, 2021

Detecting Genomic Variations and Epistatic Interactions

Genetic variation exists within every species, including humans. Understanding a person’s unique DNA provides a lot of information about their phenotype—the characteristics that are based on their genetics—and can potentially lead to personalized medicine.

During a minisymposium presentation at the SIAM Annual Meeting, which is taking place virtually this week, Mario Banuelos of California State University, Fresno described several mathematical frameworks for making predictions about genomic variations and epistatic interactions. “The goal is to leverage some a priori information and computational or machine learning frameworks,” Banuelos said. He presented techniques that were developed in collaboration with Erica Sawyer and Marissa Hernandez (California State University, Fresno), Suzanne Sindi and Roummel Marcia (University of California, Merced), and Omar DeGuchy (Lawrence Livermore National Laboratory).

Figure 1. Steps in finding the variation in a genome (black arrow at top of image). Researchers first split the genome into many pieces, on which they perform genetic sequencing (represented in the middle of the image). They then match the sequences to an established ground truth (yellow arrow at the bottom of the image) to search for any variations.

Banuelos was particularly interested in structural variations — the differences between an individual’s sequenced genome and a refence genome for their species. But before delving into techniques for measuring this, he recognized the ongoing problems in this field of research. “There is still a persistent bias in this field, especially in terms of what is considered a genetic variation and compared to who,” Banuelos said. A 2016 study found that 81 percent of the participants in genome-wide association studies were of European ancestry at that time, which was actually a decrease from 96 percent in 2009.

Many kinds of structural variations can occur in genomes, but Banuelos focused on two particular types: deletions, where genetic information is lost, and single nucleotide variants, in which just one of the small building blocks of DNA changes. Researchers find these kinds of variations in real DNA by splitting an individual’s genome up into many pieces and performing a genomic sequencing of those segments. By matching these sequences to an established ground truth, researchers can find evidence for variation (see Figure 1).

Most previous models of structural variations have used post-processed inheritance information, in which the genomes of both a parent and child are sequenced and compared to determine what variants are shared between the two and were thus passed down from parent to child. Models also typically depend on high-quality data that deeply covers the genetic sequence. But Banuelos does not have the storage capacity for high quality genomes, which can be on the order of hundreds of gigabytes. “Our goal is to use lower quality data and leverage relatedness to improve genomic variant prediction,” Banuelos said.

In lower quality data, there may only be a few pieces of information available to support the existence of a variant. Banuelos described three different techniques for leveraging this lower quality data: optimization, neural networks, and gradient boosting. In the optimization approach, one seeks to maximize the probability of what was found in the data given the coverage, which is a proxy for the data’s quality. There is also a sparsity aspect to this analysis, because variations are so rare.

Figure 2. The different classes in the neural network for finding genomic variations. 0 represents no structural variant, while 1 represents the presence of a structural variant.

The optimization model makes the simplifying assumption about relatedness that there is one parent and one child and investigates what variants, if any, the child inherits from the parent. Banuelos and his collaborators have also considered a number of possible extensions to this type of model, such as including two parents and one child or spanning multiple generations to include grandparents. While this kind of approach is advantageous in that the computations are fairly speedy and the linear constraints are easy to define, it is constrained by the fact that the relatedness must be explicitly defined — one must know what the parent-child relationship is ahead of time. It is also nontrivial to add a general relatedness parameter.

Banuelos next described a simple neural network for detecting genomic variations, with the goal of modeling the nonlinear relationships that govern structural variants in genomes. As is often the case, it was necessary to simplify the real-world phenomenon in order to frame the biological problem as a mathematical one. The neural network uses three inputs—two parents one child—and involves two hidden layers. It trains based off of labeled classes that represent whether the child and one or both parents has a particular genomic variant (see Figure 2).

The third approach for finding structural variants in genomes that Banuelos described was gradient boosting. He provided the example shown in Figure 3, in which the goal is to separate the blue circles from the red squares. The first step is to introduce a weak classifier in the form of a straight line that separates most of the blue circles from most of the red squares. Then, increase the weight of anything that was mislabeled — for example, if a red square was in an area that was classified as being the blue circle region, that red square’s weight would increase. Based on these changes, make another straight line for a weak classifier, and repeat the process. After doing this several more times, use a combination of all of the weak classifiers to say where all of the red circles and blue squares are located.

Figure 3. An example to illustrate the gradient boosting technique. Here, the goal is to separate the blue circles from the red squares.

Of course, it is much more difficult to use this technique in a real-world scenario to separate structural variants from non-structural variants, especially because the variants are rare. But the effort is worth it — since there is much more information given in this approach than in the optimization model or the neural network, this technique performs better with a higher true positivity rate.

In recent months, Banuelos and his group have begun to expand their research to investigate epistasis — the effect of the interaction of genetic mutations, often single nucleotide variations. “Say we have these single-level changes,” Banuelos said. “How do they interact to affect the phenotype? It's usually an interaction of many mutations.” The goal is thus to detect higher-order interactions to find out which variants are statistically significant when they are together.

The researchers adapted mathematical techniques developed in the 2006 Netflix recommendation contest, fitting single nucleotide variations and genome-wide association studies data into that approach as users and movies. This technique involves using the recommender system along with collaborative filtering to identify groups of mutations. Banuelos and his collaborators were able to apply this technique to genetic data from mice and extract some groupings that were statistically significant.

In the future, the research group may try to expand these studies to predict relationships, look at larger pools of related individuals, and use different modeling architectures. They have already demonstrated that many different mathematical techniques are able shed light on the variations and interactions of genes. “There are a variety of optimization and machine learning tools to predict locations of structural variants in related individuals, as well as quantify certain coalitions of interacting mutations,” Banuelos said.

Jillian Kunze is the associate editor of SIAM News.