Book Review: Phylogeny: Discrete and Random Processes in Evolution

By Mathias Cronjager, David Emms, Luca Ferretti, and Jotun Hein

The authors offer an in-depth look into Mike Steel’s SIAM Book. The review is published in three parts. Read Part I here. Part III of this book review will follow soon. A more detailed review of this book can be found here.

Chapter 5: Phylogenies Based on (Discrete) Characters

We are given characters (i.e. traits) on selected nodes—typically on the leaves of a tree—and the first section discusses when you can have a perfect phylogeny for these characters. A perfect phylogeny is a tree explaining the character states for the observed taxa allowing each trait to have evolved only once and no reversals to have occurred, i.e. there are r-1 mutations for a r-state character. This is easy for full binary characters, but much harder for multi-state characters or partial characters. For multi-state characters, three alternative avenues are given: i) reduce the characters to quartets and find their span (trees that can display them), ii) construct their intersection graph and see if it has a ”restricted chordal completion”, or iii) create all binary splits and check if they are compatible. Approach ii) implies a restriction on the tree width of the intersection graph.

Four interesting extensions of the basic problem of having full binary characters are highlighted: i) Persistent Perfect Phylogeny (PPP) - where a single reversal is allowed in a phylogeny. ii) Partial binary characters - the character states for some leaves are unknown for some characters. iii) Perfect Haplotyping Problem (PPH) - an extension of PP to diploid individuals. iv) Incomplete Directed Perfect Phylogeny (IDPP) – a rooted tree must be found so that each particular state can evolve multiple times but no reversals are allowed, and the state is not known for all leaves.

There is a nice result presented showing an extremely clever way to reconstruct ANY tree from only four characters (each of which may however have a large number of states)! So why on earth do we sequence all these genomes? The answer is simple the moment you have seen the trick behind the surprising result. I find this section very important since many biologists have good intuition about binary characters, but not more general characters such as for four nucleotides! They should read this section.

Some simple combinatorial questions are addressed next: how many binary trees can explain a given character perfectly? And the reverse: given a tree how many characters can you put on the leaves without forcing homoplasy?

Parsimony is discussed and for binary characters and connections with Menger’s theorem are drawn—namely how many characters will give a particular parsimony score (i.e. the number of edges in the tree for which the character must mutate). Then some fun connections are made with edit operations on trees which are not so surprising when you have been told them.

A very interesting section follows on ancestral state reconstruction. Parsimony assignment to the root is a kind of voting system with special cases such as the star tree (majority rule), balanced tree and caterpillar. Again there are some interesting results here. Majority rule is trivial. In a caterpillar tree the two outliers can determine the state at the root. In balanced trees an extreme unfairness in outcome can also be obtained. Further interesting results are presented in Chapter 9. Then finding maximum parsimony trees on multiple characters instead of single characters is discussed. And finally a very brief section on super-trees and short encodings.

Chapter 6: Continuous Phylogenies and Distance-based Tree Reconstruction

The important concept of distances between the objects on the leaves is considered. A central question is if it is possible to reconstruct a tree with edge lengths assigned to all edges, given only the distances between leaves. Key to this is the four-point condition: a metric has a tree representation if and only if it satisfies the four-point condition. On rooted trees the corresponding object is the ultrametric. The Gromov-Farris transform takes a tree metric and gives an ultrametric for any arbitrary rooting of the tree. Symbolic ultrametrics are also discussed although it is not clear what they are for. An obvious but important observation is made: If there is a perfect tree, the distance between strings of characters will be a tree metric, but the reverse might not be true. This is obvious since distance is a major reduction in data so you can clearly arrange characters that conflict, but whose effect cancels out in the distance function.

There follows a slightly discursive section on distances on genomes that should be extended to a fuller description. Neighbour Joining (NJ) and Balanced Minimum Evolution (BME) are discussed as two distance-based tree-reconstruction methods. The generalisation to reconstruction from partial distances is also covered.

There is a short section on indexed pyramids and Kalmanson metrics; a very strange generalisation of the 4-point condition and a way of representing such a distance function using a weighted sum of splits. The section on the geometry of the space of tree-metrics was most exciting and way too short. Here trees are embedded in the positive quadrant of a high dimensional Euclidian space and a series of interesting questions are addressed such as the connectedness of the space and the distance between two trees. Mike Steel should put in some sentences in like “[...]T-X is a compact connected simplicical complex of dimension n-4” to give the book some appeal to the average rubber boot owning botanist.

Next follows an excellent review of phylogenetic diversity. Everyone interested in biodiversity should read this chapter. The first measure of phylogenetic diversity (PD) given is the total branch length of the tree relating the species set. Assume you only are allowed to keep k of the original n species. How do we find the subset of size k with largest PD? A simple greedy algorithms is presented. Then some alternative questions are presented: i) ΔPD(Y) – find Y where the PD-loss is the largest if you extinguish Y. ii) Maximum Minimum Distance (MMD) – find the set so that the longest shortest edge from the lost species to the surviving ones is minimised. Iii) A weighted version of PD iv) PD based on splits instead of trees. Optimisation of PD and diversity indices for rooted trees are considered.

Some differences between the rooted and unrooted case are discussed and it is proven that PD is unimodular – the sum of PD on two sets is larger than the sum of PD on their union and their intersection. An interesting problem discussed is when you have some nature reservations that cost something to maintain and you have a budget. Which reservations should you keep to maximise PD while staying within budget? Then there is a subsection on decomposing the PD for a set to contributions from single species. Fair proportion (FP) and Shapley value are discussed and it is then proven that the expected Shapley value is identical to FP for the complete tree.

The two extension of PD (PD over Abelian Groups & Abstract Diversity Theory) should have either been extended or turned into footnotes since it is hard to see their relevance. Section 9.2.3 is interesting. We have extinctions that kill off species randomly and the effect on PD is considered. This leads to two special problems: Heightened evolutionary distinctiveness and the Noah’s Ark Problem. In the last section extinctions are applied to trees generated by the Yule process or the Birth-Death process.

Chapter 7: Evolution on a Tree [I]

In Chapter 7 processes on a phylogeny are investigated. The chapter discusses different models for the evolution of character states along a phylogeny, how the models are related, how they can be combined and under what conditions the phylogeny can be recovered from the character distributions (i.e. from sufficiently many sampled characters). A good review of Markov chains is given and then the theory is extended to Markov processes on a tree. It starts out very general, with non-homogenous Markov chains on trees where each branch has its own transition probability. In some sense this still seems overly-restrictive since the internal nodes are arbitrary relative to the process so why let them define different evolutionary regimes? If the determinant of the transition matrix is non-zero then a distance measure can be defined – logDet – that obeys the four point condition that we know characterises the tree, so a tree is identifiable from its leaf distributions.

There are a number of natural questions about irreversible rate matrices (Q) that are not discussed. Reversible Q are easy to characterise: give each state an equilibrium rate and each edge a rate of flow. Can one have an intuitive decomposition of the non-reversibility? There is a very interesting discussion of commuting and non-commuting rate matrices. Then we get into the basic models starting with the equal input model that for nucleotides was introduced by Felsenstein in 1981 but similar models had been used for proteins by Margeret Dayhoff around 1970. Their purpose was to get the equilibrium distribution right. Nick Goldman and Bjarne Knudsen devised equal output models and you could make a weighted sum of equal input and equal output matrices to created intermediates, which might have been worth mentioning.

There follow some very short sections. First a theorem saying that in the Jukes-Cantor model on r states the Maximum Likelihood (ML) lengths are extreme: paths between different states are maximal [infinite] and between identical states minimal [zero]. This is interesting but is for a single state; for multiple states the ML estimates hopefully are more reasonable.

This is followed by a section on G-equivariant and group based models. G-equivariant is a new concept, there the model doesn’t change under group-permutation of the state space. The most important new models in this class are models that do not change if you swap strand. Group based means the state space is a group and the Kimura 3-parameter model is the most famous. There is a very nice overview figure relating different models. A fun observation is that the most general model here is ”General Time Reversible with variable Q” that is not time reversible if the Qs don’t commute. The section on Phylogenetic Mixture Models is very important and it is well described. Here the identifiability of the underlying tree can be a problem in contrast with models where all positions evolve by the same process. The chapter ends with the Hadamard story and Felsenstein Zone which were covered a bit too lightly given how frequently they are discussed.

Chapter 8: Evolution on a Tree [II]

The chapter continues the theory and application of Markov processes on trees. Statistical methods for estimating phylogenies and ancestral character states are discussed. This leads to questions about how much data is required to reconstruct a tree.

Some preliminaries on metrics on distributions are covered then a list of attractive properties of Maximum Likelihood (ML) estimation in general and a set of identifiability definitions and results. Now comes some fun stuff about convergence of logDet, ML and Maximum Parsimony (MP). LogDet can have large variance for finite data and MP will converge with great speed towards the right result (faster than ML) provided we have parameters where it converges towards the right result! Then there are some important sections on information-theoretic results on how much data is needed to infer the correct placement of a long branch and on the reconstruction of an ancestral state. Some of these results are very surprising at first, for example, the amount of data needed to infer a tree grows as the logarithm of the number of taxa.

The section on so-called “Phylogenetic Oranges” describes phylogenies as Euclidian flakes embedded in higher dimensional space that are glued together in even lower dimensional flakes (corresponding to multifurcating phylogenies) and these have certain exact mathematical properties. A fun bit of geometry, but its implications eluded us. Similarly, the phylogenetic invariants (Algebraic Analysis) are charming but the intuition behind them is hard to see.

The Infinite Cluster Random Model is interesting and an extension of what has been discussed before under the equal input model. Going to infinite states gives you the Crow and Kimura 1964 Infinite Allele model. There are two extensions of this: The Ladder Model (investigated by Kingman among others) used to describe protein electrophoresis data and infinite site model by Kimura and Otha in 1971 used for DNA data. These clearly needed to be incorporated into Steel’s algebraic invariant dual homotopic centralising framework!

Watch out for part III of the book review to be posted soon.

Mathias Cronjager is a graduate student in the Genome Analysis group in the statistics department at Oxford University, UK. David Emms is a postdoctoral researcher in the department of plant sciences at Oxford University, UK. Luca Ferretti is a research scientist in the Integrative Biology group at the Pirbright Institute, UK. Jotun Hein is a professorial fellow and Holder of the Chair in Bioinformatics at Oxford University, UK. He heads the Genome Analysis group in the statistics department at Oxford.