| September 19, 2016

Book Review: Phylogeny: Discrete and Random Processes in Evolution

Part I

By Mathias Cronjager, David Emms, Luca Ferretti, and Jotun Hein

The authors offer an in-depth look into Mike Steel’s SIAM Book. The review is published in three parts. Read Part II & III of this book review. A more detailed review of this book can also be found here.

Mike Steel has contributed enormously to mathematical aspects of phylogenetics and is doing the field a major favour in summarising key results in the field. Phylogeneticists would do themselves a major favour by becoming more familiar with these results.

This is a high quality book from an author by one of the leaders in the field. It covers combinatorial properties of trees, different representations, their enumeration and metrics on tree space. The possible composition and decomposition of trees into subtrees and the compatibility of trees. Tree inference from characters and distances, stochastic models on trees and models for evolution or generation of trees are discussed. It is a mathematical book but one with clear applications. The presentation throughout is generally succinct and as such covers a large amount in its 350 pages including many results since Phylogenetics published with Charles Semple 12 years ago. As an introductory book it is not an easy read, there can often be a number of new concepts on a single page. However, plentiful figures provide an additional, and often easier, way to understand the material. There are few numerical examples or examples taken from real applications. Such examples could have provided stronger insight into the applications as well as the concepts themselves.

There is a lot of new material on stochastic models which underlie many current biological applications. There is a final chapter on networks, but these are not covered in depth. The ancestral recombination graph is only mentioned in passing which is a significant omission since it describes the relationship between a set of sequences sampled from a population. Important algorithmic results are highlighted although the issues are rarely discussed. This seems a sensible choice for maintaining the focus of the book. The discussion of tree inference methods presents a number of important results which are of significance to current disagreements in the literature over conflicting, seemingly well-supported phylogenies. Overall, it is a very readable introduction to the subject while giving an overview of many important recent results and current research topics. Some parts of the book could be hard reading for the average biologist, but many in the field have a strong background in statistics or algorithms, and for them it would be well worth the effort.

The book is logically structured and the preface contains advice on different paths through the book and each chapter has a few exercises embedded in the text. The presentation style is very different from the earlier Phylogenetics in that it gives more space to intuitive explanations and less to theorems followed by rigorous proofs, but the level of mathematics is about the same.

Chapter 1: Phylogeny

Chapter 1 covers the prerequisites: graphs, unrooted and rooted trees as well as some structures that will be put to use later on in the book—the intersection graph and chordal graphs. A phylogenetic tree is defined, this is the central object of interest for the book in contrast to the X-tree which was used in his earlier book, Phylogenetics. Rooted and unrooted trees are treated in parallel throughout the book as properties of one frequently have direct equivalents in the other.

Chapter 2: Basic Combinatorics of Discrete Phylogenies

Despite being only 30 pages, chapter 2 covers a lot of ground--a feat repeated frequently throughout the book! It begins with the enumeration of trees. We see Cayley’s formula for the number of trees with all nodes labelled. The formula, n^(n-2), is simple, but deceptively so. Bifurcating trees and spanning trees are then considered in a similar manner. The results are obtained in a maddeningly compact page using formal power series that should be expanded into at least four pages.

Steel then presents two key equivalences that are central to much of the book and the subject itself: a rooted tree can be represented as a hierarchy of sets on the leaf-set and an unrooted tree as a system of splits (bipartitions) on the leaf-set. He then explores a series of alternative characterisations of hierarchies, and their equivalence to rooted trees. In a hierarchy of sets, any two sets must satisfy: one of the sets is a subset of the other, or they are disjoint. An unrooted tree is shown to be encoded by a set of quartet trees and a rooted tree by a set of rooted triples.

The set representation of phylogenetic trees leads naturally on to the concept of the refinement of trees. Loosely speaking, a phylogenetic tree may not be fully-resolved and a tree, T, is a refinement of a second tree, T', if it is equivalent to T' but resolves more of the edges. A set of trees is compatible if there is a tree that is a refinement of all the trees in the set.

The Buneman Graph makes an appearance here and is revisited later in the book when multistate characters are discussed. The Buneman Graph is a graph which may be constructed from an arbitrary set of splits, which is a phylogenetic tree precisely when the splits are pairwise compatible. It will also reappear when Steel considers data that cannot be represented by a tree, but we are getting ahead of ourselves.

Three more ways to encode an unrooted tree are discussed (besides as an unrooted tree itself or as a system of splits) where the most fun was circular orderings, which is natural when you don’t have a root and can let the leaves hang down and be placed on a line. Next is a fun section on the size of neighborhoods and the diameter of tree-space with respect to rearrangement operations (NNI, SPR, TBR) on trees. One could imagine it would be real fun to sit and puzzle with and we discussed which other operations are imaginable. Then more on the metric space of trees with edges (edit operations). Finally, consensus functions are discussed. Important here are the strict consensus, the majority consensus and the Adams consensus. The Adams consensus is only defined for rooted trees and satisfies some desirable properties not satisfied by the other consensus methods. What is more, an impossibility result is given showing that no method is possible that satisfies a more relaxed, seemingly desirable property for a consensus function!

Chapter 3: Tree Shape and Random [Discrete] Phylogenies

Chapter 3 discusses the shape of trees generated according to a number of models. First we see how to enumerate the number of phylogenetic trees using the orbiter-stabiliser theorem to handle symmetries that do not change the tree such as the symmetry of flipping the children of an inner node. The Yule-Harding model for evolving trees and uniformly sampled trees are discussed. The ’big picture’ is presented (with a wonderful figure) showing how you can link and count trees with different kinds of labelling. Understanding the details of this was very rewarding. There are 6 classes of trees and there are 7 arrows between them, each showing a simple combinatorial conversion.

Measures of tree imbalance are defined (Colless and Sackin). Yule-Harding trees are more balanced than uniformly sampled trees and observed trees tend to be somewhere between the two. The Aldous beta-splitting model for generating trees is presented. The beta parameter varies the amount of imbalance and the Yule-Harding and uniform model trees come out as special cases of this model. An interesting question is what biologically realistic models could lead to Aldous beta-splitting trees since naive assumptions of independence will lead to Yule-Harding. There is a very nice generalised urn model with cherry, non-cherry and pendant edges that can be used to calculate the distribution of cherries in a Yule-Harding tree showing how using a Polya urn model formulation can make certain theorems easier. We all loved this section.

Chapter 4 Pulling Trees Apart and Putting Trees Together

Chapter 4 considers how trees can be analysed and described in terms of trees restricted to subsets of labels (taxa). Conversely, how can trees be combined into a ‘super-tree’, for example, in the construction of a ‘tree of life’. The chapter discusses tree restriction to a subset of labels and its converse, the display of a tree by a tree on a superset of labels. Compatibility and consensus trees are discussed, as is the question of when a set of trees on subsets uniquely define a binary tree. Also, when are the restriction of trees to a collection of subsets decisive in distinguishing between the two trees.

Results on when a set of trees are compatible (can all be viewed as restrictions of some common larger tree) are explored. The problem is typically harder for unrooted trees, where you need to go via a display graph that has all labelled and unlabelled nodes of the subtrees with edges such that the subtrees can all be found in the display graph. After triangulation of this graph there are simple criteria using the tree width to determine if they are compatible. The somewhat confusing concept of tree width was only scantily treated earlier in the book, so the implications of these results are hard to interpret without a better explanation.

Quartets are a special class of unrooted trees that have received a lot of attention. Besides whether a collection of trees is compatible, it is of interest to know if a set of compatible trees uniquely (in some sense) defines a single tree of which all trees in the collection are a restriction. A natural concept pertaining to collections of trees – excess – is defined as the difference between the number of internal edges in the full tree on all vertices and the number of edges of the subtrees in the collection. Ideally you would want each subtree to give you one internal edge in the tree on all leaves. It is interesting to see differences and similarities to the previous Semple & Steel book. Even topics present in both books are treated from a different perspective and it is definitely worth reading both books.

The chapter continues with two pages on (rooted & unrooted) maximal agreement subtrees [MAST] and discusses some conjectures on how small a MAST can be and what their expectation is for randomly-picked binary trees. This is followed by analogous questions for the Quartet Metric. There is a practically very relevant section on the situation if you have trees on subsets of a full set. Then random coverage is discussed – the subsets of species for which you have trees are chosen randomly – when can you infer the total tree? The chapter ends with another interesting situation where you have a set of subtrees or quartets and want to reconstruct more than one tree, which very well could be the case if you have horizontal transfer or recombination. As is often the case the chapter is very compact often with several new definitions per 10 lines.

Read Part II & III of this book review. A more detailed review of this book can also be found here.

Mathias Cronjager is a graduate student in the Genome Analysis group in the statistics department at Oxford University, UK. David Emms is a postdoctoral researcher in the department of plant sciences at Oxford University, UK. Luca Ferretti is a research scientist in the Integrative Biology group at the Pirbright Institute, UK. Jotun Hein is a professorial fellow and Holder of the Chair in Bioinformatics at Oxford University, UK. He heads the Genome Analysis group in the statistics department at Oxford.