SIAM News Blog
SIAM News
Print

Book Review: Phylogeny: Discrete and Random Processes in Evolution

Part III

By Mathias CronjagerDavid EmmsLuca Ferretti, and Jotun Hein

The authors offer an in-depth look into Mike Steel’s SIAM Book. The review is published in three parts. Read Part I here and Part II here. A more detailed review of this book can be found here

Chapter 9 Evolution of Trees

The chapter returns to the topic of the evolution of trees themselves and their relation with underlying evolutionary processes ("phylodynamics"). A section on the pure birth process - the simplest tree model - conveys much intuition and interesting paradoxes.  The distribution of the number of  descendants of a lineage after time t is geometric with parameter exp(lambda*t), where lambda is the speciation rate.  This simple result could be explained by a simple argument, while the author derives it from a more general point process construction later in the chapter. The subtle issues of conditioning on elapsed time or final number of species are then discussed.  Again some interesting paradoxes appear, arising from the properties of the simple Poisson process.  For example, the distribution of a random internal edge and an external edge are the same, which is surprising since the latter will continue to grow in the future before becoming the former.  The reconstruction of ancestral states is discussed for models with only two states and fixed mutation rates. (The extension to a large number of states is quite relevant for applications but unfortunately not discussed.) The probability that any ancestral-state reconstruction method could be correct for large trees depends on a new relevant parameter, namely the ratio of speciation and substitution rates. In general, if the speciation rate is less than four times the mutation rate, then random guessing is as good as any method.  Surprisingly, above this threshold Majority Rule is able to recover the ancestral state, while Maximum Parsimony loses information about the ancestral state faster than Majority Rule, and even above its threshold of 6 it shows poorer performances.

The more interesting case of birth-death processes is discussed. These models are widely applied since most real histories have both types of events. This leads to the interesting addition of the “reconstructed” tree or process, where only the lineages that have living descendants at time t are considered. Important effects like the "pull of the present" and "push of the past" deriving from this conditioning are discussed in a few pages. More discussion would have been nice, since this is very relevant for phylodynamics and the behaviour of these processes is not always intuitive. The next section introduces the Coalescent Point Process. This is an important process since it provides a simple characterisation of birth-death processes conditioned on time and extant lineages, allowing even for time and age dependence of the parameters. The construction is clear, but there are finer connections that eluded us. The classical Kingman coalescent is briefly discussed, since it can be found in great detail in any book on population genetics. Finally, there is an interesting and important discussion on the loss of Phylogenetic Diversity and the difference between Kingman coalescent and birth-death models with respect to branch lengths.

The rest of the chapter is devoted to the relationship between gene trees and species trees. This includes some very interesting results on the Anomalous Gene Trees of Rosenberg and Degnan, where the most likely gene tree disagrees with the topology of the species tree.  This is a challenging result reminiscent of the Felsenstein Zone.  As in other places in the book, it seems that most of the combinatorial nitty-gritty has been left out. The Degnan-Rosenberg anomalies raise the question as to whether methods can be developed that can give the right species trees. The author discusses these issues briefly, as well as other relevant issues for recent research like concatenation and Maximum Likelihood on gene trees.  Coalescent models to embed gene trees within species trees are then discussed. Lateral Gene Transfer is also considered, where the coalescent event is substituted by a species-jump and then coalescent.  These problems are extremely important, since you can observe genes but are keen to make statements about the species tree, which is extremely hard to observe directly.  Note that most of the chapter focuses on Incomplete Lineage Sorting, that dominates recent research but it is not the only source of discordance between genes and species trees. 

Chapter 10 Introduction to Phylogenetic Networks 

The book ends with a chapter on phylogenetic networks. These phylogenetic structures include several deviations from tree-like inheritance and play an important role in prokaryotes due to the high rates of Lateral Gene Transfer, but they appear also in different contexts (e.g. hybridisation in eukaryotes). The topic is clearly worth studying. In the literature on networks there is a trend towards a high ratio of concepts to real biological use; this trend is apparent also in the large number of definitions presented in this chapter, even if the author makes clear that they are needed because of the increased complexity of phylogenetic networks compared to trees. In this respect, this chapter is a very useful guide to this complexity. Most of the results apply to binary trees or networks.

Implicit (unrooted) networks are discussed first: networks with single cycles (unicyclic networks), galled networks (that can have cycles, but no node can be a member of more than one cycle), split networks including the widely used Neighbor-net method, and median networks built from sequences and related to Maximum Parsimony. 

The rest of the chapter is devoted to the more interesting case of explicit (directed) networks, which represent evolutionary histories more closely. The important difference between tree vertices (with a single ancestor) and reticulation vertices (with multiple ancestors) is discussed. For mathematical purposes, it is useful to consider subclasses of networks with bounded complexity of reticulation, which can be realised in different ways described here: level-k networks, tree-child networks, tree-sibling networks, reticulation-visible networks. For example, in tree-child networks, reticulation is limited by the fact that every internal vertex has a child that is a tree vertex. Temporal networks can be defined by extra conditions on the order of splits/reticulations, which however make sense only if all species survived and were sampled. More classes of networks with nice mathematical properties are presented (networks without redundant arcs, normal networks, regular networks). The chapter discusses the relations between these networks, as well as the larger class of tree-based network (networks obtained adding links to a tree), whose characterization is less intuitive than expected.

Finally, the relation between trees and networks is unveiled. Removing reticulations, how many trees can be displayed by a network? Can a specific tree be displayed? And conversely, is it possible to reconstruct a network from the trees displayed, or from subnetworks, distances or characters? What is the network that minimises reticulation from a set of trees? These questions find their answers here. 

There are many good things in this chapter, but it still feels a pity to ignore the Ancestral Recombination Graph, that is the structure that describes the relation of a set of genomes from a population or viruses. However, much of what is explained is extremely close to the ARG, which is already discussed in other books.

Comparison to other books

Although phylogenetics is an older field, it has for the last 50 years been strongly tied to sequence data and the field of molecular evolution. Prior to this era Willi Hennig’s Basic outline of a theory of phylogenetic systematics {Hennig, 1950 #685} was the major work and motiving in its attempt to formulate rigorous principles, despite no statistics or algorithmics, for phylogenetic inference. It was expanded in 1966 into Phylogenetic Systematics {Hennig, 1966 #693}. Sokal and Sneath’s Numerical Taxonomy {Sokal, 1963 #703} was both more exact and statistical. Sheila Embleton’s Statistics in historical linguistics {Embleton, 1986 #704} was clearly focused on language and distance methods but could also be applied to sequence data. The first textbook fully focused on molecular evolution and phylogenetics was the undergraduate text by Graur and Li {Li, 1991 #706}, Fundamentals of Molecular Evolution,  that later was expanded into W-S Li’s Molecular Evolution {Li, 1997 #709}. In 2004 Felsenstein published his large Inferring Phylogenies {Felsenstein, 2004 #710} that is highly readable and also covers the history of phylogenetics. The same year Steel and Semple published Phylogenetics {Semple, 2003 #711} which is a very appealing read for mathematicians, statisticians and computer scientists. The following year Ziheng Yang published Computational Molecular Evolution {Yang, 2006 #712} that in 2013 was expanded into Molecular Evolution: A Statistical Approach {Yang, 2014 #714}.

Besides these, there are a series of books with a very hands on approach, instructing readers on how to navigate existing programs such as Molecular Systematics {Hillis, 1996 #727} by David M. Hillis and Craig Moritz and The Phylogenetic Handbook: A Practical Approach to Phylogenetic Analysis and Hypothesis Testing {Salemi, 2009 #741} by Philippe Lemey

How should these books be prioritized by a researcher who wants to get into phylogenetics? Well, it depends on the interests and background of the researcher. If the researcher is a mathematician, statistician or computer scientist, Felsenstein {Felsenstein, 2004 #710}, Yang  {Yang, 2014 #714}, Steel (2016) and possibly Semple & Steel {Semple, 2003 #711} would provide a sound basis. If the researcher is a biologist, Felsenstein {Felsenstein, 2004 #710}, Yang  {Yang, 2014 #714} and Lemey {Salemi, 2009 #741} would be good choices. Felsenstein and Yang are on both lists, since Felsenstein provides an excellent background and Yang is closer to data analysis, which after all is the motivation for phylogenetics. The reason for discarding Steel’s books for the biologist is that they are simply too mathematical. However, it would be useful if the insights from Steel’s books diffused as much as possible to the biological community.

Summary

Mike Steel covers most of what you could consider relevant but there are some important omissions. Some of these could have been included by adding 40-60 pages and some topics could not have been included without extensively altering the scope of the book.  In the former category we find the bootstrap, statistical alignment, recombination, and phylogenetic regression.  In the category of topics that would have seriously changed the scope of the book, we find the evolution of Complex Characters (Structures, Networks, Shapes, Phenotypes,..), selection, annotation and MCMC.

Does the book fail on some accounts? Or have biases due to Mike Steel being a mathematician?  It is easy to ask for the impossibly, like listing a series of topics and computational experiments that would enlarge the book from 300 to 450 pages and would need an extra 6 months from Mike Steel and potentially some additional computational assistance.  But it is a review’s obligation to be critical and ask an excellent book to be even more excellent.  Since Steel does such an excellent job of extracting the essence of algorithms and mathematical results, it is a pity that certain topics have been ignored.

Getting into mathematical phylogenetics by reading this book is probably 10-20 times faster than tracking down the articles that Steel has digested for us. Thus the topics left out are seriously disadvantaged. However, the book is already 60% longer than it was supposed to be in that book series.

A more detailed review of this book can be found here

Further Reading

Embleton SM. 1986. Statistics in Historical Linguistics: Brockmeyer.
Felsenstein J. 2004. Inferring phylogenies. Sunderland, Mass.: Sinauer Associates.
Hennig W. 1950. Grundzüge einer Theorie der Phylogenetischen Systematik. Berlin.
Hennig W. 1966. Phylogenetic systematics. Urbana,: University of Illinois Press.
Hillis DM, Moritz C, Mable BK. 1996. Molecular systematics. Sunderland, Mass.: Sinauer Associates.
Li W-H. 1997. Molecular evolution. Sunderland, Mass.: Sinauer Associates.
Li W-H, Graur D. 1991. Fundamentals of molecular evolution. Sunderland, Mass.: Sinauer Associates.
Salemi M, Vandamme A-M, Lemey P. 2009. The phylogenetic handbook : a practical approach to phylogenetic analysis and hypothesis testing. Cambridge, UK ; New York: Cambridge University Press.
Semple C, Steel MA. 2003. Phylogenetics. Oxford ; New York: Oxford University Press.
Sokal RR, Sneath PHA. 1963. Principles of numerical taxonomy. San Francisco,: W. H. Freeman.
Yang Z. 2006. Computational molecular evolution. Oxford: Oxford University Press.
Yang Z. 2014. Molecular evolution : a statistical approach. Oxford, United Kingdom ; New York, NY, United States of America: Oxford University Press.


Mathias Cronjager is a graduate student in the Genome Analysis group in the statistics department at Oxford University, UK. David Emms is a postdoctoral researcher in the department of plant sciences at Oxford University, UK. Luca Ferretti is a research scientist in the Integrative Biology group at the Pirbright Institute, UK. Jotun Hein is a professorial fellow and Holder of the Chair in Bioinformatics at Oxford University, UK. He heads the Genome Analysis group in the statistics department at Oxford.

blog comments powered by Disqus