COSC348: Inferring Phylogenies

Phylogenetic reconstruction is of great interest to biologists. The importance of the application domain, the fact that the problems look simple but turn out to be surprisingly hard, and the wider links with other parts of computer science, are typical of bioinformatics. This block of lectures can be seen as a first taste of combinatorial optimisation.

What is a phylogeny?
Why do we care?
Two important themes.
Three basic approaches.

What is a phylogeny?

A phylogeny is a tree showing ancestor/descendent relationships between "taxonomic units". Here is an example. The square boxes in this tree represent the Recent Caminalcules; the numbers show the correspondence. The circles represent hypothetical (unobserved) ancestral species. Edges represent descent. A phylogeny does not have to be correct, and this example is not: species 30 is the actual first ancestor for these imaginary creatures.

Why do we say "taxonomic units" (things to be placed in an arrangement) rather than openly saying "species?" Because the things we want to arrange might not be species.

For example, many protein families are known, which are thought to have arisen by duplication of genes. One well known family is the haemoglobins. Human beings have at least four different kinds. When a gene is duplicated, it may be advantageous to have twice as much of its product being produced, but probably not. There is often selection pressure to either "kill off" the new gene (resulting in a pseudo-gene, and we have plenty of those) or to differentiate the two copies so that they do different jobs or the same job in a different context. For example, one kind of human haemoglobin is active in foetuses, which live in quite a different environment from adults. We might want to use the amino acid sequences of the members of a protein family to trace the evolutionary history of duplication events, perhaps to determine what the changes were and come up with ideas about why the changes took place and what functions the ancestral proteins performed.

A genealogy shows parent/child relationships between individual organisms, where it is common for an organism to have two parents. The resulting graph structure is not a tree.

A phylogeny shows ancestor/descendant relationships between taxonomic units where each taxon has a unique parent.

A particularly important distinction is between observed taxonomic units (OTUs) and hypothetical taxonomic units (HTUs). The observed taxonomic units are the ones that we have actually had access to and measured. They are placed at the leaves of the phylogeny. They are the only ones we can call real. We fill in the internal nodes of the phylogeny with hypothetical taxonomic units; imaginary ancestors whose properties we are free to invent in any way that will make a good explanation for the observed properties of the OTUs. HTUs are explanatory entities which we hope have some connection with reality, but it is only a hope. Your assignment will explore a synthetic data set which has been used to show that this hope is often unfounded.

What about fossils?

A fossil is a real thing that we can measure. Fossil species count as observed taxonomic units. They go into the tree as leaves, even though they may have died out long ago, and even though they might possibly have been ancestral to other species we are interested in. If a fossil species should turn out to be (or to be indistinguishable from) an actual ancestor, then we expect that there will be an internal node that the fossil's leaf is very very close to. The identification of a fossil species as an ancestor is a result of a phylogenetic analysis, not an input to it.

Because leaves correspond to real taxonomic units, they are labelled with their names.

Because internal nodes stand for imaginary ancestors that cannot be identified with anything in particular, even another internal node in another tree, they are not labelled with names.

How many branches should internal nodes have, and what should their order be?

The only justification we have for inventing an imaginary ancestor is a difference between two (or more) groups. An HTU with no children would be one having no reason to exist. An HTU with one child would have no evidence to distinguish it from that child. So every internal node must have at least two children. Note that this is an important difference between the real world and our model. In the real world, a species can go extinct without leaving any descendants. But a real species is not the same as an imaginary ancestor. If a real species leaves fossil evidence for its existence, it goes into our tree as a leaf, not an imaginary ancestor. In the real world, a species can split off one new species, and then go extinct, or could perhaps mutate in place to the point where it should be regarded as transformed into a new species. That's a one-child relationship between real species. But again, the internal nodes of a phylogeny do not stand for real species, but for imaginary ancestors whose job is to account for differences. In this example, we'd have two OTUs (parent species, child species) and an HTU to account for the difference between them.

An internal node must have at least two children, then. Could it have more?

We can appeal either to computational modelling or to biology to answer that question, and they both give equivocal answers. The classic model of speciation is "allopatric speciation" where the geographic range of a species is split somehow. It might be rising sea level dividing a big island into two or three or more little islands. It might be a rift like the one at the eastern end of the Mediterranean. It might be a newly uplifted mountain range. It might be a group of people with chainsaws and bulldozers building a superhighway. When the geographic range was connected, gene flow (breeding) between subpopulations evened out the selection pressures due to environmental variation, and the population size damped out drift. With the population split into smaller groups, random genetic drift has a stronger effect and selection pressures from local environmental effects cannot be evened out by gene flow from elsewhere.

In the nature of things, we can expect two-way splits to be far more common than three-way splits, but the rising sea level example suggests that three-way splits may not be too rare. The synthetic data you will work on in your assignment involve a three-way split.

From a computational point of view, a three way split (A,B,C) can be modelled by a pair of two-way splits (A,(B,C)) where the splits happen so close that they cannot be resolved: if the real split was (A,B,C) then (A,(B,C)), ((A,B),C), and ((A,C),B) will all look equally good. This can happen anyway.

When we look for the "best" phylogeny for some set of data, we often find that it is not unique. People often put together a set of "best" trees to make a "consensus tree". The whole point of such trees is that the data are often not strong enough to resolve all two-way splits, so consensus trees very often have multiway splits.

If we have time to talk about the construction of consensus trees, I shall talk about multiway trees then. In the mean time, two-way splits give us enough expressive power, so that's what we'll stick with.

Definition (phylogeny)

A phylogeny is a tree with two kinds of nodes:

A leaf has a parent and no children. It stands for an observed taxonomic unit. It is labelled with the name of that taxon.
An internal node has a parent and two children. It stands for a hypothetical taxonomic unit, an imaginary ancestor. The two children are not ordered: (A,B) and (B,A) are the same split.

Definition (Newick format)

We can accurately represent the structure of a phylogeny by means of a string in Newick format:

a leaf labelled XXX is represented by the string XXX
an internal node with children A and B is represented by the string "(" ++ rep A ++ "," ++ rep B ++ ")".

The order of the children of a node is not significant, so there are many strings that represent the same tree. One way to pick a unique string is to order the children A B of a node so that the child which has the smallest left should go first. Having unique representations makes it easier to see if we all find the same tree.

The Newick string we get for our example is

(((((((1,17),24),(16,27)),((((2,(3,22)),12),4),((5,18),23))),
  (((19,26),29),20)),(((((7,15),13),8),(14,(25,28))),30)),
  (((6,10),(11,21)),9))

Why do we care about phylogenies?

There are at least two kinds of reason to care: computational and biological.

In previous weeks you have seen that forming multiple alignments can be extremely expensive: the straightforward dynamic programming algorithm for aligning k sequences each of length n is O(n^k), which is quite impractical for even fairly small k.

Suppose that we start by forming all pairwise alignments to compute string similarity distances between all pairs of sequences. That costs O(k².n²) and gives us a k×k distance matrix.

An approximate heuristic method for forming a phylogeny (strictly speaking, a phenogram rather than a cladogram) will build a tree in O(k³) time.

With the tree in hand, we work from the leaves to the root doing k-1 pairwise alignments at a cost of O(k.n²). What's slightly tricky here is that we are aligning alignments.

Because building a tree means that we are estimating a certain number of parameters, this can only work when n > k, so the total cost is O(k².n²).

After doing this, we recompute the distances from the multiple alignment and rebuild the tree; if the tree has changed shape we redo the alignment. Convergence is usually quick.

So building a tree can help us to do multiple alignment very much faster, and the better the tree fits the actual evolutionary history of the sequences being aligned, the better. The ClustalW program does this.

Of course biologists have been building classification systems for hundreds of years. Traditional classification systems were based on the "phenetic" idea of trying to maximise the predictive power of knowing what group an organism belongs to. If you know that something is a "fish", then you know that it likes to live in water, has a backbone, is probably bilaterally symmetric, and a whole bunch of stuff, even though "fish" is not, by today's standards, a "natural monophyletic group". It has to be admitted that some of the methods we shall be looking at are phenetically based. But that doesn't make them irrelevant to phylogeny: if the actual biological descent of organisms had nothing to do with their structure and behaviour, there would be no point in thinking about evolution at all.

The "cladistic" approach takes actual evolutionary descent as of primary importance. Cladists point out that physical resemblance may be misleading and stress the importance of choosing characters which are better able to uncover the true relationships. Molecular data seem to be particular good for this. If you have some idea of the actual evolutionary history of a group of organisms, you can pose and perhaps all manner of interesting biological questions: these organisms are not closely related but look alike, why? these organisms are more closely related to ones from a distant land, why? this lineage has changed a lot, this other lineage not much, why?

Two main themes

It's the interaction that makes it interesting

Why is a Computer Science department running a Bioinformatics paper? I'm sure you could come up with six cynical reasons before breakfast, but in fact our motives are of the purest. (There is said to be money in it, but we haven't seen a sniff of any.)

The important thing is that Bioinformatics is an applied area of computing. We encounter problems that are utterly unlike any we would have met in studying compilers or operating systems or computer architecture or networks or computer graphics or practically anything in our own back yard. These are problems that real people want real solutions to; this is the kind of thing that justifies us playing with our toys.

Many application areas might have done. If you want applications to business, look to Information Science. If you want applications to entertainment, look to our Computer Games paper (which has a lot of contact with subjects outside our department). However, Bioinformatics has the pleasant property that the kind of problems it presents us with are fundamentally algorithmic problems rather than social, psychological, ethical, human factors, &c, so they are the kind of problem we are supposed to be able to help with.

It isn't simply a matter of the biologists and biochemists coming to us with a problem and us saying with a superior smile "we have what you need". It's a matter of them coming to us with problems which are genuinely novel and challenging so that we have to say "ooh, that's interesting, I'm not even sure it's doable at all, that's going to take a lot of thought that is."

We have to start by understanding what the biological data and problem actually are. For this paper, you do not need a great deal of biochemical knowledge, and the knowledge you do need is symbolic rather than chemical.

We have to develop algorithmic answers to these problems.

We have to understand the limitations of our answers, especially how well they scale with problem size.

When our precise models turn out to scale badly, we cannot simply tell the other people "this problem is so hard it isn't interesting, please go away." They would rightly say "but I still need an answer." We have to be able to devise approximate solutions, and we need to be able to tell how the possibility of affordable answers depends on the specific question asked so that we can intelligently ask whether an answer to a related question might not do just as well.

Above all, we need to be aware of the second theme so that we can maintain a properly sceptical attitude to our own "success".

Big approximation, little approximation

The distinction I am about to make is widely attributed to the statistician George Box, who says that it's quite true, but it wasn't him who said it first.

When we develop a computational or statistical model of the real world, we go through (at least) two approximations:

the big approximation is when we make a formal model of the aspects of the real world we are concerned with. For example, treating DNA as a string of letters is a big approximation.
the small approximation is when we simplify the formal model to something our statistical or computational techniques can handle. For example, using floating point arithmetic is a small approximation.

This is particularly relevant to inferring phylogenies.

There are a number of BIG approximations in our model:

"Horizontal gene transfer" is when a gene is transferred from one species to another species alive at the same time. Retroviruses can pick up a gene from one organism and drop it into the genome of another. It appears that most multi-celled organisms have genomes that are infested with the remnants of ancient retroviruses. Even the mammalian mode of reproduction may have been affected by this. There are said to be genes in the human genome that look as though they came from plants.
"Hybridisation" is when organisms from two related species interbreed producing fertile offspring. Sometimes you end up with a new species. This is particularly common in plants. Fish do it. Other vertebrates are known to. Again, our definition of a phylogeny defines this possibility out of existence.
A species is not a simple thing. It is a population of organisms, including individuals of many ages and varieties. What is the average height of a human being? Not well defined: children and adults differ, men and women differ. Interspecies variation (melanistic vs peppered moths, yellow vs green peas, major histocompatibility complex, ...) is not just common, it is exploited. In our model, we are presuming that each taxonomic unit can be characterised simply, adequately, and consistently; in varied species this may not be quite true. For example, something like 3% of the human genome is said to be polymorphic. That is, 3% of the human genes exist in more than one form in the population. Blood groups are perhaps the best known example. As Nathan Rountree has pointed out to me, because the Human Genome Project gathered sequence data from more than one person, we have no reason to believe that "the" human genome as recorded in our databases actually describes a possible human being.
Our internal nodes at best correspond to a subset of the actual ancestors. For example, a phylogeny of Recent great apes will show nothing between us and chimpanzees, whereas the literature is full of hominid species (whose relationships remain contentious).

Putting these big approximations together gives us some wiggle room in our small approximations: there is no point in striving for a computational perfection that our model guarantees will not generalise to the real world. However, we should do the best we can before giving up!

Three major approaches

There are three basic ways to make a phylogeny.

Parsimony methods.
Maximum likelihood.
Distance-based methods.

Parsimony methods

Step 1: define a cost measure for phylogenies.

Step 2: announce that the best tree(s) is( are) the one(s) having the least cost.

The cost reflects the amount or difficulty of the evolutionary change implied. Some people hold that this is just Occam's razor in action: if there is a tree with cost c which can account for the data, it would be unscientific to propose a tree with cost c' > c.

Of course, William of Occam was a Nominalist, who held that abstract categories were unreal. Imaginary ancestors are precisely the kinds of entities he held should not be multiplied.

Parsimony is related to a machine learning concept called "minimum description length".

We shall start our study by looking at parsimony.

It is important to understand, however, that there may be more than one most parsimonious phylogeny, and that the true phylogeny is not necessarily the most parsimonious.

We sometimes set an assignment on this topic, in which students explore a synthetic data set called the Caminalcules. The specially important thing about them is that the true phylogeny is known. We can even tell you where to find it. But you must not try to make your programs produce it, because it is not the most parsimonious tree. Another important thing about this data set is that it is very similar to real data sets, so we can confidently expect this to happen for real species too.

In real problems, we do not know the real phylogeny. If we did, we wouldn't be trying to find it! So we can never know that the phylogeny we compute matches what really happened; nor can we expect nature to follow our ideas of elegance. We do the best we can.

Maximum likelihood

Step 1: define a probability model for evolutionary change.

Step 2: announce that the best tree(s) is( are) the one(s) which ascribes the highest probability to the observed outcome.

Maximum likelihood is an extremely important mathematical technique, and I would like very much to include it, but we just don't have time.

Distance-based methods

Step 1: compute a "distance" matrix showing how different each OTU is from each other OTU.

Step 2: use a clustering algorithm to fit a tree to the distances.

Cluster analysis is a statistical area of great practical importance. You can cluster archaeological artefacts, documents, diseases, signals, societies, anything where you can define some kind of distance or similarity. Clustering is an important part of data mining. The classic book by Sneath and Sokal on Numerical Taxonomy introduced algorithms which are still used.

Unfortunately, clustering techniques don't have the mathematical purity or air of objectivity that parsimony and maximal likelihood have. The fact that they were initially designed for grouping things by perceived similarity (phenetic) rather than evolutionary descent (cladistic) to some extent counts against them.

There is one great argument for distance-based methods: we can actually afford to use them. They can cope with hundreds of OTUs.