You are on page 1of 9

TreeFitter Version 1.

0 Manual
(c) F. Ronquist 2001
Tree fitting has important applications in historical biogeography, coevolution and gene tree-species tree
fitting [see recent reviews by \Page, 1998 #1419; Ronquist, 1998 #760]. A general characteristic of these
problems is that two different kinds of trees are fitted to each other in order to infer how the lineages they
represent have been associated with each other during their evolution. In historical biogeography, the two
different kinds of trees are organism phylogenies and area cladograms, in coevolution they are parasite and host
phylogenies, and in gene tree-species tree problems they are gene trees and species phylogenies.
I and others have argued that parsimony methods for tree fitting should be based on models recognising
different types of events and associating each of these events with a cost inversely related to the likelihood of the
event [Ronquist, 1990 #552; Page, 1995 #551; Charleston, 1998 #]. Because many workers do not believe in
deriving parsimony methods from models with events, I have called this approach event-based parsimony. In
my view, however, event-based parsimony really represents the only logically defensible way in which
parsimony inference can be applied to the problem of tree fitting.
A number of different models can be used in parsimony-based tree fitting but most of the work has thus
far been concentrated on the four-event model first introduced by Page [, 1995 #551; Ronquist, 1998 #739] or
subsets and variants thereof [Ronquist, 1990 #552; Goodman, 1979 #751; Ronquist, 1996 #547; Ronquist, in
press #1591] In this model, we recognize four different types of events: codivergence events, duplication events,
sorting events and switching events. Codivergence events correspond to geographical vicariance in historical
biogeography, simultaneous host and parasite speciation in coevolving species associations, and gene tree
divergence caused by speciation in gene tree-species tree studies. Duplication events correspond to sympatric or
allopatric speciation in response to a temporary barrier in biogeography, independent parasite speciation in
coevolution, and gene duplication in gene tree analysis. Sorting events correspond to (partial) extinction in
biogeography, and lineage sorting in coevolution and gene tree analysis. Finally, switches correspond to
dispersal between isolated areas in biogeography, host shifts in coevolving associations, and horizontal gene
transfer in gene tree analysis.
TreeFitter is a simple program for parsimony-based tree fitting. It can handle arbitrary cost assignments
fulfilling the requirements that duplication events, sorting events, and switches all have zero or positive cost
associated with them. Codivergence events can be associated with either positive or negative cost (or zero cost).
In TreeFitter terminology, one kind of trees is called P-trees, the other kind H-trees, referring to the analogy with
parasite and host trees in coevolutionary analysis. In historical biogeography, the H-trees are the area cladograms
and the P-trees the organism phylogenies. In coevolution, the interpretation is self-evident. In gene tree-species
tree fitting, the P-trees are the gene trees and the H-trees are the species trees.
TreeFitter has a limited number of commands but still allows a number of useful inferences to be drawn
from the data sets. It fits any number of P-trees to a given H-tree, and it can search for the best H-tree given a set
of P-trees. It can calculate the events implied by the minimum-cost solutions and reconstructions can be saved in
TreeMap format (not yet implemented). Inferences about historical constraints or the number of events of a
particular type can be tested against inferences drawn from random data sets. These random data sets are drawn
from the original data either by random permutation of the terminals in the P-tree, the H-tree, or both.
Alternatively, either the P-trees, the H-tree or both are replaced by trees drawn at random from a tree universe
generated by the Markov process (all labelled histories equally probable) or from the tree universe where all
labelled distinct cladograms are equally probable. Finally, TreeFitter can examine portions of parameter space to
find the combination of cost assignments giving the best chances of finding historically constrained patterns,
given a set of P-trees and an H-tree. By default, TreeFitter works with the following cost assignments:
codivergence and duplication events have zero cost, sorting events have a cost of 1.0, and switches a cost of 2.0.
This combination of cost assignments works well for a wide variety of problems but not for all cases where it is
possible to retrieve phylogenetically conserved association patterns [Ronquist, in press #1591].

The commands available in TreeFitter are summarized below. Please remember that this software has
been developed mainly for my own research needs and is not being maintained as a commercial software
package. It is provided for free on the understanding that there are no guarantees that the software will not crash
your system, destroy your files or fail to perform as you expect. Always keep backup copies of your files. Any
suggestions for improvements or detailed bug reports are welcome and should be addressed to me
(fredrik.ronquist@ebc.uu.se).

TreeFitter commands
The TreeFitter commands are described with a syntax similar to that used in the PAUP manual. A line fed to
TreeFitter should contain a command, followed by some options with corresponding settings. In describing the
syntax, items that are optional are given within square brackets [ ]. The settings can either be a floating point
value (specified by floatval), an integer value (specified by intval, or any of a set of alternative keyword settings
(given within curly brackets and separated by vertical lines, as in {setting1|setting2|setting3}. The commands can
either be typed in from the keyboard or entered in a batch file. The batch file can then be processed by using the
execute command. The format of the batch files is similar to the NEXUS format with different blocks of
commands. TreeFitter commands can also be issued outside blocks. TreeFitter is case-insensitive except for the
labels of the H-tree and P-tree terminals.

Data file format


Data files should commence with the line #NEXUS. The commands should then be divided into blocks, each
block starting with Begin {HOSTS | AREAS | SPECIES | PARASITES | ORGANISMS | GENES |
TREEFITTER}; and ending with End; or Endblock;. Depending on the block, different commands are
available. The TreeFitter commands are also valid if issued out of block but all other commands issued out of
block will generate error messages. See Appendix 1 for examples of data files.

Commands used in an AREAS, HOSTS or SPECIES block


An AREAS, HOSTS or SPECIES block is used to feed TreeFitter with one or more H-trees, which are
added to those already in memory, if any. The syntax is
Begin {AREAS | HOSTS | SPECIES};
[tree [<tree-name1>] <tree-description1>;]
[tree [<tree-name2>] <tree-description2>;]
End;
The hosts (or areas or species) block may contain as many trees as desired. Each tree may be named by a
label. The label can be any combination of printing characters of any length. However, the label may not begin
with a number or a left parenthesis. Furthermore, the name may not, when converted to lower case, be all. If
no label is provided, the trees will be labelled noname1, noname2, etc. The tree description follows the
Newick-format. If branch lengths are provided, TreeFitter will use the branch lengths to order the nodes in the
tree (see below). Otherwise, the trees will be ordered arbitrarily (see above for discussion on ordered versus nonordered trees). The labels used for the terminal taxa in the H-tree are critical: the same labels must be used in the
range descriptions of the P-trees that you wish to fit to the H-tree (this match is case-sensitive). The labels of the
terminal taxa can contain any symbols except white space. Any polytomies in the tree will be broken according
to the settings of the polyresolve option (see the SET command).
Examples of H-tree blocks:
Begin AREAS;
tree hypothesis1 ((WN:2,EN:2):1,(WP:1,EP:1):2):1);
tree hypothesis2 ((WN:1,EN:1):2,(WP:2,EP:2):1):1);
End;
Begin HOSTS;
tree (Pap,(((Fab,(Fag,Ros)),(Sap,Ana)),(Lam,(Api,(Ast,Val)))));
End;
Begin SPECIES;

tree right (kangaroo,(dog,(human,chimp)));


tree wrong (chimp, (kangaroo, (human, dog)));
tree very_wrong (man,(dog,(kangaroo,chimp)));
End;
The first block defines two different H-trees describing the same area relationships but differing in the
order of the postulated vicariance events. The order is set by specifying the branch lengths of the two different
H-trees (these ordered trees may more appropriately be termed H-tree histories). TreeFitter attempts to interpret
the branch lengths as if they were ultrametric (measured in time units rather than in amounts of evolutionary
change) and sets the order of the splitting events accordingly. The length of each time segment separating two
consecutive splitting events can be set to an arbitrary number, such as 1, for the purpose of ordering the H-tree
(Fig. XXX).

Commands used in an ORGANISMS, PARASITES or GENES block


Either of these blocks describes a set of P-trees, which TreeFitter adds to the P-trees in memory. The
syntax is
Begin {ORGANISMS | PARASITES | GENES};
[tree [<tree-name1>] [weight = <floatval>] <tree-description1>;]
[tree [<tree-name2>] [weight = <floatval>] <tree-description2>;]
End;
The organisms (or parasites or genes) block may contain as many trees as desired. Each tree may be
named by a label. The label can be any combination of printing characters of any length. However, the label may
not begin with a number or a left parenthesis. Furthermore, the name may not, when converted to lower case, be
all or weight. If no label is provided, the trees will be labelled noname1, noname2, etc. Each tree may
be associated with a weight between 0 and 1. If no weight is given, the weight defaults to 1.0. The tree
description follows the Newick-format. Branch lengths are ignored. The labels used for the terminal taxa in the
P-tree are critical: they are used to match the P-tree terminals to the H-tree terminals. These labels are casesensitive.
Example of an Organisms block:
Begin ORGANISMS;
tree lithobiusA weight=0.5 ((lava,lcom),(lser,lbig));
tree lithobiusB weight=0.5 (lava,(lcom,(lser,lbig)));
tree Carabus ((C_aratus,C_hortensis),C_nemoralis),C_nitens);
End;

Commands used in an ASSOCIATIONS or DISTRIBUTIONS block


An ASSOCIATIONS or DISTRIBUTIONS block specifies the match between the terminals in a
specified P-tree with the terminals in an H-tree. The syntax is as follows:
Begin {ASSOCIATIONS | DISTRIBUTIONS}
[range <tree-name1> <range description1>;]
[range <tree-name1> <range description1>;]
End;
The following distributions block specifies the distribution areas of the terminals in the three P-trees
described in the ORGANISMS block given above:
Begin DISTRIBUTIONS;
range lithobiusA lava: WN EN, lser: WP EN, lcom:EP, lbig:EP;
range lithobiusB lava: WN EN, lser: WP EN, lcom:EP, lbig:EP;
range Carabus
C_nitens:A B C D,
C_nemoralis:C,
C_nitens:D,
C_hortensis:E,
C_aratus:D;
End;

Note that the range statement can be divided into several lines. TreeFitter uses the semicolon to find the
end of a statement in a datafile.

Commands used in a TREEFITTER block


Everything that can be done in the blocks discussed above can also be achieved using statements within a
TREEFITTER block. These commands can also be issued out of block. The syntax is as follows (listing all
currently implemented commands):
Begin TREEFITTER;
[ptree [<treename>] <tree-description>;]
[htree [<treename>] <tree-description>;]
[range <treename> <range-description>;]
[select {htrees | ptrees} <tree-list>;]
[deselect {htrees | ptrees} <tree-list>;]
[clear {all | htrees | ptrees};]
[show {htrees | ptrees} [<tree-list>];
[list {htrees | ptrees} <options>;]
[set <options>;]
[estimate <options>;]
[fit <options>;]
[search <options>;]
[order <options>;]
[filter <options>;]
[log <file-name>;]
[execute <file-name>;]
[ihtest <options>;]
End;
Each of these commands is described in more detail below, in alphabetical order.

ptree [<treename>] <tree-description>;


This command is exactly equivalent to the tree command issued within a block describing P-trees (that is,
an ORGANISMS, PARASITES or GENES block). See above.

htree [<treename>] <tree-description>;


This command is exactly equivalent to the tree command issued within a block describing H-trees (an
AREAS, HOSTS or SPECIES block). See above.

range <treename> <range-description>;


This command has been described above under the ASSOCIATIONS or DISTRIBUTIONS block.

select {htrees | ptrees} {ALL | <tree-list>};


This command is used to select some H-trees or some P-trees from those currently in memory. The user
must specify whether H-trees or P-trees are being selected, and then choose whether all those trees (ALL) or
only a subset specified by a tree list should be selected. The tree list should give either the names or the numbers
of the trees that are to be selected. All trees that are not selected are automatically deselected. It is possible to
specify a range of trees by giving the name or the number of the first tree followed by a hyphen and the name or
number of the last tree. Note that the names of the trees are case-sensitive. Thus, tree1 is not the same as
Tree1. To obtain a list of the trees and their number and selection status, use the list command.
Examples:
select
select
select
select
select

htrees
htrees
htrees
htrees
htrees

tree1 tree2 tree3;


1 2 3;
1-3;
tree1 tree5-tree7;
all;

deselect {htrees | ptrees} {all | <tree-list>};


This command deselects H-trees or P-trees currently in memory but is otherwise equivalent to the
SELECT command.

clear {htrees | ptrees | all};


This command clears H-trees, P-trees, or both H-trees and P-trees from memory.

show {htrees | ptrees} [<tree-list>];


This command writes an ASCII representation of either the H-trees or the P-trees specified in the tree list.
If no tree list is given, all H-trees or P-trees are shown.

list {htrees | ptrees} <options>;


This command lists all H-trees or all P-trees, their number, name and whether they are selected or not. If
there are valid costs for the H-trees (calculated by a FIT or a SEARCH command), these costs are given.

set <options>;
This command is used to change the settings of a number of different parameters, as follows.
algorithm = {LB | UB}
This option determines whether TreeFitter will be using a lower-bound or an upper-bound
algorithm to fit H-trees and P-trees. The lower-bound algorithm is recommended for general usage but
can occasionally give reconstructions with incompatible switches [Ronquist, 1996 #547; Ronquist, in
press #1591]. The upper-bound algorithm is slower but gives exact solutions without incompatible
switches for ordered H-trees. Unless you have many P-terminals per H-terminal and many switches, there
is not likely to be much information in the P-trees about the order of the nodes in the H-tree, and all or
most of the orderings of the nodes of the H-tree will have the same cost and imply the same set of events.
Therefore, if you use the upper-bound algorithm in H-tree searches you are likely to obtain a large set of
equally optimal H-trees that are identical in topology but differ only in the order of the splitting events
(nodes).
To check whether you have problems with incompatible switches, you can compare the lengths of
the H-trees fitted with the upper-bound algorithm to the length of the same trees fitted with the lowerbound algorithm.

treespace = {MARKOV | EQUAL}


This sets the tree universe used for drawing random trees. If MARKOV is chosen, random trees
are generated by a random speciation-extinction process (default setting). If EQUAL is chosen instead,
trees are picked from a universe with all distinct labelled trees equiprobable.
ccost = <floatval>
This sets the codivergence cost to the specified value. The value can be zero, negative or positive.
The default value is 0.0.
ucost = <floatval>
This sets the duplication cost to the specified value, which must be larger than or equal to zero.
The default value is 0.0.
scost = <floatval>
This sets the sorting cost to the specified value, which must be larger than or equal to zero. The
default value is 1.0.

icost = {<floatval> | HFUNCTION}


This sets the switch cost to the specified value, which must be larger than or equal to zero. The
default value is 2.0. If HFUNCTION is given instead of a value, the switching cost is determined by the
node distance between H-tree elements (as in modified Brooks Parsimony Analysis) (not yet
implemented).
cost = {DEFAULT | MC | BPA | FITCH}
This sets all the event-cost assignments at the same time. If DEFAULT is specified, the cost values
are set to the defaults (see above). If MC is specified, the cost values are set as appropriate for maximum
codivergence analysis (ccost = -1, ucost = scost = icost = 0). If BPA is specified, the cost values are set as
appropriate for modified Brooks Parsimony Analysis (ccost = INFINITY, ucost = 0, scost = 1, icost =
HFUNCTION) (not yet available). If FITCH is specified, the cost values are set up for Fitch optimisation
(ccost = INFINITY, ucost = 0, scost = INFINITY, icost = 1). INFINITY represents an arbitrary large
number (in practice, 10 000 is used).
polyresolve = <intval>
When a polytomous P-tree is read, the polytomies are arbitrarily resolved to produce one or more
binary trees. The value of polyresolve determines how many arbitrarily resolved trees are produced. If
polyresolve is set to 1, only one tree of the same weight as the original tree is produced. If polyresolve is
set to a value larger than 1, each resolved tree receives a weight corresponding to the weight of the
original tree divided by the number of resolved trees produced (not yet available).
mstaxa = {RECENT | ANCIENT | FREE}
Determines whether a widespread P-tree terminal is treated using the recent, ancient or free option
[Ronquist, in press #1591]. Default setting is RECENT.

estimate <options>;
This command explores different cost event assignments and their effects on the possibilities of finding
phylogenetically conserved association patterns. The p values obtained with different cost-event assignments are
reported. It is then up to the user to evaluate the results and to set the cost-event assignments accordingly. (The
parameter space tested is currently hard-coded).
cmin = <floatval>
Determines the minimum codivergence cost. Default setting is 0.0.
cmax = <floatval>
Determines the maximum codivergence cost. Default setting is 0.0.
cstep = <floatval>
Determines the interval between successive codivergence costs tried. Default setting is 0.2.
umin = <floatval>
Determines the minimum duplication cost. Default setting is 0.0.
umax = <floatval>
Determines the maximum duplication cost. Default setting is 0.0.
ustep = <floatval>
Determines the interval between successive duplication costs tried. Default setting is 0.5.
smin = <floatval>
Determines the minimum sorting cost. Default setting is 1.0.
smax = <floatval>
Determines the maximum sorting cost. Default setting is 1.0.

sstep = <floatval>
Determines the interval between successive sorting costs tried. Default setting is 0.5.
imin = <floatval>
Determines the minimum switching cost. Default setting is 0.0.
imax = <floatval>
Determines the maximum switching cost. Default setting is 10.0.
istep = <floatval>
Determines the interval between successive switching costs tried. Default setting is 0.5.

fit <options>;
This command will fit the selected H-trees onto the selected P-trees using the currently chosen event-cost
assignments (altered with the set command). Available options:
output = {SUMMARY | STANDARD | DETAILED}
The setting of this option determines the type of report produced by the fit command. If
SUMMARY is chosen, only the cost (and p value, if relevant) is printed for each H-tree. If STANDARD
is chosen, then a more detailed report is printed for each H-tree. If DETAILED is chosen, results are
printed separately for each P-tree.
perm = {HTERM | PTERM | HPTERM | HTREE | PTREE | HPTREE}
The setting of this option determines the type of permutation used to test the significance of
results. If HTERMS is chosen, H-tree terminals are permuted; if PTERMS is chosen, P-tree terminals are
permuted instead; and if HPTERMS is selected, both H-tree and P-tree terminals are permuted. If HTREE
is chosen, then a random H-tree is drawn for each permutation; if PTREE is chosen, then a random P-tree
is drawn instead. Finally, if HPTREE is chosen, both the H-tree and the P-tree is replaced by random
trees. The tree universe used for the random trees is set by the treespace option.
nperm = <intval>
Sets the number of permutations used in permutation tests of the fit. If 0 is chosen, no
permutations will be performed.
calcevents = {YES | NO }
Determines whether the program will calculate the frequency of different types of events
(switches, duplications, sortings and switches) when fitting H-trees and P-trees. The reported frequency is
the range (minimum and maximum) over the equally optimal reconstructions.
showancstates = {YES | NO }
Determines whether the ancestral states (the ancestral hosts) are output for each P-tree. Ignored
unless output = DETAILED. (not yet available).
showreconstructions = {YES | NO }
Determines whether the optimal reconstructions are output for each P-tree. Ignored unless output =
DETAILED. (not yet available).

search <options>;
Searches for the best H-tree given the selected P-trees. Available options:
type = {EXHAUSTIVE | HEURISTIC}
Determines whether an exhaustive or a heuristic search will be used.

start = {RANDOM | STEPWISE}


Determines whether a random tree or a stepwise built tree will be used as the starting point for
heuristic searches.
neighbourhood = <intval>
Determines the swapping neighbourhood of the TBR algorithm. Setting neighbourhood to 1 is
slightly more extensive than NNI.
keep = {ONE | MIN | BOUND}
Determines whether the search should keep only one tree of minimum cost, all trees of minimum
cost, or all trees with a maximum cost set by the bound option.
bound = <floatval>
Determines the maximum cost of the H-trees to be kept.
hterms = <list of H-tree terminals>
This command sets the H-tree terminals that should be included in the calculated H-trees. The list
of H-tree terminals should be put within quotation marks. For instance, the line hterms = A B C;
would restrict the H-tree terminals to the areas named A, B and C. By default, all H-tree terminals
appearing in the range descriptions of the selected P-trees will be included in the calculated H-trees.

order <options>;
Determines the order of the nodes in the currently selected H-trees. (not yet available).
keep = {ONE | MIN | BOUND}
Determines whether the search should keep only one tree of minimum cost for each starting tree,
all trees of minimum cost, or all trees with a maximum cost set by the bound option.
bound = <floatval>
Determines the maximum cost of the H-trees to be kept.
Determines whether an exhaustive or a heuristic search will be used.

filter {htrees | ptrees} <options>;


This command will filter H-trees based on cost and P-trees based on whether they are informative about
the relationships among the terminal taxa in the selected H-trees. Unlike all other commands, the options used
for this command are not persistent; they have to be given each time the command is invoked. (not yet
available).
cost = floatval
This option sets the maximum cost value of the H-trees to be retained in memory. It will be
ignored if P-trees are being filtered. The default value is INFINITY.
compress = {YES | NO}
Determines whether H-trees that have the same cost but differ only in the order of the splitting
events should be compressed to a single tree.

log <file-name>;
This command is used to log the results to a file with the specified name. The file will be stored in the
same directory as the TreeFitter program.

execute <file-name>;
This command will execute the file with the specified name. The file must be in the same directory as the
TreeFitter program, unless the correct path is given as part of the file name.

Tree fitting format

You might also like