A scalable method for identifying frequent subtrees in sets of large phylogenetic trees.pdf (Phylogenetic trees)

Ramu et al. BMC Bioinformatics 2012, 13:256 http://www.biomedcentral.com/1471-2105/13/256 RESEARCH ARTICLE Open Access A scalable method for identifying frequent subtrees in sets of large phylogenetic trees Avinash Ramu1 , Tamer Kahveci2* and J Gordon Burleigh3 Abstract Background: We consider the problem of ﬁnding the maximum frequent agreement subtrees (MFASTs) in a collection of phylogenetic trees. Existing methods for this problem often do not scale beyond datasets with around 100 taxa. Our goal is to address this problem for datasets with over a thousand taxa and hundreds of trees. Results: We develop a heuristic solution that aims to ﬁnd MFASTs in sets of many, large phylogenetic trees. Our method works in multiple phases. In the ﬁrst phase, it identiﬁes small candidate subtrees from the set of input trees which serve as the seeds of larger subtrees. In the second phase, it combines these small seeds to build larger candidate MFASTs. In the ﬁnal phase, it performs a post-processing step that ensures that we ﬁnd a frequent agreement subtree that is not contained in a larger frequent agreement subtree. We demonstrate that this heuristic can easily handle data sets with 1000 taxa, greatly extending the estimation of MFASTs beyond current methods. Conclusions: Although this heuristic does not guarantee to ﬁnd all MFASTs or the largest MFAST, it found the MFAST in all of our synthetic datasets where we could verify the correctness of the result. It also performed well on large empirical data sets. Its performance is robust to the number and size of the input trees. Overall, this method provides a simple and fast way to identify strongly supported subtrees within large phylogenetic hypotheses. Keywords: Phylogenetic trees, Frequent subtree Background Phylogenetic trees represent the evolutionary relationships of organisms. While recent advances in genomic sequencing technology and computational methods have enabled construction of extremely large phylogenetic trees (e.g., [1-3]), assessing the support for phylogenetic hypotheses, and ultimately identifying well-supported relationships, remains a major challenge in phylogenetics. Support for a tree often is determined by methods such as nonparametric bootstrapping [4], jackkniﬁng [5], or Bayesian MCMC sampling (e.g., [6]), which generate a collection of trees with identical taxa representing the range of possible phylogenetic relationships. These trees can be summarized in a consensus tree (see [7]). Consensus methods can highlight support for speciﬁc nodes in a tree, but they also may obscure highly supported subtrees. For example, in Figure 1, the subtree containing taxa *Correspondence: tamer@cise.uﬂ.edu 2 Computer and Information Science and Engineering, University of Florida, Gainesville, FL, USA Full list of author information is available at the end of the article A, B, C, and D is present in all ﬁve input trees. However, due to the uncertain placement of taxon E, the majority rule consensus tree implies that the clades in the tree have relatively low (60%) support. Alternate approaches have been proposed to reveal highly supported subtrees. The maximum agreement subtree (MAST) problem seeks the largest subtree that is present in all members of a given collection of trees [8]. For example, in Figure 1 the MAST includes taxa A, B, C, and D. Finding the MAST is an NP-hard problem [9], although eﬃcient algorithms exist to compute the MAST in some cases (e.g., [9-17]). In practice, since any diﬀerence in any single tree will reduce the size of the MAST, the MAST is often quite small, limiting it usefulness. A less restrictive problem is to ﬁnd frequent agreement subtrees (FAST), or subtrees that are found in many, but not necessarily all, of the input trees (see [18]). In this problem, a subtree is declared as frequent if it is in at least as many trees as a user supplied frequency threshold. Several algorithmic approaches have been suggested to identify FASTs, and speciﬁcally the maximum FASTs © 2012 Ramu et al.; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Ramu et al. BMC Bioinformatics 2012, 13:256 http://www.biomedcentral.com/1471-2105/13/256 A E B Page 2 of 15 C A D E B A B D C B A C D E C E A D B C E D A Majority Rule Consensus Maximum Agreement Subtree C A C 100% 60% 60% B E (a) D B D (b) Figure 1 (a) A collection of ﬁve input trees. The same subtree with taxa A, B, C, and D is present in all input trees, and only the position of taxa E changes. (b) The majority rule consensus and maximum agreement subtrees of the 5 input trees in Figure 1a. (MFASTs), or FASTs that contain the largest number of taxa. A variant of this problem seeks the maximal FASTs, i.e., FASTS that are not contained in any other FASTs. Notice that an MFAST is a maximal FAST, however, the inverse is not necessarily true. Zhang and Wang deﬁned algorithms, implemented in Phylominer, to identify FASTs from a collection of phylogenetic trees [19,20]. These algorithms are guaranteed to ﬁnd all FASTs but they may be prohibitively slow for data sets larger than 20 taxa. Cranston and Rannala implemented Metropolis-Hastings and Threshold Accepting searches to identify large FASTs from a Bayesian posterior distribution of phylogenetic trees [21]. This approach can handle thousands of input trees but it may not be feasible if the trees have more than 100 taxa [21]. Another approach to reveal highly supported subtrees from a collection of trees is to identify and remove rogue taxa, or taxa whose position in the input trees is least consistent. Recently, several methods have been developed that can identify and remove rogue taxa from collections of trees with thousands of taxa [22-24]. However, unlike MAST or FAST approaches, they do not provide guarantees about the support for the remaining taxa. In this paper, we describe a heuristic approach for identifying MFASTs in collections of trees. Unlike previous methods, our method easily scales to datasets with over a thousand taxa and hundreds of trees. Towards this goal, we develop a heuristic solution that works in multiple phases. In the ﬁrst phase, it identiﬁes small candidate subtrees from the set of input trees which serve as the seeds of larger subtrees. In the second phase, it combines these seeds to build larger candidate MFASTs. In the ﬁnal phase, it performs a post processing step. This step ensures that the size (i.e., number of taxa) of the FAST found can not be increased further by adding a new taxon without reducing its frequency below a user supplied frequency threshold. We demonstrate that this heuristic can easily handle data sets with 1000 taxa. We test the eﬀectiveness of these approaches on simulated data sets and then demonstrate its performance on large, empirical data sets. Although our heuristic does not guarantee to ﬁnd all MFASTs or the largest MFAST in theory, it found the true MFAST in all of our synthetic datasets where we could verify the correctness of the result. It also performed well on the empirical data sets. Its performance is robust with respect to the number of input trees and the size of the input trees. Methods In this section we describe our method that aims to ﬁnd Maximum Frequent Agreement SubTrees (MFASTs) in a given set of m phylogenetic trees T = {T1 , T2 , · · · , Tm }. Our method follows from the observation that an MFAST is present in a large number of trees in T . The method builds MFASTs bottom up from small subtrees of taxa in the trees in T . Brieﬂy, it works in three phases. • Phase 1. Seed generation (Section “Phase one: Seed generation”). In the ﬁrst phase, we identify small subtrees from the input trees that have a potential to be a part of an MFAST. We call each such subtree a seed. • Phase 2. Seed combination (Section “Phase two: Seed combination”). In the second phase, we construct an initial FAST by combining the seeds found in the ﬁrst phase. • Phase 3. Post processing (Section “Phase three: Post-processing”). In the third phase, we grow the FAST further to obtain the maximal FAST that contains it by individually considering the taxa which are not already in the FAST. We report the resulting maximal FAST as a possible MFAST. First, we present the the basic deﬁnitions needed for this paper in Section “Preliminaries and notation”. We then discuss each of the three phases above in detail. Ramu et al. BMC Bioinformatics 2012, 13:256 http://www.biomedcentral.com/1471-2105/13/256 Page 3 of 15 Preliminaries and notation In this section, we present the key deﬁnitions and notations needed to understand the rest of the paper. We describe our method using rooted and bifurcating phylogenetic trees. However, our method and deﬁnitions can easily be applied to unrooted or multifurcating trees with minor or no modiﬁcations. Also, we assume that all the taxa are placed at the leaf level nodes of the phylogenetic tree, and all the internal nodes are inferred ancestors. Figure 2(a) shows a sample phylogenetic tree built on ﬁve taxa. We deﬁne the size of a tree as the number of taxa in that tree. We start by deﬁning key terms. Deﬁnition 1 (Clade). Let T be a phylogenetic tree. Given an internal node of T, we deﬁne the set of all nodes and edges of T contained under that node as the clade rooted at that node. but the inverse is not true all the time. Figures 2(b) and 2(c) illustrate two subtrees of the tree in Figure 2(a). Let us denote the number of combinations of k taxa from a set of n taxa with nk .In general, if a tree has n taxa, then that tree contains nk subtrees with k taxa. As a consequence, that tree contains 2n − 1 subtrees of any size including itself. Deﬁnition 4 (Frequency). Let T = {T1 , T2 , · · · , Tm } be a set of m phylogenetic trees and T be a phylogenetic tree. Let us denote the number of trees in T at which T is present with the variable m . We deﬁne the frequency of T in T as freq(T, T ) = m . m Each internal node of a phylogenetic tree corresponds to a clade of that tree. Figure 2(b) depicts the clade of the tree in Figure 2(a) rooted at x1 . Deﬁnition 5 (FAST). Let T = {T1 , T2 , · · · , Tm } be a set of m phylogenetic trees and T be a phylogenetic tree. Let γ be a number in [0, 1] interval that denotes frequency cutoﬀ. We say that T is a Frequent Agreement SubTree (FAST) of T if its frequency in T is at least γ (i.e., freq(T, T ) ≥ γ ). Deﬁnition 2 (Contraction). Let T be a phylogenetic tree with n taxa. The contraction operation transforms T into a tree with n − 1 taxa by removing a given taxon in T along with the edge that connects that taxon to T. We say that a FAST is maximal if there is no other FAST that contains all the taxa in that FAST. Clearly, larger FASTs indicate biologically more relevant consensus patterns. The following deﬁnition summarizes this. The contraction operation can extract the clades of a tree by removing all the taxa that are not a part of that clade. It can also extract parts of the tree that are not necessarily clades. We use the term subtree to denote a tree that is obtained by applying contractions to arbitrary set of taxa in a given tree. Formal deﬁnition is as follows. Deﬁnition 6 (MFAST). Let T = {T1 , T2 , · · · , Tm } be a set of m phylogenetic trees. Let γ be a number in [0, 1] interval that denotes frequency cutoﬀ. A FAST T of T is a Maximum Frequent Agreement SubTree (MFAST) of T if there is no other FAST T of T that has a larger size than T. Deﬁnition 3 (Subtree). Let T and T be two phylogenetic trees. We say that T is a subtree of T if T can be transformed into T by applying a series of contractions on T. If a tree T is a subtree of another tree T, we say that T is present in T. Notice that a clade is always a subtree, x0 x1 x2 a x3 b c (a) d e a b (b) c b c d (c) Figure 2 (a) A rooted, bifurcating phylogenetic tree T built on ﬁve taxa labeled with a, b, c, d and e. The internal nodes are shown with x0 , x1 , x2 and x3 . (b) A clade of T rooted at x1 . (b) and (c) Two subtrees of T by contracting the taxa sets {d, e} and {a, e}. Formally, given a set of phylogenetic trees T = {T1 , T2 , · · · , Tm } and a frequency cutoﬀ, γ , we would like to ﬁnd the MFASTs in T in this paper. We develop an algorithm that aims to solve this problem. Table 1 lists the variables used throughout the rest of this paper. Table 1 Commonly used variables and functions in this paper T A set of phylogenetic trees Ti ith tree m Number of trees in T n Number of taxa in each input tree ai ith taxa freq(T, T ) Frequency of the subtree T in T γ Frequency cutoﬀ Si ith seed (each seed is a subtree of a tree in T ) k Size of a seed c Number of contractions used to create a seed Ramu et al. BMC Bioinformatics 2012, 13:256 http://www.biomedcentral.com/1471-2105/13/256 Page 4 of 15 Phase one: Seed generation The ﬁrst phase extracts small subtrees from the given set of trees. From these subtrees we extract the basic building blocks which are used to construct MFASTs. We call these building blocks seeds. Conceptually each seed is a phylogenetic tree that contains a small subset of the taxa that make up the trees in T . We characterize each seed with three features that are listed below. We elaborate on each feature later in this section. 1. Seed size (k ) is the number of taxa in the seed. 2. Number of contractions (c) is the number of taxa we prune from a clade taken from an input tree in order to extract the seed. 3. Frequency (f ) is the fraction of input trees in which the seed is present. We explain the seed features with the help of Figures 3 and 4. The ﬁrst two characteristics explain how a seed can be found in one of the trees in T . They indicate that there is a clade of a tree in T such that this clade contains k + c taxa and it can be transformed into that seed after c contractions from that clade. For instance in Figure 3, when k = 2 and c = 0, only seed S1 can be extracted from T1 by choosing the clade rooted at x2 . When k = 2 and c = 1, seeds S1 , S2 and S3 can be obtained using one contraction (a3 , a2 and a1 respectively) from the clade rooted at x1 . The last feature denotes the number of trees in T in which the seed is present. For example in Figure 4, there are nine seeds S1 , S2 , · · · , S9 extracted from the three input trees using only one contraction. Among these, the frequency of S1 is 1 as it is present in all the trees. Frequency of S2 is about 0.67 for it is present in only two out of three x0 T1 x1 x2 a1 a2 S1 a3 S2 a4 trees (T1 and T2 ). The frequency of the rest of the seeds is only about 0.33. Recall that, by deﬁnition, an MFAST is present in at least a fraction γ of the trees in T . Therefore, we consider only the seeds whose frequency values are equal to or greater than this number ( i.e. , f ≥ γ ). Given the values of k, c and γ , we extract all the seeds which possess the desired feature values from the set of input trees as follows. In the newick string representation of a tree, a pair of matching parentheses corresponds to an internal node in the tree. The number of taxa in the clade rooted at this internal node is given by the number of labels between the two matching parentheses. Following from this observation, we scan the newick string of each tree one by one. For each such tree, we identify the clades which have k + c taxa. Notice that, if a tree conn clades of size tains n taxa, then it contains at most k+c k + c as no two such clades can contain common taxa. We then extract all combinations of k taxa from each of these clades by contracting the remaining c taxa. The number of ways this can be done is k+c c . Notice that all the small trees extracted this way possess the ﬁrst two characteristics explained above. At this point, we however do not know their frequencies. Therefore, we call them potential seeds. It is worth mentioning that the same seed might be extracted from diﬀerent trees. As we extract a new potential seed, before storing it in the list of potential seeds, we check if it is already present there. We include it in the potential seed list only if it does not exist there yet. Otherwise, we ignore it. This way, we maintain only one copy of each seed. Once we build our potential seed list for all the trees in T , we go over them one by one and count their frequency in T as the fraction of trees that contain them. We ﬁlter all the potential seeds whose frequencies are less than the frequency cutoﬀ. We keep the remaining ones as the list of seeds along with the frequency of each seed. In Figure 4, consider the tree T1 that has four taxa. For k = 3 and c = 1, there is only one clade of size k + c = 4 which is the tree T1 itself. We extract four potential seeds, each having three leaves from this tree. The potential seeds in this ﬁgure are given by S1 , S2 , S5 and S7 which we extract by contracting a4 , a3 , a2 and a1 respectively from T1 . S3 Phase two: Seed combination a1 a2 a1 a3 a2 a3 Figure 3 T1 is an input tree built on four taxa a1 , a2 , a3 and a4 . The internal nodes of T1 are labeled as x0 , x1 and x2 . S1 is the only seed obtained from T1 when k = 2 and c = 0. That is S1 is identical to the clade rooted at x2 . S1 , S2 and S3 are the seeds extracted from T1 when k = 2 and c = 1. They are all extracted from the clade rooted at x1 by contracting a3 , a2 and a1 respectively. At the end of the ﬁrst phase, we obtain a set of frequent seeds from the input trees. Notice that each seed is a FAST as each seed is present in suﬃcient number of trees speciﬁed by γ . These seeds are the basic building blocks of our method. In the second phase of our method, we combine subsets of these seeds to construct larger FASTs. We ﬁrst deﬁne what it means to combine two seeds. In order to combine two seeds, it is a necessary condition Ramu et al. BMC Bioinformatics 2012, 13:256 http://www.biomedcentral.com/1471-2105/13/256 Page 5 of 15 T1 a1 T2 a2 a3 a4 a1 S1 a1 a2 a3 a1 a3 a3 a4 a1 a2 a4 a2 a1 a2 a4 a3 a4 a1 a2 a4 a4 a3 a3 a4 S6 a3 S8 a3 a4 S3 S5 S7 a2 a2 S2 S4 a1 T3 a4 a1 S9 a4 a3 a2 Figure 4 The set of input trees T1 , T2 , T3 and the set of all nine potential seeds S1 , S2 · · · S9 when the seed characteristics are set to k = 3 and c = 1. All the potential seeds have three taxa as k=3. We need one contraction from the input tree to obtain each seed. S1 has frequency 1.0 as it is present in T1 , T2 and T3 . Seed S2 has frequency ∼0.67 as it is present in T1 and T2 . Remaining seeds have frequency ∼0.33 as each appears in only one of the three trees. that both seeds are present in at least one common tree T in T . We call such a tree T as the reference tree. We combine two seeds with the guidance of a reference tree. Let S1 and S2 be two seeds and let T be their reference tree. Let L1 , L2 and L be the set of taxa in S1 , S2 and T respectively. Combining S1 and S2 results in the tree that is equivalent to the one obtained by contracting the taxa in L − (L1 ∪ L2 ) from T. For simplicity, we will denote the combine operation using T as the reference network with the ⊕T symbol. For instance we denote combining S1 and S2 with T being the reference tree as S1 ⊕T S2 . To simplify our notation, whenever the identity of the reference tree is irrelevant, we will use the symbol ⊕ instead of ⊕T . Figure 5 demonstrates how two seeds S1 and S2 are combined with the help of the reference tree T. In this ﬁgure, both S1 and S2 are subtrees of T. Thus, it is possible to use T as the reference tree. We have L1 = {a1 , a3 , a4 }, L2 = {a1 , a2 , a5 , a7 }. Thus, we build C = S1 ⊕T S2 by contracting the taxa in L − (L1 ∪ L2 ) = {a6 , a8 } from T. So far, we have explained how to combine two seeds S1 and S2 using a reference tree. It is possible that many trees in T have both seeds present in them. Thus, one question is which of these trees should we use as the reference tree to combine the two seeds? The brief answer is that all such trees need to be considered. However, we make several observations that helps us avoid combining S1 and S2 using each such reference tree one by one exhaustively without ignoring any of such trees. We explain them next. Consider two trees T1 and T2 from T where both seeds are present in. There are two cases for T1 and T2 . • C ASE 1: S1 ⊕T1 S2 = S1 ⊕T2 S2 . In this case, it does not matter whether we use T1 or T2 as the reference tree. They will both lead to the same combined subtree. Thus, we use only one. • C ASE 2: S1 ⊕T1 S2 = S1 ⊕T2 S2 . In this case, the trees T1 and T2 lead to alternative combination topologies. So, we consider both of them separately. We utilize the observations above as follows. We start by picking one reference tree arbitrarily. Once we create a combined subtree using that tree, we check whether that subtree is present in the remaining trees in T . We mark those trees that contain it as considered for reference tree and never use them as reference for the same seed pair again. This is because those trees fall into the ﬁrst case described above. This way, we also store the frequency of the combined subtree in T . If the number of unmarked trees is too small (i.e., less than γ × m) then it means that even if all the remaining trees agree on the same combined topology for the two seeds under consideration, they are not suﬃcient to make it a FAST. Thus, we do not use any of the remaining trees as reference for those two seeds. Otherwise, we pick another unmarked tree arbitrarily and repeat the same process until we run out of reference trees. The next question we need to answer is which seed pairs should we combine? To answer this question we ﬁrst make the following proposition. Proposition 1. Assume that we are given a set of phylogenetic trees T . Let S1 and S2 be two seeds constructed from Ramu et al. BMC Bioinformatics 2012, 13:256 http://www.biomedcentral.com/1471-2105/13/256 Page 6 of 15 intuition, we develop two approaches for combining the seeds. T 1. In-order Combination (Section “In-order combination”). 2. Minimum Overlap Combination (Section “Minimum overlap combination”). a1 a2 a3 a4 a5 a6 a7 a8 S2 S1 Both approaches accept the list of seeds computed in the ﬁrst phase as input and produce a larger FAST that is a combination of multiple seeds. Both of them also assume that the list of input seeds are already sorted in decreasing order of their frequencies. We discuss these approaches next. In-order combination a1 a3 a2 a1 a4 a7 a5 C a1 a2 a3 a4 a5 a7 Figure 5 T is the reference tree. S1 and S2 are the seeds to be combined, both are present in T. C is obtained by pruning the subtree containing taxa a1 , a2 , a3 , a4 , a5 and a7 from T. the trees in T . For all trees T ∈ T , we have the following inequality freq(S1 ⊕T S2 , T ) ≤ min{freq(S1 , T ), freq(S2 , T )} Proof. For any T, both S1 and S2 are subtrees of S1 ⊕T S2 . Thus if S1 ⊕T S2 is present in a tree, then both S1 and S2 are present in that tree. As a result, freq(S1 ⊕T S2 , T ) ≤ freq(S1 , T ) and freq(S1 ⊕T S2 , T ) ≤ freq(S2 , T ). Hence, freq(S1 ⊕T S2 , T ) ≤ min{freq(S1 , T ), freq(S2 , T )} Proposition 1 states that as we combine pairs of seeds to grow them, their frequency monotonically decreases. This suggests that it is desirable to combine two seeds if both of them have large frequencies. This is because if one of them has a small frequency, regardless of the frequency of the other, the combined tree will have a small frequency. As a result its chance to grow into a larger tree through additional combine operations gets smaller. Following this The in-order combination approach follows from Proposition 1. It assumes that the seeds with higher frequencies have greater potential to be a part of an MFAST. It exploits this assumption as follows, ﬁrst it picks a seed as the starting point to create a FAST. It then grows this seed by combining it with other seeds starting from the most frequent one as long as the frequency of the resulting tree remains at least as large as the given cutoﬀ γ . It repeats this process by trying each seed as the starting point, Algorithm 1 presents this approach. Algorithm 1 In order combination FAST ← ∅ for all seeds Si do FAST ← Si Mark Si as considered repeat Sj ← seed with highest frequency among unconsidered seeds Mark Sj as considered CUTOFF ← γ t FAST ← FAST repeat Pick the next unconsidered tree T ∈ T as reference Mark all the trees as that contain FAST ⊕T Sj as considered if freq(FAST ⊕T Sj , T ) ≥ CUTOFF then t FAST ← FAST ⊕T Sj CUTOFF ← freq(FAST ⊕T Sj , T ) end if until Less than γ × m unmarked reference trees are left in T FAST ← t FAST Unmark all trees in T until all seeds are considered if size of FAST ≥ size of FAST then FAST ← FAST Ramu et al. BMC Bioinformatics 2012, 13:256 http://www.biomedcentral.com/1471-2105/13/256 end if Unmark all seeds end for In Algorithm 1 we ﬁrst initialize the FAST as empty. We then consider each seed one by one. We initialize a temporary subtree denoted by FAST’ with the seed Si under consideration and mark Si as considered. We combine the FAST’ with a seed Sj which has the highest frequency amongst the seeds that have not been added. If multiple seeds have the highest frequency, we randomly pick one of them and mark that seed Sj as added to the FAST’. There can be alternative ways to combine FAST’ with Sj leading to diﬀerent topologies. We use the trees in T that contain both FAST’ and Sj as guides to try only the topologies that exist in T . We stop constructing alternative topologies as soon as we ensure that there are not suﬃcient number of trees to yield frequency of γ . We set FAST’ to the combined seed if the combined seed has large enough frequency. We then consider the seed with the next highest frequency for addition and repeat this step till all Sj have been considered. If the resulting temporary FAST is larger than FAST we replace the smaller FAST with the larger one. In the next iteration, we initialize the FAST with the next Si . Using this approach we can initialize the FAST with all Si , alternatively if the user wishes to limit the amount of time spent using a maximum time cutoﬀ we stop the outermost loop (i.e., alternative initializations of FAST’) as soon as the allowed running time budget is reached. Notice that in Algorithm 3 each seed Si can lead to a different FAST. We record only the FAST that has the largest size. However, it is trivial to maintain the top k FASTs with the largest size instead if the user is looking for k alternative maximal FASTs. Minimum overlap combination The purpose of combining seeds is to construct a FAST that is large in size. Our in-order combination approach (Section “In-order combination”) aimed to maximize the frequency of the combined seeds. In this section, we develop our second approach, named Minimum Overlap Combination. This approach picks seeds so that their combination produced as large subtree as possible. We elaborate on this approach next. When we combine two seeds, the size of the resulting tree becomes at least as big as the size of each of these seeds. Formally let S1 and S2 be two seeds (i.e., trees). Let L1 and L2 be the set of taxa combined in S1 and S2 . We denote the size of a set, say L1 , with |L1 |. The size of the tree resulting from combination of S1 and S2 is |L1 |+|L2 |− |L1 ∩ L2 |. For a given ﬁxed seed size, the ﬁrst two terms of this formulation remains unchanged regardless of the seed. The last term determines the growth in the size of Page 7 of 15 the FAST. Thus, in order to grow the FAST rapidly, it is desirable to combine two frequent subtrees with a small number of common taxa. Our second approach follows from the observation above. We introduce a criteria called the overlap between two subtrees as the number of taxa common between them. Our minimum overlap combination approach works the same as Algorithm 1 with a minor diﬀerence in selecting the seed Sj that will be combined with the current temporary FAST (i.e., FAST’). Rather than choosing the seed with the largest frequency, this approach chooses the one that has the least overlap with FAST’ among all the unconsidered and frequent seeds. If multiple seeds have the same smallest overlap, it considers the frequency as the tie breaker and chooses the one with the largest frequency among those. Phase three: Post-processing So far we described how to obtain seeds (Section “Phase one: Seed generation”) and how to combine them to construct FAST (Section “Phase two: Seed combination” ). The two approaches we developed for combining seeds aim to maximize the size of FAST. However, they do not ensure the maximality of the resulting FAST. There are two main reasons that prevent our seed combining algorithms from constructing maximal FAST. First, some of the taxa of a maximal FAST may not appear in any seed (i.e. false negatives). As a result no combination of seeds will lead to that maximal FAST. Second, even if all the taxa of a maximal FAST are parts of at least one seed, our algorithms will reject combining that seed with the FAST of the seeds if those seeds contain other taxa that are not part of the maximal FAST (i.e. false positives). In the post-processing phase, we tackle abovementioned problem. Algorithm 3 describes the post processing phase in detail. We do this by considering all taxa which are not already present in the FAST one by one. We iteratively grow the current FAST by including one more taxon at a time if the frequency of the resulting FAST remains at least as large as the frequency cutoﬀ γ . We repeat these iterations until no new taxon can be included in the FAST. Thus the resulting FAST is guaranteed to be maximal. Algorithm 2 Post processing INPUT = FAST from the seed combination phase INPUT = T OUTPUT = Maximal FAST RESULT ← FAST for all ai not in FAST do CUTOFF ← γ t RESULT ← RESULT repeat Ramu et al. BMC Bioinformatics 2012, 13:256 http://www.biomedcentral.com/1471-2105/13/256 Pick the next unconsidered tree T ∈ T as reference RESULT’ ← RESULT ⊕T ai Mark all the trees that contain RESULT’ as considered if frequency of RESULT’ ≥ CUTOFF then t RESULT ← RESULT’ CUTOFF ← frequency of RESULT’ end if until Less than γ × m unmarked reference trees are left in T RESULT ← t RESULT Unmark all trees in T end for return RESULT We expect the post processing step to identify quickly the taxa that have a potential to be in an MFAST that might have not been considered during the seed generation and seed combination phases. At the end of the post processing step we obtain an MFAST. Complexity analysis of our method In this section we discuss the complexity of our method in terms of the three phases involved in it. Let T be a set of m phylogenetic trees having n leaves each. The complexity of the diﬀerent phases of our method are as follows. Phase one. Finding the seeds involves enumerating all the subtrees and checking their frequencies. Given seed size k and number of contractions c, each will con tree n clades each leading to k+c tain at most k+c c alternative mn k+c subtrees. Thus, in total there can be up to k+c seeds c (possibly many of them identical) from all the trees in T . Typically, the values of k and c are ﬁxed and small (in our experiments we have k ∈ {3, 4, 5} and c ∈ {0, 1, 2, 3, 4, 5}) leading to O(mn) seeds. The complexity of ﬁnding whether a seed is present in a single tree is O(n log n). Given that there are m trees in T , the cost of computing the frequency of a single seed is O(mn log n). Thus, the time complexity for ﬁnding the frequency of all the seeds is this expression multiplied by the number of seeds, which is O(m2 n2 log n). Phase two. Consider a set of p frequent seeds that will be considered for combining in this phase. Recall that we have two approaches to combine them. Below, we focus on each. INORDER COMBINATION We try to combine each seed with every other seed leading to O(p2 ) iterations. The complexity of checking the frequency of each combined subtree is O(mn log n). Also, there can be up to O(m) diﬀerent reference trees for guiding the combine operation. Multiplying these terms, we obtain the complexity of phase using this approach as O(p2 m2 n log n). Page 8 of 15 MINIMUM OVERLAP COMBINATION The complexity of combining the frequent seeds using the minimum overlap combination approach is very similar to the inorder approach except for an additional term. The additional complexity is because we maintain the overlap between the subtrees. This leads to the complexity O(p2 n2 + p2 m2 n log n). Phase three. Here, we consider the FAST obtained from each of the p frequent seeds in phase two. For each FAST, we sequentially go over each taxa one by one leading to O(n) iterations. There can be up to O(γ ×m) references to add a taxon. So the cost of extending all p FASTs is O(γ × mnp). Notice that each frequent seed has to appear in at least γ × m trees. Thus, the number of unique frequent seeds n p is bounded by O( γmn ×m )= O( γ ). Thus, adding the cost of all the three phases, the overall time complexity of our method using inorder combination is O(m2 n2 log n + m2 n3 log n + mn2 ). γ2 That using minimum overlap combination is O(m2 n2 log n + m2 n3 log n n4 log n + + mn2 ). γ2 γ2 In the two summations above, the second term is asymptotically larger than the ﬁrst and the last terms. Thus, we can simplify the asymptotic time complexity of inorder and minimum overlap combinations as O( m2 n3 log n ) γ2 and O( n3 log n 2 (m + n)) γ2 respectively. Results and discussion This section evaluates the performance of our MFAST algorithm experimentally. Implementation details. We implemented our MFAST algorithm using C and Perl. More speciﬁcally, we implemented the ﬁrst two phases (seed generation and seed combination) in C and the third phase (post processing) in Perl. We utilize the functions provided in the newick Utilities [25] package by modifying the source code provided in that package. We use k ∈ {3, 4, 5} and c ∈ {0, 1, 2, 3, 4, 5} in all of our experiments unless otherwise stated. In our experiments, we observed that the minimum overlap combination produced larger MFASTs than the in-order combination approach. Therefore, we Ramu et al. BMC Bioinformatics 2012, 13:256 http://www.biomedcentral.com/1471-2105/13/256 limit our experimental results to the minimum overlap combination approach. Methods compared against. We have compared our method against Phylominer [20] and the MAST command implemented in PAUP* [26]. Among these, Phylominer also seeks MFASTs in a collection of trees. However, the time complexity of this method is exponential in the size of the input trees, and hence it becomes intractable for large trees. In our experiments, we observed that it does not scale beyond 50 taxa. PAUP* is primarily a program for phylogenetic inference, although it also can compute MASTs. MASTs have a strict 100% agreement criterion unlike the arbitrary frequency cutoﬀ values γ in our method. Evaluation Criteria. We evaluate our algorithm based on the size of the MFAST found. Larger MFASTs are preferable. When possible, we report the size of the optimal solution as well. Test Environment. We ran our experiments on Linux servers equipped with dual AMD Opteron dual core processors running at 2.2 GHz and 3 GB of main memory to test the performance of our method. Datasets We test the performance and verify the results of our method on synthetic datasets and real datasets. • SYNTHETIC DATASET We built synthetic datasets in which we embedded an MFAST as described below. We characterize each synthetic dataset using ﬁve parameters. 1. 2. 3. 4. 5. Tree size (n). Number of trees (m). MFAST frequency (f ). MFAST size (n ). Noise percentage (). The ﬁrst two parameters denote the size and number of trees in T . MFAST frequency speciﬁes the fraction of trees in T which contain an MFAST. MFAST size is the number of taxa in the embedded MFAST. The noise percentage is the percentage of taxa that is not a part of the embedded MFAST but is placed on the branches within the clade that contains the MFAST. We place all the other taxa on the branches outside this clade. Given an instantiation of these parameters, we ﬁrst created a tree that has n taxa. This tree serves as the MFAST. We then created m × f trees that contain this MFAST. We build each of these trees by inserting n − n taxa randomly in the MFAST. With probability we insert each taxa within the clade that contains Page 9 of 15 MFAST. With probability 1 − we insert it outside that clade. We then created m − (m × f ) trees that do not contain the current MFAST. We simply do this by inserting all the taxa one by one at a random location. • R EAL DATASETS . We use two empirical datasets to evaluate the performance of our heuristic. The data sets contain 200 bootstrap trees generated from phylogenetic analysis of the Gymnosperm [27] and Saxifragales (Burleigh, unpublished) plant clades. To make the bootstrap trees, we assembled super-matrices, matrices of concatenated gene alignments with partial taxon overlap, from gene sequence data available in GenBank. We performed a maximum likelihood bootstrap analysis on each super-matrix using RAxML v. 7.0.4 [28]. The Gymnosperm trees each contain 959 taxa, and the Saxifragales trees each contain 950 taxa. Eﬀects of number of input trees In our ﬁrst experiment, we analyze how the number of input trees in T aﬀects the performance of our algorithm. For this purpose, we created 30 synthetic datasets. The size of the embedded MFAST in all the datasets was 15. Among these 30 datasets, 10 contained 50 trees, 10 contained 100 trees and 10 contained 200 trees. We set the noise percentage to 20% in all the datasets. The frequency of the embedded MFAST was 0.8. We set the number of taxa in all the trees in these datasets to 100. We ran our algorithm on each of these datasets to ﬁnd the size of the MFAST for γ = 0.7. Table 2 lists the average MFAST size we found for each of the dataset sizes before post processing (i.e., at the end of phase two) and after post processing (i.e., at the end of phase three). The results demonstrate that our method can identify an MFAST that is almost as big as the embedded one even without post processing, regardless of the number of trees in the dataset. Post processing improves the MFAST size slightly. On the average, we always ﬁnd an MFAST that is as large as or larger than the embedded one. An MFAST larger than 15 here implies that while randomly inserting the taxa that are not in the embedded MFAST, at least one of them was placed under the same clade at least a fraction γ of the time. More importantly, our method successfully located such taxa along with the rest of the MFAST. Eﬀects of tree size Our second experiment considers the impact of the number of taxa in the input trees contained in T on the success of our method. To carry out this test, we built datasets with varying tree sizes (i.e., n). Particularly, we used n = 100, 250, 500 and 1000. For each value of n, we repeated the experiment 10 times by creating 10 datasets with the same properties. In all datasets, we set the number of trees to m = 100, the noise percentage at = 20%, the size of the Ramu et al. BMC Bioinformatics 2012, 13:256 http://www.biomedcentral.com/1471-2105/13/256 Page 10 of 15 Table 2 Evaluation of the eﬀect of the number of trees in T Number of trees MFAST size Before post processing After post processing 50 14.5 16.0 100 15.3 15.8 200 14.4 15.4 The number of trees is set to 50, 100 and 200. For each number of trees we run our experiments on ten datasets. Each dataset contains trees with 100 taxa and an embedded MFAST of size 15. We report the average size of the MFAST obtained by our method across the ten datasets. embedded MFAST at 15% of n, and the MFAST frequency at 0.8. Table 3 reports the average MFAST size found by our method for varying tree sizes. Second column shows the embedded MFAST size. Last two columns list the average size of the MFAST found by our method across the ten datasets. Before, going into detailed discussion of the results, it is crucial to observe that our method could run to completion for datasets that have as many as 1000 taxa. When we tried to run Phylominer, it did not return any results for datasets that have more than 100 taxa. The results also demonstrate that our method could successfully identify the embedded MFAST in all the datasets regardless of the size of the input trees. In some datasets, the reported MFAST was slightly larger than the embedded one. This indicates that while randomly inserting the taxa that are not part of the embedded MFAST, it is possible that a few taxa was consistently placed under the same same clade. The results also suggest that our method identiﬁes a signiﬁcant percentage of the taxa in the embedded MFAST after the second phase (i.e., before post-processing) when the tree size is small. As the tree size grows, it starts missing some taxa at this phase. It however recovers the missing taxa during the post-processing phase even for the largest tree size. This indicates that at the end of phase Table 3 Evaluation of the eﬀect of the size of the trees in T Number of taxa MFAST size Embedded two our method could identify a backbone of the actual MFAST. The unidentiﬁed taxa at this phase are scattered throughout the clades in the input trees. Thus, there is no clade of size k + c that contains them with c contractions for small k or c. As evident from Table 3, this however does not prevent our method from recovering them. This is because the backbone reported at the end of phase two is large enough, and thus speciﬁc enough, to recover the missing taxa one by one in the last phase. This is a signiﬁcant observation as it demonstrates that our method works well even with small values of k and c. Eﬀects of noise percentage Recall that the noise percentage denotes the percentage of taxa that is added inside the clade that contains the MFAST. As increases, the pairs of taxa in the MFAST get farther away from each other in the tree that contains it. As a result, fewer taxa from MFAST will be contained in small clades of size k + c. This raises the question whether our method works well as increase and thus the MFAST taxa gets scattered around in the trees that contain it. In this experiment, we answer the question above and analyze the eﬀect of the noise percentage on the success of our method. We create synthetic datasets with various values. Particularly, we use = 20, 40 and 60%. We set the size of the embedded MFAST to n = 15, the tree size to n = 100, number of trees to m = 100 and the MFAST frequency to f = 0.8. We repeat our experiment for each parameter 10 times by recreating the dataset randomly using the same parameters. We set the frequency cutoﬀ to γ = 0.7. We report the average MFAST size found by our method in Table 4. The results suggest that our method can identify the embedded MFAST successfully even when the noise percentage is very high. We observe that the size of the MFAST found by our method before post processing decreases slowly with increasing amount of noise. This is not surprising as the taxa contained in the embedded MFAST gets more spread out (and thus farther away from each other) in the trees in T with increasing noise. As a result, if there are taxa that are not part of any seed with the provided values of k and c, they will never be included Reported Before post After post processing processing Table 4 Evaluation of the eﬀect of the noise in the trees in T 100 15 15.3 15.8 250 38 32.3 38.8 Before post processing After post processing 500 75 43.7 76.0 20 15.3 15.8 1000 150 69.8 151.0 40 13.6 15.0 60 12.7 15.0 The tree size is set to 100, 250, 500 and 1000. For each tree size we run our experiments on ten datasets. Each dataset contains 100 trees with an embedded MFAST of size 15% of the input tree size. Second column shows the embedded MFAST size. Last two columns list the average size of the MFAST found by our method across the ten datasets. Noise (%) MFAST size The size of the embedded MFAST in all the experiments is 15. We list the average size of the MFAST found by our method before and after the post processing phase.

A scalable method for identifying frequent subtrees in sets of large phylogenetic trees

Nội dung