HISEA: HIerarchical SEed Aligner for PacBio data

pdf
Số trang HISEA: HIerarchical SEed Aligner for PacBio data 13 Cỡ tệp HISEA: HIerarchical SEed Aligner for PacBio data 1 MB Lượt tải HISEA: HIerarchical SEed Aligner for PacBio data 0 Lượt đọc HISEA: HIerarchical SEed Aligner for PacBio data 0
Đánh giá HISEA: HIerarchical SEed Aligner for PacBio data
4.9 ( 11 lượt)
Nhấn vào bên dưới để tải tài liệu
Đang xem trước 10 trên tổng 13 trang, để tải xuống xem đầy đủ hãy nhấn vào bên trên
Chủ đề liên quan

Nội dung

Khiste and Ilie BMC Bioinformatics (2017) 18:564 DOI 10.1186/s12859-017-1953-9 RESEARCH ARTICLE Open Access HISEA: HIerarchical SEed Aligner for PacBio data Nilesh Khiste and Lucian Ilie* Abstract Background: The next generation sequencing (NGS) techniques have been around for over a decade. Many of their fundamental applications rely on the ability to compute good genome assemblies. As the technology evolves, the assembly algorithms and tools have to continuously adjust and improve. The currently dominant technology of Illumina produces reads that are too short to bridge many repeats, setting limits on what can be successfully assembled. The emerging SMRT (Single Molecule, Real-Time) sequencing technique from Pacific Biosciences produces uniform coverage and long reads of length up to sixty thousand base pairs, enabling significantly better genome assemblies. However, SMRT reads are much more expensive and have a much higher error rate than Illumina’s – around 10-15% – mostly due to indels. New algorithms are very much needed to take advantage of the long reads while mitigating the effect of high error rate and lowering the required coverage. Methods: An essential step in assembling SMRT data is the detection of alignments, or overlaps, between reads. High error rate and very long reads make this a much more challenging problem than for Illumina data. We present a new pairwise read aligner, or overlapper, HISEA (Hierarchical SEed Aligner) for SMRT sequencing data. HISEA uses a novel two-step k-mer search, employing consistent clustering, k-mer filtering, and read alignment extension. Results: We compare HISEA against several state-of-the-art programs – BLASR, DALIGNER, GraphMap, MHAP, and Minimap – on real datasets from five organisms. We compare their sensitivity, precision, specificity, F1-score, as well as time and memory usage. We also introduce a new, more precise, evaluation method. Finally, we compare the two leading programs, MHAP and HISEA, for their genome assembly performance in the Canu pipeline. Discussion: Our algorithm has the best alignment detection sensitivity among all programs for SMRT data, significantly higher than the current best. The currently best assembler for SMRT data is the Canu program which uses the MHAP aligner in its pipeline. We have incorporated our new HISEA aligner in the Canu pipeline and benchmarked it against the best pipeline for multiple datasets at two relevant coverage levels: 30x and 50x. Our assemblies are better than those using MHAP for both coverage levels. Moreover, Canu+HISEA assemblies for 30x coverage are comparable with Canu+MHAP assemblies for 50x coverage, while being faster and cheaper. Conclusions: The HISEA algorithm produces alignments with highest sensitivity compared with the current state-of-the-art algorithms. Integrated in the Canu pipeline, currently the best for assembling PacBio data, it produces better assemblies than Canu+MHAP. Keywords: PacBio sequencing, Read aligner, Read overlapper, Genome assembly Background De novo genome assembly is the problem of reconstructing the entire genome of an organism from sequencing reads without using a reference genome. The high throughput NGS technologies produce short reads, of few hundred base pairs, which are much smaller than *Correspondence: ilie@uwo.ca Department of Computer Science, University of Western Ontario, N6A 5B7 London, Ontario, Canada most of the repeated regions in microbial and eukaryotic genomes. The repeated regions that are longer than read length pose serious challenges to the genome assembly algorithm. This imbalance of read versus repeat length increases the complexity and processing requirements of the assembly algorithm. This is the reason many assemblies using NGS data are fragmented and incomplete [1], and often not useful for downstream analysis. © The Author(s). 2017 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Khiste and Ilie BMC Bioinformatics (2017) 18:564 The advent of SMRT sequencing technology from Pacific Biosciences has encouraged researchers to look into the genome assembly problem from a fresh perspective. The long reads spanning across many repeated regions enable the production of significantly better assemblies. The SMRT technology is also less biased [2] than previous NGS technologies. However, two important drawbacks of SMRT sequencing are high error rate, of 1015%, and high cost. For comparison, the dominant technology of Illumina has up to 100 times lower error rate and is over 100 times cheaper in terms of cost per Gbp [3]. On the positive side, it has been found that the errors are random and it is possible to correct them algorithmically [4] by increasing the coverage of sequencing data. Thus, SMRT sequencing makes it possible to produce more continuous and higher quality genome assemblies than what has been achieved with previous technologies. In most of the published SMRT genome assembly pipelines [5–7], a critical step is finding all-vs-all raw read alignments. The outcome of this step can have a large impact on the processing of subsequent steps and the overall outcome of the assembly pipeline. It is therefore essential to use a highly sensitive aligner. We present a new long read aligner, HISEA, which is much more sensitive than all existing ones. We compared the sensitivity of our aligner with BLASR [8], DALIGNER [9], GraphMap [10], MHAP [11], and MiniMap [5]. Note that we use the terms “alignment” and “overlap” interchangeably. The comparatively high cost of SMRT sequencing has prevented its widespread use. It is very expensive to sequence large genomes with high coverage using SMRT technology, therefore it is still beyond the reach of many research labs. Recently, Koren et al. [6] showed that their Canu assembler can generate assemblies using only 20x coverage that are comparable with 150x coverage hybrid assemblies generated with SPAdes [12]. It has also shown that it can achieve maximum assembly continuity around 50x coverage. As indicated by Koren et al. [6], Canu pipeline is currently the best. It uses the MHAP aligner [11] and therefore we incorporated HISEA in this assembly pipeline, in place of MHAP. We have compared the two pipelines, Canu+MHAP and Canu+HISEA for five organisms, E.coli, S.cerevisiae, C.elegans, A.thaliana, and D.melanogaster at two coverage levels: 30x and 50x. The pipeline using HISEA is shown to produce better assemblies for both coverage levels. Moreover, the Canu+HISEA assemblies for 30x coverage are comparable with those of Canu+MHAP for 50x coverage. Our HISEA software is implemented in C++ and OpenMP and its source code is freely available. It can be used as a stand alone aligner or as an all-vs-all read aligner in other assembly pipelines. We have tested it in the Canu [6] assembly pipeline and the modified pipeline source code is also freely available for download. Page 2 of 13 Methods The HISEA algorithm Let  = {A, C, G, T} be the DNA alphabet;  ∗ is the set of all DNA sequences, that is, all finite strings over . Our setup assumes two sets of reads: the set of reference reads, R = {r1 , r2 , ..., rn } ⊂  ∗ , and the set of query reads, Q = {q1 , q2 , ..., qm } ⊂  ∗ . A k-mer is a string of length k over . Storing reads and hashing the reference set Each read ri is encoded using 2 bits per nucleotide and stored as an array of unsigned 64-bit integers, that is, as blocks of 32 nucleotides. The reverse complement of r is stored in the same array and it starts at the next unsigned 64-bit integer. A precomputed 16-bit reverse complement array of all possible values is used to quickly compute the reverse complement of reads. All k-mers that occur in reads of R are quickly computed using bitwise operations and bit masking and stored in a hash table using double hashing technique. In the hash table, each entry stores the value of the k-mer and a pointer to another hash table which stores the set of read ids rj , and positions within rj , where this k-mer occurs. Any k-mer which occurs more than MAX_KMER_COUNT times is ignored. The MAX_KMER_COUNT is a user configurable parameter with a default value of 10000. Similarly, k-mers appearing in low count can be ignored. These k-mers do not impact the alignment and ignoring them speeds up the alignment process. The default value for low count k-mers is 2 and it can be controlled by a user configurable parameter. Searching the query set The k-mers occurring in the query read set Q are not stored; they are quickly computed as needed using bit operations. Then they are efficiently searched for in the hash table built for the set R. Every time a matching k-mer is found in the hash table, the corresponding reference read id and its position is recorded. Note that the reads in the query set are only searched in forward direction. Clustering and filtering For a given query q ∈ Q and a reference read r ∈ R, the reference read direction and all matching k-mer positions are stored in the previous step. For a pair of reads (q, r), further processing is considered either in forward or reverse direction of r. The decision is taken based on the read direction of r which has higher number of matching k-mers. The next step is to perform clustering of all the matching k-mers. Clustering is an essential step in identifying the best alignment out of multiple possible alignments. Our algorithm reports only the best alignment between a pair of reads. Figure 1 shows an example of all k-mer matches between read q and read r before and after clustering. The Khiste and Ilie BMC Bioinformatics (2017) 18:564 Page 3 of 13 Fig. 1 All k-mer matches between reads q and r before (a) and after (b) clustering example shown here is one simple case; in reality many complex cases are possible where clustering is essential. The initial matches can have contradictory information, such as the ones in Fig. 1a, and the clustering phase involves collecting together consistent matches. A consistent set of k-mer matches is defined as a set of all k-mer matches arranged in ascending order of their positions and are equidistant from neighboring k-mer matches within defined threshold. The threshold is governed by a global parameter maxShift. The parameter maxShift is a user configurable parameter that accommodates the indel errors during k-mer matching, clustering and extension algorithms. The default value of this parameter has been experimentally determined to be 0.2 (or 20%). Figure 1b shows the set of k-mers as divided into three consistent groups. It can be seen from the diagram that the rightmost cluster of k-mers is expected to produce the best alignment. Algorithm 2.1 gives the details of the clustering algorithm. The input to the algorithm is an array V which contains all k-mer matches for a pair of reads (q, r). The input k-mer matches in V are sorted beforehand, first by query read positions and then by reference read positions. If the clustering algorithm fails to produce any meaningful clusters, we reverse the sort order i.e. first sort by reference read positions and then by query read positions and retry the algorithm. The algorithm uses two global parameters, kmerSize and maxShift. The parameter kmerSize is the size of the k-mers used for the initial hashing. The parameter maxShift is defined previously. The output of the clustering algorithm is a set of matches, ClusterArray, segregated in groups such that each group has a consistent set of k-mers. Note that the first two values in ClusterArray store the left and right k-mer positions in V for that cluster. The third and fourth values are the number of matching bps and k-mer hit counts respectively. From the output of Algorithm 2.1, the cluster with the maximum number of matching base pairs is selected for further processing. The expected number of k-mer matches is estimated with the help of k-mer bounds in read q and read r; see Fig. 2. The leftmost and rightmost query k-mers start and end at positions qL and qR , respectively. Similarly, the corresponding positions in the reference read are rL and rR . The alignment length is L = rR + querySize − qR . The number of k-mer hits in the overlapping region is approximated as a binomial distribution with probability p = (1 − e)2k and L trials. Overlaps that have fewer k-mer matches than three standard deviations below the mean, that is, less than μ−3σ =  Lp − 3 Lp(1 − p), are eliminated as having too low similarity. This procedure is employed several times during different steps of the algorithm and will be referred to as the μ − 3σ criterion. Algorithm 2.1: C LUSTER KMERS(V ) global kmerSize, maxShift local k ← 0, j ← 0, ClusterArray ← (0, 0, kmerSize, 1) local found ← false , refDiff ← 0, queryDiff ← 0 for k ← 1 to V .size ⎧ found ← false ⎪ ⎪ ⎪ ⎪ ⎪ for j ← 0 to ClusterArray.size ⎪ ⎪   ⎧ ⎪ ⎪ ⎪ refDiff ← (V [k] .r − V ClusterArray[ j] [1] .r) ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ if (refDiff < 0) ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ then continue ⎪ ⎪ ⎪ ⎪   ⎪ ⎪ ⎪ ⎪ ⎪ queryDiff ← (V [k] .q − V ClusterArray[ j] [0] .q) ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎨ do if (queryDiff < 0) ⎪ ⎪ do ⎪ then continue ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ if (refDiff and queryDiff within maxShift limits) ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ found ← true ⎪ ⎪ ⎪ ⎪ ⎪ ⎩ then ⎪ ⎪ Update values in ClusterArray[ j] ⎪ ⎪ ⎪ ⎪ ⎪ if (found = false ) ⎪ ⎧ ⎪ ⎪ ⎪ ⎨comment: Add new cluster in ClusterArray ⎪ ⎪ ⎪ ⎪ then ⎪ ⎩ ⎩ClusterArray[ j + 1] ← (k, k, kmerSize, 1) return (ClusterArray) Computing alignments The alignment between the two given reads starts as the shortest interval that contains all k-mer matches, shown in dark grey in Fig. 2. This region is extended using a smaller seed, that is, using k  -mer matches, for some k  < k. The default values are k = 16 and k  = 12. These values have been determined experimentally to produce reasonably good results for most datasets. Note that MHAP uses 16-mers as well. Khiste and Ilie BMC Bioinformatics (2017) 18:564 Page 4 of 13 Fig. 2 Computing the alignment. The dark grey region contains all k-mer matches and is extended by the light grey ones using k -mer matches The first step is to compute the maximum bounds of the alignment considering the maximum amount of allowable indels in the overlapping region. This is given by the user configurable parameter maxShift we mentioned above. As an example, for the situation depicted in Fig. 2, we set the maximum bounds for read q and read r as (queryStart, querySize) and (0, refEnd) respectively (see Fig. 2) where: queryStart = qL − (1 + maxShift)rL refEnd = rR + (1 + maxShift)(querySize − qR ) Then, all k  -mer matches within these bounds are computed as done previously for k-mers. These matches are used to extend the alignment we have computed so far; in Fig. 2, the dark grey region is extended by the light grey ones on both sides. Each k  -mer match is added if together with the ones already added they satisfy the μ − 3σ criterion described above. The structure of the extension step is given in Algorithm 2.2. The input bounds are either (qL , rL ) or (qR , rR ). The extension is performed as long as k  -mer matches exist that satisfy the μ − 3σ criterion. Algorithm 2.2: E XTENDA LIGNMENT(queryBound, refBound, V ) local refDiff ← 0, queryDiff ← 0, hits ← 0, currIndex ← −1 for i ← 1 to V .size ⎧ if (currIndex = −1) ⎪ ⎪ ⎪ ⎪ ⎪ refDiff ← |refBound − V [i] .r| ⎪ ⎪ then ⎪ ⎪ ⎪ queryDiff ← |queryBound − V [i] .q| ⎪ ⎪ ⎪ ⎪ ⎪ refDiff ← |V [currIndex] .r − V [i] .r| ⎪ ⎪ else ⎪ ⎪ ⎪ queryDiff ← |V [currIndex] .q − V [i] .q| ⎨ do if (refDiff and queryDiff within maxShift limits) ⎧ ⎪ ⎪ ⎪ ⎪ estimate ← μ − 3σ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪if (hits ≥ estimate) ⎪ ⎨ ⎪ ⎪ ⎪ ⎪ then hits ← hits + 1 ⎪ ⎪ ⎪ then ⎪ ⎪ ⎪ ⎪ ⎪ currIndex ← i ⎪ ⎪ ⎪ ⎪ ⎪ ⎩ ⎩ else if (currIndex  = −1)break if (currIndex  = −1) then return (V [currIndex] .r, V [currIndex] .q) comment: Could not extend bounds else return (0) Finally, all the k  -mers within the initial region – dark grey colour in Fig. 2 – are computed. Note also that the process is now guided by the original k-mers and therefore the clustering step is not required. The μ − 3σ criterion is applied once more to the total number of k  -mer matches for the entire overlap (light and dark grey); if the condition is satisfied, then the reads are considered to be overlapping and the alignment is reported. Note that HISEA computes only the alignment boundaries, not the actual alignments. The same is true for other programs, such as MHAP [11], Minimap [5], and GraphMap [10]. Once identified, the alignments can be computed by dynamic programming; we avoid this step as it is very time consuming and not necessary for assembly, which is the goal of HISEA. Alignment evaluation procedures The EstimateROC utility estimates the sensitivity, specificity, and precision for the alignments reported. The original EstimateROC utility of Berlin et al. [11] relies heavily on BLASR mappings for the verification of reported alignments. This is not the most accurate procedure since BLASR can make errors. Ideally, each alignment needs to be verified against the optimally computed alignment using the Smith-Waterman dynamic programming algorithm [13]. We modified the functions estimating sensitivity, specificity, and precision accordingly. The modified function ComputeDP first computes an optimal alignment, Aopt , between two reads using SmithWaterman dynamic programming algorithm; it ensures that this is a good alignment. Then, assuming the program reported an alignment, Arep , for these two reads, it compares the length, direction, and bounds of the alignment reported by the program with those of the optimal alignment. This is essential since the program could report a very different alignment between the same reads and that should not be considered correct. The use of an optimal alignment algorithm increases the accuracy of evaluation. The three functions used for evaluation, EstimateSensitivity, EstimateSpecificity, and EstimatePrecision are modified to correspond with our new ComputeDP Khiste and Ilie BMC Bioinformatics (2017) 18:564 function. The details are given in pseudo code below; see Algorithms 3.3-3.6. Note that our evaluation is more accurate than the one of Berlin et al. [11] and all programs exhibit an decline in performance. The Results section contains a comparison of several evaluation procedures. Algorithm 3.3: C OMPUTE DP(r1 , r2 , Arep ) Aopt ← optimal alignment between r1 and r2 if (Aopt .length < minOverlapLength) or (Aopt .score < minOverlapScore) then return (bad_Aopt ) if (Arep = void) then return (no_Arep ) if (Aopt .direction = Arep .direction) or (|Aopt .length − Arep .length| < 0.3Aopt .length) or (|Aopt .left − Arep .left| > 0.3Aopt .length) or (|Aopt .right − Arep .right| > 0.3Aopt .length) then return (bad_Arep ) return (good_Arep ) Page 5 of 13 Algorithm 3.6: E STIMATE PRECISION( ) for i ← 1 to numTrials ⎧ Pick random alignment Arep (r1 , r2 ) from program output. ⎪ ⎪ ⎪ ⎨if (ComputeDP(r , r , A ) = good_A ) 1 2 rep rep do ⎪ then TP ← TP + 1 ⎪ ⎪ ⎩ else FP ← FP + 1 TP ) return ( TP+FP Results Datasets All the datasets have been downloaded from Pacific Biosciences DevNet Datasets (https://github.com/ PacificBiosciences/DevNet/wiki/Datasets). The datasets used for this evaluation are given in Table 1. Details are provided in the Additional file 1. The tests were performed on a DELL PowerEdge R620 computer with 12 cores Intel Xeon at 2.0 GHz and 256 GB of RAM, running Linux Red Hat, CentOS 6.3. Competing programs Algorithm 3.4: E STIMATE SENSITIVIT Y( ) for i ← 1 to numTrials ⎧ Pick random overlap from BLASR reference mapping. ⎪ ⎪ ⎪ ⎪ Assume the reads are r1 and r2 . ⎪ ⎪ ⎪ ⎨if (ComputeDP(r , r , A ) = bad_A ) 1 2 rep opt ⎧ do ⎪ ⎪ if (ComputeDP(r , r , A ) 1 2 rep = good_Arep ) ⎨ ⎪ ⎪ ⎪ ⎪ then then TP ← TP +1 ⎪ ⎪ ⎪ ⎩ ⎩ else FN ← FN + 1 TP return ( TP+FN ) Algorithm 3.5: E STIMATE SPECIFICIT Y( ) for i ← 1 to numTrials ⎧ ⎪Generate two random read IDs: r1 and r2 . ⎪ ⎪ ⎪ if (overlap Arep (r1 , r2 ) exists in program output) ⎪ ⎪ ⎨ if (ComputeDP(r1 , r2 , Arep )  = good_Arep ) then do then FP ← FP + 1 ⎪ ⎪ ⎪ ⎪ ⎪ if (ComputeDP(r 1 , r2 , Arep ) = bad_Aopt ) ⎪ ⎩ else then TN ← TN + 1 TN ) return ( TN+FP We evaluated the performance of HISEA against the currently best programs for PacBio read alignment: BLASR [8], DALIGNER [9], GraphMap [10], MHAP [11], and Minimap [5]. We then assessed the performance of HISEA for assembling PacBio data by including HISEA in the Canu assembly pipeline [6] and comparing it with the Canu assembly using MHAP as the aligner. The programs were run according to their own developers’ suggestions or better, as follows. Minimap and DALIGNER were run as suggested by the developers. BLASR was run according to what the MHAP paper claimed to be the best choice of parameters. This is clearly better than the default parameters of BLASR. GraphMap was run with default parameters as the only choice in overlapping mode. MHAP was run with default parameters, except the number of hashes, which was set to 1256, instead of the default 512, for increased sensitivity. Minimap was run with window size 5 (default is 10), as recommended by the designers. HISEA was run with default parameters. Table 1 SMRT datasets used in for evaluation Genome Reference number Coverage Chemistry Genome size (Mbp) E.coli NC_000913 85x P5C3 4.64 S.cerevisiae NC_001133.9 117x P4C2 12.16 C.elegans WS222 80x P6C4 100.2 A.thaliana TAIR10 110x P4C2 134.6 D.melanogaster Ref v5 90x P5C3 129.7 Khiste and Ilie BMC Bioinformatics (2017) 18:564 The Additional file 1 contains all the details concerning the versions used, download websites, and command lines. Alignment comparison The first tests we performed, as done also by Berlin et al. [11], use subdatasets of 1Gbp randomly sampled from the initial datasets; for the two smallest genomes, E.coli and S.cerevisiae, full datasets are used since they are close to 1Gbp with the given coverage. The sensitivity, specificity, and precision values for all five programs are given in Table 2. They were computed using the EstimateSensitivity, EstimateSpecificity, and EstimatePrecision procedures that we described in the Methods section. Similarly to MHAP [11] evaluation parameters for EstimateROC, we use minimum alignment length 2000 bps and 50,000 trials. The other mandatory inputs to EstimateROC are the reference genome, the reads and the mapping of the reads to the reference. The mapping of Page 6 of 13 the reads to the reference is computed using the BLASR program. HISEA has clearly the highest sensitivity, over 16% higher, on the average, than the second best program, MHAP. The specificity is high for all programs. Minimap has the highest specificity but low sensitivity. BLASR has the highest precision but, again, low sensitivity. HISEA is second for precision, not far from BLASR. To better compare the performance with respect to sensitivity and precision, we have computed the F1 -scores, also shown in Table 2. The F1 -score for HISEA is much higher than all the other programs, with DALIGNER and MHAP following 18% and 19% behind. Next are BLASR and Minimap and last comes GraphMap with a very low score. The time and memory comparison for the same 1Gbp datasets is presented in Table 3. Minimap and GraphMap are clearly the fastest and BLASR the slowest. HISEA is in the middle, behind MHAP and DALINER. Space-wise, Minimap is again the best, followed closely by BLASR, Table 2 Comparison for the 1Gbp datasets (coverage levels in parentheses) Sensitivity, specificity, precision, and F1 -score are given as percentages; A dash mean that the program crashed with segmentation fault. The best values are shown in bold. The bottom of the table shows the average values, each computed from the five corresponding values in the table Khiste and Ilie BMC Bioinformatics (2017) 18:564 Page 7 of 13 Table 3 Time and memory comparison for the 1Gbp datasets Genome E.coli S.cerevisiae C.elegans A.thaliana D.melanogaster Time (h) Memory (GB) BLASR DALIGNER GraphMap MHAP Minimap HISEA Time 113.0 3.0 0.3 3.0 0.1 4.0 Memory 7.1 124.6 42.3 210.0 8.8 25.5 Time 283.2 – 0.6 10.6 0.3 23.5 Memory 13.3 – 71.0 210.0 15.1 56.5 Time 333.6 4.1 0.6 4.3 0.2 23.6 Memory 14.5 248.2 59.0 210.0 9.8 46.4 Time 43.2 8.1 0.6 5.9 0.2 12.2 Memory 10.3 248.2 60.0 210.0 9.9 45.3 Time 355.2 12.5 0.4 4.8 0.1 95.1 Memory 16.7 204.2 59.0 210.0 9.7 48.1 CPU time is in hours and the memory in GB. The best results are in bold Table 4 Comparison of several types of sensitivity computations on the 1Gbp datasets For each dataset, four types of sensitivity computations are used: “presence” only checks for the read pair, “length” also checks the correct length, “bounds” checks for correct alignment bounds (the one used in this paper), and the last one is from Berlin et al. [11] Khiste and Ilie BMC Bioinformatics (2017) 18:564 Page 8 of 13 and at some distance by HISEA and GraphMap. MHAP and DALIGNER used the most memory. MHAP is implemented in JAVA which generally requires more memory. The java command-line parameter -Xmx is used to set the maximum heap size for MHAP stand alone invocation. The default maximum java heap size depends on the platform and the amount of memory in the system. For our systems, the default was not sufficient to perform the tests. We set -Xmx parameter to 200G which was sufficient for all tests but it does not capture true overlapper memory for MHAP. The reported memory usage for MHAP consists of the overlapper memory and the memory required for Java Virtual Machine environment. the sensitivity are compared. In our evaluation we check for precise bounds of the alignment; this is given in the rows labelled as “bounds” in the table. We can relax this condition by checking only the length of the alignment; labelled as “length” in the table, this is the closest to the procedure of Berlin et al.. Finally, the weakest check we can have is simply for the “presence” of an alignment between the reads. While there are differences among all these sensitivity modes, HISEA remains clearly the first, followed by DALIGNER and MHAP, and then at some distance by the other three programs. It is interesting to note the very high sensitivity of DALIGNER in the “presence” only scenario. MHAP sketch size and Minimap minimizers Sensitivity variations As we have described above, we use a more precise evaluation compared to the one of Berlin et al. [11]. As a result the programs exhibit a decrease in sensitivity. It is therefore interesting to compare our procedure with the one of Berlin et al. [11]. In Table 4, four ways of evaluating Both MHAP and Minimap can have their parameters adjust to improve sensitivity. We investigate here this effect. MHAP uses a technique called MinHash [14] in order to compute the overlaps. MinHash reduces a string to a set of fingerprints, called sketch. It is clear that using Table 5 Testing larger sketch sizes for MHAP. Starting with the value we have used for testing, 1256, the sketch size is increased with increments of 512 up to 3816 Genome E.coli S.cerevisiae C.elegans A.thaliana D.melanogaster Parameter MHAP skecth size 1256 1768 2280 2792 3304 3816 Sensitivity 83.74 85.75 86.52 86.87 87.05 87.16 Specificity 99.90 99.86 99.84 99.82 99.81 99.80 Precision 97.15 96.99 96.82 96.89 96.88 97.70 F1-score 89.95 91.02 91.38 91.61 91.70 92.13 Sensitivity 62.08 63.67 64.32 64.52 64.62 64.69 Specificity 99.77 99.72 99.66 99.63 99.58 99.56 Precision 89.29 88.79 88.69 88.62 88.55 88.30 F1-score 73.24 74.16 74.56 74.67 74.72 74.67 Sensitivity 80.43 81.81 82.37 82.62 82.69 82.73 Specificity 99.97 99.93 99.90 99.88 99.85 99.82 Precision 45.46 35.71 29.32 25.80 23.75 22.13 F1-score 58.09 49.72 43.25 39.32 36.90 34.92 Sensitivity 76.19 77.05 77.38 77.49 77.55 77.57 Specificity 99.91 99.87 99.86 99.85 99.84 99.83 Precision 88.78 88.50 88.68 88.35 88.55 88.33 F1-score 82.00 82.38 82.65 82.56 82.69 82.60 Sensitivity 71.86 73.36 73.89 74.12 74.24 74.30 Specificity 99.94 99.92 99.91 99.88 99.87 99.86 Precision 72.47 72.00 72.07 72.46 71.45 71.62 F1-score 72.16 72.67 72.97 73.28 72.82 72.94 Note that the results for the first column (sketch size 1256) appear also in Table 2. They are repeated here for comparison convenience Khiste and Ilie BMC Bioinformatics (2017) 18:564 a larger sketch increases the sensitivity at the cost of speed decrease. Given the excellent speed of MHAP, it is worth investigating the effect of this parameter. Note that we already tested sketch size 1256 instead of the default 512, for improved sensitivity. Table 5 shows the results for sketch size increased with increments of 512 from 1256 to 3816. The sensitivity increases slightly but never comes close to that of HISEA. Also, precision decreases and so the F1 -score increases very little (or decreases dramatically, as it happens for C.elegans). Also, the running time increases up to 10 times when changing sketch size from 1256 to 3816. Overall, increasing the sketch size is clearly not improving the performance of MHAP. Similarly, the sensitivity of Minimap can be increased by using more minimizers. A minimizer is the smallest k-mer in a window of w consecutive k-mers. The default value is w = 10 but the recommended value by the designers for all-vs-all PacBio read self-mapping is w = 5 and this is what we used in our tests. We have investigated the effect of increasing the number of minimizers by decreasing w. The results are presented in Table 6. The improvement is more significant for Minimap but it starts from lower values. The improved performance is still far from the top programs. Page 9 of 13 Table 6 Testing higher number of minimizers for Minimap. Starting with the value we have used for testing, w = 5, we increase the number of minimizers by decreasing w all the way to the smallest value w = 1. Note that the results for the first column (w = 5) appear also in Table 2. They are repeated here for comparison convenience Genome E.coli S.cerevisiae C.elegans A.thaliana Parameter Minimap window size 5 4 3 2 1 Sensitivity 91.80 93.08 94.13 95.24 96.29 Specificity 99.93 99.92 99.93 99.92 99.91 Precision 97.13 97.22 97.42 97.51 97.58 F1-score 94.39 95.10 95.75 96.36 96.93 Sensitivity 9.35 9.64 9.94 10.36 11.00 Specificity 99.98 99.98 99.97 99.97 99.97 Precision 94.30 94.18 93.28 91.90 88.58 F1-score 17.01 17.49 17.97 18.62 19.57 Sensitivity 85.38 86.63 87.63 88.77 89.80 Specificity 99.98 99.98 99.98 99.98 99.97 Precision 89.80 89.77 89.05 88.11 85.76 F1-score 87.53 88.17 88.33 88.44 87.73 Sensitivity 23.55 26.90 31.21 37.08 45.56 Specificity 99.97 99.98 99.96 99.96 99.96 Sensitivity vs overlap size Precision 84.00 84.77 85.48 86.43 87.94 It is easier to find long overlaps with correct bounds compared to short overlaps. We have plotted in Fig. 3 the aligners’ sensitivity as a function of overlap length. The sensitivity increases with the overlap length for all aligners except DALIGNER. The sensitivity of HISEA remains very high for both short and long overlaps and it improves with longer overlap lengths. MHAP shows a similar trend but its sensitivity for short overlaps is low. BLASR, Minimap, and GraphMap seem to have been optimized for more recent chemistry; note the very low performance on the oldest chemistry P4C2 datasets. F1-score 36.79 40.84 45.73 51.90 60.02 Sensitivity 40.72 42.82 45.51 49.11 54.00 Specificity 99.99 99.98 99.98 99.98 99.97 HISEA vs MHAP Since sensitivity is the most important parameter, as long as the difference in precision is not too large, we compare for the remaining tests only the top two programs, HISEA and MHAP. It turns out that the way MHAP is run within the Canu assembly pipeline is different from running it in stand alone mode. Therefore, we are comparing again the sensitivity, specificity and precision of the alignment produced by the two programs, this time while run in the pipeline mode. We consider the same datasets as above but with higher coverage: 30x and 50x. As mentioned by Koren et al. [6], Canu+MHAP pipeline reaches the best assemblies around 50x coverage. Our goal is to produce similar quality assemblies with only 30x coverage. The 30x D.melanogaster Precision 83.93 83.12 82.87 81.85 81.25 F1-score 54.84 56.52 58.75 61.39 64.88 and 50x coverage datasets were sampled using the utility fastqSample available from the Canu pipeline [6]. The alignments computed by MHAP and HISEA while run in the Canu pipeline were extracted and analyzed as above. The results are shown in Table 7. HISEA has better sensitivity, precision, and F1 score in all tests with very large differences for the 50x coverage datasets. The specificity of both programs is very high for all tests, with HISEA edging ahead for 30x coverage and MHAP for 50x. Assembly comparison We have integrated the HISEA program in the Canu assembly pipeline, which is currently the best. Our alignment output is similar to the M4 format used by BLASR and MHAP programs1 . HISEA can also be integrated in other assembly pipelines, e.g., Miniasm [5] and Falcon [7], by converting HISEA output to the format required by these pipelines. Khiste and Ilie BMC Bioinformatics (2017) 18:564 Page 10 of 13 Fig. 3 Sensitivity as a function of mean overlap length Table 7 Sensitivity, specificity, precision, and F1 -score for HISEA and MHAP program output within the Canu pipeline Two coverage levels are considered for each dataset: 30x and 50x. The best values are shown in bold. The bottom of the table shows the average values, each computed from the five corresponding values in the table. All values are percentages
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.