Aggressive Assembly of Pyrosequencing Reads With Mates
Abstract
Motivation: Deoxyribonucleic acid sequence reads from Sanger and pyrosequencing platforms differ in price, accuracy, typical coverage, average read length and the variety of available paired-end protocols. Both read types can complement one another in a 'hybrid' arroyo to whole-genome shotgun sequencing projects, but associates software must be modified to accommodate their unlike characteristics. This is true even of pyrosequencing mated and unmated read combinations. Without special modifications, assemblers tuned for homogeneous sequence data may perform poorly on hybrid data.
Results: Celera Assembler was modified for combinations of ABI 3730 and 454 FLX reads. The revised pipeline called CABOG (Celera Assembler with the Best Overlap Graph) is robust to homopolymer run length uncertainty, high read coverage and heterogeneous read lengths. In tests on iv genomes, information technology generated the longest contigs among all assemblers tested. Information technology exploited the mate constraints provided by paired-end reads from either platform to build larger contigs and scaffolds, which were validated by comparison to a finished reference sequence. A low rate of contig mis-assembly was detected in some CABOG assemblies, but this was reduced in the presence of sufficient mate pair data.
Availability: The software is freely bachelor as open-source from http://wgs-assembler.sf.net nether the GNU Public License.
Contact:jmiller@jcvi.org
Supplementary information: Supplementary data are available at Bioinformatics online.
i INTRODUCTION
New Dna sequencing technologies demand new associates software to sew together brusk strings of nucleotide bases—as adamant by a sequencer—called reads. Many mature assemblers were adult when virtually all DNA sequence data were generated using Sanger chemical science to produce high-fidelity long reads. De novo assemblers, for which sequence data are the simply input, include: Phrap (world wide web.phrap.org), TIGR Assembler (Sutton et al., 1995), Celera Assembler (Myers et al., 2000), Euler (Pevzner et al., 2001), PCAP (Huang and Yang, 2005) and Arachne (Jaffe et al., 2003). The pyrosequencing platform produced by 454 Life Sciences is sold with Newbler, an assembler specifically for 454'south medium-length reads (Margulies et al., 2005). New assemblers including Velvet (Zerbino and Birney, 2008) offer functionality specifically for brusque-read sequencing technologies, such as Solexa (Bentley, 2006).
Hybrid sequencing: hybrid sequencing strategies leverage the strengths of two or more sequencing platforms and may require assembly software tuned for specific-read blazon combinations (Hall, 2007). At least 3 groups accept introduced software for hybrids of pyrosequencing and other read types. Nosotros introduce a bundle that best exploits paired-end mate information.
The offset protocols for associates of hybrid data applied a multiple-assembler pipeline. Newbler combined 'pyro' reads into initial contigs that were shredded to produce overlapping pseudo-Sanger reads. These were processed with Sanger reads by a Sanger-specific assembler: using Celera Assembler (Goldberg et al., 2006) or Arachne for whole genomes, or using Phrap (Wicker et al., 2006) for cloned targets. Recent protocols use a unmarried assembler tuned for hybrid information. Newbler was updated to take not-pyro data (Roche, 2007), and Euler was modified to have pyro data in a version called Euler-SR (Chaisson and Pevzner, 2008). Now, Celera Assembler has been modified to accept pyrosequencing data natively, alone or in combination with Sanger data.
Modified Celera Assembler: the Celera Assembler software has modules for successive phases of assembly: pairwise overlap detection; initial ungapped multiple sequence alignments called unitigs; unitig consensus calculation; combination of unitigs with mate constraints to form contigs and scaffolds, which are ungapped and gapped multiple sequence alignments, respectively; and finally, scaffold consensus conclusion (Myers et al., 2000). Our arroyo to hybrid information assembly reuses the Celera Assembler scaffold and consensus modules. Independent of the hybrid problem, the scaffold module was revised to recover trimmed base calls confirmed by co-locating reads, and the consensus module was revised to decide alternate consensus sequences in regions of apparent polymorphism (Denisov et al., 2008). Our assay narrowed the source of hybrid assembly problems to the overlap and unitig stages.
For speed, Celera Assembler relies on brusk exact matches between reads equally seeds for overlap detection. Its exact-match algorithms were sensitive to the different proclivities for stutter observed between platforms. Stutter, that is, incorrect conclusion of the number of bases in homopolymer (single-letter of the alphabet) runs, is more than prevalent in pyro reads than Sanger reads. We therefore modified the software to search for matches in compressed sequence, in which all single-letter repeats are reduced to a single base. The uncompressed sequence is consulted afterwards before the seeds become overlaps.
Celera Assembler was sensitive to the different boilerplate read lengths between platforms. The shorter reads are more likely to exist entirely independent within genomic repeats. Over-complanate alignments of short repeat reads induce truthful and false overlaps to the interior of longer reads. Where the longer reads extend beyond the genomic repeats, they exercise not all overlap each other. The result is short reads with containment overlaps to multiple long reads that do not overlap each other. These overlap tangles were triggering Celera heuristics designed to notice mis-associates, leading to unnecessarily short contigs.
Celera Assembler was also sensitive to the higher coverage typical of lower toll pyrosequencing. College coverage leads to increased collisions of reads with exactly the same prefix sequence. The assembler'due south arbitrary necktie-breaking heuristics, sufficient for exceptional ties, had the potential to lead the assembler away from the global optimum in hybrid data. To address these problems nosotros adult an aggressive arroyo to unitig construction that builds unitigs in greedy fashion, always following a read's best overlap (by an advisable criterion), and ignoring contained reads at first. The ambitious unitigs initially incorporate mistakes that, ideally, are caught and corrected later past pattern analysis applied to best overlaps and mate constraints.
High coverage could besides increment the number of spurs, that is, reads with invalid sequence at one end. These seemed to contribute to fractured unitigs on hybrid data. We realized the software could turn college coverage to its advantage past carefully trimming reads of unconfirmed sequence.
The new pipeline for hybrid data associates is named CABOG (Celera Assembler with the Best Overlap Graph). It was challenged to gather small genomes from 454 GS FLX reads in combination with paired-end mates from either FLX, or Sanger sequencing, or both. It was compared with other hybrid assembly protocols for continuity, accuracy and performance.
2 ALGORITHM
CABOG parses the native SFF files produced past the 454 FLX pyrosequencing machines. It discards 454 reads that include at least one unresolved base (the letter N). It recognizes mated reads as those whose sequence contains 454 linker sequences. From these mated reads it generates one or ii shorter, linker-gratuitous pseudo-reads, plus a distance constraint set to the estimated mean separation (default iii kb).
Overlap-based trimming: to exploit the increased expected read coverage, CABOG employs a read-trimming stride. This functionality has been explored previously in Lucy (Chou and Holmes, 2001), PCAP, Arachne, UMD Overlapper (Roberts et al., 2004), and Figaro (White et al., 2008). The new read trimmer first computes for all read pairs local alignments, or fractional overlaps that may not span the terminate of either read. On reads with sequence across the initially specified articulate range, information technology extends the clear range to the extent confirmed by overlaps. It flags regions with discontinuous overlap coverage and trims the clear range to the longest covered span, mayhap length nix, using heuristics to identify the precise boundaries. It identifies probable spurs and chimers (reads that join discontinuous genomic loci) and trims each to one trusted clear range.
Anchors and overlaps: CABOG uses exact-match seeds to detect peradventure overlapping reads quickly. It finds these seed matches in compressed sequence where sequent instances of the same base are reduced to ane base of operations. Compression compensates for the stutter that is observed more frequently in pyro reads. Each seed is a chiliad-mer (a substring of k bases, with thousand=22 by default). CABOG counts the number of instances of each distinct k-mer observed in the compressed input sequence. To avert highly repetitive k-mers, information technology dynamically tabulates a threshold M and then that one thousand-mers with more Thou occurrences constitute at near 1% of all yard-mer occurrences. Simply thou-mers with between two and One thousand occurrences are used for overlap seeds. CABOG identifies read pairs every bit likely to overlap if they share sufficiently many thou-mers, with a sliding threshold that favors rare g-mers but accepts more common k-mers if they cover longer spans. For the selected read pairs, it chooses a unmarried 0-length anchor position from the rarest k-mer shared by that pair of reads.
CABOG then determines which anchors extend to overlaps. Iteratively and in parallel, it considers each read equally a reference to which information technology aligns all other reads anchored to it. Information technology calculates pairwise alignments by first extending from the anchor in i management to observe the last aligning position Ten. If X is at the end of either read, the full-alignment extension is computed from 10 back across the anchor. Calculating all alignments in a consistent direction with respect to the reference read produces a more accurate multialignment, specially near homopolymer runs. A modified version of the Landau–Vishkin algorithm (Gusfield, 1997) is used to efficiently discover the position Ten. The multialignment of overlapping reads to each reference read is used to detect likely sequencing errors and modify alignments accordingly. It is besides used to boilerplate the homopolymer run lengths, applying different weighting criteria for Sanger and pyro reads. Finally, it outputs directed overlaps from the corrected reference sequences, with each overlap region required to span at to the lowest degree 40 bases covering two-read ends, and to have 94% or better sequence identity.
Best overlap graph: conceptually, reads and overlaps are represented in a multigraph, G, with both directed and undirected edges. Each read is represented by a pair of nodes, corresponding to the ii ends of the read, connected by an undirected edge. Directed edges represent dovetail overlaps, that is, those that span exactly ane end of each read. Dovetail paths in G are acyclic paths that include an undirected edge immediately before and after every directed edge. Path length is the number of implicit reads traversed.
When CABOG creates One thousand, it disregards overlaps from reads with containment overlaps. Information technology likewise disregards overlaps that exercise not satisfy a quality threshold (by default, at most i.5% alignment mistake and at least xl bp spanned). It loads at most one directed edge per node, which represents the corresponding read end's all-time overlap. By default 'best' is measured as most bases spanned by the overlap alignment, although other criteria can be used. Ties are broken past alignment per centum error, or failing that, arbitrarily by read ID. G is a best overlap graph, or BOG. Information technology is implemented as multiple linked lists in an array of reads, where each array element includes one left and ane right (possibly null) pointer to a particular end and strand of another read. The BOG represents a drastic and lossy information reduction of the overlap fix. It is a greedy heuristic to avert the overlap tangles expected in high-coverage hybrid data that present a wide mix of read lengths.
Unitig construction: any cycles in Yard are eliminated by deletion of ane border, chosen arbitrarily. The resulting BOG paths cannot diverge considering each node has out-degree of at most one in the overlap edges. The BOG paths tin can converge due to overlaps that are not mutually best for both reads involved. See Figure 1.
Fig. i.
Two representations of a best overlap graph. In (a), the layout resembles a multiple sequence alignment. In (b) each read is represented by ii nodes joined past an undirected edge. Arrows represent all-time overlaps, where all-time ways covering the most sequence. In that location are mutual all-time overlaps between successive pairs of reads A through D. Due to erroneous bases at one end (wavy line), read E has a non-mutual best overlap to B. Paths span undirected and directed edges alternately. Path EBA converges on path ABCD. CABOG scores read E lower than the others since only three reads are on paths from it. Starting with any 1 of the high-scoring reads, CABOG would build initial unitig ABCD, then East. Using saved data well-nigh each path intersection, CABOG would discount the intersection at B because the path from East spanned just one read before B. It would break ABCD but if there were also a change in read inflow charge per unit at B, which is not the case here. Although linear-time directed-path following finds the longest possible unitig in this constructed case, it is not guaranteed to do so when paths span multiple intersections.
Fig. 1.
Ii representations of a best overlap graph. In (a), the layout resembles a multiple sequence alignment. In (b) each read is represented by two nodes joined by an undirected edge. Arrows represent all-time overlaps, where best means covering the almost sequence. At that place are mutual all-time overlaps between successive pairs of reads A through D. Due to erroneous bases at i end (wavy line), read E has a not-mutual best overlap to B. Paths span undirected and directed edges alternately. Path EBA converges on path ABCD. CABOG scores read E lower than the others since only three reads are on paths from it. Starting with any one of the loftier-scoring reads, CABOG would build initial unitig ABCD, then E. Using saved information well-nigh each path intersection, CABOG would discount the intersection at B because the path from Eastward spanned but one read before B. It would break ABCD only if there were too a change in read arrival rate at B, which is not the case here. Although linear-time directed-path following finds the longest possible unitig in this synthetic case, it is not guaranteed to do and then when paths span multiple intersections.
CABOG scores each read past the number of other reads reachable from information technology along BOG paths. Each read's score is the sum of the lengths of the paths from both of the read's nodes. CABOG finds path lengths efficiently by reusing saved path lengths to brusque-circuit path post-obit.
CABOG sorts the reads past score. Starting at higher scoring reads, information technology follows paths and builds a unitig from each path. More precisely, starting at the next-highest scoring read R, information technology skips R if R is already in a unitig. Otherwise, information technology begins a new unitig with R and follows the two dovetail paths from R'south two nodes, adding reads to the unitig until it encounters a path stop or a read already belonging to some unitig.
This completes the greedy, aggressive phase of unitig construction. At this point, the unitigs partition the reads. The initial unitigs are called 'promiscuous' because their paths could bridge non-mutually all-time overlaps. The read visitation society ensures that the longest path through each intersection becomes a unitig first. Shorter convergent paths become shorter unitigs that terminate at the intersections. On paths with a single intersection, CABOG always selects the longer path kickoff. On paths with multiple intersections, information technology can miss opportunities for fifty-fifty larger unitigs. Such unitigs would be revealed by making all overlap edges bi-directional in Yard. However, in such a graph, path following would exist a non-linear operation.
Unitig splitting: CABOG breaks promiscuous unitigs at sites corresponding to selected path intersections. Each BOG path intersection tin can point a genomic repeat purlieus or represent noise. CABOG uses heuristics designed to select most repeat-induced intersections while avoiding noise-induced intersections. Spurs are a common class of racket, and not all spurs would be corrected during the overlap-based trimming step. Spur-induced path intersections produce an 'intruder' path of length ane. It would be incorrect to ignore all length=ane intruders; a valid read that spans a genomic repeat boundary can have no overlaps at its non-repetitive end due to random low coverage. Therefore, CABOG ignores about, but not all, length=1 intruders, according to the heuristics beneath.
A BOG path is called 'long', if its unitig contains more than one read and has a full sequence length >500 bases. CABOG visits every BOG path P and applies the following operations in order. If it splits P, then it also splits the corresponding unitig. (i) At each bespeak where P is intersected past path L, it breaks P if Fifty is long. (ii) Between every consecutive pair of intersections with P, if neither incident path is long, then information technology examines the bracketed interval of P. If the interval's read-arrival rate is approximately double or more compared to the surround, it breaks P at both intersection points. A path'southward inflow rate is measured by the average spacing between read starts in the corresponding unitig. (three) If P has one or more incident paths that are not long, and if the intersection points stand for to changes in arrival rate in P, then CABOG chooses i such point and breaks P there. It chooses the intersection point beyond which P's read inflow rate changes about dramatically.
After intersection-based splitting, CABOG breaks unitigs further using mate constraints. First, CABOG incorporates into unitigs the contained reads, according to their best overlaps. It tabulates mate pair satisfaction and violation from mate pairs that co-locate to a unitig. Satisfaction means placement within predicted hateful length±five SDs at the proper orientation relative to each other; anything else is a violation. CABOG ejects from unitigs any independent reads whose placement violates a mate constraint. Information technology breaks unitigs where total mate coverage is sufficient just the number of violations is higher up a given threshold.
Contigs, scaffolds, consensus: the rest of the Celera Assembler pipeline runs without whatever special modification for hybrid associates. Note the scaffold module may re-incorporate reads ejected previously for mate constraint violations; information technology can use mate constraints to guide private reads to their appropriate contigs.
3 METHODS
Assemblers: the Celera Assembler was used for the CABOG, Goldberg and traditional Celera pipelines; version 5.0 from 5/2008 was used everywhere except the homo trial, which used version 5.2 from 10/2008. The latest product version of Newbler (1.1.03.24) was used on FLX data, with the large option for the man trial. The software for Arachne, PCAP and Euler-SR were electric current through five/2008. Velvet version 0.seven, from 10/2008, ran with expected coverage prepare to 24. Assemblies ran under SuSE Linux on 64-chip Intel or AMD processors with 24 GB or 32 GB RAM, although the human assembly also exploited 48 GB of a high-RAM node. CABOG and Newbler were fed 454 reads in SFF format. Arachne was fed files from NCBI, slightly modified to satisfy the input parser. Euler-SR, PCAP and Velvet were fed files generated by CABOG's parser post-obit instructions in each program's documentation.
Analysis: continuity statistics were gathered from each assembler using Perl analysis of the FASTA output files. Assembly alignments were generated with MUMmer (Kurtz et al., 2004), ATAC (http://kmer.sf.net) and Stretcher (http://emboss.sf.net). Repeat annotation was generated with REPuter (Kurtz et al., 2001) with a post-process to aggregate repeat classes by overlapping sequence. EST alignments were generated with the ESTmapper (http://kmer.sf.net) extension to Sim4 (Florea et al., 1998).
Reference: the Psychromas sp. CNPT3 reference (RefSeq NZ_AAPG00000000), with 2 945 265 bases in one linear contig, had been produced at JCVI using Celera Assembler plus finishing. The Porphyromonas gingivalis W83 reference (GenBank NC_AE015924), with ii 343 476 bases in one circular contig, had been sequenced past Sanger chemistry, assembled with TIGR Assembler, and finished at TIGR/JCVI (Nelson et al., 2003). The Sanger reads assembled here were a distinct gear up. The Escherichia coli K12 MG1655 reference (GenBank NC_000913), with four 639 221 bases in a round contig, had been produced independently by a method other than whole-genome shotgun sequencing (Blattner et al., 1997). There was no reference for Cryptosporidium muris RN66, a eukaryotic genome estimated at 9 Mb. The ESTs were obtained from NCBI via CryptoDB.
Reads: many reads were obtained directly from JCVI. All reads are bachelor at the NCBI Trace Archive or Brusque Read Archive (meet Supplementary Material for particular). The homogeneous component sets of reads were combined to make hybrid datasets with realistic levels of genome coverage.
iv RESULTS
Reads were combined from several genome sequencing projects as shown in Table one. Pyrosequencing reads were used in the half-plate units provided by 454 FLX sequencers. Half-plates of unpaired reads (∼250 bases/read) were combined with Sanger mate pairs (∼800 bases/read) to make hybrid sets. One-half-plates from paired-end libraries were considered hybrid sets in themselves because they consist of by and large unpaired reads mixed with some (∼30%) mate pairs (∼100 bases/read).
Tabular array 1.
Homogeneous components for hybrid datasets
Sp | Cmp | Library | #Unmated | Len | #Mated | Len | Cov |
---|---|---|---|---|---|---|---|
P.gingivalis | |||||||
F1 | FLX unmated | 2 55 329 | 259 | 0 | - | 28.2 | |
F2 | FLX unmated | 2 54 703 | 259 | 0 | - | 28.2 | |
M1 | FLX 3-6Kbp | one 84 680 | 243 | 80 304 | 116 | 23.1 | |
M2 | FLX 3-6Kbp | 1 87 012 | 243 | 81 926 | 116 | 23.4 | |
S1 | Sanger 40Kbp | 90 | 601 | 2786 | 728 | ane.0 | |
E. coli | |||||||
F1 | FLX unmated | 2 thirty 517 | 253 | 0 | - | 12.6 | |
F2 | FLX unmated | 2 16 458 | 253 | 0 | - | xi.8 | |
M1 | FLX iii-6Kbp | 2 34 299 | 232 | 65 118 | 115 | 13.3 | |
P. CNPT3 | |||||||
F1 | FLX unmated | 2 98 610 | 266 | 0 | - | 26.0 | |
F2 | FLX unmated | 2 78 142 | 267 | 0 | - | 24.3 | |
S1 | Sanger 40Kbp | 38 | 537 | 1522 | 830 | 0.4 | |
C. muris | |||||||
F1 | FLX unmated | 4 34 956 | 243 | 0 | - | 11.7 | |
S1 | Sanger 40Kbp | 3272 | 434 | 21 092 | 713 | 1.vii | |
Sanger vi-8Kbp | 4108 | 727 | 17 382 | 892 | i.7 | ||
Sanger 2-3Kbp | 2652 | 508 | 27 296 | 826 | 2.7 |
Sp | Cmp | Library | #Unmated | Len | #Mated | Len | Cov |
---|---|---|---|---|---|---|---|
P.gingivalis | |||||||
F1 | FLX unmated | 2 55 329 | 259 | 0 | - | 28.2 | |
F2 | FLX unmated | 2 54 703 | 259 | 0 | - | 28.2 | |
M1 | FLX 3-6Kbp | one 84 680 | 243 | eighty 304 | 116 | 23.1 | |
M2 | FLX 3-6Kbp | one 87 012 | 243 | 81 926 | 116 | 23.4 | |
S1 | Sanger 40Kbp | 90 | 601 | 2786 | 728 | 1.0 | |
Eastward. coli | |||||||
F1 | FLX unmated | 2 30 517 | 253 | 0 | - | 12.six | |
F2 | FLX unmated | 2 16 458 | 253 | 0 | - | 11.eight | |
M1 | FLX 3-6Kbp | 2 34 299 | 232 | 65 118 | 115 | xiii.iii | |
P. CNPT3 | |||||||
F1 | FLX unmated | 2 98 610 | 266 | 0 | - | 26.0 | |
F2 | FLX unmated | 2 78 142 | 267 | 0 | - | 24.3 | |
S1 | Sanger 40Kbp | 38 | 537 | 1522 | 830 | 0.4 | |
C. muris | |||||||
F1 | FLX unmated | 4 34 956 | 243 | 0 | - | 11.seven | |
S1 | Sanger 40Kbp | 3272 | 434 | 21 092 | 713 | i.7 | |
Sanger 6-8Kbp | 4108 | 727 | 17 382 | 892 | ane.7 | ||
Sanger 2-3Kbp | 2652 | 508 | 27 296 | 826 | two.7 |
Sequence contribution from each component dataset. Sp, species proper name; Cmp, component proper name; Unmated/Mated, number of non-paired or paired-terminate reads; Len, for unmated and mated, the boilerplate clear range per read in bases; Cov, fold coverage of the genome by reads; FLX reads originate from the 454 GS FLX sequencer. Sanger reads originate from the ABI 3730 sequencer.
Table 1.
Homogeneous components for hybrid datasets
Sp | Cmp | Library | #Unmated | Len | #Mated | Len | Cov |
---|---|---|---|---|---|---|---|
P.gingivalis | |||||||
F1 | FLX unmated | two 55 329 | 259 | 0 | - | 28.ii | |
F2 | FLX unmated | 2 54 703 | 259 | 0 | - | 28.ii | |
M1 | FLX 3-6Kbp | 1 84 680 | 243 | lxxx 304 | 116 | 23.1 | |
M2 | FLX iii-6Kbp | 1 87 012 | 243 | 81 926 | 116 | 23.iv | |
S1 | Sanger 40Kbp | ninety | 601 | 2786 | 728 | 1.0 | |
E. coli | |||||||
F1 | FLX unmated | 2 30 517 | 253 | 0 | - | 12.6 | |
F2 | FLX unmated | ii 16 458 | 253 | 0 | - | 11.8 | |
M1 | FLX 3-6Kbp | ii 34 299 | 232 | 65 118 | 115 | 13.3 | |
P. CNPT3 | |||||||
F1 | FLX unmated | ii 98 610 | 266 | 0 | - | 26.0 | |
F2 | FLX unmated | ii 78 142 | 267 | 0 | - | 24.iii | |
S1 | Sanger 40Kbp | 38 | 537 | 1522 | 830 | 0.4 | |
C. muris | |||||||
F1 | FLX unmated | 4 34 956 | 243 | 0 | - | 11.7 | |
S1 | Sanger 40Kbp | 3272 | 434 | 21 092 | 713 | ane.7 | |
Sanger 6-8Kbp | 4108 | 727 | 17 382 | 892 | 1.vii | ||
Sanger 2-3Kbp | 2652 | 508 | 27 296 | 826 | 2.7 |
Sp | Cmp | Library | #Unmated | Len | #Mated | Len | Cov |
---|---|---|---|---|---|---|---|
P.gingivalis | |||||||
F1 | FLX unmated | two 55 329 | 259 | 0 | - | 28.2 | |
F2 | FLX unmated | 2 54 703 | 259 | 0 | - | 28.2 | |
M1 | FLX 3-6Kbp | one 84 680 | 243 | 80 304 | 116 | 23.1 | |
M2 | FLX 3-6Kbp | 1 87 012 | 243 | 81 926 | 116 | 23.4 | |
S1 | Sanger 40Kbp | 90 | 601 | 2786 | 728 | 1.0 | |
E. coli | |||||||
F1 | FLX unmated | 2 xxx 517 | 253 | 0 | - | 12.half dozen | |
F2 | FLX unmated | 2 16 458 | 253 | 0 | - | xi.8 | |
M1 | FLX 3-6Kbp | ii 34 299 | 232 | 65 118 | 115 | 13.3 | |
P. CNPT3 | |||||||
F1 | FLX unmated | 2 98 610 | 266 | 0 | - | 26.0 | |
F2 | FLX unmated | ii 78 142 | 267 | 0 | - | 24.3 | |
S1 | Sanger 40Kbp | 38 | 537 | 1522 | 830 | 0.4 | |
C. muris | |||||||
F1 | FLX unmated | 4 34 956 | 243 | 0 | - | 11.7 | |
S1 | Sanger 40Kbp | 3272 | 434 | 21 092 | 713 | i.7 | |
Sanger 6-8Kbp | 4108 | 727 | 17 382 | 892 | 1.seven | ||
Sanger ii-3Kbp | 2652 | 508 | 27 296 | 826 | 2.vii |
Sequence contribution from each component dataset. Sp, species name; Cmp, component name; Unmated/Mated, number of non-paired or paired-stop reads; Len, for unmated and mated, the average clear range per read in bases; Cov, fold coverage of the genome by reads; FLX reads originate from the 454 GS FLX sequencer. Sanger reads originate from the ABI 3730 sequencer.
CABOG and other assemblers were run on each combination dataset. Contig and scaffold statistics were tabulated for every assembly past an automatic process. CABOG assemblies were compared with reference genomes and likewise to the outputs from other assemblers. The comparisons included the recent version of Newbler designed to handle FLX mate and Sanger mate hybrid sets, and Euler-SR which had been demonstrated on a hybrid prepare of 454 GS 20 reads plus simulated Sanger mates (Chaisson and Pevzner, 2008). Velvet was tested on one dataset; it was designed for short reads but recent versions too accept long reads. PCAP, Arachne and the traditional Celera Assembler were included though they were designed for Sanger reads only. The last 2 were abandoned part way through testing after they produced fractured or no assemblies of several datasets. The Goldberg pipeline (Goldberg et al., 2006), which applies Newbler to pyrosequencing reads and Celera Assembler to Sanger mates, was run on those sets that included Sanger data.
4.1 Contig assay
Contig size is one measure of associates utility. Tabular array 2 presents iv contig size statistics for assemblies of selected hybrid datasets, with CABOG assemblies compared with Newbler assemblies.
Table 2.
CABOG and Newbler assemblies of hybrid information sets
Assembler | #Contigs | Contig N50 | Contig Max | Contig Sum |
---|---|---|---|---|
P.gingivalis / FLX reads+FLX mates (F1+M2) | ||||
CABOG | 48 | 67 993 | 205 585 | 2 332 097 |
Newbler | 119 | 27 561 | 134 859 | 2 183 278 |
P.gingivalis / FLX reads+Sanger mates (F1+S1) | ||||
CABOG | 65 | 51 745 | 169 923 | ii 266 305 |
Newbler | 104 | 32 377 | 154 008 | 2 184 009 |
P.gingivalis / FLX mates+Sanger mates (M2+S1) | ||||
CABOG | 34 | 101 101 | 307 732 | ii 314 836 |
Newbler | 115 | 29 216 | 110 686 | 2 179 717 |
E.coli / FLX reads+FLX mates (F2+M1) | ||||
CABOG | 22 | 440 632 | 861 331 | 4 642 198 |
Newbler | 87 | 87 223 | 240 232 | iv 516 116 |
P.sp CNPT3 / FLX reads+Sanger mates (F1+S1) | ||||
CABOG | 39 | 126 165 | 336 216 | 2 992 650 |
Newbler | 70 | 79 879 | 203 365 | 2 963 428 |
P.sp CNPT3 / FLX reads+Sanger mates (F2+S1) | ||||
CABOG | 42 | 138 508 | 365 104 | 2 983 118 |
Newbler | 99 | 45 693 | 171 391 | 2 951 683 |
C.muris / FLX reads+Sanger mates (F1+S1) | ||||
CABOG | 69 | 323 162 | 819 035 | ix 186 849 |
Newbler | 73 | 247 897 | 731 211 | 9 097 078 |
Assembler | #Contigs | Contig N50 | Contig Max | Contig Sum |
---|---|---|---|---|
P.gingivalis / FLX reads+FLX mates (F1+M2) | ||||
CABOG | 48 | 67 993 | 205 585 | 2 332 097 |
Newbler | 119 | 27 561 | 134 859 | ii 183 278 |
P.gingivalis / FLX reads+Sanger mates (F1+S1) | ||||
CABOG | 65 | 51 745 | 169 923 | 2 266 305 |
Newbler | 104 | 32 377 | 154 008 | 2 184 009 |
P.gingivalis / FLX mates+Sanger mates (M2+S1) | ||||
CABOG | 34 | 101 101 | 307 732 | 2 314 836 |
Newbler | 115 | 29 216 | 110 686 | 2 179 717 |
E.coli / FLX reads+FLX mates (F2+M1) | ||||
CABOG | 22 | 440 632 | 861 331 | four 642 198 |
Newbler | 87 | 87 223 | 240 232 | 4 516 116 |
P.sp CNPT3 / FLX reads+Sanger mates (F1+S1) | ||||
CABOG | 39 | 126 165 | 336 216 | 2 992 650 |
Newbler | 70 | 79 879 | 203 365 | 2 963 428 |
P.sp CNPT3 / FLX reads+Sanger mates (F2+S1) | ||||
CABOG | 42 | 138 508 | 365 104 | ii 983 118 |
Newbler | 99 | 45 693 | 171 391 | two 951 683 |
C.muris / FLX reads+Sanger mates (F1+S1) | ||||
CABOG | 69 | 323 162 | 819 035 | 9 186 849 |
Newbler | 73 | 247 897 | 731 211 | 9 097 078 |
The assay included all contigs 2 kb or longer found in each assembler'southward FASTA output. N50, the length of the shortest contig required to bridge 50% of the genome length; Max, the length of the longest contig, Sum, the full contig span. Contig size statistics are shown in bases. The codes in parentheses refer to component datasets described in Table one. Assemblies are compared past contig size statistics. Selected combinations are shown; others are provided in the Supplementary Material.
Table 2.
CABOG and Newbler assemblies of hybrid information sets
Assembler | #Contigs | Contig N50 | Contig Max | Contig Sum |
---|---|---|---|---|
P.gingivalis / FLX reads+FLX mates (F1+M2) | ||||
CABOG | 48 | 67 993 | 205 585 | 2 332 097 |
Newbler | 119 | 27 561 | 134 859 | 2 183 278 |
P.gingivalis / FLX reads+Sanger mates (F1+S1) | ||||
CABOG | 65 | 51 745 | 169 923 | two 266 305 |
Newbler | 104 | 32 377 | 154 008 | two 184 009 |
P.gingivalis / FLX mates+Sanger mates (M2+S1) | ||||
CABOG | 34 | 101 101 | 307 732 | 2 314 836 |
Newbler | 115 | 29 216 | 110 686 | two 179 717 |
East.coli / FLX reads+FLX mates (F2+M1) | ||||
CABOG | 22 | 440 632 | 861 331 | 4 642 198 |
Newbler | 87 | 87 223 | 240 232 | 4 516 116 |
P.sp CNPT3 / FLX reads+Sanger mates (F1+S1) | ||||
CABOG | 39 | 126 165 | 336 216 | 2 992 650 |
Newbler | seventy | 79 879 | 203 365 | 2 963 428 |
P.sp CNPT3 / FLX reads+Sanger mates (F2+S1) | ||||
CABOG | 42 | 138 508 | 365 104 | 2 983 118 |
Newbler | 99 | 45 693 | 171 391 | ii 951 683 |
C.muris / FLX reads+Sanger mates (F1+S1) | ||||
CABOG | 69 | 323 162 | 819 035 | 9 186 849 |
Newbler | 73 | 247 897 | 731 211 | 9 097 078 |
Assembler | #Contigs | Contig N50 | Contig Max | Contig Sum |
---|---|---|---|---|
P.gingivalis / FLX reads+FLX mates (F1+M2) | ||||
CABOG | 48 | 67 993 | 205 585 | 2 332 097 |
Newbler | 119 | 27 561 | 134 859 | 2 183 278 |
P.gingivalis / FLX reads+Sanger mates (F1+S1) | ||||
CABOG | 65 | 51 745 | 169 923 | 2 266 305 |
Newbler | 104 | 32 377 | 154 008 | 2 184 009 |
P.gingivalis / FLX mates+Sanger mates (M2+S1) | ||||
CABOG | 34 | 101 101 | 307 732 | 2 314 836 |
Newbler | 115 | 29 216 | 110 686 | 2 179 717 |
E.coli / FLX reads+FLX mates (F2+M1) | ||||
CABOG | 22 | 440 632 | 861 331 | 4 642 198 |
Newbler | 87 | 87 223 | 240 232 | 4 516 116 |
P.sp CNPT3 / FLX reads+Sanger mates (F1+S1) | ||||
CABOG | 39 | 126 165 | 336 216 | 2 992 650 |
Newbler | seventy | 79 879 | 203 365 | ii 963 428 |
P.sp CNPT3 / FLX reads+Sanger mates (F2+S1) | ||||
CABOG | 42 | 138 508 | 365 104 | 2 983 118 |
Newbler | 99 | 45 693 | 171 391 | 2 951 683 |
C.muris / FLX reads+Sanger mates (F1+S1) | ||||
CABOG | 69 | 323 162 | 819 035 | 9 186 849 |
Newbler | 73 | 247 897 | 731 211 | 9 097 078 |
The analysis included all contigs 2 kb or longer found in each assembler'due south FASTA output. N50, the length of the shortest contig required to bridge l% of the genome length; Max, the length of the longest contig, Sum, the total contig span. Contig size statistics are shown in bases. The codes in parentheses refer to component datasets described in Table 1. Assemblies are compared past contig size statistics. Selected combinations are shown; others are provided in the Supplementary Material.
The differences in Table two are clearly significant. On boilerplate, CABOG's largest contig was twice the Newbler'due south. Its N50 contig was more than than twice as large. CABOG consistently assembled more than total bases into fewer (larger) contigs. Thus, CABOG demonstrated greater continuity than Newbler on these data.
In Table 2, the rows for P.gingivalis F1+S1 and M2+S1 offer a comparing betwixt sets containing the same Sanger reads (S1) but distinct FLX reads. The FLX paired-end reads in M2 give that prepare slightly shorter reads on average, but as well boosted mate constraints. In the combination with M2, the CABOG contig N50 doubled simply the Newbler value actually dropped. CABOG may exploit mate constraints more than fully during contig construction. It has been observed that Newbler uses mate constraints mostly to bring together contigs in scaffolds (Jarvie and Harkins, 2008).
In Table two, the rows for P.gingivalis F1+M2 and Eastward.coli F2+M1 are devoid of Sanger information. On both of these sets, CABOG assembled over 120 000 more total bases into 1/2 to 1/4 as many contigs. Thus, CABOG provided more than continuity than Newbler on 454-only sets that included mate information. To address the question of whether the mates were disquisitional, CABOG and Newbler were further tested on homogeneous sets of simply 454 unpaired reads. Hither, the statistics were similar betwixt assemblers and the assembler ranking varied. This provides boosted support for the observation that CABOG improve exploits mate constraints during contig construction.
The genomes in Table 2 include three prokaryotes and a small-scale eukaryotic genome from C.muris. Thus, CABOG provided more than continuity across both domains. CABOG's gain over Newbler was smallest for the eukaryote, perchance because coverage was lowest on that dataset.
CABOG's extra sequence may represent genomic repeats. To investigate this hypothesis, the P.gingivalis reference was annotated for repeats. CABOG and Newbler contigs for 6 datasets were mapped to the annotated reference. The repeat and contig spans were compared for overlap. The result indicated that CABOG contigs spanned more repeats and longer repeats in all the assemblies. In one example, using the F1+M2 combination, CABOG spanned 34 repeats of boilerplate length 1099 bases, only Newbler spanned xiv repeats of average length 703 bases. The deviation was more pronounced in the M2+S1 combination, which included long-range Sanger mates. CABOG contigs spanned 26 repeats of average length 1981. Newbler contigs spanned nine repeats of average length 815. Thus, some of CABOG'southward continuity gain is attributable to increased resolution of repetitive sequence inside assembled contigs, which increases with mate availability.
Table three shows contig size statistics on i dataset for all assemblers tested. PCAP produced surprisingly big contigs considering it was not designed for hybrid information. The table is representative of results on other datasets, provided as Supple-mentary Material. The statistics consistently ranked CABOG first, followed by Goldberg (when run), Newbler, PCAP and Euler-SR.
Table 3.
Assemblies of one hybrid data set by all assemblers
Assembler | #Contigs | Contig N50 | Contig max | Contig sum |
---|---|---|---|---|
Due east.coli / FLX reads+FLX mates (F1+M1) | ||||
CABOG | 27 | 285 910 | 833 636 | 4 629,501 |
Newbler | 89 | 82 668 | 209 279 | 4 519,532 |
PCAP | 152 | 50 897 | 175 160 | 4 554 652 |
Euler-SR | 328 | 22 159 | 71 505 | four 343 338 |
Velvet | 490 | 11 510 | 53 664 | iv 230 559 |
Assembler | #Contigs | Contig N50 | Contig max | Contig sum |
---|---|---|---|---|
E.coli / FLX reads+FLX mates (F1+M1) | ||||
CABOG | 27 | 285 910 | 833 636 | 4 629,501 |
Newbler | 89 | 82 668 | 209 279 | iv 519,532 |
PCAP | 152 | 50 897 | 175 160 | iv 554 652 |
Euler-SR | 328 | 22 159 | 71 505 | four 343 338 |
Velvet | 490 | 11 510 | 53 664 | 4 230 559 |
The analysis is described in Tabular array 2. Just CABOG and Newbler were designed for FLX hybrid datasets. Euler-SR had been introduced for 454 GS 20 reads+Sanger mates. PCAP was designed for Sanger mates only. Velvet was designed for short reads. The Goldberg method was not run since it requires Sanger mates to meliorate Newbler contigs. Arachne and the traditional Celera Assembler did non gather this dataset. The assemblies are summarized and compared using contig length statistics.
Table 3.
Assemblies of one hybrid information set by all assemblers
Assembler | #Contigs | Contig N50 | Contig max | Contig sum |
---|---|---|---|---|
E.coli / FLX reads+FLX mates (F1+M1) | ||||
CABOG | 27 | 285 910 | 833 636 | iv 629,501 |
Newbler | 89 | 82 668 | 209 279 | 4 519,532 |
PCAP | 152 | l 897 | 175 160 | 4 554 652 |
Euler-SR | 328 | 22 159 | 71 505 | 4 343 338 |
Velvet | 490 | 11 510 | 53 664 | 4 230 559 |
Assembler | #Contigs | Contig N50 | Contig max | Contig sum |
---|---|---|---|---|
E.coli / FLX reads+FLX mates (F1+M1) | ||||
CABOG | 27 | 285 910 | 833 636 | 4 629,501 |
Newbler | 89 | 82 668 | 209 279 | 4 519,532 |
PCAP | 152 | 50 897 | 175 160 | 4 554 652 |
Euler-SR | 328 | 22 159 | 71 505 | 4 343 338 |
Velvet | 490 | 11 510 | 53 664 | four 230 559 |
The analysis is described in Table 2. Only CABOG and Newbler were designed for FLX hybrid datasets. Euler-SR had been introduced for 454 GS twenty reads+Sanger mates. PCAP was designed for Sanger mates only. Velvet was designed for short reads. The Goldberg method was not run since it requires Sanger mates to amend Newbler contigs. Arachne and the traditional Celera Assembler did not assemble this dataset. The assemblies are summarized and compared using contig length statistics.
iv.ii Scaffold analysis
Scaffold size is another measure of assembly utility. Table 4 presents scaffold size statistics for CABOG and Newbler assemblies of selected combinations of P.gingivalis information. The table indicates that CABOG scaffolds were significantly larger. All merely two measurements favored CABOG. In ane of the exceptions, Newbler's largest scaffold was longer than CABOG's on the M2 set. It may be meaning that this was a depression-coverage dataset.
Table 4.
Scaffold analysis of CABOG and Newbler assemblies
Assembler | #Scaf. | Scaf. N50 | Scaf. max | Scaf. sum | Cov. (%) |
---|---|---|---|---|---|
P.gingivalis / FLX mates (M2) | |||||
CABOG | 7 | 392 892 | 661 267 | ii 324,483 | 98.vii |
Newbler | 9 | 268 678 | 718 704 | 2 187 430 | 94.i |
P.gingivalis / FLX mates+FLX mates (M1+M2) | |||||
CABOG | 7 | 417 898 | 758 093 | two 339 970 | 98.ix |
Newbler | eleven | 266 698 | 718 559 | 2 183 668 | 93.nine |
P.gingivalis / FLX reads+FLX mates (F1+M2) | |||||
CABOG | 9 | 450 308 | 758 275 | two 335 950 | 98.8 |
Newbler | 382 223 | 720 519 | 2 189 593 | 94.ii | |
P.gingivalis / FLX reads+Sanger mates (F1+S1) | |||||
CABOG | 6 | 1 507 760 | one 507 760 | two 268 548 | 96.six |
Newbler | 51 | 1 489 797 | 1 489 797 | 2 185 214 | 94.3 |
P.gingivalis / FLX mates+Sanger mates (M2+S1) | |||||
CABOG | ane | ii 317 095 | 2 317 095 | 2 317 095 | 98.7 |
Newbler | half-dozen | 1 550 861 | 1 550 861 | 2 184 352 | 93.9 |
Assembler | #Scaf. | Scaf. N50 | Scaf. max | Scaf. sum | Cov. (%) |
---|---|---|---|---|---|
P.gingivalis / FLX mates (M2) | |||||
CABOG | seven | 392 892 | 661 267 | 2 324,483 | 98.7 |
Newbler | 9 | 268 678 | 718 704 | 2 187 430 | 94.ane |
P.gingivalis / FLX mates+FLX mates (M1+M2) | |||||
CABOG | 7 | 417 898 | 758 093 | ii 339 970 | 98.9 |
Newbler | 11 | 266 698 | 718 559 | ii 183 668 | 93.9 |
P.gingivalis / FLX reads+FLX mates (F1+M2) | |||||
CABOG | 9 | 450 308 | 758 275 | 2 335 950 | 98.8 |
Newbler | 382 223 | 720 519 | two 189 593 | 94.2 | |
P.gingivalis / FLX reads+Sanger mates (F1+S1) | |||||
CABOG | half-dozen | ane 507 760 | 1 507 760 | 2 268 548 | 96.6 |
Newbler | 51 | 1 489 797 | 1 489 797 | 2 185 214 | 94.3 |
P.gingivalis / FLX mates+Sanger mates (M2+S1) | |||||
CABOG | 1 | 2 317 095 | two 317 095 | 2 317 095 | 98.seven |
Newbler | 6 | ane 550 861 | one 550 861 | two 184 352 | 93.9 |
The analysis included all scaffolds 2 kb or longer found in each assembler'south FASTA output. Scaffold length statistics are shown in bases excluding the lengths of the gaps between contigs. Note that scaffold sum may not equal contig sum (Table 2) due to the 2 kb threshold being practical at the scaffold not contig level. Cov, bases of the reference covered by a sum over unmarried best alignments of each full or partial scaffold sequence.
Table 4.
Scaffold assay of CABOG and Newbler assemblies
Assembler | #Scaf. | Scaf. N50 | Scaf. max | Scaf. sum | Cov. (%) |
---|---|---|---|---|---|
P.gingivalis / FLX mates (M2) | |||||
CABOG | 7 | 392 892 | 661 267 | 2 324,483 | 98.vii |
Newbler | 9 | 268 678 | 718 704 | ii 187 430 | 94.i |
P.gingivalis / FLX mates+FLX mates (M1+M2) | |||||
CABOG | 7 | 417 898 | 758 093 | ii 339 970 | 98.9 |
Newbler | 11 | 266 698 | 718 559 | two 183 668 | 93.9 |
P.gingivalis / FLX reads+FLX mates (F1+M2) | |||||
CABOG | 9 | 450 308 | 758 275 | 2 335 950 | 98.eight |
Newbler | 382 223 | 720 519 | 2 189 593 | 94.2 | |
P.gingivalis / FLX reads+Sanger mates (F1+S1) | |||||
CABOG | vi | one 507 760 | i 507 760 | two 268 548 | 96.6 |
Newbler | 51 | i 489 797 | 1 489 797 | ii 185 214 | 94.3 |
P.gingivalis / FLX mates+Sanger mates (M2+S1) | |||||
CABOG | 1 | 2 317 095 | 2 317 095 | 2 317 095 | 98.7 |
Newbler | 6 | i 550 861 | ane 550 861 | two 184 352 | 93.9 |
Assembler | #Scaf. | Scaf. N50 | Scaf. max | Scaf. sum | Cov. (%) |
---|---|---|---|---|---|
P.gingivalis / FLX mates (M2) | |||||
CABOG | 7 | 392 892 | 661 267 | 2 324,483 | 98.7 |
Newbler | 9 | 268 678 | 718 704 | 2 187 430 | 94.1 |
P.gingivalis / FLX mates+FLX mates (M1+M2) | |||||
CABOG | 7 | 417 898 | 758 093 | 2 339 970 | 98.ix |
Newbler | 11 | 266 698 | 718 559 | ii 183 668 | 93.ix |
P.gingivalis / FLX reads+FLX mates (F1+M2) | |||||
CABOG | nine | 450 308 | 758 275 | 2 335 950 | 98.8 |
Newbler | 382 223 | 720 519 | 2 189 593 | 94.2 | |
P.gingivalis / FLX reads+Sanger mates (F1+S1) | |||||
CABOG | 6 | i 507 760 | ane 507 760 | 2 268 548 | 96.6 |
Newbler | 51 | ane 489 797 | one 489 797 | 2 185 214 | 94.iii |
P.gingivalis / FLX mates+Sanger mates (M2+S1) | |||||
CABOG | ane | 2 317 095 | 2 317 095 | 2 317 095 | 98.7 |
Newbler | 6 | 1 550 861 | ane 550 861 | ii 184 352 | 93.ix |
The analysis included all scaffolds 2 kb or longer found in each assembler'southward FASTA output. Scaffold length statistics are shown in bases excluding the lengths of the gaps between contigs. Annotation that scaffold sum may non equal contig sum (Table two) due to the two kb threshold beingness applied at the scaffold not contig level. Cov, bases of the reference covered by a sum over single best alignments of each full or partial scaffold sequence.
In one case, CABOG produced exactly i scaffold, and information technology covered 99% of the reference sequence. That dataset, M2+S1, included short-range FLX mates and long-range Sanger mates. It is possible that the high concentration of mate constraints, or the combination of mate distances, enabled CABOG to resolve a single-scaffold assembly.
Scaffold span is an alternating measure out of scaffold size. Bridge includes the estimated lengths of the gaps between the contigs, every bit well every bit the contig lengths. On many datasets, Newbler scaffold bridge statistics exceeded those of CABOG.
4.3 Assembly correctness
Selected assemblies were tested for their coverage of the reference genome sequence. Table iv indicates the genome coverage provided past CABOG and Newbler assemblies of hybrid sets of P.gingivalis reads. CABOG coverage was consistently above 96% and was always college than Newbler coverage. This examination considered best matches merely, so collapsed assemblies of repeat copies would encompass only i repeat copy.
The aforementioned assemblies were measured for consensus accurateness. The alignments to reference were parsed to count all inserted, deleted, substituted and unmapped bases. Accurateness was expressed as the fraction of assembled bases that did not fall into one of these categories. For the 4 datasets in Tabular array 4, CABOG accurateness varied between 99.932% and 99.980%. Newbler accurateness varied between 99.995% and 99.998%.
Selected alignments were assessed by visual inspection to reveal assembly errors, such equally mis-oriented or misordered contigs inside scaffolds. No errors were found at the scaffold level. Some rearrangements inside CABOG contigs were noticed; these were too revealed by the subsequent assay.
Adjacent, alignments of contigs were inspected in detail. The analysis relied on manual and scripted review of textual representations of alignments. Information technology covered four CABOG assemblies of P.gingivalis, four CABOG assemblies of P. sp CNPT3, and the respective Newbler assemblies. Breaks in the alignments were counted and inspected. Results based on MUMmer were confirmed by analyses with other software.
Table v lists some minor problems found in CABOG contigs: bad ends, bad contigs and collapsed tandem repeats. Collapse of tandem repeats has been observed before in Celera Assembler (She et al., 2004) and other assemblers. Almost CABOG collapses involved the omission of <100 bases. CABOG's bad finish and bad contig problems likewise involved small (nether 1 kb) bits of sequence. CABOG assemblies showed more serious problems: chimeric joins and chimeric ends. The assay of Newbler assemblies (information not shown) revealed no serious problems and three pocket-sized problems. Annotation Newbler'southward rate of collapsed tandem repeats would take been underestimated because just contigs that spanned a echo region could contribute to the alignment breaks that were counted hither.
Table 5.
Errors in CABOG assemblies
Genome | Dataset | Chimeric | Chimeric | Bad | Bad | Complanate |
---|---|---|---|---|---|---|
join | end | finish | contig | tandem | ||
P.gingivalis | F1 | 0 | 0 | 0 | 0 | four |
P.gingivalis | F1+M2 | 3 | 8 | one | 1 | 11 |
P.gingivalis | F1+S1 | 0 | 0 | ane | 0 | 7 |
P.gingivalis | M2+S1 | 0 | 1 | ii | 0 | nine |
P.sp CNPT3 | F1 | 0 | one | 0 | 0 | 1 |
P.sp CNPT3 | F2 | 0 | 2 | 0 | 0 | 4 |
P.sp CNPT3 | F1+S1 | 0 | 0 | 0 | 0 | i |
P.sp CNPT3 | F2+S1 | 0 | 0 | 0 | 0 | 0 |
Genome | Dataset | Chimeric | Chimeric | Bad | Bad | Collapsed |
---|---|---|---|---|---|---|
bring together | end | end | contig | tandem | ||
P.gingivalis | F1 | 0 | 0 | 0 | 0 | iv |
P.gingivalis | F1+M2 | 3 | eight | one | ane | 11 |
P.gingivalis | F1+S1 | 0 | 0 | 1 | 0 | vii |
P.gingivalis | M2+S1 | 0 | 1 | ii | 0 | nine |
P.sp CNPT3 | F1 | 0 | 1 | 0 | 0 | 1 |
P.sp CNPT3 | F2 | 0 | 2 | 0 | 0 | iv |
P.sp CNPT3 | F1+S1 | 0 | 0 | 0 | 0 | 1 |
P.sp CNPT3 | F2+S1 | 0 | 0 | 0 | 0 | 0 |
The analysis included contigs at to the lowest degree 2 kb long. Chimeric join, a concatenation of unrelated sequences of at least 1 kb. Chimeric Finish, concatenation of less than 1 kb to a contig end. Bad end, less than 1 kb of unaligned sequence at a contig end. Bad Contig, unaligned contig. Collapsed Tandem, multiple alignments between a contig and the reference, partially overlapping in either sequence. Errors were estimated by analysis of alignments to reference sequences. Estimates were confirmed by two other alignment-based methods.
Tabular array 5.
Errors in CABOG assemblies
Genome | Dataset | Chimeric | Chimeric | Bad | Bad | Collapsed |
---|---|---|---|---|---|---|
join | cease | end | contig | tandem | ||
P.gingivalis | F1 | 0 | 0 | 0 | 0 | iv |
P.gingivalis | F1+M2 | 3 | viii | 1 | 1 | 11 |
P.gingivalis | F1+S1 | 0 | 0 | 1 | 0 | seven |
P.gingivalis | M2+S1 | 0 | one | 2 | 0 | 9 |
P.sp CNPT3 | F1 | 0 | 1 | 0 | 0 | 1 |
P.sp CNPT3 | F2 | 0 | 2 | 0 | 0 | 4 |
P.sp CNPT3 | F1+S1 | 0 | 0 | 0 | 0 | 1 |
P.sp CNPT3 | F2+S1 | 0 | 0 | 0 | 0 | 0 |
Genome | Dataset | Chimeric | Chimeric | Bad | Bad | Collapsed |
---|---|---|---|---|---|---|
join | stop | end | contig | tandem | ||
P.gingivalis | F1 | 0 | 0 | 0 | 0 | 4 |
P.gingivalis | F1+M2 | iii | 8 | 1 | 1 | 11 |
P.gingivalis | F1+S1 | 0 | 0 | 1 | 0 | 7 |
P.gingivalis | M2+S1 | 0 | 1 | 2 | 0 | nine |
P.sp CNPT3 | F1 | 0 | i | 0 | 0 | i |
P.sp CNPT3 | F2 | 0 | 2 | 0 | 0 | 4 |
P.sp CNPT3 | F1+S1 | 0 | 0 | 0 | 0 | 1 |
P.sp CNPT3 | F2+S1 | 0 | 0 | 0 | 0 | 0 |
The analysis included contigs at to the lowest degree 2 kb long. Chimeric join, a concatenation of unrelated sequences of at to the lowest degree ane kb. Chimeric Finish, concatenation of less than i kb to a contig terminate. Bad end, less than one kb of unaligned sequence at a contig end. Bad Contig, unaligned contig. Collapsed Tandem, multiple alignments between a contig and the reference, partially overlapping in either sequence. Errors were estimated by analysis of alignments to reference sequences. Estimates were confirmed by two other alignment-based methods.
The chimeric joins in the CABOG assemblies corresponded to repetitive regions of the reference genome sequences. In no case did a Newbler contig span the corresponding region. Thus, it appears that CABOG was more aggressive than Newbler almost including whole repeats inside larger contigs while committing false joins in a few repetitive regions.
On both genomes for which alignments were studied, the chimer rate dropped when the S1 (Sanger mates) set was included. This is consistent with CABOG'south use of long-range mate constraints to correct mis-assembly errors within unitigs. Sanger sequencing provided the long-range mates here, but long-range mate constraints may be bachelor soon from the pyrosequencing platform (Jarvie and Harkins, 2008). In summary, CABOG has a chimeric join rate that may be acceptably low for some genome projects, and that is diminished by inclusion of long-range mate information.
Assemblies of the C.muris genome were validated past EST mapping since no independent reference was available. All available ESTs were mapped to the CABOG and Newbler scaffolds with a threshold of 95% identity over 95% of EST length. No EST mapping spanned multiple scaffolds in either assembly. Of 27 498 ESTs, there were 14 148 unspliced and 2312 spliced alignments to CABOG'southward associates. Thus, over half the ESTs confirmed CABOG scaffolds by full-length, high-stringency alignments. At that place were 13 214 unspliced and 1883 spliced alignments to Newbler's assembly. Thus, the CABOG assembly showed a higher rate of EST confirmation than the Newbler assembly.
iv.4 Large genomes
Large eukaryotic genome projects present additional problems of calibration and complexity. To test whether CABOG would calibration upwardly to such bug, it was applied to human genome data. Information technology was run on a hybrid set up consisting of 6X 454 FLX unmated reads from the Watson genome project (Wheeler et al., 2008) plus 3X in 10 kb and larger Sanger mate pairs from the Venter genome projection (Levy et al., 2007). The computation consumed 5209 CPU hours over 5 days on our filigree. The assembly's statistics included: Correctness is more difficult to evaluate on larger genomes. Using the NCBI B36 homo reference sequence, a whole-genome alignment was generated by the ATAC method (Istrail et al., 2004). Reference coverage past ungapped matches was 97%, indicating abyss and short-range understanding. A measure out of long-range understanding was provided by the maximal i-to-1 mappings between reference chromosomes and assembled scaffolds 2 kb or longer. These mappings span at nearly one chromosome and one scaffold. Ninety-three percent of mapped scaffolds were included in exactly one mapping. The 223 discontinuously mapped scaffolds could indicate incorrect associates or other factors including reference errors, population differences or alignment artifacts. For comparison, the Venter assembly was reported to have 12 chimera (Levy et al., 2007) though it has 116 discontinuous mappings past this technique. Thus, CABOG produced a reasonable assembly of the human genome from this hybrid mixture of pyrosequencing reads plus mates. On the same dataset, Newbler reported overflow conditions and terminated.
-
Contig count=145 971
-
Contig N50=36 460 bp
-
Contig max=310 470 bp
-
Contig sum=2 715 539 585 bp
-
Scaffold N50=x 913 700 bp
5 DISCUSSION
The rapid recent emergence of new sequencing technologies has fabricated it hard for assembly software to continue pace. Particularly challenging has been the problem of assembling heterogeneous mixtures of information so as to exploit the relative advantage of each information type. The hybrid assembly problem is new but information technology volition retain importance equally long as dissimilar platforms each offering different characteristics and compelling advantages. It is not surprising that assemblers, such as Newbler and Velvet, explicitly support hybrid datasets. Hybrid associates software is disquisitional even for some seemingly homogenous data. The 454 paired-terminate protocol produces a mixture of paired and non-paired reads, where the paired reads are less than half the length of the non-paired, on average. This phenomenon would persist even if the new 454 FLX 'Titanium' upgrade is able to evangelize Sanger-length reads.
Here, we described improvements to the Celera Assembler that were embodied in a pipeline called CABOG. CABOG parses native 454 output and Sanger reads. It handles mate pairs of either blazon alone or in combination. These abilities make CABOG a versatile tool for modern assembly tasks.
CABOG assemblies of heterogeneous information compare favorably to those produced past other assembly software. CABOG assembles more than bases into fewer and larger contigs and scaffolds. CABOG is more aggressive than Newbler at repeat resolution. Its big-contig and big-scaffold output would provide more substrate for manual review and automated annotation. CABOG is a valuable tool for projects where repeat resolution is desirable.
CABOG can generate mis-assemblies, but the trouble appears to be mitigated past inclusion of long-range mate data. Indeed, CABOG makes broad use of mate constraints to build larger contigs, to span repeats, and to avoid mis-assemblies. CABOG should be valuable to sequencing projects that include mate pairs, whether those are derived from Sanger sequencing or pyrosequencing. With the expected availability of long-range, too as short-range, mates from the 454 GS FLX platform, CABOG could get the preferred assembler for projects with 100% FLX information.
CABOG used Celera Assembler's consensus module without any modification specific to pyrosequencing reads. CABOG'south consensus accurateness, though high, is less than Newbler's. Thus, the consensus module may demand to be tuned for pyrosequencing reads.
To our knowledge, CABOG is the only software capable of calculating a de novo assembly of the human being genome from pyrosequencing and Sanger whole-genome shotgun reads. CABOG is a modification to Celera Assembler, which had previously assembled Sanger-only data from human being (Levy et al., 2007). The 454 technology had previously been applied to sequencing an individual human (Wheeler et al., 2008) and to comparing individual humans (Korbel et al., 2007), though neither experiment employed de novo whole-genome shotgun assembly. Our test on human data showed that CABOG is able to run on a big-eukaryotic dataset within the memory limitations of modern computers. Our attention has shifted toward the testing and tuning of CABOG for hybrid datasets from large genomes. Other proportions of mated reads and mate distances, or further adjustment to the software, may refine CABOG's large-genome capabilities.
ACKNOWLEDGEMENTS
Gennady Denisov, Aaron Halpern, Saul Kravitz, Laura Sheahan, Tim Stockwell, Shibu Yooseph and an anonymous reviewer provided helpful feedback. Swapna Annavarapu, Les Foster, Hernan Lorenzi, Diana Radune, Joana Da Silva and Indresh Singh assisted with the data grooming.
Funding: NIAID (contract No. HHSN266200400038C), 'Bioinfor-matics Resource Centers for Biodefense and EmergingRe-emerging Infectious Diseases'; NIAID (contract No. N01-AI-30071), 'Microbial Genome Centers'; NIGMS (grant R01-GM077117); the J. Craig Venter Institute.
Disharmonize of Interest: none alleged.
REFERENCES
.
Whole-genome re-sequencing
,
Curr. Opin. Genet. Dev
,
2006
, vol.
sixteen
(pg.
545
-
552
)
, et al.
The complete genome sequence of Escherichia coli K-12
,
Science
,
1997
, vol.
277
(pg.
1453
-
1474
)
, .
Short read fragment associates of bacterial genomes
,
Genome Res.
,
2008
, vol.
eighteen
(pg.
324
-
330
)
, .
Deoxyribonucleic acid sequence quality trimming and vector removal
,
Bioinformatics
,
2001
, vol.
17
(pg.
1093
-
1104
)
, et al.
Consensus generation and variant detection past Celera Assembler
,
Bioinformatics
,
2008
, vol.
24
(pg.
1035
-
1040
)
, et al.
A computer program for adjustment a cDNA sequence with a genomic DNA sequence
,
Genome Res.
,
1998
, vol.
8
(pg.
967
-
974
)
, et al.
A Sanger/pyrosequencing hybrid approach for the generation of high-quality draft assemblies of marine microbial genomes
,
Proc. Natl Acad. Sci. U.s.
,
2006
, vol.
103
(pg.
11240
-
11245
)
.,
Algorithms on Strings, Trees and Sequences: Informatics and Computational Biological science.
,
1997
Cambridge, UK
Cambridge Academy Press
.
Advanced sequencing technologies and their wider impact in microbiology
,
J. Exp. Biol.
,
2007
, vol.
210
(pg.
1518
-
1525
)
, .
Generating a genome associates with PCAP
,
Curr. Protoc. Bioinformatics
,
2005
Chap. 11, Unit11 3
, et al.
Whole-genome shotgun assembly and comparison of human being genome assemblies
,
Proc. Natl Acad. Sci. USA
,
2004
, vol.
101
(pg.
1916
-
1921
)
, et al.
Whole-genome sequence assembly for mammalian genomes: Arachne 2
,
Genome Res.
,
2003
, vol.
13
(pg.
91
-
96
)
, .
De novo assembly and genomic structural variation analysis with genome sequencer FLX 3K long-tag paired end reads
,
Biotechniques
,
2008
, vol.
44
(pg.
829
-
831
)
, et al.
Paired-finish mapping reveals extensive structural variation in the human genome
,
Science
,
2007
, vol.
318
(pg.
420
-
426
)
, et al.
REPuter: the manifold applications of repeat analysis on a genomic scale
,
Nucleic Acids Res.
,
2001
, vol.
29
(pg.
4633
-
4642
)
, et al.
Versatile and open up software for comparing large genomes
,
Genome Biol
,
2004
, vol.
5
pg.
R12
, et al.
The diploid genome sequence of anindividual human being
,
PLoS Biol
,
2007
, vol.
5
pg.
e254
, et al.
Genome sequencing in microfabricated high-density picolitre reactors
,
Nature
,
2005
, vol.
437
(pg.
376
-
380
)
, et al.
A whole-genome assembly of Drosophila
,
Scienc
,
2000
, vol.
287
(pg.
2196
-
2204
)
, et al.
Complete genome sequence of the oral pathogenic Bacterium Porphyromonas gingivalis strain W83
,
J. Bacteriol.
,
2003
, vol.
185
(pg.
5591
-
5601
)
, et al.
An Eulerian path approach to DNA fragment assembly
,
Proc. Natl Acad. Sci. United states of america
,
2001
, vol.
98
(pg.
9748
-
9753
)
, et al.
A preprocessor for shotgun assembly of large genomes
,
J. Comput. Biol.
,
2004
, vol.
xi
(pg.
734
-
752
)
.,
Genome Sequencer FLX Data Analysis Software Manual.
,
2007
Mannheim, Germany
Roche Technology
, et al.
Shotgun sequence associates and recent segmental duplications inside the human genome
,
Nature
,
2004
, vol.
431
(pg.
927
-
930
)
, et al.
TIGR Assembler: a new tool for assembling big shotgun sequencing projects
,
Genome Sci. Technol.
,
1995
, vol.
1
(pg.
9
-
nineteen
)
, et al.
The complete genome of an private by massively parallel Deoxyribonucleic acid sequencing
,
Nature
,
2008
, vol.
452
(pg.
872
-
876
)
, et al.
Figaro: a novel statistical method for vector sequence removal
,
Bioinformatics
,
2008
, vol.
24
(pg.
462
-
467
)
, et al.
454 sequencing put to the examination using the circuitous genome of barley
,
BMC Genomics
,
2006
, vol.
7
pg.
275
, .
Velvet: algorithms for de novo short read assembly using de Bruijn graphs
,
Genome Res.
,
2008
, vol.
18
(pg.
821
-
829
)
Author notes
Associate Editor: Dmitrij Frishman
© 2008 The Author(southward)
This is an Open up Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original piece of work is properly cited.
Source: https://academic.oup.com/bioinformatics/article/24/24/2818/197033
Post a Comment for "Aggressive Assembly of Pyrosequencing Reads With Mates"