UGbS-Flex, a novel bioinformatics pipeline for imputation-free SNP discovery in polyploids without a reference genome: Finger millet as a case study.pdf (BMC Plant Biology)

Qi et al. BMC Plant Biology (2018) 18:117 https://doi.org/10.1186/s12870-018-1316-3 METHODOLOGY Open Access UGbS-Flex, a novel bioinformatics pipeline for imputation-free SNP discovery in polyploids without a reference genome: finger millet as a case study Peng Qi1,2, Davis Gimode1,3,7, Dipnarayan Saha1,2,8, Stephan Schröder1,2,9, Debkanta Chakraborty4, Xuewen Wang5, Mathews M. Dida6, Russell L. Malmberg2 and Katrien M. Devos1,2,4* Abstract Background: Research on orphan crops is often hindered by a lack of genomic resources. With the advent of affordable sequencing technologies, genotyping an entire genome or, for large-genome species, a representative fraction of the genome has become feasible for any crop. Nevertheless, most genotyping-by-sequencing (GBS) methods are geared towards obtaining large numbers of markers at low sequence depth, which excludes their application in heterozygous individuals. Furthermore, bioinformatics pipelines often lack the flexibility to deal with paired-end reads or to be applied in polyploid species. Results: UGbS-Flex combines publicly available software with in-house python and perl scripts to efficiently call SNPs from genotyping-by-sequencing reads irrespective of the species’ ploidy level, breeding system and availability of a reference genome. Noteworthy features of the UGbS-Flex pipeline are an ability to use paired-end reads as input, an effective approach to cluster reads across samples with enhanced outputs, and maximization of SNP calling. We demonstrate use of the pipeline for the identification of several thousand high-confidence SNPs with high representation across samples in an F3-derived F2 population in the allotetraploid finger millet. Robust high-density genetic maps were constructed using the time-tested mapping program MAPMAKER which we upgraded to run efficiently and in a semi-automated manner in a Windows Command Prompt Environment. We exploited comparative GBS with one of the diploid ancestors of finger millet to assign linkage groups to subgenomes and demonstrate the presence of chromosomal rearrangements. Conclusions: The paper combines GBS protocol modifications, a novel flexible GBS analysis pipeline, UGbS-Flex, recommendations to maximize SNP identification, updated genetic mapping software, and the first high-density maps of finger millet. The modules used in the UGbS-Flex pipeline and for genetic mapping were applied to finger millet, an allotetraploid selfing species without a reference genome, as a case study. The UGbS-Flex modules, which can be run independently, are easily transferable to species with other breeding systems or ploidy levels. Keywords: Chromosomal rearrangements, Eleusine coracana, E. indica, Finger millet, Genetic mapping, Genotyping-bysequencing (GBS), GBS-pipeline, Paired-end reads, Polyploid, SNP calling * Correspondence: kdevos@uga.edu 1 Institute of Plant Breeding, Genetics and Genomics (Department of Crop and Soil Sciences), University of Georgia, Athens, GA 30602, USA 2 Department of Plant Biology, University of Georgia, Athens, GA 30602, USA Full list of author information is available at the end of the article © The Author(s). 2018 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Qi et al. BMC Plant Biology (2018) 18:117 Background Efficient genotyping methods, whether used for mapping or population genetic studies, must be simple and reliable, and provide the allele composition at thousands or more of polymorphic loci that cover the entire genome [1]. Despite the recent advances in sequencing technologies, whole genome sequencing is still not cost effective for large-genome species, especially when multiple-fold coverage needs to be achieved of several hundred individuals. Several reduced representation methods based on selective sequencing of restriction fragments have been developed to simultaneously conduct high-throughput marker discovery and genotyping [2– 4]. We collectively refer to these methods as ‘genotyping-by-sequencing’ (GBS). To keep sequencing costs low, there is typically a trade-off between marker number and sequencing depth. As a result, genotyping-by-sequencing data sets often have large amounts of missing data and low sequence coverage at each locus [2, 5, 6]. Imputations can be used to infer missing genotypes [7]. Low sequence depth, however, is highly problematic when analyzing diversity panels in outcrossing species, and biparental backcross, F1 (in outcrossing species) and F2 populations where sufficient sequence depth at each locus is a prerequisite to unambiguously differentiate homozygous and heterozygous alleles. We therefore implemented several modifications to the experimental GBS protocol developed by Elshire et al. [2] and Poland et al. [8], and tested their effect on reducing the GBS fragment pool and providing more even read coverage across pooled samples for high-confidence imputation free single nucleotide polymorphism (SNP) identification. Analysis of GBS reads is fairly straightforward if whole genome sequence data is available to which the GBS reads can be aligned. Pipelines for reference-based GBS analysis include TASSEL-GBS [9, 10], Fast-GBS [11] and Stacks [12]. Pipelines such as TASSEL-UNEAK [5], Stacks [12] and GBS-SNP-CROP [13] can generate a de novo GBS reference from the experimental data and are aimed at the analysis of GBS data from species without reference genomes. Most pipelines are geared towards single-end sequencing data from diploid organisms. Finger millet, Eleusine coracana (L.) Gaertn. subsp. coracana, is an inbreeding allotetraploid (AABB) cereal belonging to the subfamily Chloridoideae with a haploid genome size of 1.7 Gb [14]. Despite being an important food security crop for Eastern Africa and parts of southern India, it has persistently been neglected by the international research community. The wild progenitor of finger millet, E. coracana subsp. africana, originated by hybridization between E. indica (AA genome) and an unknown B-genome species. The timing of the allopolyploidization event is not known. To date, only a single Page 2 of 19 linkage map has been generated of finger millet consisting of 332 loci, mostly detected by restriction fragment length polymorphism (RFLP) markers and single strand conformation polymorphic expressed sequenced tags [15, 16]. The map was generated in an F2 population derived from a cross between a wild accession, MD-20, and a cultivated accession, Okhale-1. Based on similarity between RFLP fragment sizes in the A-genomes of E. indica and E. coracana, linkage groups were allocated to subgenomes [15]. The linkage map was also used to establish gross comparative relationships between the finger millet and rice genomes [16]. There is a need for high-density SNP maps of finger millet to assist with trait analyses and planned genome sequencing efforts. Even with less than 10,000 markers, construction of an accurate linkage map is extremely challenging, in particular when dealing with less than perfect data sets such as, for example, obtained in F3-reconstituted F2 populations. Traditional software like MAPMAKER [17, 18] and JoinMap [19, 20] use maximum likelihood based three-point and multi-point analyses which provide highly accurate marker ordering but are highly memoryand time-intensive for large data sets. To deal with large marker numbers, software packages such as MSTmap [21] and Lep-Map [22] based on the traveling salesman principle have been developed. Map generation is fast, but marker ordering is more sensitive to genotyping errors. To be able to take advantage of MAPMAKER’s high accuracy for ordering large marker sets, we modified the original MAPMAKER package to run efficiently in a Windows Command Prompt Environment and developed in-house python scripts to automate several steps of the MAPMAKER mapping process. The mapping pipeline was applied to generate a new high-density genetic map of finger millet comprised of several thousand high-quality SNP markers. Our paper thus provides a modified GBS protocol, a new pipeline (UGbS-Flex) for analysis of paired-end GBS data suitable for application in species with different ploidy levels, breeding systems and polymorphism levels, irrespective of the availability of a reference genome sequence. We also provide a comprehensive solution for post-GBS data analysis, and a high-density genetic map of finger millet with new information on the organization of the allotetraploid finger millet genome. Methods Plant materials and DNA extraction The F2 mapping population was generated from a cross between E. coracana subsp. africana accession MD-20 and E. coracana subsp. coracana accession Okhale-1 [15]. One hundred and thirty-four F2:3 families with a minimum of 15 plants per F3 family plus the parents were grown in the greenhouse at UGA under day/night Qi et al. BMC Plant Biology (2018) 18:117 temperatures in the range 26–30 °C. KNE 796, an accession for which whole genome sequencing is ongoing, and E. indica accessions Ei-0, Ei-2 and Ei-5, collected in the wild in Kenya, were grown under the same conditions. All germplasm was obtained in compliance with national and international accords on export/import of seeds for research purposes. DNA was extracted using a modified CTAB method adapted from Doyle and Doyle [23] from leaves bulk-harvested for each F3 family from eight weeks old seedlings. DNA concentrations were measured on a Nanodrop (Thermo Scientific) and samples were diluted to 50 ng/μl. The quality of the DNA was assessed on a 0.8% agarose gel. GBS sample preparation Two-hundred nanogram of high-molecular-weight DNA was digested with a cocktail of either PstI/MspI, PstI/ NdeI or PstI/MspI + ApeKI. Digestions were done in volumes of 30 μl with 4 U PstI-HF and 8 U MspI in NEB CutSmart buffer for PstI/MspI digestions, 4 U PstI, 8 U MspI and 4 U ApeKI in NEB-buffer 3.1 for PstI/MspI plus ApeKI digestions, and 4 U PstI-HF and 4 U NdeI in NEB CutSmart buffer for PstI/NdeI digestions. Samples were incubated for 2 h at 37 °C. This was followed by an additional 2-h incubation at 75 °C for reaction mixtures comprising the enzyme ApeKI. Samples without ApeKI were incubated at 75 °C for 20 min to inactivate the restriction enzymes. Twenty microliter restriction digest was mixed with 1 μl barcoded PstI adapter (stock: 0.1 μM), 1.5 μl common MspI or NdeI Y-adapter (stock: 10 μM), 4 μl 10X T4-DNA ligase buffer and 200 U T4 DNA ligase (NEB) in a total volume of 40 μl. The common Y-adaptor and barcoded adapters were as described by Poland et al. [8]. Ligation was conducted at 22 °C for 2 h. Following ligation, fragments smaller than 300 bp were removed by incubating the samples with 0.7 volumes of Sera-Mag SpeedBeads (GE Healthcare Life Sciences) prepared according to Rohland and Reich [24] at room temperature for 5 min. The beads were separated from the supernatant using a magnetic stand and washed three times with 200 μl freshly prepared 70% ethanol. DNA was eluted from the air-dried beads with 40 μl 10 mM Tris.HCl (pH 8.0). Three microliter of the resulting eluate was added to a cocktail of 16 μl H2O, 5 μl 5X Taq master mix (NEB), 0.5 μl of a forward primer specific to the barcoded adaptor (stock: 10 μM) and 0.5 μl of a reverse primer with homology to the common adaptor (stock: 10 μM). PCR amplification was done for each sample separately using an initial denaturation at 95 °C for 30 s, 16 cycles of denaturation at 95 °C for 30 s, primer annealing at 62 °C for 20 s and fragment elongation at 68 °C for 15 s, followed by a final fragment elongation step at 68 °C for Page 3 of 19 5 min. Eight microliters of PCR product were checked on a 1.5% agarose gel. The DNA concentration of each GBS library was measured on a Qubit 2.0 using a Qubit™ dsDNA HS assay kit. Only GBS libraries with concentrations > 5.0 ng/μl were sequenced. Thirty nanograms of each GBS library were pooled. The number of samples pooled depended on the sequencing platform used; we aimed to obtain two million reads per sample. Primers, dNTP and small DNA fragments were removed from the pooled DNA with 0.7 volumes of AMPure Beads or Sera-Mag SpeedBeads. Pooled GBS libraries (100 ng) were sequenced on an Illumina NextSeq platform with paired-end 150 bp reads. The parents and 115 F2:3 samples were sequenced as part of the same sequencing run. An additional three and 26 F2:3 samples were sequenced as part of two separate NextSeq runs. Ten samples were sequenced in duplicate from independently generated libraries to ensure consistency across libraries and runs. GBS analysis pipeline with optional de novo generation of a reference The full UGbS-Flex pipeline is described below. All in-house perl and python scripts used in the UGbS-Flex pipeline with information on their use are provided under ‘Programs and Scripts’ on http://research.franklin.uga.edu/devoslab/. Detailed information on how to apply the UGbS-Flex pipeline is given in Additional file 1: Data S1 and Additional file 2: Figure S1. Preprocessing of the reads The read quality was checked with ‘FastQC’ v. 0.11.4 [25]. Reads were split by barcode using the module ‘Process_Radtags’ within the ‘Stacks’ program [12] with option –r (rescue barcodes and RAD-tags). Forward reads passed the filter if they carried both the barcode and the PstI restriction site. Because the first few bases of the reverse reads were low quality in some Illumina NextSeq sequencing runs, no selection was carried out for the restriction site of the second enzyme (MspI or NdeI) in the reverse reads. ‘FASTX_trimmer’ from the ‘FASTX Toolkit’ package (http://hannonlab.cshl.edu/fastx_toolkit/) was used to remove (1) the restriction sites, (2) the last 5 bp (typically) of each read that were more likely to contain errors (or, for lower quality runs, all bases at the 3′ end of a read with FastQC quality scores lower than 20) and (3) an additional 0 (for a 10 bp barcode) to 5 bases (for a 5 bp barcode) at the 3′ end of the forward read to make all reads the same length. Identical read length is a prerequisite of the ‘Stacks’ program [12] used in the generation of a de novo reference from the GBS reads. De novo generation of a GBS reference To facilitate handling of paired-end reads during the generation of a GBS reference, overlapping forward and Qi et al. BMC Plant Biology (2018) 18:117 reverse reads were merged using ‘Flash’ [26]. From the non-overlapping read files, we removed any reads that were shorter than the expected (trimmed) size by running python script ‘EL.1.2.py’. If a reverse read was removed, the corresponding forward read was also removed, and vice versa. The ‘EL.1.2.py’ script then reverse complemented the non-overlapping reverse reads within the remaining read pairs and artificially joined them to the 3′ end of the corresponding forwards reads. No Ns were added at the junction of the forward read and the reverse complemented reverse read. Because read clustering by ‘ustacks’ [12] requires reads of equal length, ‘EL.1.2.py’ also extended merged overlapping reads at the 3′ end with ‘As’ to make them the same length as the joined non-overlapping reads. The A-extended overlapping fragments are typically common to all samples, and hence the polyA tracts do not generate polymorphisms. Reads within each sample were clustered using the ‘ustacks’ module (options: -m 2 -M 1 -N 1) within the ‘Stacks’ program [12]. The ‘cstacks’ module within Stacks (options: -b 1 -n 1) was used to generate a set of representative tags by clustering the read stacks obtained from ‘ustacks’ across the two parents and 117 F2 progeny. Only a subset of the F2 samples was included in the ‘cstacks’ analysis due to the high memory requirements for running ‘cstacks’ with large numbers of samples. We also tested and validated an alternative approach, referred to as ‘across-sample ustacks’ (‘ASustacks’) to replace ‘cstacks’. Using in-house python scripts, consensus sequences generated in each sample by ‘ustacks’ were extracted, given a unique name including a sample identifier, and placed in an artificial fastq file by adding a sequence quality line consisting of Es to each consensus sequence. The ‘ustacks’ module was then applied to this file using parameters comparable to those Page 4 of 19 applied in ‘cstacks’. The minimum number of reads required to form a stack (−m) was set at 1. An overview of the steps involved in the generation of a GBS reference using the UGbS-Flex pipeline is shown in Fig. 1. Filtering of the GBS representative tags to generate a reference Two filtering steps were conducted on the representative tags identified across samples by either ‘cstacks’ or ‘ASustacks’ (Fig. 1). Firstly, representative tags that were present in less than a user-defined percentage of samples (50 and 70% in this study) in the ‘cstacks’ output were removed using the in-house perl script ‘FCT.pl’. This filtering step has been integrated into the ‘ASustacks’ module. Secondly, we removed representative tags that had similarity levels to another representative tag equal to or higher than a user-defined percentage (98% in this study). To achieve this, a blastn all-against-all analysis of the consensus tags was conducted (e-value threshold: 10e− 5). For tags with ≥98% homology, only a single tag was retained using python script ‘Ref_98.py’. The representative tags remaining were used as the GBS reference. SNP/indel calling and filtering Preprocessed reads were aligned against the GBS reference using Bowtie 2 v. 2.2.0 with default parameters [27]. If a whole genome reference sequence is available, de novo generation of a GBS reference can be omitted and trimmed reads are aligned against the reference genome. For SNP/indel calling, we tested both the ‘Unified Genotyper (GATK v. 3.4.0)’ (parameters –dcov 1000, −glm BOTH) and ‘Haplotype Caller (GATK v. 3.4.0)’ (parameters –genotyping_mode DISCOVERY –stand_emit_conf 10 –stand_call_conf 30 – minPruning 1 –emitRefConfidence GVCF) modules Fig. 1 Schematic overview showing use of the UGbS-Flex pipeline to generate a GBS reference Qi et al. BMC Plant Biology (2018) 18:117 Page 5 of 19 within the GATK suite [28]. Indels were treated the same as SNPs and, for simplicity, we use the term ‘SNPs’ to cover both SNPs and indels. Raw SNPs were filtered within GATK to only retain biallelic SNPs with an allele frequency between 10 and 90%. We also removed adjacent SNPs (scripts ‘SNPs_ISL.pl’ and ‘Rm_adj_SNPs.pl’) because some of these were artefacts caused by misalignments at the junction of the forward and reverse reads. The allelic depth (AD) information provided in the GATK .vcf file was then used to score the allelic status of the SNPs at each locus (script ‘SNP_genotyper.py’). Loci with a total AD < 8 were scored as missing data points (−). Loci with an ADref(erence allele)/ADalt(ernate allele) ratio ≥ 10 were scored as A (homozygous for the Parent 1 allele), ADref/ ADalt ≤ 0.10 as B, 10 > ADref/ADalt > 4 as D (ambiguous A or heterozygous (H)) and 0.25 > ADref/ADalt > 0.1 as C (ambiguous B or H). Loci with other ratios were scored as H. For F2 populations generated from two inbred parents, as was the case for finger millet, all SNPs that were not homozygous for different alleles in the parents were removed (script ‘SNP_selectByParent.py’). SNPs that were missing in more than 30% of the samples or had an A/B ratio < 10% or > 90% were also removed. This was done manually in Excel. Similarly, samples with more than 30% of missing data were removed. Comparison of the UGbS-flex and GBS-SNP-CROP pipelines To compare the performance of the UGbS-Flex and GBS-SNP-CROP [13] pipelines, the raw sequence data from 48 tetraploid kiwiberry genotypes used by Melo et al. [13] were downloaded from NCBI (SRR2296676). Guided by the fastQC report, we trimmed the forward and reverse reads to 121 bp. The parameters used for ‘ustacks’ were – m 2 –M 2 –N 4, and for ‘ASustacks’–m 1 –M 2 –N 4. We selected the same missing data threshold (25%) and sequencing depth for SNP scoring as Melo and colleagues [13]. The H-threshold was set at 4. Genetic mapping Reducing the dataset by consolidating SNPs located in the same GBS tag Information at SNP loci that were located within the same GBS reference tag and were in linkage disequilibrium was consolidated to further improve the robustness of the mapping scores (function included in ‘SNP_genotyper.py’). If loci within a representative tag were scored as a combination of ‘A’, ‘D’ and ‘-’, the consolidated score was ‘A’ (Table 1). Similarly, a combination of ‘B’, ‘C’ and ‘-’ was consolidated as ‘B’, and a combination of ‘H’,‘C’, ‘D’ and ‘-‘ as H. If all loci in a GBS representative tag were scored as ‘-’, ‘C’ or ‘D’, those scores were retained. Conflicting scores (A and B, A and C, A and H, B and H, and B and D) were flagged (‘F’) and treated as missing data for map generation. Reducing the dataset by removing cosegregating markers To generate a set of high-quality non-redundant markers for genetic map construction, each SNP marker was given a penalty score for the presence of a ‘C’ or ‘D’ (penalty = 1) or a missing data point (penalty = 2). Using the in-house python script ‘SNP_cosegregation.py’, the mapping scores of all SNP markers were compared in an all-against-all analysis with a greedy algorithm. SNP markers with the same multilocus genotypes were identified, and the marker with the smallest penalty score in each set was selected for mapping. Construction of genetic maps To construct the genetic maps, we removed all SNPs that had missing data in more than 20% of the progeny. The set of SNP markers was split into linkage groups using MSTmap [21]. Initial map orders were established with MST map and checked for double recombination events using MAPMAKER (adapted from [18]; adapted version available from http://research.franklin.uga.edu/ devoslab/). Markers with more than a defined number (four in this study) of double recombination events were removed. The process of MSTmap mapping and checking of double recombination events was repeated until the number of double recombination events between each marker and its flanking markers was ≤4. The corresponding MSTmap maps were used as starting points for map generation using MAPMAKER (adapted from [18]; adapted version available from http://research. franklin.uga.edu/devoslab/). Because the MAPMAKER version used was limited to ordering ~ 100 markers due to inherent program limitations, MSTmap maps with more than 100 markers were Table 1 Schematic representation of the approach used to consolidate SNPs within the same GBS reference tag S1b S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 a – Bc – C D A B H A B H A a SNP2 A – – – – D C C A D A A SNP3a – – H – – A C D C B A B After consolidation A B H C D A B H – – – – SNP1 a SNP1, SNP2 and SNP3 are SNPs located on the same GBS tag b S1 to S12 represent different situations c A, B, H, C, D and – are genotypic scores Qi et al. BMC Plant Biology (2018) 18:117 split into smaller subgroups of 60 to 100 markers. Subgroups overlapped by 40 markers. Genetic maps were constructed for each of the subgroups using the ‘order’ and ‘try’ commands. Subgroup maps were merged based on common marker orders in the overlapping segments. Each linkage group was split again into subgroups of < 100 markers, and marker orders were further adjusted using the ‘ripple’ command. ‘Try’ and ‘ripple’ were done in a semi-automated manner using the scripts ‘try.py’ and ‘ripple.py’, respectively. Final marker orders were merged across subgroups. Genetic map distances (in Kosambi) were obtained using the ‘map’ command with ‘error detection on’ in MAPMAKER. Map orders were scrutinized manually and, if necessary, further adjusted. Markers with the same multi-locus genotype were added to the map as cosegregating with their representative marker. Finally, we placed markers with ambiguous orders (not separated by clear recombination events) in bins. Identifying a and B genome linkage groups Reads from three E. indica accessions (AA genome) were aligned against the GBS reference tags using Bowtie 2 v. 2.2.9 with default parameters [27]. Presence (present in at least two of the three E. indica accessions analyzed) and absence (absent from all three E. indica accessions analyzed) of mapped GBS tags in the E. indica genome were charted along the length of the genetic map using Excel scatterplots. Linkage groups with E. indica tags along their entire length were allocated to the A genome and those that were largely devoid of E. indica tags were allocated to the B-genome. Results Efficiency of different enzyme combinations in generating polymorphic markers We tested two two-enzyme combinations (PstI/MspI and PstI/NdeI) and one three-enzyme combination (PstI/MspI + ApeKI) on three finger millet accessions for their efficiency in generating largely overlapping fragment pools that, when sequenced, yielded SNPs that were present in all three accessions at a depth of at least 8×. All samples were sequenced (paired-end 150 bp) on an Illumina NextSeq platform. The number of reads obtained for each of the nine sample/enzyme combinations (three accessions, three enzyme combinations) is given in Additional file 3: Table S1. To estimate the effect of read depth on the de novo generation of a GBS reference, we analyzed subsets of 0.2 million (M), 0.5 M, 1 M, 2 M and 3 M paired-end reads for each accession/ enzyme combination with our newly developed UGbS-Flex pipeline (Fig. 1). The smaller read numbers were subsets of the larger read sets. Page 6 of 19 For the enzyme combinations PstI/MspI and PstI/MspI plus ApeKI, as expected, the number of GBS tags that were common to all three accessions tested increased with increasing total read numbers, reaching a plateau around 2 M reads (Additional file 3: Table S1 ‘By Enzyme Combination’). For the enzyme combination PstI/NdeI, however, the number of common GBS tags increased from 0.2 M to 1 M total reads, but then decreased when total read numbers were increased from 1 M to 3 M. To test whether this was an artefact generated by the ‘cstacks’ [12] module, which clusters reads across samples, we developed the script ‘across-sample ustacks’ (‘ASustacks’). ‘ASustacks’ generated an artificial fastq file from each sample’s ‘ustacks’ output and these files were used as input for ‘ustacks’. The ‘ASustacks’ approach yielded similar numbers of reference tags as ‘cstacks’ except for read numbers ≥1 M in enzyme combination PstI/NdeI. We now saw the expected increase in common GBS tags with increasing read numbers across all enzyme combinations. More than 97% of GBS reference tags that were identified with ‘cstacks’ were also found in the reference generated using ‘ASustacks’. Interestingly, the read depth of the GBS reference tags that were identified by both ‘cstacks’ and ‘ASustacks’ was significantly lower than the read depth of the GBS reference tags that were uniquely identified by ‘ASustacks’ (Additional file 4: Table S2). This suggests that high read depth hampered the performance of ‘cstacks’, possibly because a higher read depth led to a higher absolute presence of SNPs caused by PCR or sequencing errors in allelic reads. The ‘cstacks’ module may have eliminated these clusters as likely consisting of repetitive DNA. We conducted a blastn analysis of all finger millet reference tags that were identified by ‘ASustacks’ in the subset of 3 M reads in the PstI/NdeI digested samples against the repeat-masked rice (Oryza sativa) genome. Blastn hits were identified (e-value threshold of e-5) for 30% of the tags that were common to ‘cstacks’ and ‘ASustacks’, and for 37% of the tags that were uniquely identified by ‘ASustacks’. This shows that the tags eliminated by ‘cstacks’ were not enriched for repeats. To further validate the ‘ASustacks’ approach, we compared read clusters generated by ‘cstacks’ and ‘ASustacks’ in GBS data from DNA of a set of 96 diverse foxtail lily lines belonging to different Eremurus species digested with PstI/MspI. Foxtail lily has an ~ 7.9 Gb (1C) genome (I. Leitch, pers. comm.). The number of tags that were common to at least 50% of the lines based on ‘cstacks’ and ‘ASustacks’ was 376 and 3552, respectively, with 98% of ‘cstacks’ clusters being present in the ‘ASustacks’ output. In addition to yielding higher numbers of reference tags, the ‘ASustacks’ approach was computationally much less intensive than ‘cstacks’. To estimate the effect of the different enzymes on reducing the fragment pool for sequencing, we compared the number of polymorphic GBS tags that were present Qi et al. BMC Plant Biology (2018) 18:117 Page 7 of 19 in all three accessions tested and the number of SNPs identified across the three enzyme combinations (Table 2; Additional file 3: Table S1 ‘By Read Number’). Minimum read depth for SNP scoring was 8×. The use of a third enzyme greatly decreased the number of GBS reference tags and hence the number of SNPs identified but, contrary to our expectations, only marginally increased SNP read depth (Table 2). The percentage of ApeKI sites in the PstI/MspI read set was some two-fold higher than in the PstI/MspI + ApeKI read set, and comparable to that found in the GBS reference tags (Additional file 5: Table S3). In the triple digest, however, ApeKI-containing reads were highly underrepresented in the GBS reference compared to the reads (Additional file 5: Table S3). This suggests that ApeKI-containing reads in the triple digests could not be clustered. They likely originated during the adapter-ligation step through random ligation of ApeKI fragments from different genomic regions. The highest number of polymorphic SNPs present at a minimum read depth of 8× in each of the three accessions tested was obtained with enzyme combination PstI/MspI for read numbers ≥1 M and with PstI/NdeI when the number of fragments sequenced per sample was ≤500,000 (Additional file 3: Table S1 ‘By Read Number’). Generation of a GBS reference in the MD-20 x Okhale-1 mapping population A total of 278,880,767 paired-end reads were obtained across 146 samples (SRA Study SRP136342). Genotypic scores of duplicated samples were merged using the rules applied to SNP consolidation. Approximately 2% of SNPs disagreed between duplicated samples and were entered as missing data. The average and median read number per sample was 1,910,142 and 1,317,595, respectively. One sample with less than 600,000 reads was removed from the analysis. The number of representative GBS tags that were present in at least 50% of the samples (hereafter referred to as Ref50) following ‘cstacks’ analysis was 34,960 (Table 3). This number decreased to 16,725 for tags present in at least 70% of the samples (Ref70). Following removal of representative tags with ≥98% homology, 28,579 tags remained in the Ref50 reference (Ref50_98) and 15,397 in the Ref70 reference (Ref70_98). SNP calling Trimmed reads were aligned against the generated Ref50, Ref50_98, Ref70 and Ref70_98 GBS references and the alignments were used for SNP calling. Both GATK’s Unified Genotyper and Haplotype Caller were employed and their outputs compared. The number of reference GBS tags that carried SNPs for each SNP caller/alignment combination is given in Table 3. The number of SNP-containing GBS tags common to different references and to different SNP callers is shown in Fig. 2 and Additional file 6: Figure S2. For construction of the genetic maps, the SNPs identified with Unified Genotyper and Haplotype Caller against the references Ref50_98 and Ref70_98 were pooled, yielding a total of 17,245 SNPs distributed over 7307 tags. Consolidation of the SNPs located on the same tag to one consensus SNP per tag reduced the number of SNPs (tags) to 7125. A total of 182 tags was removed because they carried SNPs with conflicting genotypic scores in 20% or more of the progeny. High numbers of conflicting scores across SNPs located on the same GBS tag is likely caused by alignment of both A and B genome reads to the same GBS reference tag. Table 2 Summary statistics obtained for each of the three enzyme combinations for a subset of 1 M reads Enzyme combination PstI/MspI PstI/MspI + ApeKI PstI/NdeI Accession Total read number Number of stacks (tags) within samplesa Number of stacks (tags) common to all samplesb 22,191 KNE 796 5,720,433 83,082 MD-20 5,418,738 64,382 Okhale-1 7,738,023 73,903 KNE 796 3,466,582 50,189 MD-20 4,474,070 37,619 Okhale-1 5,586,989 45,829 KNE 796 6,647,185 39,406 MD-20 4,269,463 35,218 Okhale-1 5,851,494 37,551 Number of polymorphic tagsc Number of polymorphic SNPsc 4440 4515 Average depth of scored SNPs 15.20 15.64 17.21 12,249 16.57 2475 2509 20.90 19.65 12,392 32.28 3232 3305 28.66 34.76 Determined using ‘ustacks’ b GBS tags that are common to all three accessions were selected from the ‘ASustacks’ output; if two or more tags had a level of homology ≥98%, only a single tag was retained c Only SNPs at a read depth ≥ 8× were scored a Qi et al. BMC Plant Biology (2018) 18:117 Page 8 of 19 Table 3 Number of SNPs identified with different combinations of SNP callers and GBS references Unified Genotyper (GATK) Haplotype Caller (GATK) Ref50 Ref50_98 Ref70 Ref70_98 Ref50 Ref50_98 Ref70 Ref70_98 No. of GBS tags 34,960 28,579 16,725 15,397 34,960 28,579 16,725 15,397 No. of SNPs 2358 5534 3939 4477 1766 4593 3413 3934 Comparison of UGbS-flex and GBS-SNP-CROP pipelines Using the same dataset and, to the extent possible, same thresholds as reported by Melo and colleagues [13], UGbS-Flex yielded, before filtering, a total of 86,810 SNPs compared to 56,598 reported by Melo and colleagues for GBS-SNP-CROP (Additional file 7: Table S4). After filtering, the total number of SNPs retained from the UGbS-Flex pipeline was 50,139 compared to 21,318 by GBS-SNP-CROP (Additional file 7: Table S4). The SNP number reported for the GBS-SNP-CROP pipeline is when a single accession with the most abundant read number was used in the generation of a reference. When all 48 accessions were used, the total number of filtered SNPs obtained by GBS-SNP-CROP was 14,712. We used all 48 accessions to generate a reference with UGbS-Flex. Genetic mapping Because some of the distorted markers caused spurious linkages, we initially removed all markers with segregation ratios that deviated from 1:2:1 (A:H:B). We also removed cosegregating markers to reduce marker load during map construction. After the initial map generation, three groups of distorted markers that linked together at high LOD scores and extended the preliminary maps were re-added to the dataset to generate the final maps. The maps consisted of a total of 3772 SNP markers organized in 18 linkage groups with the number of markers per linkage group varying from 39 (51 cM) to 301 (240 cM) (Figs. 3, 4 and 5, Table 4, Additional file 8: Table S5). After Fig. 2 Venn-diagram showing the number of unique and common SNPs identified using GATK’s Unified Genotyper and Haplotype Caller in combination with GBS references Ref50_98 and Ref70_98 integrating the cosegregating markers, the total number of markers mapped was 4453 (Additional file 9: Table S6). The number of recombination bins per chromosome varied from 25 to 120 (Additional file 8: Tables S5 and Additional file 9: Table S6). The sequences of the mapped GBS tags and the SNP positions in these tags are provided in Additional file 9: Table S6. Allocating linkage groups to a and B subgenomes For approximately 13% of the mapped E. coracana GBS tags, corresponding GBS reads were identified in all three E. indica accessions analyzed. An additional 14% of mapped tags were represented in two of the three analyzed E. indica accessions and 18% were present in only a single E. indica accession. Excel scatterplots showing the distribution of GBS tags absent from all three E. indica accessions and present in at least two of the three E. indica accessions in each of the 18 E. coracana linkage groups are shown in Fig. 6. Chromosomes within seven of the nine homoeologous groups were unambiguously assigned to the A or B subgenome. A/B translocations were identified for homoeologous groups 6 and 9. Discussion Optimization of the GBS technology We tested several modifications to the experimental GBS protocol developed by Elshire et al. [2] and Poland et al. [8]. The aim was to reduce the fragment pool for sequencing and to provide more even read coverage across pooled samples to increase read depth at each locus and SNP representation across samples. A reduction in the number of fragments that will be sequenced can be attained by preventing the addition of Illumina sequencing adapters to a subset of the DNA fragments during the PCR step. This can be achieved using primers with one or more selective bases as applied in Tunable GBS (tGBS®, Data2Bio, ISU Research Park, Iowa) or by cutting PCR-amplifiable fragments with a restriction enzyme. We pioneered the latter approach in finger millet, an inbreeding tetraploid species. DNA of three accessions was either double digested with the enzyme combination PstI/MspI or PstI/NdeI, or triple digested by adding ApeKI to the PstI/MspI digest. PstI and NdeI are 6-bp cutters, MspI is a 4-bp cutter and ApeKI is a 5-bp cutter. PstI, NdeI and ApeKI are sensitive to CNG methylation. No adapters were ligated to sites generated by the third enzyme. While employing a third enzyme Qi et al. BMC Plant Biology (2018) 18:117 Page 9 of 19 Fig. 3 High-density genetic maps of finger millet (homoeologous groups 1, 2 and 3). Marker names are on the right-hand side, centiMorgan (Kosambi) distances on the left-hand side. For readability, only the first marker of each marker bin is represented on the map. Locations of all markers are available from Additional file 9: Table S6 Qi et al. BMC Plant Biology (2018) 18:117 Page 10 of 19 Fig. 4 High-density genetic maps of finger millet (homoeologous groups 4, 5 and 6). Marker names are on the right-hand side, centiMorgan (Kosambi) distances on the left-hand side. For readability, only the first marker of each marker bin is represented on the map. Locations of all markers are available from Additional file 9: Table S6

UGbS-Flex, a novel bioinformatics pipeline for imputation-free SNP discovery in polyploids without a reference genome: Finger millet as a case study

Nội dung