SAMSA: A comprehensive metatranscriptome analysis pipeline.pdf (BMC Bioinformatics)

Westreich et al. BMC Bioinformatics (2016) 17:399 DOI 10.1186/s12859-016-1270-8 METHODOLOGY ARTICLE Open Access SAMSA: a comprehensive metatranscriptome analysis pipeline Samuel T. Westreich1,2, Ian Korf1,2, David A. Mills3 and Danielle G. Lemay2* Abstract Background: Although metatranscriptomics—the study of diverse microbial population activity based on RNA-seq data—is rapidly growing in popularity, there are limited options for biologists to analyze this type of data. Current approaches for processing metatranscriptomes rely on restricted databases and a dedicated computing cluster, or metagenome-based approaches that have not been fully evaluated for processing metatranscriptomic datasets. We created a new bioinformatics pipeline, designed specifically for metatranscriptome dataset analysis, which runs in conjunction with Metagenome-RAST (MG-RAST) servers. Designed for use by researchers with relatively little bioinformatics experience, SAMSA offers a breakdown of metatranscriptome transcription activity levels by organism or transcript function, and is fully open source. We used this new tool to evaluate best practices for sequencing stool metatranscriptomes. Results: Working with the MG-RAST annotation server, we constructed the Simple Annotation of Metatranscriptomes by Sequence Analysis (SAMSA) software package, a complete pipeline for the analysis of gut microbiome data. SAMSA can summarize and evaluate raw annotation results, identifying abundant species and significant functional differences between metatranscriptomes. Using pilot data and simulated subsets, we determined experimental requirements for fecal gut metatranscriptomes. Sequences need to be either long reads (longer than 100 bp) or joined paired-end reads. Each sample needs 40–50 million raw sequences, which can be expected to yield the 5–10 million annotated reads necessary for accurate abundance measures. We also demonstrated that ribosomal RNA depletion does not equally deplete ribosomes from all species within a sample, and remaining rRNA sequences should be discarded. Using publicly available metatranscriptome data in which rRNA was not depleted, we were able to demonstrate that overall organism transcriptional activity can be measured using mRNA counts. We were also able to detect significant differences between control and experimental groups in both organism transcriptional activity and specific cellular functions. Conclusions: By making this new pipeline publicly available, we have created a powerful new tool for metatranscriptomics research, offering a new method for greater insight into the activity of diverse microbial communities. We further recommend that stool metatranscriptomes be ribodepleted and sequenced in a 100 bp paired end format with a minimum of 40 million reads per sample. Keywords: Metatranscriptome, Pipeline, Microbiome, Metagenome, RNA-seq, Big data, Software package, Best practices * Correspondence: dglemay@ucdavis.edu 2 Genome Center, University of California - Davis, Davis, CA, USA Full list of author information is available at the end of the article © 2016 The Author(s). Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Westreich et al. BMC Bioinformatics (2016) 17:399 Background Metatranscriptomics, the large-scale sequencing of mRNAs from complex microbial communities, allows for the observation of gene expression patterns [1–4]. Metatranscriptomics is a relatively new field, with the first mention around 2008 [5], but it is growing quickly. Using high-throughput techniques developed in conjunction with “big data” computer-automated analysis approaches, metatranscriptomics offers a novel and complete method for looking at not just the organisms present, but also the transcriptional activity occurring within a complex and diverse population at any chosen specific point in time [2, 3]. Metatranscriptomics is especially useful for analyzing complex populations in flux, such as the gut microbiome, which can be impacted and altered by a large number of transitory factors [6–8]. However, because of the complexity of metatranscriptomic data, extensive analysis is needed to convert raw data into simplified and easily understood results. The raw data comprise tens of millions of individual reads per sample [4]. Simplifying and condensing this very large data set requires multiple steps in a software pipeline and generally requires a dedicated bioinformatician to perform the analysis. Current pipelines or in-house methods often require significant computing power or use several different tools, many of which were not originally intended for metatranscriptome analysis [2, 3, 9]. For researchers who want to utilize metatranscriptomic analysis but may not have the necessary bioinformatics experience, there is a strong need for a complete pipeline, designed to analyze this data from beginning to end without requiring extensive technical expertise. In addition, as metatranscriptomics is a new area of genomic exploration, there are few established guidelines or standard protocols. Standardized protocols exist for mRNA collection [5] and for stool-specific extraction [10, 11], but there is also a pressing need for an investigation of necessary sequencing parameters and depth, minimum read quality standards, and reliable reference databases against which metatranscriptome reads can be aligned. Any researcher planing to sequence metatranscriptomic data must know the target sequencing depth and minimum recommended read length, whether to filter out ribosomal sequences, and the relative confidence of their predicted results based on these parameters. In this paper, to develop a standard metatranscriptomic analysis pipeline that can be applied to a wide range of microbiome samples, we extracted and sequenced RNA from human fecal material using experimentally validated protocols [10]. By testing various parameters and modifications, we determined the best practices at each step in obtaining and analyzing the metatranscriptomic data. In addition, we incorporated a publicly available metatranscriptomic data set examining the fecal gut microbiome in Page 2 of 12 tyrosine kinase two knockout (Tyk2−/−) mice compared to wild-type controls over time during the outbreak of dextran sodium sulfate (DSS) induced colitis [12]. Examining this data, we provide new insights into differences in functional expression between the two different microbiomes over the course of the colitis development. By applying the approaches detailed here, researchers wishing to include metatranscriptomic analysis in their experiments can obtain accurate results without an investment in the development of bioinformatics tools. Results Creation of SAMSA, a new pipeline for metatranscriptome analysis We created the SAMSA (Simple Analysis of Metatranscriptome Sequence Annotations) pipeline, designed to fully analyze and characterize bacterial transcription within a metatranscriptome, illustrating relative level of transcriptional activity split by both organism and functional category. There are four phases to the pipeline: the preprocessing phase trims and combines reads for input to the annotation phase, the annotation phase provides an annotation for each read, the aggregation phase aggregates organsim and function information across all reads, and the analysis phase provides visualizations and statistical analysis (Fig. 1). Preprocessing phase During preprocessing, raw sequences are trimmed to remove reads containing low-quality bases and eliminate adaptor contamination using Trimmomatic, a flexible read trimming tool for Illumina NGS data [13]. Next, each pair of paired-end reads are aligned to each other using FLASh, a short read aligning program [14]. In our pilot samples, approximately 32–54 % of the raw reads in each sample were successfully aligned, with an average aligned read length of 178 base pairs. Annotation phase Next, these sequences are submitted for annotation to Metagenomic Rapid Annotations using Subsystems Technology (MG-RAST) [15]. MG-RAST includes several steps, including an initial sequence quality control check through SolexaQA, gene calling through FragGeneScan, clustering of amino acid sequences at 90 % identity through the uclust implementation of QIIME, and then using sBLAT on each protein sequence cluster to locate the best match reference. For each sequence cluster, MG-RAST selects the best match through the sBLAT similarity search. If multiple reference database matches tie for best matching score, they are both included in the final results. If the read does not achieve a match score above the minimum e-value cutoff, it is discarded. Each match is linked to MG- Westreich et al. BMC Bioinformatics (2016) 17:399 Page 3 of 12 Fig. 1 The SAMSA pipeline. This organizational chart shows the flow of data through the pipeline, beginning with raw reads at the top of the chart and ending with the graphical output of the results at the bottom. Note that blue boxes denote intermediate generated output files, red boxes denote Python scripts, orange boxes denote R scripts, and green boxes denote external reference databases Westreich et al. BMC Bioinformatics (2016) 17:399 RAST’s internal identifier system and assigned an M5nr ID, correlating with linked matches in all subsystems databases. The annotated output can be provided on a permatch basis, using the M5nr ID to link each read to its best match from the subsystems database of choice. To create sorted abundance measures of the metatranscriptome using the SAMSA pipeline, all annotations with an acceptable best-match to the NCBI Reference Database (RefSeq) [16] are downloaded from MG-RAST. Annotations are downloaded for the best match to both organism and individual transcript. Annotations are downloaded directly in tabdelimited form using MG-RAST’s RESTful API interface [17] and a custom Python program to assemble the API call command. In addition, the annotated output is also downloaded from the SEED Subsystems reference database [18] to provide ontology annotations. Aggregation phase A custom Python program parses each annotated output, storing each unique annotation match in a dictionary and maintaining counts of the number of occurrences of each unique annotation. After the annotation file is processed, unique annotations are sorted by abundance and exported as output. Analysis phase Annotation and abundance information from the Aggregation phase are inputs to custom R scripts, which generate barplots and dendograms (see Fig. 1). To compare experimental versus control metatranscriptomes and determine significantly differentially expressed transcripts, R’s DESeq2 package is used to test the output files for differential expression [19]. DESeq2’s testing method adjusts for multiple hypothesis testing, performing pairwise comparisons in the scope of a larger overall data set. The pipeline’s R scripts export sorted lists of log2fold change in both organism and functional category activity, sorted by adjusted p-value. Results are also stored in the local R environment and can be used for graphical output. This complete pipeline, coded in Python and R, is fully open source, is set up for streamlined use from the command line, does not require a local server for intense computation, and is freely available for download through GitHub. Measurement of host transcripts in fecal metatranscriptomes Given that metatranscriptomes theoretically contain the transcriptional profile of all organisms present, we asked whether the transcriptomes of host cells could still be elucidated from fecal metatranscriptomes for which the primary goal was bacterial metatranscriptome analysis. We tried two methods of RNA extraction followed by either ribosomal depletion or poly(A) enrichment (see Page 4 of 12 Methods). Neither method yielded many reads aligning to the human genome. We found 417 and 340 ribodepleted reads (0.0019 and 0.0085 % of all ribodepleted reads) matching the human genome, and 2273 and 2100 poly(A) enriched reads (0.116 and 0.097 % of all poly(A) enriched reads) matching the human genome. Thus, we conclude that host transcriptomes cannot be reliably extracted from stool metatranscriptomes that have been isolated and sequenced using standard methods. Use of ribosomal depletion methods in metatranscriptome preparation Transcriptomic and metatranscriptomic extractions are often treated with ribodepletion methods before sequencing for the removal of ribosomal reads, increasing the relative mRNA yield [4, 20]. Comparing ribodepleted and non-depleted RNA, we found that the Ribo-Zero Gold ribodepletion kit from Illumina showed an overall 63–82 % reduction in ribosomal reads present within our sample, but certain microbial ribosomes did not show any reduction, including members of the Actinobacteria, Cyanobacteria, and Spirochaetes phyla (Fig. 2, Table 1). Ribodepletion appears effective at depleting rRNA reads within a sample, but also preferentially removes ribosomes from certain bacterial phyla over others, skewing ribosomal output within the metatranscriptome’s sequencing results. Optimal sequencing depth for gut microbiome metatranscriptomics An important consideration when creating a metatranscriptome is sequencing depth: how many reads must be obtained in order to provide proper representation of all reads within a sample? As the proportionality of different reads within the sample is of crucial importance, sequencing depth must be large enough to ensure a balanced representation of each read’s relative abundance within the total sample. Too little depth can result in inaccurate abundance and activity measurements. To evaluate necessary sequencing depth using a bioinformatics based approach, we generated 100 randomly selected subsets of a large and comparatively over-sequenced metatranscriptome of 21.6 million annotated reads (derived from 38 M raw reads), creating ten smaller stand-alone simulated metatranscriptomes for each size point measured, moving in ten percent increments from 1 million up to 20 annotated million reads per subset. Within each simulated metatranscriptome, the relative percentage of various reads was measured, and these results were compared by simulated metatranscriptome size. This approach was performed looking at both high (in the 90th percentile of all reads by sorted abundance), medium (in the 50th percentile of all reads by sorted abundance), and low (in the 10th percentile of Westreich et al. BMC Bioinformatics (2016) 17:399 Page 5 of 12 Fig. 2 Level of depletion in ribodepleted vs. control metatranscriptomes. Identical pilot metatranscriptome samples were sequenced; protocol 1 included a ribodepletion step, while protocol 4 did not include this step. As is demonstrated, not all species were equally depleted, skewing the perceived abundances of different organisms within the metatranscriptome. Normalized data shows the combined average depletion percentage for each phylum of bacteria present within the sample all reads by sorted abundance) abundance reads within the parent metatranscriptome (Fig. 3). By comparing relative abundance in each subset metatranscriptome to the final full data, we quantified the accuracy of the abundance measurements. We defined accuracy as percentage deviation from final, stable abundance counts. This data shows that, for the human gut metatranscriptomes examined, a minimum a minimum of 5 and 10 million read annotations are needed to achieve above 90 % accuracy in low abundance reads. This is equivalent to about 40–50 million raw sequence reads. Note that abundance estimates for medium and high abundance reads can be reasonably accurate with fewer annotations (Fig. 3). Paired or single end sequencing for metatranscriptome analysis Another important consideration when sequencing a metatranscriptome is whether paired-end sequencing is necessary for proper read annotation. Paired-end sequencing is more expensive on a per-read basis but the average output read length is significantly increased, allowing for a more accurate best-match annotation in the pipeline. Statistical comparisons of the relative organism activity results and the relative transcript abundance between single and paired end read versions of the same original metatranscriptome file show significant differences in both organism-of-origin and transcript identification by the annotation pipeline. In identical metatranscriptomes containing either paired-end or single end reads, the paired-end reads resulted in fewer reads removed due to quality controls (51,507 versus 123,518, or 0.3 vs. 0.6 %), and a greater percentage of all reads matching at least one alignment in the reference database (77.2 %, or 4,321,429 paired end reads versus 65.9 %, or 3,194,911 single end reads) (Table 2). The paired-end sequencing data matched to a higher number of unique transcripts than the single read data (270,384 unique transcripts versus 215,599 transcripts), suggesting that increased read length leads to greater specificity in transcript annotation. Correlations between paired-end and single-read processed identical metatranscriptomes averaged only 0.72 between both pilot samples. Given that these pairedend and single read metatranscriptomes were originally identical before the trimming of excess bases beyond 100 bases per read, these results indicate significant mislabeling or lost information in the single read approach when compared to the more accurate paired-end sample. Taken with the organism matching results, these numbers suggest that the increased read length generated either through paired-end sequencing or through 150-bp single-end sequencing is necessary to ensure accurate metatranscriptome annotation. Westreich et al. BMC Bioinformatics (2016) 17:399 Page 6 of 12 Table 1 Comparison of rRNA abundance in control vs. ribodepleted samples Applying this pipeline to a public metatranscriptomic data set: organism and functional analysis Phylum Determining organisms present within fecal metatranscriptomes Publicly available metatranscriptomic data [12] was analyzed using the SAMSA pipeline described above (Fig. 1). Output tables generated by the pipeline are imported into R, where a stacked bar graph allows for comparing of relative activity levels both by organism (Fig. 5) and by functional category (Fig. 6). Due to the large number of both organism and functional categories, the R analysis scripts generate graphs which show only the top 30 most abundant organisms and/or functional categories, with remaining categories grouped under an “other” catch-all category (Fig. 6a) or removed from display (Fig. 6b). In the original paper [12], which focused primarily on the response of mouse gut epithelial cells to dextran sodium sulfate (DSS) induced colitis, the authors used primarily only the rRNA sequences from 16S amplicon sequencing of the samples to examine shifts in mouse gut microbiomes. They did report an increase in mRNA matching Enterobacteriaceae species in Tyrosine Kinase 2 (Tyk2) deficient mice, rising on day 3 of treatment, but provided no other mRNA analysis of these samples. Examining the results of our pipeline, we observed a 1.558-fold increase in Enterobacteriaceae mRNA expression in day 3 samples (p = 0.0055). Indeed, looking across all bacterial species with greater than 100 average expressed transcripts, we found significant increases in transcriptional activity in Escherichia and Providencia species, both of which are associated with ulcerative colitis [21]. In addition, we noted a significant decrease in Butyrivibrio species (p = 0.03), when comparing between wild-type and Tyk2-deficient mice over the course of DSS-induced colitis. Butyrivibrio is a probiotic-associated strain that may normally offer protection against colitis [22]. We also observed several significant changes in functional activity between the two groups, identifying more than 300 protein-coding transcripts differentially expressed between wild-type and Tyk2-deficient mice (adjusted p < 0.05, Additional file 1). Using publicly available metatranscriptomes that were not depleted of ribosomal RNAs [12], we compared estimates of organism abundance using either mRNA or rRNA transcripts, using matches in NCBI’s RefSeq database for mRNA and matches in the SILVA small subunit (SSU) database for rRNA. We found a consistent correlation between abundance estimates from mRNA vs. rRNA (Fig. 4) across all samples, suggesting that mRNA abundance estimates may be able to provide useful representative population data within a sample. With filtering to remove non-bacterial annotations, we observed an average Pearson correlation of 0.99 at the order level, 0.85 at the family level, and 0.81 at the genus level of identification across all 15 samples. Discussion Although still an emerging field, metatranscriptomics offers a powerful approach for analyzing complex functional activity within a heterogeneous microbial population, such as that found within the gut microbiome. Previous approaches for the study of the gut microbiome provide an incomplete picture, limited to only identifying variations in organism population levels and not including data on the gene expression of different species within the environment. Our pipeline fulfills the need for four main steps in metatranscriptomic data analysis: preprocessing, annotation, aggregation, and analysis. By leveraging MG-RAST’s 4012, P4 4012, P1 4015, P4 4015, P1 % depleted Gemmatimonadetes 3 1 28 5 74.40 Spirochaetes 3 4 14 55 163.10 Nitrospirae 4 1 22 0 87.50 Fusobacteria 5 5 24 9 31.25 Fibrobacteres 6 0 21 0 100.00 Synergistetes 6 5 14 2 51.19 Thermi 6 1 11 1 87.12 Chlorobi 6 1 34 0 91.67 Chloroflexi 7 6 29 2 53.69 Caldiserica 8 0 64 0 100.00 Caldithrix 11 0 70 2 98.57 Tenericutes 67 6 411 28 92.12 Chlamydiae 74 1 410 15 97.50 Acidobacteria 94 5 453 7 96.57 Euryarchaeota 175 15 942 693 58.93 Parvarchaeota 241 8 1298 22 97.49 Cyanobacteria 262 210 974 59 56.89 Planctomycetes 303 7 2019 23 98.28 Crenarchaeota 514 183 2474 106 80.06 Lentisphaerae 1045 19 6324 151 97.90 Verrucomicrobia 1107 29 6679 117 97.81 Actinobacteria 2739 5432 8394 5326 30.89 2612 Proteobacteria 43585 3780 61436 Bacteroidetes 486595 218584 1163164 98778 73.29 93.54 Firmicutes 939246 127130 2178859 38923 92.34 TOTALS 1476112 355433 3434266 146948 75.98 Two pilot metatranscriptomes were sequenced twice; the P1 protocol involved a ribodepletion step, while the P4 protocol did not. Comparisons of the organism annotations reveal that ribodepletion led to a decrease in 16S rRNA reads for most, but not all, bacterial phyla Westreich et al. BMC Bioinformatics (2016) 17:399 Page 7 of 12 Fig. 3 Effects of metatranscriptome size on read abundance variation. As the number of annotations in a metatranscriptome increases along the X axis, accuracy of abundance measurements increases for all reads. Red denotes the top 5 most abundant transcripts within the sampled metatranscriptome (by counts), while blue denotes transcripts of medium abundance (top 50 % by sorted counts) and green denotes low-abundance transcripts (bottom 10 % by sorted counts). Approximately 10–15 million annotations are needed before abundance accuracy for all transcripts tops 90 %. Abundance accuracy was measured out to 20 million annotations, but the accuracy was 100 % in all categories beyond 15 million (data not shown) annotation server, we can provide comprehensive annotation and analysis of a metatranscriptome without requiring a dedicated private server. SAMSA generates outputs at each step, creating a streamlined pipeline where each stage in data analysis can be independently examined. Using the SAMSA pipeline, we have established a set of “best practices” for metatranscriptome sequencing. Our work demonstrates that the increased specificity provided by a paired-end sequencing approach significantly increases specificity of read annotations. In addition, to provide a complete and accurate measure of read abundance within a metatranscriptome, approximately 40 million raw Table 2 Read count comparison between single and paired end files 4012 SR 4012 PE 4015 SR 4015 PE Forward reads 38,763,820 - 22,613,426 - Reverse reads 38,763,820 - 22,613,426 - Paired assembled reads - 19,269,224 - 7,245,834 Total annotations 11,840,915 21,683,084 1,928,992 3,041,963 Unique annotations 8,658,502 1,875,267 13,310,506 1,280,940 Two pilot paired-end metatranscriptomes were analyzed both in paired configuration, and using only the forward reads (to simulate a single-end metatranscriptome from the same data). Despite higher numbers of total reads, the single-end data matched to fewer total and unique annotations in both cases reads must be sequenced, providing an estimated 10 million mRNA annotations. Due to the vast majority of extracted total RNA originating from ribosomes, ribodepletion is strongly recommended for all metatranscriptome sample processing. We demonstrate that although ribodepletion is not successful at removing all rRNA, and skews the proportions of remaining rRNA within a sample, it greatly increases the number of mRNA annotations obtained per metatranscriptome. Although we only tested one commercially available ribodepletion kit, we believe that all ribodepletion methods based on hybridization with complementary rRNA oligonucleotide probes will carry some intrinsic bias, skewing the proportions of remaining rRNA within the sample. Therefore, we recommend discarding all remaining ribosomal reads before performing further analysis on these samples. Although ribodepletion results in a loss of information regarding organism abundance within a metatranscriptomic sample, we show that total mRNA can be used as an alternate method of evaluating overall organism presence within a sample. Compared to mRNA results, we found a higher number of false annotations among the rRNA reads, particularly for eukaryotic organisms which were not present in the original fecal samples, including plant species such as Sisymbrium (cabbage), Westreich et al. BMC Bioinformatics (2016) 17:399 Page 8 of 12 Fig. 4 Comparison of mRNA vs. rRNA based abundance estimates. a Stacked bar graph measuring percentage distribution of total metatranscriptome activity by organism, with pairwise comparisons between mRNA and rRNA transcripts. To reduce potential mislabeling of organisms in the “long tail” of low-abundance organisms, only the top 30 most abundant organisms are displayed; other results are included in the purple “Other” catchall category. b The same measurement, expressed in total number of annotations per metatranscriptome sample. Due to a lack of ribodepletion, rRNA transcripts dominate all samples Arabidopsis, and Populus (cottonwood), Bos taurus, human, and Coptotermes (termite). This increased mismatch rate may be due to the fact that MG-RAST is not tailored to work with eukaryotic organisms. Another reason that mRNA and rRNA results were not perfectly correlated is that there were insufficient mRNA reads in the data set used for direct comparison. Our data (Fig. 3) suggested that a minimum of 10 million annotated reads are required for accurate annotations, while the nonribosomal depleted data set averaged only 1–2 million annotated mRNA reads per sample. While total mRNA may not be a perfect proxy of organism population size, the extent of mismatches in the rRNA data suggest that mRNA is at least better than total rRNA for judging overall activity-based abundance. Analyzing a public data set, we confirmed the rRNA patterns previously described by the producers of that data [12], and also identified significant changes in other, more specific genera of microbial species, including increases in Escherichia and Providencia and decreases in Butyrivibrio. Both the increase in Escherichia and Providencia and the decrease in Butyrivibrio species were only notable at the genus level and would not be identifiable in 16S data, demonstrating the value of mRNA-based metatranscriptomics analysis. In addition, we successfully identified a variety of functional expressed activities of gut microbes that significantly differed between wild-type and Tyk2−/− strains of mice strains. As the field of metatranscriptomics continues to grow and expand, we expect that metatranscriptome analysis will become increasingly important to understand the functional responses of gut microbiome communities. By using mRNA transcripts to identify both the overall transcriptional activity levels of organisms within a sample and changes in specific gene or functional expression, we can gain a better understanding of the capabilities and actions within an active gut microbiome at any chosen point in time. Because all parts of SAMSA are open source and publicly accessible, this tool can be used even by researchers with little previous experience Westreich et al. BMC Bioinformatics (2016) 17:399 Page 9 of 12 Fig. 5 Organism output of 15 metatranscriptomes. a Stacked bar graph measuring percentage distribution of total metatranscriptome activity by organism, comparing 6 wild-type to 9 tyrosine kinase 2 knockout gut microbiomes. To reduce potential mislabeling of organisms in the “long tail” of low-abundance organisms, only the top 30 most abundant organisms are displayed; other results are included in the purple “Other” catchall category. b The same measurement, expressed in total number of annotations per metatranscriptome sample in working with metatranscriptomes. We hope to encourage the more widespread use of metatranscriptomics as the next “big data” tool for determining activity within complex microbiome populations. Although the SAMSA pipeline successfully annotated and summarized gut microbiome metatranscriptome data without the need for large server resources, several limitations still exist. MG-RAST’s annotation servers require waiting in a queue, and processing may be slowed by days to weeks. Additionally, although SAMSA can reveal shifts in functional expression patterns, shifts in microbial population sizes cannot be measured as accurately as using shotgun metagenomics, and future approaches may focus on incorporating both techniques to ensure completeness. Future versions of this pipeline will focus on a more rapid in-house annotation step, using custom built reference databases and allowing for greater speed and additional analysis options, drawing from multiple reference databases as data progresses through the pipeline. Conclusions We have created a new pipeline for metatranscriptome analysis, functioning in conjunction with the MGRAST annotation pipeline. This pipeline is capable of determining functional expression activity within a sample at the transcript level, and provides measures of total transcription, differentiated either by organism or by functional category of transcript. This pipeline will enable more rapid adoption of metatranscriptomics Westreich et al. BMC Bioinformatics (2016) 17:399 Page 10 of 12 Fig. 6 Transcript functional category output of 13 metatranscriptomes. a Stacked bar graph measuring percentage distribution of total metatranscriptome activity by functional category, comparing 5 wild-type to 8 tyrosine kinase 2 knockout mouse gut microbiomes. To reduce graph complexity, only the top 30 most abundant functions in each sample are displayed. Two samples were excluded due to very low total read counts. b The same measurements, expressed in total number of annotations per metatranscriptome methods. Finally, we recommend that stool metatranscriptomes be ribodepleted and sequenced in a 100 bp paired end format with a minimum of 40 million reads per sample. Methods Sample Collection Stool samples were collected from two normal healthy adults. Stool was added to 50 mL tubes containing 25 mL RNAlater until the total volume reached 30 mL. Samples were then sealed and shaked vigourously until the contents appeared to be uniform in consistency. The samples were then temporarily stored at −20 C before being moved to −80 C. RNA extraction Two different RNA extraction protocols were applied to the two human subject stool samples, adapted from Giannoukos et al.[10]. Samples were initially frozen at −80° C in RNAlater. Samples were partially thawed and mixed with bacterial lysis buffer, incubated, and then homogenized through both bead-beating and QIAshredder treatments. Extraction was performed using the Qiagen RNeasy isolation kit, with additional rigorous Turbo DNAse treatment to remove DNA contamination. Ribodepletion was performed on two of the samples (designated P1) using the RiboZero Magnetic GOLD kit. Poly(A) selection was performed on two samples (designated P4) using the Illumina TruSeq kit protocol.

SAMSA: A comprehensive metatranscriptome analysis pipeline

Nội dung