Skip to content

Estimate the relative abundance of sequence reads originating from different species in a sample.

License

Notifications You must be signed in to change notification settings

phac-nml/speciesabundance

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Nextflow

SpeciesAbundance Pipeline

This is the nf-core-based pipeline for SpeciesAbundance. This pipeline estimates the relative abundance of sequence reads originating from different species in a sample. This pipeline is designed to be integrated into IRIDA Next. However, it may be run as a stand-alone pipeline.

This pipeline is designed to estimate taxonomic abundance using both single- and paired-end Illumina short-read data. It does not currently accommodate long-read sequencing data (Nanopore or PacBio).

Input

The input to the pipeline is a standard sample sheet (passed as --input samplesheet.csv) that looks like:

sample fastq_1 fastq_2
SampleA file_1.fastq.gz file_2.fastq.gz

An example samplesheet has been provided with the pipeline.

The structure of this file is defined in assets/schema_input.json. Validation of the sample sheet is performed by nf-validation.

IRIDA-Next Optional Input Configuration

speciesabundance accepts the IRIDA-Next format for samplesheets which can contain an additional column: sample_name

sample_name: An optional column, that overrides sample for outputs (filenames and sample names) and reference assembly identification.

sample_name, allows more flexibility in naming output files or sample identification. Unlike sample, sample_name is not required to contain unique values. Nextflow requires unique sample names, and therefore in the instance of repeat sample_names, sample will be suffixed to any sample_name. Non-alphanumeric characters (excluding _,-,.) will be replaced with "_".

The sample sheet, when including the optional sample_name column, should look like:

sample sample_name fastq_1 fastq_2
SampleA A1 file_1.fastq.gz file_2.fastq.gz

An example samplesheet has been provided with the pipeline, which includes the sample_name column.

Parameters

Mandatory

The mandatory parameters are as follows:

  • --input : a URI to the samplesheet as specified in the Input section.
  • --output : to specify the output results directory.

Database Selection

It is mandatory to have one of either --database or both [--kraken2_db and --bracken_db].

Please use only:

  • --database /path/to/database : to specify the directory to the database files required by both Kraken2 and Bracken

Or:

  • --kraken2_db /path/to/kraken2database : to specify the directory to the Kraken2 database files and
  • --bracken_db /path/to/brackendatabase : to specify the directory to the Bracken database files

Optional

Additionally, you may choose to provide:

SpeciesAbundance Parameters

  • --taxonomic_level : to specify the taxonomic level of the bracken abundance estimation.
    • Must be one of : S(species)(default), G(genus), O(order), F(family), P(phylum), or K(kingdom)
  • --kmer_len : to specify the kmer length for the bracken distribution file used to estimate the abundance at the specified taxonomic level
    • Must be one of : 50, 75, 100 (default), 150, 200, 250, or 300
    • Selecting a lower k-mer length enhances sensitivity, while a higher k-mer length increases specificity.
  • --top_n : to specify the number of top results to keep and include in the metadata for IRIDA Next.
    • Default: 5

Other Parameters

  • -profile : to specify which profile to use (ex: -profile singularity)
  • -r [branch] : to specify which GitHub branch you would like to use (ex: -r dev)

Other parameters (defaults from nf-core) are defined in nextflow_schema.json.

Running

Test Data

To run the pipeline using the test profile, please run:

nextflow run phac-nml/speciesabundance -profile docker,test --outdir results

The pipeline output will be written to a directory named results. A JSON file for integrating with IRIDA Next will be written to results/iridanext.output.json.gz (as detailed in the Output section)

Output

Results

The following output files are generated by the pipeline:

  • fastp/
    • sampleID_{R1/R2}_trimmed.fastq.gz
    • sampleID.fastp.json
    • sampleID.fastp.html
  • kraken2/
    • sampleID_kraken2_output.tsv.gz
    • sampleID_kraken2_report.txt
  • bracken/
    • sampleID_S_bracken_abundance_unsorted.tsv
    • sampleID_S_bracken.txt
  • failure/
    • failures_report.csv
  • adjust/
    • sampleID_S_bracken_abundance.csv
    • sampleIS_S_adjusted_report.txt
  • top/sampleID_S_top_N.csv
  • csvtk/merged_topN.csv
  • bracken2krona/sampleID.txt
  • krona/sampleID.krona.html

IRIDA Next Integration File

A JSON file for loading metadata into IRIDA Next is output by this pipeline. The format of this JSON file is specified in our Pipeline Standards for the IRIDA Next JSON. This JSON file is written directly within the --outdir provided to the pipeline with the name iridanext.output.json.gz (ex: [outdir]/iridanext.output.json.gz).

An example of the what the contents of the IRIDA Next JSON file looks like for this particular pipeline is as follows:

{
    "files": {
        "global": [
          {
                "path": "failure/failures_report.csv"
            }
        ],
        "samples": {
            "sampleID": [
                {"path": "adjust/sampleID_S_bracken_abundances.csv"},
                {"path": "krona/sampleID.krona.html"},
                {"path": "fastp/sampleID.fastp.html"}
            ]
        }
    },
    "metadata": {
        "samples": {
            "sampleID": {
                "taxonomy_level": "S",
                "abundance_1_name": "Bacteroides fragilis",
                "abundance_1_ncbi_taxonomy_id": "817",
                "abundance_1_num_assigned_reads": "28877",
                "abundance_1_fraction_total_reads": "57.77018",
                "abundance_2_name": "Escherichia coli",
                "abundance_2_ncbi_taxonomy_id": "562",
                "abundance_2_num_assigned_reads": "21065",
                "abundance_2_fraction_total_reads": "42.1418",
                "abundance_3_name": "",
                "abundance_3_ncbi_taxonomy_id": "",
                "abundance_3_num_assigned_reads": "",
                "abundance_3_fraction_total_reads": "",
                "abundance_4_name": "",
                "abundance_4_ncbi_taxonomy_id": "",
                "abundance_4_num_assigned_reads": "",
                "abundance_4_fraction_total_reads": "",
                "abundance_5_name": "",
                "abundance_5_ncbi_taxonomy_id": "",
                "abundance_5_num_assigned_reads": "",
                "abundance_5_fraction_total_reads": "",
                "unclassified_name": "unclassified",
                "unclassified_ncbi_taxonomy_id": "0",
                "unclassified_num_assigned_reads": "44",
                "unclassified_fraction_total_reads": "0.08802"
            }
        }
    }
}

Within the files section of this JSON file, all of the output paths are relative to the outdir. Therefore, "path": "adjust/SAMPLE1_S_bracken_abundances.csv" refers to a file located within outdir/adjust/SAMPLE1_S_bracken_abundances.csv.

Failures

If one or more samples fail during the pipeline execution, the workflow will still run all other samples in the samplesheet. The samples that fail will be reported in a file named results/failure/failure_report.csv. This CSV file has three columns:

  • sample : the name of the sample that failed (matching the input samplesheet)
  • module : the module (or process) where the error occured
  • error_message : suggestions that aim to provide insights into potential reasons for sample failure in the respective process

For example:

sample,module,error_message
[SAMPLE1],FASTP,The input FASTQ file(s) might exhibit either a mismatch in PAIRED files; corruption in one or both SINGLE/PAIRED file(s); or file(s) may not exist in PATH provided by input samplesheet
[SAMPLE2],KRAKEN2,The reads may not have passed the quality control and trimming process OR the database directory may be missing required KRAKEN2 files
{SAMPLE3},BRACKEN,The reads may have failed to classify against the selected Kraken2 database OR the database directory may be missing the Bracken distribution files

Legal

Copyright 2024 Government of Canada

Licensed under the MIT License (the "License"); you may not use this work except in compliance with the License. You may obtain a copy of the License at:

https://opensource.org/license/mit/

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Derivative Work

This pipeline includes source code from a nextflow pipeline for taxon-abundance and an IRIDA-plugin for SpeciesAbundance developed by Dan Fornika as a work of the BC Center for Disease Control Public Health Laboratory (BCCDC_PHL).

The included source code developed by Dan Fornika as a work of the BCCDC-PHL was distributed within the public domain under the Apache Software License version 2.0.

Any such source files in this project that are included from or derived from the original work by Dan Fornika will include a notice.

About

Estimate the relative abundance of sequence reads originating from different species in a sample.

Resources

License

Code of conduct

Stars

Watchers

Forks

Packages

No packages published