diff --git a/README.md b/README.md index 00f5f4e..6346713 100644 --- a/README.md +++ b/README.md @@ -1,40 +1,36 @@ -# RNA-seek - -[![DOI](https://zenodo.org/badge/305525443.svg)](https://zenodo.org/badge/latestdoi/305525443) [![GitHub releases](https://img.shields.io/github/release/skchronicles/RNA-seek)](https://github.com/skchronicles/RNA-seek/releases) ![Docker Pulls](https://img.shields.io/docker/pulls/nciccbr/ccbr_arriba_2.0.0) [![Build](https://github.com/skchronicles/RNA-seek/workflows/Tests/badge.svg)](https://github.com/skchronicles/RNA-seek/actions) [![GitHub issues](https://img.shields.io/github/issues/skchronicles/RNA-seek?color=brightgreen)](https://github.com/skchronicles/RNA-seek/issues) [![GitHub license](https://img.shields.io/github/license/skchronicles/RNA-seek)](https://github.com/skchronicles/RNA-seek/blob/main/LICENSE) +# RENEE - **R**na s**E**quencing a**N**alysis pip**E**lin**E** An open-source, reproducible, and scalable solution for analyzing RNA-seq data. ### Table of Contents -1. [Introduction](#1-Introduction) -2. [Overview](#2-Overview-of-Pipeline) - 2.1 [RNA-seek Pipeline](#21-RNA-seek-Pipeline) - 2.2 [Reference Genomes](#22-Reference-Genomes) - 2.3 [Dependencies](#23-Dependencies) - 2.4 [Installation](#24-Installation) -3. [Run RNA-seek pipeline](#3-Run-RNA-seek-pipeline) - 3.1 [Using Singularity](#31-Using-Singularity) - 3.2 [Using Docker](#32-Using-Docker) - 3.3 [Biowulf](#33-Biowulf) -4. [Contribute](#4-Contribute) -5. [References](#5-References) +- [RENEE - **R**na s**E**quencing a**N**alysis pip**E**lin**E**](#renee---rna-sequencing-analysis-pipeline) + - [Table of Contents](#table-of-contents) + - [1. Introduction](#1-introduction) + - [2. Overview](#2-overview) + - [2.1 RENEE Pipeline](#21-renee-pipeline) + - [2.2 Reference Genomes](#22-reference-genomes) + - [2.3 Dependencies](#23-dependencies) + - [3. Run RENEE pipeline](#3-run-renee-pipeline) + - [3.3 Biowulf](#33-biowulf) + - [5. References](#5-references) ### 1. Introduction RNA-sequencing (*RNA-seq*) has a wide variety of applications. This popular transcriptome profiling technique can be used to quantify gene and isoform expression, detect alternative splicing events, predict gene-fusions, call variants and much more. -**RNA-seek** is a comprehensive, open-source RNA-seq pipeline that relies on technologies like [Docker20](https://www.docker.com/why-docker) and [Singularity21](https://singularity.lbl.gov/) to maintain the highest-level of reproducibility. The pipeline consists of a series of data processing and quality-control steps orchestrated by [Snakemake19](https://snakemake.readthedocs.io/en/stable/), a flexible and scalable workflow management system, to submit jobs to a cluster or cloud provider. +**RENEE** is a comprehensive, open-source RNA-seq pipeline that relies on technologies like [Docker20](https://www.docker.com/why-docker) and [Singularity21](https://singularity.lbl.gov/) to maintain the highest-level of reproducibility. The pipeline consists of a series of data processing and quality-control steps orchestrated by [Snakemake19](https://snakemake.readthedocs.io/en/stable/), a flexible and scalable workflow management system, to submit jobs to a cluster or cloud provider. -![RNA-seek_overview_diagram](https://github.com/skchronicles/RNA-seek/blob/main/resources/overview.svg) +![RENEE_overview_diagram](./resources/overview.svg) **Fig 1. Run locally on a compute instance, on-premise using a cluster, or on the cloud using AWS.** A user can define the method or mode of execution. The pipeline can submit jobs to a cluster using a job scheduler like SLURM, or run on AWS using Tibanna (feature coming soon!). A hybrid approach ensures the pipeline is accessible to all users. As an optional step, relevelant output files and metadata can be stored in object storage using HPC DME (NIH users) or Amazon S3 for archival purposes (coming soon!). ### 2. Overview -#### 2.1 RNA-seek Pipeline -A bioinformatics pipeline is more than the sum of its data processing steps. A pipeline without quality-control steps provides a myopic view of the potential sources of variation within your data (i.e., biological verses technical sources of variation). RNA-seek pipeline is composed of a series of quality-control and data processing steps. +#### 2.1 RENEE Pipeline +A bioinformatics pipeline is more than the sum of its data processing steps. A pipeline without quality-control steps provides a myopic view of the potential sources of variation within your data (i.e., biological verses technical sources of variation). RENEE pipeline is composed of a series of quality-control and data processing steps. -The accuracy of the downstream interpretations made from transcriptomic data are highly dependent on initial sample library. Unwanted sources of technical variation, which if not accounted for properly, can influence the results. RNA-seek's comprehensive quality-control helps ensure your results are reliable and _reproducible across experiments_. In the data processing steps, RNA-seek quantifies gene and isoform expression and predicts gene fusions. Please note that the detection of alternative splicing events and variant calling will be incorporated in a later release. +The accuracy of the downstream interpretations made from transcriptomic data are highly dependent on initial sample library. Unwanted sources of technical variation, which if not accounted for properly, can influence the results. RENEE's comprehensive quality-control helps ensure your results are reliable and _reproducible across experiments_. In the data processing steps, RENEE quantifies gene and isoform expression and predicts gene fusions. Please note that the detection of alternative splicing events and variant calling will be incorporated in a later release. -![RNA-seq quantification pipeline](https://github.com/skchronicles/RNA-seek/blob/main/resources/RNA-seek_Pipeline.svg) **Fig 2. An Overview of RNA-seek Pipeline.** Gene and isoform counts are quantified and a series of QC-checks are performed to assess the quality of the data. This pipeline stops at the generation of a raw counts matrix and gene-fusion calling. To run the pipeline, a user must select their raw data, a reference genome, and output directory (i.e., the location where the pipeline performs the analysis). Quality-control information is summarized across all samples in a MultiQC report. +![RNA-seq quantification pipeline](./resources/RENEE_Pipeline.svg) **Fig 2. An Overview of RENEE Pipeline.** Gene and isoform counts are quantified and a series of QC-checks are performed to assess the quality of the data. This pipeline stops at the generation of a raw counts matrix and gene-fusion calling. To run the pipeline, a user must select their raw data, a reference genome, and output directory (i.e., the location where the pipeline performs the analysis). Quality-control information is summarized across all samples in a MultiQC report. **Quality Control** [*FastQC*2](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) is used to assess the sequencing quality. FastQC is run twice, before and after adapter trimming. It generates a set of basic statistics to identify problems that can arise during sequencing or library preparation. FastQC will summarize per base and per read QC metrics such as quality scores and GC content. It will also summarize the distribution of sequence lengths and will report the presence of adapter sequences. @@ -52,7 +48,7 @@ The accuracy of the downstream interpretations made from transcriptomic data are **Quantification** [*Cutadapt*3](https://cutadapt.readthedocs.io/en/stable/) is used to remove adapter sequences, perform quality trimming, and remove very short sequences that would otherwise multi-map all over the genome prior to alignment. -[*STAR*4](https://github.com/alexdobin/STAR) is used to align reads to the reference genome. The RNA-seek pipeline runs STAR in a two-passes where splice-junctions are collected and aggregated across all samples and provided to the second-pass of STAR. In the second pass of STAR, the splice-junctions detected in the first pass are inserted into the genome indices prior to alignment. +[*STAR*4](https://github.com/alexdobin/STAR) is used to align reads to the reference genome. The RENEE pipeline runs STAR in a two-passes where splice-junctions are collected and aggregated across all samples and provided to the second-pass of STAR. In the second pass of STAR, the splice-junctions detected in the first pass are inserted into the genome indices prior to alignment. [*RSEM*5](https://github.com/deweylab/RSEM) is used to quantify gene and isoform expression. The expected counts from RSEM are merged across samples to create a two counts matrices for gene counts and isoform counts. @@ -60,45 +56,28 @@ The accuracy of the downstream interpretations made from transcriptomic data are #### 2.2 Reference Genomes Reference files are pulled from an S3 bucket to the compute instance or local filesystem prior to execution. -RNA-seek comes bundled with pre-built reference files for the following genomes: +RENEE comes bundled with pre-built reference files for the following genomes: | Name | Species | Genome | Annotation | | -------- | ------- | ------------------ | -------- | | hg38_30 | Homo sapiens (human) | [GRCh38](http://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_28/GRCh38.primary_assembly.genome.fa.gz) | [Gencode6 Release 30](http://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_30/gencode.v30.annotation.gtf.gz) | | mm10_M21 | Mus musculus (mouse) | [GRCm38](http://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M18/GRCm38.primary_assembly.genome.fa.gz) | [Gencode Release M21](http://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M21/gencode.vM21.annotation.gtf.gz) | > **Warning:** This section contains FTP links for downloading each reference file. Open the link in a new tab to start a download. +> **Note:** Release 30 for hg38 and Release M21 for mm10 were the only annotation versions available at the time of writing this documentation. Newer annotations versions may be added upon request and may be already available. Please contact [Vishal Koparde](mailto:vishal.koparde@nih.gov) for details. #### 2.3 Dependencies **Requires:** `singularity>=3.5` `snakemake>=6.0` -[Snakemake](https://snakemake.readthedocs.io/en/stable/getting_started/installation.html) and [singularity](https://singularity.lbl.gov/all-releases) must be installed on the target system. Snakemake orchestrates the execution of each step in the pipeline. To guarantee reproducibility, each step relies on pre-built images from [DockerHub](https://hub.docker.com/orgs/nciccbr/repositories). Snakemake uses singaularity to pull these images onto the local filesystem prior to job execution, and as so, snakemake and singularity are the only two dependencies. - -#### 2.4 Installation -Please clone this repository to your local filesystem using the following command: -```bash -# Clone Repository from Github -git clone https://github.com/skchronicles/RNA-seek.git -# Change your working directory to the RNA-seek repo -cd RNA-seek/ -``` - -### 3. Run RNA-seek pipeline - -#### 3.1 Using Singularity -```bash -# Coming Soon! -``` +[Snakemake](https://snakemake.readthedocs.io/en/stable/getting_started/installation.html) and [singularity](https://singularity.lbl.gov/all-releases) must be installed on the target system. Snakemake orchestrates the execution of each step in the pipeline. To guarantee reproducibility, each step relies on pre-built images from [DockerHub](https://hub.docker.com/orgs/nciccbr/repositories). Snakemake pulls these docker images while converting them to singularity on the fly and saves them onto the local filesystem prior to job execution, and as so, snakemake and singularity are the only two dependencies. -#### 3.2 Using Docker -```bash -# Coming Soon! -``` +### 3. Run RENEE pipeline #### 3.3 Biowulf ```bash -# rna-seek is configured to use different execution backends: local or slurm +# RENEE is configured to use different execution backends: local or slurm # view the help page for more information -./rna-seek run --help +module load ccbrpipeliner +RENEE run --help # @local: uses local singularity execution method # The local MODE will run serially on compute @@ -112,27 +91,16 @@ cd RNA-seek/ sinteractive --mem=110g --cpus-per-task=12 --gres=lscratch:200 module purge module load singularity snakemake -./rna-seek run --input .tests/*.R?.fastq.gz --output /data/$USER/RNA_hg38 --genome hg38_30 --mode local +RENEE run --input .tests/*.R?.fastq.gz --output /data/$USER/RNA_hg38 --genome hg38_30 --mode local # @slurm: uses slurm and singularity execution method # The slurm MODE will submit jobs to the cluster. -# It is recommended running rna-seek in this mode. +# It is recommended running RENEE in this mode. module purge module load singularity snakemake -./rna-seek run --input .tests/*.R?.fastq.gz --output /data/$USER/RNA_hg38 --genome hg38_30 --mode slurm +./RENEE run --input .tests/*.R?.fastq.gz --output /data/$USER/RNA_hg38 --genome hg38_30 --mode slurm ``` -### 4. Contribute - -This section is for new developers working with the RNA-seek pipeline. If you have added new features or adding new changes, please consider contributing them back to the original repository: - -1. [Fork](https://help.github.com/en/articles/fork-a-repo) the original repo to a personal or org account. -2. [Clone](https://help.github.com/en/articles/cloning-a-repository) the fork to your local filesystem. -3. Copy the modified files to the cloned fork. -4. Commit and push your changes to your fork. -5. Create a [pull request](https://help.github.com/en/articles/creating-a-pull-request) to this repository. - - ### 5. References **1.** Daley, T. and A.D. Smith, Predicting the molecular complexity of sequencing libraries. Nat Methods, 2013. 10(4): p. 325-7. @@ -161,5 +129,5 @@ This section is for new developers working with the RNA-seek pipeline. If you ha

- Back to Top + Back to Top