Skip to content

TF-Chan-Lab/rG4-seeker

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

rG4-seeker

rG4-seeker is a pipeline for processing and analyzing rG4-seq data (https://www.nature.com/articles/nmeth.3965)

Requirements

  • 16GB RAM
  • At least 30GB free hard disk space
  • Single CPU core / Recommended: A 4-core desktop CPU
  • Linux operating system (tested on Ubuntu 16.04 and 18.04)
  • Python 3

Installation

git clone https://github.com/TF-Chan-Lab/rG4-seeker
cd rG4-seeker
pip install .

Usage

  1. Prepare input files for rG4-seeker

    • Reference genome in FASTA format

    • One or more gene annotation in GFF3 or GTF format

    • Genome-aligned, coordinate-sorted rG4-seq reads in BAM format

      • BAM indexing is required (generated by “samtools index” command)
      • Currently, only output from STAR aligner is supported/tested
      • Support for HISAT2 and Tophat2 aligner will be added in the future
    • A working directory for rG4-seeker to write its intermediate/output files

  2. Construct a configuration file

    • rG4-seeker reads program settings and location of input files via a configuration file (config.ini)
    • A template configuration file ( example.ini ) is provided
    • Please refer to the ‘configuration file format’ section for modifying the configuration file
  3. Run rG4-seeker:

    rG4-seeker config.ini
    
  4. Obtain results:

    • RTS sites identified by rG4-seeker will be reported in 2 csv files, which can be directly opened by spreadsheet editors such as Microsoft Excel.
    • The 2 csv files have identical content except for level of details in the sequence diagram:

    [SAMPLE_NAME].rG4_list.full_combined.csv

    The sequence diagram shows the aggregated RTS site from all rG4-seq K+ replicates and K+-PDS replicate datasets

    [SAMPLE_NAME].rG4_list.full_combined_breakdown.csv

    The sequence diagram shows RTS site identified in each individual dataset

Example dataset

  • An example dataset hela-2016-chr20.tar.gz derived from published HeLa rG4-seq dataset (Kwok et al. 2016) is available for testing newly rG4-seeker installations. It will run for approximately 60 minutes.

  • The example dataset is extracted into a directory containing all input files and a configuration file, and can be run directly:

    tar -axzf hela-2016-chr20.tar.gz
    cd hela-2016-chr20
    rG4-seeker hela-2016-chr20.ini
    

Configuration file formatting

  • rG4-seeker reads configuration files with Python3 configparser module
  • An rG4-seeker configuration file is consisting of 4 sections, each lead by a [section] header, followed by options/value entries separated by ‘=’
  • Unless specified, all options are required
  1. [global] section

    • Global parameters are configured in this section

    Options

    Values

    Remarks

    WORKING_DIR

    Path to the working directory for rG4-seeker to place intermediate/output files

     

    SAMPLE_NAME

    An arbitrary identifier for the rG4-seq sample (e.g. HeLa-rG4seq)

    The identifier will be the prefix for all output files

    THEADS

    No. of CPU threads rG4-seeker can use

     

    NO_OF_ANNOTATIONS

    No. of gene annotation sets to use

    rG4-seeker can simultaneously use multiple annotation sets (e.g. GENCODE and RefSeq)

    NO_OF_REPLICATES

    No. replicates present in the rG4-seq dataset

     

    HAVE_KPDS_CONDITION

    ‘True’ or ‘False’, indicating whether K+/PDS condition is present the rG4-seq dataset

     

    ALIGNER

    Name of short read aligner used (e.g. ‘STAR’)

    Currently only ‘STAR’ aligner is supported.

    READS_TYPE

    ‘SE’ or ‘PE’, corresponding to single-end or pair-end illumina read types

     
    • Example configuration for [global] section:

      [global]
      WORKING_DIR = /home/user/rg4seeker_working_dir/
      SAMPLE_NAME = HeLa-rG4seq
      THREADS = 8
      NO_OF_ANNOTATIONS = 2
      NO_OF_REPLICATES = 2
      HAVE_KPDS_CONDITION = True
      ALIGNER = STAR
      READS_TYPE = SE
      
  2. [genome] section

    • The reference genome to use is specified in this section

    Options

    Values

    Remarks

    GENOME_FASTA

    Path to the reference genome sequence in FASTA format

    The FASTA file must be in uncompressed format

    GENOME_FASTA_FAI

    Path to the fai index file of the reference genome sequence

    A fai index can be generated using samtools

    • Example configuration for [genome] section:

      [genome]
      GENOME_FASTA = /home/user/references/GRCh38.primary_assembly.genome.fa
      GENOME_FASTA_FAI = /home/user/references/GRCh38.primary_assembly.genome.fa.fai
      
  3. [annotation] section

    • The gene annotation set(s) to use are specified in this section

    Options

    Values

    Remarks

    ANNOTATION_NAME

    An identifier for the gene annotation (e.g. GENCODE)

     

    ANNOTATION_GFF

    Path to the annotation GFF3/GTF file

    The GFF3/GTF file can be compressed (in .gz format)

    • Note: Please provide multiple [annotation_n] sections matching the number of annotations sets

    • Example configuration for [annotation] section when 2 annotations sets are used:

      [annotation_1]
      ANNOTATION_NAME = Gencode
      ANNOTATION_GFF = /home/user/references/gencode.v29.primary_assembly.annotation.gff3.gz
      
      [annotation_2]
      ANNOTATION_NAME = RefSeq
      ANNOTATION_GFF = /home/user/references/GRCh38.RefSeqGeneAnnotation.gff.gz
      
  4. [replicate_n] section

    • The rG4-seq datasets to use (in format of aligned reads) are specified in this section

    Options

    Values

    Remarks

    LI_BAM_FILE

    Path to the BAM file containing aligned reads from rG4-seq (Li+ condition)

     

    K_BAM_FILE

    Path to the BAM file containing aligned reads from rG4-seq (K+ condition)

     

    KPDS_BAM_FILE

    Path to the BAM file containing aligned reads from rG4-seq (K+/PDS condition)

    Required if ‘HAVE_KPDS_CONDITION’ is set as ‘True’

    • Note: Please provide multiple [annotation_n] sections matching the number of rG4-seq replicates

    • Example configuration for [replicate_n] section when NO_OF_REPLICATES = 2 and HAVE_KPDS_CONDITION = TRUE:

      [replicate_1]
      LI_BAM_FILE = /home/user/HeLa-rG4Seq/Li-rep1.Aligned.sortedByCoord.out.bam
      K_BAM_FILE = /home/user/HeLa-rG4Seq/K-rep1.Aligned.sortedByCoord.out.bam
      KPDS_BAM_FILE = /home/user/HeLa-rG4Seq/KPDS-rep1.Aligned.sortedByCoord.out.bam
      
      [replicate_2]
      LI_BAM_FILE = /home/user/HeLa-rG4Seq/Li-rep2.Aligned.sortedByCoord.out.bam
      K_BAM_FILE = /home/user/HeLa-rG4Seq/K-rep2.Aligned.sortedByCoord.out.bam
      KPDS_BAM_FILE = /home/user/HeLa-rG4Seq/KPDS-rep2.Aligned.sortedByCoord.out.bam
      

Docker image distribution

  • rG4-seeker is also available as a Docker image

  • Installation

    1. Install Docker following instructions on Docker homepage https://docs.docker.com/

    2. Download the rG4-seeker Docker image rg4_seeker.docker.tar.gz

    3. Import rG4-seeker Docker image:

      sudo docker load -i rg4_seeker.docker.tar.gz
      sudo docker run rg4_seeker
      
  • Usage

    • When using docker version of rG4-seeker, we strongly recommended putting all input files (Genome/Annotation/Reads) and the configuration file in the same working directory to simplify.

    • Running rG4-seeker from Docker:

      cd working_dir
      sudo docker run -v [working_dir]:[working_dir] rg4_seeker [abs_path_to_config.ini]
      
      
      * Notes: The ‘-v’ option allows dockerized programs to read/write files outside its container, and is required for rG4-seeker to access input files / write result files.
      
  • Running the example data

    1. Download the example dataset hela-2016-chr20.tar.gz derived

    2. Decompress the example dataset and enter the working directory:

      tar -axzf hela-2016-chr20.tar.gz
      cd hela-2016-chr20
      
    3. Update the configuration file with the current working directory:

      cat hela-2016-chr20.ini | awk -v srch="./" -v repl="$PWD/" '{ sub(srch,repl,$0); print $0 }' >hela-2016-chr20.docker.ini
      
    4. Run rG4-seeker:

      sudo docker run -v $PWD:$PWD rg4_seeker $PWD/hela-2016-chr20.docker.ini
      

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published