Skip to content
/ ALLHiC Public

A package to scaffolding polyploidy or heterozygosis genome using Hi-C data

Notifications You must be signed in to change notification settings

sanvva/ALLHiC

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 

Repository files navigation

ALLHiC

A package to scaffolding polyploidy or heterozygosis genome using Hi-C data

Introduction

The major problem of scaffolding polyploid genome is that Hi-C signals are frequently detected between allelic haplotypes and any existing stat of art Hi-C scaffolding program links the allelic haplotypes together. To solve the problem, we developed a new Hi-C scaffolding pipeline, called ALLHIC, specifically tailored to the highly heterozygous diploids or polyploid genomes. ALLHIC pipeline contains a total of 4 steps: prune, partition, optimize and build.

Installation

$ git clone https://github.com/tangerzhang/ALLHiC
$ cd ALLHiC
$ chmod +x bin/*
$ cp bin/* ~/bin

Dependencies

Following is a list of thirty-party programs that will be used in ALLHIC pipeline.

Running the pipeline

  • Data Preparation

    • Closely related 'diploid' species (chromosomal level) with gene annotation
    • Hi-C reads (with some coverage)
    • Contig level assembly of target genome with decent N50 and gene annotation
  • Identification of alleles based on BLAST results

Blast CDS in target genome to CDS file in close related reference
Note: Please modify cds name before running BLAST. The cds name should be same with gene name present in GFF3

$ blastn -query rice.cds -db Bd.cds -out rice_vs_Sb.blast.out -evalue 0.001 -outfmt 6 -num_threads 4 -num_alignments 1

Remove blast hits with identity < 60% and coverage < 80%

blastn_parse.pl -i rice_vs_Bd.blast.out -o Erice_vs_Bd.blast.out -q ../4_rice_alleles/T2/riceT2.cds.fasta -b 1 -c 0.8 -d 0.6 

Classify alleles based on BLAST results

classify.pl -i Eblast.out -p 2 -r Bdistachyon_314_v3.1.gene.gff3 -g riceT2.gff3   

After running the scripts above, two tables will be genrated. Allele.gene.table lists the allelic genes in the order of diplod refernece genome and Allele.ctg.table lists corresponding contig names in the same order.

  • Map Hi-C reads to draft assembly

use bwa index and samtools faidx to index your draft genme assembly

bwa index -a bwtsw draft.asm.fasta  
samtools faidx draft.asm.fasta  

Aligning Hi-C reads to the draft assembly

bwa aln -t 24 draft.asm.fasta reads_R1.fastq.gz > sample_R1.sai  
bwa aln -t 24 draft.asm.fasta reads_R2.fastq.gz > sample_R2.sai  
bwa sampe draft.asm.fasta sample_R1.sai sample_R2.sai reads_R1.fastq.gz reads_R2.fastq.gz > sample.bwa_aln.sam  

Filtering SAM file

PreprocessSAMs.pl sample.bwa_aln.sam draft.asm.fasta MBOI
perl ~/software/script/filterBAM_forHiC.pl sample.bwa_aln.REduced.paired_only.bam sample.clean.sam  
samtools view -bt draft.asm.fasta.fai sample.clean.sam > sample.clean.bam  
  • Prune

Next, we will used Allele.ctg.table to prune 1) signals that link alleles and 2) weak signals from BAM files

prune.pl -i Allele.ctg.table -b bam.list -r draft.asm.fasta   
  • Partition

Apply LACHESIS clustering algorthm to partition contig (require Lachesis installed)

Lachesis conf.ini
  • Optimize

ordering and orientation for each group

bam2CLM.pl -b sample.clean.bam -r draft.asm.fasta -d out/main_results/
allhic optimize group0.clm
allhic optimize group1.clm
allhic optimize group2.clm
...
  • Build

convert tour format to fasta sequences and agp location file

tour2asm.pl draft.asm.fasta

This step will output a list of superscaffolds and un-achored contigs. Check groups.asm.fasta file.

Sample data

Test data can be found in the following link:

https://pan.baidu.com/s/1_EW7N5qOgpa1hdn95LP26A
or
https://www.dropbox.com/s/yk3o2hh7as0iz6y/sample_data.tar.gz?dl=0

Test data starts from cleaning bam files.

About

A package to scaffolding polyploidy or heterozygosis genome using Hi-C data

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages