Skip to content

Commit

Permalink
Merge pull request #165 from pangenome/high-div-wfmash-ok
Browse files Browse the repository at this point in the history
Support higher divergence between input sequences
  • Loading branch information
AndreaGuarracino committed Mar 28, 2022
2 parents 15ba28f + 3a80e69 commit 0897ff9
Show file tree
Hide file tree
Showing 3 changed files with 65 additions and 36 deletions.
11 changes: 6 additions & 5 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ RUN apt-get update \
libjemalloc-dev \
libhts-dev \
build-essential \
pkg-config \
time \
curl \
pigz \
Expand All @@ -33,7 +34,7 @@ RUN apt-get update \
RUN git clone --recursive https://github.com/ekg/wfmash \
&& cd wfmash \
&& git pull \
&& git checkout 948f1683d14927745aef781cdabeb66ac6c7880b \
&& git checkout c746c7328344d9da7c3cfa1d6e8a4cf09d634a3a \
&& git submodule update --init --recursive \
&& sed -i 's/-mcx16 //g' CMakeLists.txt \
&& sed -i 's/-march=native //g' CMakeLists.txt \
Expand All @@ -50,7 +51,7 @@ RUN git clone --recursive https://github.com/ekg/wfmash \
RUN git clone --recursive https://github.com/ekg/seqwish \
&& cd seqwish \
&& git pull \
&& git checkout 51ee55079ccb89402fdb50999268f3382678b083 \
&& git checkout 5a159f51b6617c559539ed7283a06b4394a4c7ff \
&& git submodule update --init --recursive \
&& cmake -H. -Bbuild && cmake --build build -- -j $(nproc) \
&& cp bin/seqwish /usr/local/bin/seqwish \
Expand All @@ -59,7 +60,7 @@ RUN git clone --recursive https://github.com/ekg/seqwish \
RUN git clone --recursive https://github.com/ekg/smoothxg \
&& cd smoothxg \
&& git pull \
&& git checkout 0f383e5033c6af18d95f5d8dca0a6e17e5dbf524 \
&& git checkout 79cc2d131a818000401d451ecf14c00afc12dc8c \
&& git submodule update --init --recursive \
&& sed -i 's/-march=native/-march=haswell/g' deps/abPOA/CMakeLists.txt \
&& sed -i 's/-mcx16 //g' deps/WFA/CMakeLists.txt \
Expand All @@ -75,12 +76,12 @@ RUN cargo --help
RUN git clone https://github.com/marschall-lab/GFAffix.git \
&& cd GFAffix \
&& git pull \
&& git checkout a22e828 \
&& git checkout 9581a29d6dfe1e76a98f8c360ed33adf0348fa27 \
&& cargo install --force --path . && mv /root/.cargo/bin/gfaffix /usr/local/bin/gfaffix

RUN apt-get update && apt-get install -y pip && pip install multiqc && apt-get install -y bcftools

RUN apt-get install wget && wget http://hypervolu.me/~erik/vg/vg-e5be425.gz && zcat vg-e5be425.gz >vg && chmod +x vg && cp vg /usr/local/bin/vg
RUN apt-get install wget && wget https://github.com/vgteam/vg/releases/download/v1.39.0/vg && chmod +x vg && mv vg /usr/local/bin/vg

COPY pggb /usr/local/bin/pggb
RUN chmod 777 /usr/local/bin/pggb
Expand Down
14 changes: 8 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,30 +19,32 @@ First, [install `pggb`](https://github.com/pangenome/pggb#installation) using Do
Put your sequences in one FASTA file and index it with `samtools faidx`.
If you have many genomes, we suggest using the [PanSN prefix naming pattern](https://github.com/pangenome/PanSN-spec).

To build a graph from `input.fa`, which contains 9 haplotypes, in the directory `output`, scaffolding the graph using 5kb matches at >= 90% identity, and using 16 parallel threads for processing, execute:
To build a graph from `input.fa`, which contains 9 haplotypes, in the directory `output`, scaffolding the graph using 10kb matches at >= 90% identity, and using 16 parallel threads for processing, execute:

```
pggb \
-i input.fa \
-o output \
-t 16 \
-p 90 \
-s 5000 \
-s 10000 \
-n 9 \
-v
```

The final process output will be called `outdir/input.fa*smooth.gfa`.
By default, several intermediate files are produced.
We add `-v` to render 1D and 2D visualizations of the graph with `odgi`.
These are generally useful but do require some processing time, so they are not currently done by default.
We render 1D and 2D visualizations of the graph with `odgi`, which are very useful to understand the result of the build.

## establishing parameters

### essential

`pggb` requires that the user set a mapping identity minimum `-p`, a segment length `-s`, and a number of mappings `-n` per segment.
These 3 key parameters define most of the structure of the pangenome graph.
`pggb` requires that the user set a number of mappings `-n` per segment.
Although the defaults (`-p 95 -s 10k`) should work for most pangenome contexts, it is recommended that suitable mapping identity minimum `-p` and a segment length `-s`.
In particular, for high divergence problems (e.g. models built from separate species) it can be necessary to set `-p` and `-s` to different levels.
Increasing `-p` and `-s` will increase the stringency of the initial alignment, while reducing them will make this more sensitive.
These 3 key parameters (`-n`, `-p`, and `-s`) define most of the structure of the pangenome graph.
They can be set using some prior information about the sequences that you're using.

_Estimate divergence_:
Expand Down
76 changes: 51 additions & 25 deletions pggb
Original file line number Diff line number Diff line change
Expand Up @@ -7,11 +7,11 @@ input_fasta=false
output_dir=""
input_paf=false
resume=false
map_pct_id=false
map_pct_id=95
n_mappings=false
segment_length=false
segment_length=10000
block_length=false
mash_kmer=16
mash_kmer=false
min_match_length=47
transclose_batch=10000000
n_haps=false
Expand All @@ -20,12 +20,7 @@ pad_max_depth=100
max_path_jump=0
max_edge_jump=0
target_poa_length=4001,4507
# poa param suggestions from minimap2
# - asm5, --poa-params 1,19,39,3,81,1, ~0.1 divergence
# - asm10, --poa-params 1,9,16,2,41,1, ~1 divergence
# - asm20, --poa-params 1,4,6,2,26,1, ~5% divergence
# between asm10 and asm20 ~ 1,7,11,2,33,1
poa_params="1,19,39,3,81,1"
poa_params=false
poa_padding=0.03
do_viz=true
do_layout=true
Expand Down Expand Up @@ -107,23 +102,54 @@ fi

if [[
"$input_fasta" == false
|| $map_pct_id == false
|| $n_mappings == false
|| $segment_length == false
]];
then
show_help=true
>&2 echo "Mandatory arguments -i, -s, -n, -p"
>&2 echo "Mandatory arguments -i, -n"
fi

if [[ $poa_threads == 0 ]];
then
poa_threads=$threads
fi

if [[ $block_length == false ]];
block_length_cmd=""
if [[ $block_length != false ]];
then
block_length=$(echo $segment_length '*' 3 | bc)
block_length_cmd="-l $block_length"
fi
# wfmash auto-sets our block length based on its parameters

mash_kmer_cmd=""
if [[ $mash_kmer != false ]];
then
mash_kmer_cmd="-k $mash_kmer"
fi
# wfmash auto-sets our kmer length based on its parameters

# our partial order alignment parameters
# poa param suggestions from minimap2
# - asm5, --poa-params 1,19,39,3,81,1, ~0.1 divergence
# - asm10, --poa-params 1,9,16,2,41,1, ~1 divergence
# - asm20, --poa-params 1,4,6,2,26,1, ~5% divergence
# between asm10 and asm20 ~ 1,7,11,2,33,1
poa_params_cmd=""
if [[ $poa_params == false ]];
then
poa_params_cmd="-P 1,19,39,3,81,1"
else
if [[ $poa_params == "asm5" ]]; then
poa_params_cmd="-P 1,19,39,3,81,1"
elif [[ $poa_params == "asm10" ]]; then
poa_params_cmd="-P 1,9,16,2,41,1"
elif [[ $poa_params == "asm15" ]]; then
poa_params_cmd="-P 1,7,11,2,33,1"
elif [[ $poa_params == "asm20" ]]; then
poa_params_cmd="-P 1,4,6,2,26,1"
else
poa_params_cmd="-P $poa_params"
fi
fi

if [[ $n_haps == false ]];
Expand All @@ -134,15 +160,14 @@ fi
if [ $show_help ];
then
padding=`printf %${#0}s` # prints as many spaces as the length of $0
echo "usage: $0 -i <input-fasta> -s <segment-length> -n <n-mappings>"
echo " $padding -p <map-pct-id> [options]"
echo "usage: $0 -i <input-fasta> -n <n-mappings> [options]"
echo "options:"
echo " [wfmash]"
echo " -i, --input-fasta FILE input FASTA/FASTQ file"
echo " -s, --segment-length N segment length for mapping"
echo " -l, --block-length N minimum block length filter for mapping [default: 3*segment-length]"
echo " -s, --segment-length N segment length for mapping [default: 10k]"
echo " -l, --block-length N minimum block length filter for mapping [default: 5*segment-length]"
echo " -N, --no-split disable splitting of input sequences during mapping [enabled by default]"
echo " -p, --map-pct-id PCT percent identity for mapping/alignment"
echo " -p, --map-pct-id PCT percent identity for mapping/alignment [default: 95]"
echo " -n, --n-mappings N number of mappings to retain for each segment"
echo " -K, --mash-kmer N kmer size for mapping [default: 16]"
echo " -Y, --exclude-delim C skip mappings between sequences with the same name prefix before"
Expand All @@ -156,7 +181,8 @@ then
echo " -e, --edge-jump-max N maximum edge jump before breaking [default: 0 / off]"
echo " -G, --poa-length-target N,M target sequence length for POA, first pass = N, second pass = M [default: 4001,4507]"
echo " -P, --poa-params PARAMS score parameters for POA in the form of match,mismatch,gap1,ext1,gap2,ext2"
echo " [default: 1,19,39,3,81,1]"
echo " may also be given as presets: asm5, asm10, asm15, asm20"
echo " [default: 1,19,39,3,81,1 = asm5]"
echo " -O, --poa-padding N pad each end of each sequence in POA with N*(longest_poa_seq) bp [default: 0.03]"
echo " -d, --pad-max-depth N depth/haplotype at which we don't pad the POA problem [default: 100]"
echo " -M, --write-maf write MAF output representing merged POA blocks [default: off]"
Expand Down Expand Up @@ -279,7 +305,7 @@ smoothxg:
path-jump-max: $max_path_jump
edge-jump-max: $max_edge_jump
poa-length-target: $target_poa_length
poa-params: $poa_params
poa-params: $poa_params_cmd
write-maf: $write_maf
consensus-prefix: $consensus_prefix
consensus-spec: $consensus_spec
Expand Down Expand Up @@ -307,12 +333,12 @@ if [[ ! -s $prefix_paf.$mapper.paf || $resume == false ]]; then
($timer -f "$fmt" wfmash \
$exclude_cmd \
-s $segment_length \
-l $block_length \
$block_length_cmd \
$merge_cmd \
$split_cmd \
$mash_kmer_cmd \
-p $map_pct_id \
-n $n_mappings \
-k $mash_kmer \
-t $threads \
"$input_fasta" "$input_fasta" \
> "$prefix_paf".$mapper.paf) 2> >(tee -a "$log_file")
Expand Down Expand Up @@ -377,7 +403,7 @@ do
-j $max_path_jump \
-e $max_edge_jump \
-l $poa_length \
-p "$poa_params" \
$poa_params_cmd \
-O $poa_padding \
-Y $(echo "$pad_max_depth * $n_haps" | bc) \
-d 0 -D 0 \
Expand All @@ -400,7 +426,7 @@ do
-j $max_path_jump \
-e $max_edge_jump \
-l $poa_length \
-p "$poa_params" \
$poa_params_cmd \
-O $poa_padding \
-Y $(echo "$pad_max_depth * $n_haps" | bc) \
-d 0 -D 0 \
Expand Down

0 comments on commit 0897ff9

Please sign in to comment.