Clustering Hi-C contact graphs using Graph Neural Networks

Bioinformatics Institute spring project, year 2022. Student: Velikonivtsev Fyodor Supervisors: Tolstoganov Ivan, Korobeynikov Anton

Goal & objectives

Goal: discover properties, opportunities and perspectives of clustering metagenome Hi-C data by graph deep learning methods.
Objectives:

Develop core understanding of problem´s crucial concepts and current advances
Apply effective GNN clustering models - DMoN, GraphMB
Apply VAE model - VAMB
Create interface & modify models API for correctly processing Hi-C data formats
Explore models´ hyperparameters space and compare results
Compare tools efficiency with VAMB and Bin3C as baselines using AMBER & CheckM

Key results:

Bin was considered as HQ (high-quality metagenome-assembled genome) if it had >95% completeness and <5% Contamination 3 datasets have been used:

Zymo dataset (supervised) [6625 contigs, 76799 Hi-C links]
IC9 dataset (unsupervised) size [165712 contigs, 1150887 Hi-C links]
CAMI AIRWAYS synthetic dataset (supervised) [728682 contigs, 70405 Hi-C links]

DMoN:

Zymo dataset - 0 HQ genomes
Wase taken out of experiment

GraphMB:

Restored 7-8/10 HQ MAGs vs. VAMB’s 10/10 vs bin3C`s (non-DL tool for clustering Hi-C data) 6/10 (a)
Restored 98/600 HQ MAGs vs. VAMB’s 93/600 in CAMI AIRWAYS (b)
Restored 12 HQ bins vs VAMB’s 12 in IC9 dataset (c)

Conclusions:

DMoN strongly depends from input node features and doesn’t support edge features, therefore it is incompatible for contact map clustering
GraphMB showed better results on larger dataset where VAMB isn’t enough - it shows that some new information from contact graph was used to resolve bins
GraphMB relies on the effective VAMB workflow - this explains close results on small Zymo dataset and its close to VAMB result on such small datasets
GraphMB showed its competitive effectiveness in the problem of Hi-C contact map clustering

References:

Nissen, J.N., Johansen, J., Allesøe, R.L. et al. Improved metagenome binning and assembly using deep variational autoencoders. Nat Biotechnol 39, 555–560 (2021). https://doi.org/10.1038/s41587-020-00777-4
Metagenomic binning with assembly graph embeddings. Andre Lamurias, Mantas Sereika, Mads Albertsen, Katja Hose, Thomas Dyhre Nielsen. bioRxiv 2022.02.25.481923; doi: https://doi.org/10.1101/2022.02.25.481923
Tsitsulin, Anton & Palowitch, John & Perozzi, Bryan & Müller, Emmanuel. (2020). Graph Clustering with Graph Neural Networks.
Fernando Meyer, Peter Hofmann, Peter Belmann, Ruben Garrido-Oter, Adrian Fritz, Alexander Sczyrba, Alice C McHardy, AMBER: Assessment of Metagenome BinnERs, GigaScience, Volume 7, Issue 6, June 2018, giy069, https://doi.org/10.1093/gigascience/giy069
Parks DH, Imelfort M, Skennerton CT, Hugenholtz P, Tyson GW. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res. 2015 Jul;25(7):1043-55. doi: 10.1101/gr.186072.114. Epub 2015 May 14. PMID: 25977477; PMCID: PMC4484387.
DeMaere, M., Darling, A. bin3C: exploiting Hi-C sequencing data to accurately resolve metagenome-assembled genomes. Genome Biol 20, 46 (2019). https://doi.org/10.1186/s13059-019-1643-1

Toolkit

This repository contains tools for:

Correctly preprocessing Hi-C contact map and related assembly graph with scaffolds to create compatible input formats for DMoN & GraphMB - preprocess_files.py
Correctly transforming ground truth labels (if any) to AMBER specific input format & post-clustering output modification for VAMB output - vamb2amber.py
Basic exploratory data analysis by providing length distribution for subsets of contigs with Hi-C links and without Hi-C links - might be helpful in choosing minimal length threshold while clustering - preprocess_files.py
Conveniently comparing CheckM quality assessment summaries of several binning runs by providing metrics plot for each of the runs (example can be seen at plot (c)) - compare_checkm_results.py

Requirements

Software requirements

Desired tested GNN - DMoN, GraphMB
QC tools - AMBER, CHECKM
Python 3.8+
Python packages (see installation for details)
UNIX command line

Hardware requirements

GPU is preferrable
CPU - any, multicore is preferrable
RAM - 16 Gb
Disk space - depends on data, + 2 Gb for CheckM database

Installation

Install AMBER, CheckM into separate environments
Install desired tool into a separate environment (crucial for GraphMB and AMBER) (GraphMB - my modification, DMoN - my modification, VAMB - my modification (minimal modified tool))
Install python packages:

pip install -U numpy scipy pandas sklearn tqdm plotly kaleido

Clone repository and add it to PATH (e.g. for bash):

git clone https://github.com/Abusagit/GNN_plus_HiC.git && cd GNN_plus_HiC && echo 'export PATH="your-dir:$PATH"' >> ~/.bashrc && source ~/.bashrc

Workflow

Preprocess data for given GNN (e.g. GraphMB: transform contact_map.tsv, scaffolds.fasta):

py preprocess_files.py --graphmb -c contact_map.tsv --scaling log -f scaffolds.fasta --mimic-jgi -o graphmb_input/

Run GNN:

graphmb --assembly graphmb_input/ [--other-paraneters-for-graphmb]

Run AMBER or CheckM
You could also run VAMB - it produces clustering output incompatible for AMBER work but suitable for CheckM. You can use the followeing:

py vamb2amber.py -i amber_result.tsv -g golden_standard_with_amber_format.tsv -o outdir/vamb_for_amber.tsv

In the case of having labels you can directly compare binning results by AMBER - it provides comprehensive visualization plots. However, this is not the case for CheckM multiple study - here you can use compare_checkm_results.py and compare joint distribution of completeness and purity metrics among HQ genomes in m binning results (any number starting from 1):

py compare_checkm_results.py -i [checkm_input_1 ... checkm_input_m] --labels [label_1 ... label_m] --min-completeness 0.95 --min-purity 0.95

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
graphmb_workflow		graphmb_workflow
.gitignore		.gitignore
README.md		README.md
__main__.py		__main__.py
analyze_binned_contigs.py		analyze_binned_contigs.py
analyze_connections.py		analyze_connections.py
clusters2bins.py		clusters2bins.py
compare_checkm_results.py		compare_checkm_results.py
compare_tools_template.sh		compare_tools_template.sh
contact_map_processing.py		contact_map_processing.py
create_contact_map_from_bin3c.py		create_contact_map_from_bin3c.py
create_contact_map_from_hicbin.py		create_contact_map_from_hicbin.py
io_prep_tools.py		io_prep_tools.py
preprocess_files.py		preprocess_files.py
reduce_embeddings.py		reduce_embeddings.py
vamb2amber.py		vamb2amber.py
write_bins_from_tsv.py		write_bins_from_tsv.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Clustering Hi-C contact graphs using Graph Neural Networks

Goal & objectives

Key results:

DMoN:

GraphMB:

Conclusions:

References:

Toolkit

Requirements

Software requirements

Hardware requirements

Installation

Workflow

About

Releases

Packages

Languages

Abusagit/GNN_plus_HiC

Folders and files

Latest commit

History

Repository files navigation

Clustering Hi-C contact graphs using Graph Neural Networks

Goal & objectives

Key results:

DMoN:

GraphMB:

Conclusions:

References:

Toolkit

Requirements

Software requirements

Hardware requirements

Installation

Workflow

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages