fastMSA

fastMSA provides a way to efficiently build multiple sequence alignments by substituting the time-consuming identity search with a bi-encoder powered similarity search, achieving a 34-fold speed-up.

How does it work?

Pipeline overview

The top-K most similar sequences will be retrieved using dot product, then JackHMMER is applied on this small retrieved dataset to build the MSA for further tasks, such as 3D structure prediction or protein function prediction. Before retrieval, the UniRef90 can be encoded into vectors offline, and it will NOT affect the inference time of building MSA.

Bi-encoder model

We use a query encoder and a context encoder. The dataset is encoded using the context encoder offline while the query encoder is used during inference.

Quick stark

Prerequisites

Hardware requirements

At least one Nvidia GPU. We conduct our experiments on four Nvidia V100 GPU with 32G memory each, batch size is expected to be set proportionally to the memory. e.g. Nvidia A100 80G may have batch size of 64, V100 32G may have 24 in training.
Large memory is also expected to hold the context file. Faiss conducts similarity search totally in memory so that in order to run the default search, around 250G memory is needed. You may cut the searching process into several parts if large memory is not available.
It is better to move this repo onto SSD as the demo will have to write retrieved data. SSD can significantly boost the speed of serialization.

Environment setup

Anaconda2 (https://www.anaconda.com/distribution/) or Minoconda2 (https://conda.io/miniconda.html) is expected to managed the environment. Check whether you have it. You can easily set it up with command wget https://repo.anaconda.com/miniconda/Miniconda3-py37_4.10.3-Linux-x86_64.sh && bash Miniconda3-py37_4.10.3-Linux-x86_64.sh
clone this repo git clone [email protected]:heathcliff233/fastMSA.git
setup dependencies conda create --name fastMSA --file ./fastMSA/requirements.txt
activate the env conda activate fastMSA
install the Bio-data I/O package following the instructions at phylopandas
download the model pth file and set the path path_to_model
download the context encoding to directory path_to_ctx, the provided ctx encoding is based on UniRef90 2018.3, change it into two-line format and set the path as path_to_ur90

Run a demo

go to the directory of this repo and cd app
set the three paths in retrieve_app-v6.py then use the command streamlit run retrieve_app-v6.py

Customize dataset or context

if you would like to use your own database:

cd dense_encode
mkdir $CTXDIR
set save_path arg in generate_vec_one.py as $CTXDIR
decide whether to use multiple GPUs. Note that it takes around 24h to encode 70M sequences using four V100 GPUs.
- single card:
  - set DISTRIBUTED=False in gen_u90.py
  - python3 generate_vec_one.py
- distributed mode:
  - set DISTRIBUTED=True in gen_u90.py
  - python3 -m torch.distributed.launch --nproc_per_node=4 generate_vec_one.py

Training

cd train
mkdir saved_model
identify the number of GPUs and decide whether a parellel training should be used
- single card mode:
  - set DISTRIBUTED arg in main.py as false
  - python3 main.py
- distributed mode:
  - set DISTRIBUTED arg in main.py as True
  - python3 -m torch.distributed.launch --nproc_per_node=$NUM_OF_GPU main.py

Dependencies

torch
phylopandas
streamlit
wandb
fair-esm
faiss

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
app		app
data		data
dense_encode		dense_encode
hmmer-batch		hmmer-batch
model		model
msa_builder		msa_builder
pics		pics
split		split
test		test
tools		tools
train		train
utils		utils
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

fastMSA

How does it work?

Pipeline overview

Bi-encoder model

Quick stark

Prerequisites

Hardware requirements

Environment setup

Run a demo

Customize dataset or context

Training

Dependencies

About

Releases

Packages

Languages

heathcliff233/fastMSA

Folders and files

Latest commit

History

Repository files navigation

fastMSA

How does it work?

Pipeline overview

Bi-encoder model

Quick stark

Prerequisites

Hardware requirements

Environment setup

Run a demo

Customize dataset or context

Training

Dependencies

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages