Skip to content

Subsurface Aquifer Tutorial

Elizabeth McDaniel edited this page Dec 18, 2019 · 6 revisions

So you have a set of reconstructed genomes from a metagenomic experiment you performed form your favorite environment. In addition to assigning phylogenetic classifications to your set of population genomes and performing functional annotations to describe broad metabolic guilds, you may possibly have some specific biological question in mind or a specific biogeochemical cycle that you are super interested in. This example workflow uses a set of publicly available metagenome-assembled genomes (MAGs) from a subsurface aquifer system in Anantharaman et al. 2016 "Thousands of microbial genomes shed light on interconnected biogeochemical processes in an aquifer system." These genomes are used to explore the Wood-Ljungdahl pathway for carbon fixation through constructing phylogenies and making presence/absence matrices. This example and all of the metabolisHMM workflows assume that you have a set of population genomes that you have performed quality statistics and curation on, and have preliminary or defined phylogenetic classifications through using an automatic classification tool or manual approach. This tool does not check the quality or validate the phylogenetic assignment of your MAGs/SAGs/genomes, but rather facilitates downstream analyses for your specific biological question.

Downloading Publicly Available Genomes

For this example, all bacterial and archaeal genomes from the subsurface aquifer study were downloaded using ncbi-genome-download. These steps for downloading and reformatting genome files can also be found in the genomes-MAGs-database repository. In addition to analyzing your own genomes, you may want to download genomes from other metagenomic experiments or reference isolates to compare against. ncbi-genome-download is a great tool for downloading genomes from NCBI through the Refseq and/or Genbank databases. To download genomes from a specific sequencing project, go to the corresponding BioProject page and click "Download":


Take this file and make a list of genbank accessions with awk '{print $1}'PRJNA288027_AssemblyDetails.txt | tail -n +3 > aquifer-accessions.txt. The accession list file looks like:

GCA_001766905.1
GCA_001766895.1
GCA_001766875.1
GCA_001766955.1
GCA_001766965.1
GCA_001766975.1
GCA_001766985.1
GCA_001768595.1
GCA_001767145.1
GCA_001767035.1
...

Then pass this file to ncbi-genome-download with the command:

ncbi-genome-download -s genbank -F fasta -A aquifer-accessions.txt -o aquifer-genomes -m aquifer-genomes-metadata.txt all

The files may need a little reformatting. Additionally, I would recommend having a set of the raw nucleotide files and a set of the predicted protein files. Although metabolisHMM will perform this step for you if you provide nucleotide files, it's good practice to have those files for yourself. To reformat the downloaded genomes and get protein fasta files, run this entire script:

#! /bin/bash
# Reformat and call genes on all sequences, outputting the protein files

# unzip downloaded files
gunzip */*.gz
echo Unzipped all files

# rename based on directory name
for filename in */*.fna; do mv $filename ${filename%/*}/${filename%/*}.fna; done
echo Renamed all files

# call genes with Prodigal and output proteins
for file in */*.fna; do 
    N=$(basename $file .fna);
    prodigal -i $file -a $N/$N.faa;
done

You are now ready to run the metabolisHMM workflows on your genomes. Move either all of the .fna or .faa into one directory. Additionally, if you want to keep your bacterial and archaeal genomes separate for making ribosomal phylogenies, create separate directories of copies of those specific files.

Single Marker Phylogeny

We want to explore the distribution and phylogeny of the folD marker, which catalyzes the oxidation of 5,10-methylenetetrahydrofolate to 5,10-methenyltetrahydrofolate as part of the methyl branch of the Wood-Ljungdahl pathway. I have provided a custom HMM profile for the folD marker from the KofamKOALA distribution of HMMs. To create a single marker phylogeny of the folD protein, we would run:

single-marker-phylogeny --input prots/ --output folD --marker WLJ/folD.hmm --phylogeny fastree

This will create a directory called folD, and contains all reformatted and intermediate files, and the results for a list of genomes that contain a hit for the folD marker, fasta sequences for the proteins of those hits, alignment file, and tree file constructed with FastTree. If you decide to use RaxML for building phylogenies, you can also add an optional argument for adding more threads.

If you want to compare the phylogeny of folD of these genomes to their corresponding ribosomal phylogeny, you can make a corresponding ribosomal phylogeny by turning the --ribo_tree argument ON, and providing what domain your genomes are from, such as bacteria, archaea, or all, as the specific markers differ between bacteria and archaea. This takes advantage of the create-genome-phylogeny workflow to only give you a tree of the folD hits. You can also output necessary metadata files configured for exploring your trees in iTOL. An example of this command is:

single-marker-phylogeny --input prots/ --output folD_and_ribo --marker WLJ/folD.hmm --phylogeny fastree --ribo_tree ON --domain all --metadata ON --names aquifer-group-names.csv

Where the metadata file is formatted as follows:

GCA_001940655.1,Asgard
GCA_001786585.1,DPANN
GCA_001784635.1,DPANN
GCA_001787285.1,DPANN
GCA_001786395.1,DPANN
GCA_001786425.1,DPANN
GCA_001786415.1,DPANN
GCA_001791825.1,DPANN
GCA_001800665.1,Euryarchaeota
GCA_001800675.1,Euryarchaeota

The metadata file is a CSV formatted file where the first column are the filenames of your genomes without the .fna or .faa extension, and the second column is the group name you want to appear in your tree. This will create an iTOL configured file for all of your genomes in your set, not just the ones with the single-marker hits, but iTOL will only create the figure for the names of genomes that are included in the tree.

With some formatting in iTOL and Illustrator, you can create a comparison of the single marker gene phylogeny of folD and the corresponding ribosomal protein phylogeny of the genomes containing folD:

Ribosomal Phylogeny

If you wish to create a ribosomal tree of all your input genomes, and not just certain ones with a particular marker like above, you can use the create-genome-phylogeny workflow, such as:

create-genome-phylogeny --input prots/ --output ribo --domain all --phylogeny fastree

Where you need to specify the domain for which your genomes belong to, as this will change which markers are used for creating the alignment. The options are bacteria, archaea, or all if you want to create a tree with both bacterial and archaeal members. To output associated metadata for use with iTOL:

create-genome-phylogeny --input prots/ --output ribo --domain all --phylogeny fastree --metadata aquifer-groups.csv --itol_files ON

Broad Metabolic Summaries

For a set of newly assembled genomes, most of the time you want the breakdown of major biogeochemical cycles or metabolic guilds your genomes fall into. The curate metabolic summaries workflow screens all genomes for markers spanning the carbon, nitrogen, sulfur, oxygen, and hydrogen cycles so you can understand the main functions present in your genomes. In order to use these visualization workflows, you will need to have the make-heatmap.R script in the current working directory that you are running the workflow from. The basic command for running the summarize-metabolism workflow is:

summarize-metabolism --input prots/ --output summaries --metadata groups.csv

Where you have a CSV formatted metadata file that either names the genomes as the names of the filenames, or specific names you have for your genome groups such as:

bacteria00190,bacteria00190
bacteria00193,bacteria00193
bacteria00203,bacteria00203
bacteria00229,bacteria00229
bacteria01060v2,bacteria01060v2
bacteria23257,bacteria23257
bacteria23258,bacteria23258
bacteria23259,bacteria23259
bacteria23260,bacteria23260
bacteria23263,bacteria23263

or

GCA_001940655.1,Asgard
GCA_001786585.1,DPANN
GCA_001784635.1,DPANN
GCA_001787285.1,DPANN
GCA_001786395.1,DPANN
GCA_001786425.1,DPANN
GCA_001786415.1,DPANN
GCA_001791825.1,DPANN
GCA_001800665.1,Euryarchaeota
GCA_001800675.1,Euryarchaeota
GCA_001800715.1,Euryarchaeota
GCA_001800735.1,Euryarchaeota
GCA_001800745.1,Euryarchaeota
GCA_001800755.1,Euryarchaeota
GCA_001800795.1,Euryarchaeota
GCA_001800815.1,Euryarchaeota
GCA_001800825.1,Euryarchaeota
GCA_001775955.1,TACK
GCA_001774235.1,TACK
GCA_001774245.1,TACK

The visualization workflows also have an optional argument that can change how you explore the metabolic summaries within your genomes. This is specified using the --aggregate option. If you have a subset of your genomes you are interested in, and want to look at metabolic functions for each genome, then the --aggregate option by default is OFF.

If you have 100s of genomes and want to understand the breakdown of metabolic markers within those groups, you can turn the --aggregate option on, and in your metadata file are the names of groups for which the genomes belong to, which results in:

Where the intensity of shading within a cell is the proportion of genomes within that group that contain that marker. By default, the output plot doesn't include the legends or names of the cycles at the top as appears in the manuscript. The purpose of this workflow is to get a cursory look at broad summaries of biogeochemical transformations. To get a prettier version of the figure with legends and whatnot, this requires some manual editing of the automatic output from the workflow, and you can get something like this:

Wood-Ljungdahl Pathway Characterization

Sometimes you want to create presence/absence summaries and heatmaps of a custom set of markers for a pathway of interest, or a random selection of markers that you quickly want to explore and visualize the distribution for. The search-custom-markers workflow takes a list of any set of HMM profiles with corresponding threshold cutoffs, and the list is ordered in any way that you want to be represented, and will output a heatmap visualization and summary statistics for presence/absence of those markers. As an example, a subset of the aquifer genomes were used to search for the bacterial subunits of the Wood-Ljungdahl pathway for carbon fixation. For a set of markers within any directory, list them in the order you want them to appear in the table and heatmap in a text file, such as wlj-markers.txt:

fdhA.hmm
fdhB.hmm
fhs.hmm
folD.hmm
metF.hmm
cdhA.hmm
cdhB.hmm
cdhC.hmm
cdhD.hmm
cdhE.hmm
acsE.hmm

Then the command will look like:

search-custom-markers --input prots/ --output WLJ --markers_dir WLJ_markers/ --markers_list wlj-markers.txt --metadata aquifer-genomes.csv

You can also add the same --aggregate options as in the summarize-metabolism workflow, so the heatmap summarizes the breakdown of a marker per group:

search-custom-markers --input prots/ --output WLJ --markers_dir WLJ_markers/ --markers_list wlj-markers.txt --metadata aquifer-genomes.csv --aggregate ON

Where the heatmap would look like:

Concluding Remarks

Overall, metabolisHMM provides flexible & reproducible workflows for performing common comparative genomics analyses with any set of HMM markers so you can further explore the functional potential of your genomes.