Skip to content

Snakefile

Francisco Zorrilla edited this page Mar 29, 2021 · 7 revisions

Overview

The Snakefile is the backbone of the metaGEM workflow. This file is the recipe book that contains all the instructions necessary to take inputs from one bioinformatic tool, process them, and pass them on to the next set of analyses. When experiencing errors, you should always look at what the Snakefile rule is trying to do in order to troubleshoot.

Snakefiles are split into rules, which are structured like this:

rule nameOfRule:  
    input:
        youCanHave = rules.previousRule.output,
        multipleInputs = f'{config["path"]["root"]}/{config["folder"]["qfiltered"]}'
    output:
        same = directory(f'{config["path"]["root"]}/{config["folder"]["concoct"]}/{{IDs}}/cov'),
        with = directory(f'{config["path"]["root"]}/{config["folder"]["metabat"]}/{{IDs}}/cov'),
        outputs = directory(f'{config["path"]["root"]}/{config["folder"]["maxbin"]}/{{IDs}}/cov')
    benchmark:
        f'{config["path"]["root"]}/benchmarks/{{IDs}}.ruleName.benchmark.txt'
    message:
        """
        Explanation of what the rule is trying to do goes here, usually with some warnings or word of caution.
        """
    shell:
        """
        # some bash code goes here that probably runs some super fancy algos
        """

Wildcards

The Snakefile rules generally make use of wildcards, which are useful for expanding job submissions for all samples in a dataset. We can easily switch wildcards to expand jobs based on sample IDs, or bin IDs, or GEM IDs, etc. as needed. The wildcard {{IDs}} is expanded based on the sample IDs using the following code at the top of the Snakefile:

import os
import glob

def get_ids_from_path_pattern(path_pattern):
    ids = sorted([os.path.basename(os.path.splitext(val)[0])
                  for val in (glob.glob(path_pattern))])
    return ids

IDs = get_ids_from_path_pattern('dataset/*')

DATA_READS_1 = f'{config["path"]["root"]}/{config["folder"]["data"]}/{{IDs}}/{{IDs}}_R1.fastq.gz'
DATA_READS_2 = f'{config["path"]["root"]}/{config["folder"]["data"]}/{{IDs}}/{{IDs}}_R2.fastq.gz'

That is, the Snakefile will search for the folder called dataset and will take the sub-folder names to be sample IDs. For practical purposes we usually delete the raw dataset after quality filtering, especially if working with multiple/large datasets, in which case you should replace the dataset folder with qfiltered, for example, inside the get_ids_from_path_pattern() function.

Note that the root folder is defined in the config.yaml file, and is automatically set as the working directory when running the metaGEM.sh parser.