Skip to content

Snakefile

Francisco Zorrilla edited this page Mar 26, 2021 · 7 revisions

The Snakefile is the backbone of the metaGEM workflow. This file is the recipe book that contains all the instructions necessary to take inputs from one bioinformatic tool, process them, and pass them on to the next set of analyses. When experiencing errors, you should always look at what the Snakefile rule is trying to do in order to troubleshoot.

Snakefiles are split into rules, which are structured like this:

rule nameOfRule:  
    input:
        youCanHave = rules.previousRule.output,
        multipleInputs = f'{config["path"]["root"]}/{config["folder"]["qfiltered"]}'
    output:
        same = directory(f'{config["path"]["root"]}/{config["folder"]["concoct"]}/{{IDs}}/cov'),
        with = directory(f'{config["path"]["root"]}/{config["folder"]["metabat"]}/{{IDs}}/cov'),
        outputs = directory(f'{config["path"]["root"]}/{config["folder"]["maxbin"]}/{{IDs}}/cov')
    benchmark:
        f'{config["path"]["root"]}/benchmarks/{{IDs}}.ruleName.benchmark.txt'
    message:
        """
        Explanation of what the rule is trying to do goes here, usually with some warnings or word of caution.
        """
    shell:
        """
        # some bash code goes here that probably runs some super fancy algos
        """

The Snakefile rules generally make use of wildcards, which are useful for expanding job submissions for all samples in a dataset. The wildcard {{IDs}} is expanded based on the sample IDs using the following code at the top of the Snakefile:

import os
import glob

def get_ids_from_path_pattern(path_pattern):
    ids = sorted([os.path.basename(os.path.splitext(val)[0])
                  for val in (glob.glob(path_pattern))])
    return ids

IDs = get_ids_from_path_pattern('dataset/*')

DATA_READS_1 = f'{config["path"]["root"]}/{config["folder"]["data"]}/{{IDs}}/{{IDs}}_R1.fastq.gz'
DATA_READS_2 = f'{config["path"]["root"]}/{config["folder"]["data"]}/{{IDs}}/{{IDs}}_R2.fastq.gz'

That is, the Snakefile will search for the folder called dataset and will take the sub-folder names to be sample IDs. For practical purposes we usually delete the raw dataset after quality filtering, especially if working with multiple/large datasets, in which case you should replace the dataset folder with qfiltered, for example.

Note that the root folder is defined in the config.yaml file, and is automatically set as the working directory when running the metaGEM.sh parser.