Skip to content

Quick Start and Full Example

Lucas Czech edited this page Jul 19, 2024 · 39 revisions

This page is meant for the impatient.

Get grenepipe

Simply download grenepipe to somewhere. We recommend to use a release version, so that you have a stable point of reference. No installation is needed - just extract the files to somewhere.

You will furthermore need one of conda / miniconda / anaconda / mamba / micromamba. At the moment, we recommend micromamba, as that seems to be the easiest to install. However, the whole conda ecosystem is rather fast moving, fragile, and unpredictable, so just use whatever is working at the moment. On computer clusters, this might already be available as a module, or you can install it locally for your user. We highly recommended to use mamba (or micromamba) instead of conda, for speed. We will then use that to install and run Snakemake, which is the backend that grenepipe runs in.

Preparing the environment

Snakemake, conda/mamba, python, pandas, and numpy are notorious for causing trouble when mixing their versions. We make sure to always use the same versions of these tools by running the main pipeline in an environment of its own, instead of using your local versions of the tools.

First, install micromamba locally or on your cluster. Then, use that to install and activate the grenepipe environment, from within the main grenepipe directory:

# Create and activate a conda environment for running snakemake.
cd /path/to/grenepipe
micromamba env create -f workflow/envs/grenepipe.yaml
micromamba activate grenepipe

Instead of micromamba, you can also use mamba or conda, depending on which one you decided to use.

Example dataset

We provide a small test/exemplary data set at grenepipe/example. This contains the files minimally needed to run the pipeline:

  • samples.tsv: table listing all input fastq files.
  • samples/*.fastq.gz: actual sequence data, referenced from the table.
  • TAIR10_chr_all.fa.gz: reference genome (here, Arabidopsis thaliana).
  • known-variants.vcf.gz: to constrain the variant calling process. This file is based on the 1001 Genomes dataset, and imputed and subset to serve for exemplary and test purposes.
  • Lastly, a config.yaml file is needed to set up which input files, tools, and settings we are using. The main grenepipe directory contains the base config file, which we will use.

We now need to prepare the config.yaml file for the example, by adjusting the file paths to the fastq files in the samples.tsv, which need to fit with where grenepipe is located.

Simply call

# Prepare the config.yaml and samples.tsv as described above.
./example/prepare.sh 

which copies the config/config.yaml to the example directory, and adjusts the paths in the two files as needed. All other settings are left at their defaults.

NB: Note that we are using Arabidopsis thaliana as a small exemplary genome here; the pipeline is however agnostic to the species under study.

Running the pipeline

The data can then be fully analyzed by running the following command from the main grenepipe directory:

# Run the pipeline!
snakemake --use-conda --directory example/

to run the actual pipeline. That's it.

[Expand] A note on grenepipe < v0.13.0

In grenepipe v0.13.0, we upgraded from Snakemake v6 to Snakemake v8, which now by default uses --conda-frontend mamba and also sets the number of compute cores to use by default, instead of having to specify, e.g., --cores 4. If you are still using an older grenepipe before v0.13.0, you will have to add both options to the above command.

Note: Snakemake always needs to be run from within the directory where you downloaded grenepipe to; you then always specify where your config file is (and hence, where the output files are produced) via the --directory option.

The most important outputs of this are:

  • The calling/filtered-all.vcf.gz final variant call file (excluding SnpEff and VEP annotations).
  • The qc/multiqc.html MultiQC quality control statistics report.

See Setup and Usage for more details.