This details the workflow configuration
Workflow Configuration

When the workflow was first downloaded (in the download section), a config.yaml file was downloaded as well. Open this file and modify the values to your needs.

To make the BED_FILE, RRNA_INTERVAL_LIST, and REF_FLAT_FILE, see slides 3 and 4 in [this Google Slides][] presentation, with examples for the Human Reference Genome

The most up-to-date version of this file can be found here.


The master contol file (found under controls/master_control.csv) is a CSV file consisting of the following columns:

  • SRR codes
  • Tissue names + study number, replicate number, and (if applicable) a run number
  • Library layouts
    • PE for Paired-End
    • SE for Single-End
  • Library preparation column
    • total for total RNA seq
    • mrna for PolyA/mRNA RNA seq

If you have PERFORM_PREFETCH set to False in the config.yaml file, do not need to modify the master control file. This assumes you are providing the sra files yourself. An example of this file is as follows:

SRR Tissue/Study/Replicate/Run Library Layout Library Prep
SRR7647658 naiveB_S1R1 PE mrna
SRR7647700 naiveB_S1R2 PE mrna
SRR7647769 naiveB_S2R1r1 PE mrna
SRR7647808 naiveB_S2R1r2 PE mrna
SRR5110334 naiveB_S3R1 SE total
SRR5110338 naiveB_S3R2 SE total
SRR10408536 m2Macro_S1R1r1 SE total
SRR10408537 m2Macro_S1R1r2 SE total
SRR10408538 m2Macro_S1R1r3 SE total
SRR10408539 m2Macro_S2R1 SE mrna
SRR10408540 m2Macro_S2R2 SE mrna
SRR10408541 m2Macro_S2R3 SE mrna


This options is only required if you have set PERFORM_PREFETCH to False. It is the location at which your input .fatsq.gz files are located


The relative file path where results should be placed. This is most likely going to be under a /work folder. The default value is results, which will place results in the results folder in the current directory.


The path to a refFlat file for your reference genome. This can be made using the following format:

# Download the gtf to refFlat converter

# Add execution permissions
chmod =rwx,g+s ./gtfToGenePred

# Execute the gtf to refFlat converter
# The `genome/Homo_sapiens.GRCh38.105.gtf` is the path to your gtf file
# The last argument, `refFlat.tmp.txt` is the output filename
./gtfToGenePRed -genePredExt -geneNameAsName2 genome/Homo_sapiens.GRCh38.105.gtf refFlat.tmp.txt

# Modify values so Picard is able to parse the refFlat file correctly
# `refFlat.tmp.txt` is the output of the previous command
# `genome/refFlat_GRCh38.105.txt` is the path (and file name) you would like to save results to
paste <(cut -f 12 refFlat.tmp.txt) <(cut -f 1-10 refFlat.tmp.txt) > genome/refFlat_GRCh38.105.txt

# Remove the temporary refFlat file
rm refFlat.tmp.txt


THe path to a ribosomal interval list built from the GTF file for Picard’s GetRNASeqMetrics command. This finds rRNA transcript quantities.

The file was downloaded with the pipeline. You should modify the values so they satisfy your folder paths.


The path to a BED file for RSeQC, also built from the GTF_FILE, which corresponds to your reference genome. This can be made using the following:

# Change directories into your `genome` directory
cd genome

# Download the BED file

# Change the name to a readable name
mv download hg38_GENCODE.v38.bed.gz

# Unzip the file
gunzip hg38_GENCODE.v38.bed.gz

# Remove quotation marks from exon positions
sed -i 's/"//g' hg38_GENCODE.v38.bed

# Remove "chr" from chromosome indices
sed -i 's/chr//g' hg38_GENCODE.v38.bed


Should trimming of reads be performed? True or False


Screen against genomes of common contaminants?
True or False
The current contaminants screened against are:


Use Picard’s getRNASeqMetrics?
True or False
This required REF_FLAT_FILE and RRNA_INTERVAL_LIST to be set.


If you only have SRR codes (from MASTER_CONTROL), this option will download those sra files from NCBI.
True or False


Get the interval size using Picard? True or False


Get fragment sizes with RSeQC? True or False


The path to the directory where genome output should be saved to. This should be under your /work folder, as large files will be created


This is the input genome fasta file that has been previously downloaded.
Most likely located under your /work folder.
Can be downloaded using the following:

# Change directories into your `genome` directory
cd genome

# Download the assembly
# To set the release number, set the following variable

# Unzip the file
gunzip Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz


This is the input GTF genome file that has also been previously downloaded
Most likely located under your /work folder
Can be downloaded the human genome annotation with:

# Change directories into your `genome` directory
cd genome

# Download the annotations
# To set releases, modify the following variable to the release number

You do not need to make any further changes. SnakeMake will extract the configuration values you have set up

Once these steps are complete, the workflow should be prepared to execute.
Continue to the next page to execute the workflow