This details the workflow configuration
Edit me

Workflow Configuration

When the workflow was first downloaded (in the download section), a config.yaml file was downloaded as well. Open this file and modify the values to your needs.

To make the BED_FILE, RRNA_INTERVAL_LIST, and REF_FLAT_FILE, see slides 3 and 4 in [this Google Slides][https://docs.google.com/presentation/d/1gxlxbIObhxitgrPLp7lByYFwrFhEdvEm4mILmygAATY/edit#slide=id.g111a3589bd3_0_0] presentation, with examples for the Human Reference Genome

The most up-to-date version of this file can be found here.

MASTER_CONTROL

The master contol file (found under controls/master_control.csv) is a CSV file consisting of the following columns:

  • SRR codes
  • Tissue names + study number, replicate number, and (if applicable) a run number
  • Library layouts
    • PE for Paired-End
    • SE for Single-End
  • Library preparation column
    • total for total RNA seq
    • mrna for PolyA/mRNA RNA seq

If you have PERFORM_PREFETCH set to False in the config.yaml file, do not need to modify the master control file. This assumes you are providing the sra files yourself. An example of this file is as follows:

SRR Tissue/Study/Replicate/Run Library Layout Library Prep
SRR7647658 naiveB_S1R1 PE mrna
SRR7647700 naiveB_S1R2 PE mrna
SRR7647769 naiveB_S2R1r1 PE mrna
SRR7647808 naiveB_S2R1r2 PE mrna
SRR5110334 naiveB_S3R1 SE total
SRR5110338 naiveB_S3R2 SE total
SRR10408536 m2Macro_S1R1r1 SE total
SRR10408537 m2Macro_S1R1r2 SE total
SRR10408538 m2Macro_S1R1r3 SE total
SRR10408539 m2Macro_S2R1 SE mrna
SRR10408540 m2Macro_S2R2 SE mrna
SRR10408541 m2Macro_S2R3 SE mrna

DUMP_FASTQ_FILES

This options is only required if you have set PERFORM_PREFETCH to False. It is the location at which your input .fatsq.gz files are located

ROOTDIR

The relative file path where results should be placed. This is most likely going to be under a /work folder. The default value is results, which will place results in the results folder in the current directory.

REF_FLAT_FILE

The path to a refFlat file for your reference genome. This can be made using the following format:

# Download the gtf to refFlat converter
wget https://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/gtfToGenePred

# Add execution permissions
chmod =rwx,g+s ./gtfToGenePred

# Execute the gtf to refFlat converter
# The `genome/Homo_sapiens.GRCh38.105.gtf` is the path to your gtf file
# The last argument, `refFlat.tmp.txt` is the output filename
./gtfToGenePRed -genePredExt -geneNameAsName2 genome/Homo_sapiens.GRCh38.105.gtf refFlat.tmp.txt

# Modify values so Picard is able to parse the refFlat file correctly
# `refFlat.tmp.txt` is the output of the previous command
# `genome/refFlat_GRCh38.105.txt` is the path (and file name) you would like to save results to
paste <(cut -f 12 refFlat.tmp.txt) <(cut -f 1-10 refFlat.tmp.txt) > genome/refFlat_GRCh38.105.txt

# Remove the temporary refFlat file
rm refFlat.tmp.txt

RRNA_INTERVAL_LIST

THe path to a ribosomal interval list built from the GTF file for Picard’s GetRNASeqMetrics command. This finds rRNA transcript quantities.

The riboInt.sh file was downloaded with the pipeline. You should modify the values so they satisfy your folder paths.

BED_FILE

The path to a BED file for RSeQC, also built from the GTF_FILE, which corresponds to your reference genome. This can be made using the following:

# Change directories into your `genome` directory
cd genome

# Download the BED file
wget https://sourceforge.net/projects/rseqc/files/BED/Human_Homo_sapiens/hg38_GENCODE.v38.bed.gz/download

# Change the name to a readable name
mv download hg38_GENCODE.v38.bed.gz

# Unzip the file
gunzip hg38_GENCODE.v38.bed.gz

# Remove quotation marks from exon positions
sed -i 's/"//g' hg38_GENCODE.v38.bed

# Remove "chr" from chromosome indices
sed -i 's/chr//g' hg38_GENCODE.v38.bed

PERFORM_TRIM

Should trimming of reads be performed? True or False

PERFORM_SCREEN

Screen against genomes of common contaminants?
True or False
The current contaminants screened against are:

PERFORM_GET_RNASEQ_METRICS

Use Picard’s getRNASeqMetrics?
True or False
This required REF_FLAT_FILE and RRNA_INTERVAL_LIST to be set.

PERFORM_PREFETCH

If you only have SRR codes (from MASTER_CONTROL), this option will download those sra files from NCBI.
True or False

PERFORM_GET_INSERT_SIZE

Get the interval size using Picard? True or False

GET_FRAGMENT_SIZE

Get fragment sizes with RSeQC? True or False

GENOME_SAVE_DIR

The path to the directory where genome output should be saved to. This should be under your /work folder, as large files will be created

GENOME_FASTA_FILE

This is the input genome fasta file that has been previously downloaded.
Most likely located under your /work folder.
Can be downloaded using the following:

# Change directories into your `genome` directory
cd genome

# Download the assembly
# To set the release number, set the following variable
assembly_release=105
wget ftp://ftp.ensembl.org/pub/release-${assembly_release}/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz

# Unzip the file
gunzip Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz

GTF_FILE

This is the input GTF genome file that has also been previously downloaded
Most likely located under your /work folder
Can be downloaded the human genome annotation with:

# Change directories into your `genome` directory
cd genome

# Download the annotations
# To set releases, modify the following variable to the release number
annotation_release=105
wget ftp://ftp.ensembl.org/pub/release-${annotation_release}/gtf/homo_sapiens/Homo_sapiens.GRCh38.${annotation_release}.gtf.gz

You do not need to make any further changes. SnakeMake will extract the configuration values you have set up

Once these steps are complete, the workflow should be prepared to execute.
Continue to the next page to execute the workflow