Workflow Configuration
When the workflow was first downloaded (in the download section), a config.yaml
file was downloaded as well. Open this file and modify the values to your needs.
To make the BED_FILE
, RRNA_INTERVAL_LIST
, and REF_FLAT_FILE
, see slides 3 and 4 in [this Google Slides][https://docs.google.com/presentation/d/1gxlxbIObhxitgrPLp7lByYFwrFhEdvEm4mILmygAATY/edit#slide=id.g111a3589bd3_0_0] presentation, with examples for the Human Reference Genome
The most up-to-date version of this file can be found here.
MASTER_CONTROL
The master contol file (found under controls/master_control.csv
) is a CSV file consisting of the following columns:
- SRR codes
- Tissue names + study number, replicate number, and (if applicable) a run number
- Library layouts
PE
for Paired-EndSE
for Single-End
- Library preparation column
total
for total RNA seqmrna
for PolyA/mRNA RNA seq
If you have PERFORM_PREFETCH
set to False
in the config.yaml
file, do not need to modify the master control file. This assumes you are providing the sra
files yourself. An example of this file is as follows:
SRR | Tissue/Study/Replicate/Run | Library Layout | Library Prep |
---|---|---|---|
SRR7647658 | naiveB_S1R1 | PE | mrna |
SRR7647700 | naiveB_S1R2 | PE | mrna |
SRR7647769 | naiveB_S2R1r1 | PE | mrna |
SRR7647808 | naiveB_S2R1r2 | PE | mrna |
SRR5110334 | naiveB_S3R1 | SE | total |
SRR5110338 | naiveB_S3R2 | SE | total |
SRR10408536 | m2Macro_S1R1r1 | SE | total |
SRR10408537 | m2Macro_S1R1r2 | SE | total |
SRR10408538 | m2Macro_S1R1r3 | SE | total |
SRR10408539 | m2Macro_S2R1 | SE | mrna |
SRR10408540 | m2Macro_S2R2 | SE | mrna |
SRR10408541 | m2Macro_S2R3 | SE | mrna |
DUMP_FASTQ_FILES
This options is only required if you have set PERFORM_PREFETCH
to False
. It is the location at which your input .fatsq.gz
files are located
ROOTDIR
The relative file path where results should be placed. This is most likely going to be under a /work
folder. The default value is results
, which will place results in the results
folder in the current directory.
REF_FLAT_FILE
The path to a refFlat
file for your reference genome. This can be made using the following format:
# Download the gtf to refFlat converter
wget https://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/gtfToGenePred
# Add execution permissions
chmod =rwx,g+s ./gtfToGenePred
# Execute the gtf to refFlat converter
# The `genome/Homo_sapiens.GRCh38.105.gtf` is the path to your gtf file
# The last argument, `refFlat.tmp.txt` is the output filename
./gtfToGenePRed -genePredExt -geneNameAsName2 genome/Homo_sapiens.GRCh38.105.gtf refFlat.tmp.txt
# Modify values so Picard is able to parse the refFlat file correctly
# `refFlat.tmp.txt` is the output of the previous command
# `genome/refFlat_GRCh38.105.txt` is the path (and file name) you would like to save results to
paste <(cut -f 12 refFlat.tmp.txt) <(cut -f 1-10 refFlat.tmp.txt) > genome/refFlat_GRCh38.105.txt
# Remove the temporary refFlat file
rm refFlat.tmp.txt
RRNA_INTERVAL_LIST
THe path to a ribosomal interval list built from the GTF file for Picard’s GetRNASeqMetrics command. This finds rRNA transcript quantities.
The riboInt.sh
file was downloaded with the pipeline. You should modify the values so they satisfy your folder paths.
BED_FILE
The path to a BED file for RSeQC, also built from the GTF_FILE
, which corresponds to your reference genome. This can be made using the following:
# Change directories into your `genome` directory
cd genome
# Download the BED file
wget https://sourceforge.net/projects/rseqc/files/BED/Human_Homo_sapiens/hg38_GENCODE.v38.bed.gz/download
# Change the name to a readable name
mv download hg38_GENCODE.v38.bed.gz
# Unzip the file
gunzip hg38_GENCODE.v38.bed.gz
# Remove quotation marks from exon positions
sed -i 's/"//g' hg38_GENCODE.v38.bed
# Remove "chr" from chromosome indices
sed -i 's/chr//g' hg38_GENCODE.v38.bed
PERFORM_TRIM
Should trimming of reads be performed? True
or False
PERFORM_SCREEN
Screen against genomes of common contaminants?
True
or False
The current contaminants screened against are:
- Arabidopsis
- Drosophila
- Escherichia coli
- Lambda Phage
- Mitochondria
- Mouse
- PhiX Bacteriophage
- Brown Rat
- rRNA (specifically, GRCm38 rRNA)
- Vectors
- Caenorhabditis elegans (worm)
- Saccharomyces cerevisiae (yeast)
PERFORM_GET_RNASEQ_METRICS
Use Picard’s getRNASeqMetrics?
True
or False
This required REF_FLAT_FILE
and RRNA_INTERVAL_LIST
to be set.
PERFORM_PREFETCH
If you only have SRR codes (from MASTER_CONTROL
), this option will download those sra
files from NCBI.
True
or False
PERFORM_GET_INSERT_SIZE
Get the interval size using Picard?
True
or False
GET_FRAGMENT_SIZE
Get fragment sizes with RSeQC?
True
or False
GENOME_SAVE_DIR
The path to the directory where genome output should be saved to. This should be under your /work
folder, as large files will be created
GENOME_FASTA_FILE
This is the input genome fasta file that has been previously downloaded.
Most likely located under your /work
folder.
Can be downloaded using the following:
# Change directories into your `genome` directory
cd genome
# Download the assembly
# To set the release number, set the following variable
assembly_release=105
wget ftp://ftp.ensembl.org/pub/release-${assembly_release}/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
# Unzip the file
gunzip Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
GTF_FILE
This is the input GTF genome file that has also been previously downloaded
Most likely located under your /work
folder
Can be downloaded the human genome annotation with:
# Change directories into your `genome` directory
cd genome
# Download the annotations
# To set releases, modify the following variable to the release number
annotation_release=105
wget ftp://ftp.ensembl.org/pub/release-${annotation_release}/gtf/homo_sapiens/Homo_sapiens.GRCh38.${annotation_release}.gtf.gz
You do not need to make any further changes. SnakeMake will extract the configuration values you have set up
Once these steps are complete, the workflow should be prepared to execute.
Continue to the next page to execute the workflow