This is an overview of how to set up the `genome` folder
Edit me

Overview

This will overview how to set up the genome folder.

Variable Definition

assembly_release=109
GRCh_version=38

Genome FASTA File

# Change directories into your `genome` directory
cd FastqToGeneCounts

mkdir genome
cd genome

# Download the assembly
wget ftp://ftp.ensembl.org/pub/release-${assembly_release}/fasta/homo_sapiens/dna/Homo_sapiens.GRCh${GRCh_version}.dna.primary_assembly.fa.gz

# Unzip the file
gunzip Homo_sapiens.GRCh${GRCh_version}.dna.primary_assembly.fa.gz

GTF File

# Execute this in the `genome` directory!

# Download the annotations
# To set releases, modify the following variable to the release number
wget ftp://ftp.ensembl.org/pub/release-${annotation_release}/gtf/homo_sapiens/Homo_sapiens.GRCh${GRCh_version}.${annotation_release}.gtf.gz
gunzip Homo_sapiens.GRCh${GRCh_version}.${annotation_release}.gtf.gz

Ref Flat File

# Execute this in the `genome` directory!

# Download the gtf to refFlat converter
wget https://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/gtfToGenePred

# Add execution permissions
chmod +x ./gtfToGenePred

# Execute the gtf to refFlat converter
# The last argument, `refFlat.tmp.txt` is the output filename
./gtfToGenePred -genePredExt -geneNameAsName2 Homo_sapiens.GRCh.${assembly_release}.gtf refFlat.tmp.txt

# Modify values so Picard is able to parse the refFlat file correctly
# `refFlat.tmp.txt` is the output of the previous command
# `refFlat_GRCh38.105.txt` is the final output filename
paste <(cut -f 12 refFlat.tmp.txt) <(cut -f 1-10 refFlat.tmp.txt) > refFlat_GRCh${GRCh_version}.${assembly_release}.txt

# Remove the temporary refFlat file
rm refFlat.tmp.txt

BED File

# Execute this in the `genome` directory!

# Download the BED file, then set the file name
wget https://sourceforge.net/projects/rseqc/files/BED/Human_Homo_sapiens/hg38_GENCODE.v${GRCh_version}.bed.gz/download
mv download hg38_GENCODE.v${GRCh_version}.bed.gz

# Unzip the file
gunzip hg38_GENCODE.v${GRCh_version}.bed.gz

# remove (1) quotation marks from exon positions and (2) “chr” from chromosome indices
sed -i 's/"//g' hg38_GENCODE.v${GRCh_version}.bed
sed -i 's/chr//g' hg38_GENCODE.v${GRCh_version}.bed

rRNA Interval List

The path to a ribosomal interval list built from the GTF file for Picard’s GetRNASeqMetrics command. This finds rRNA transcript quantities.

The riboInt.sh file was downloaded with the pipeline.

In theory, this file should fit the current paths set up according to these instructions. However, you should double check the values within satisfy your own setup.

To run the riboInt.sh file, perform the following:

# Change directories to the location the pipeline was downloaded; for example:

# WARNING: Change the next line to your own installation directory!
cd /work/helikarlab/joshl/FastqToGeneCounts
sh riboInt.sh