Welcome!

This is a snakemake workflow that aims to do several things, using as much parallelization as possible.

Given a CSV file containing: SRR codes, target output names, library layouts, and library preparation methods:

Generate genome files using STAR
Download each SRR code in parallel using prefetch
Unpack the .sra files using parallel-fastq-dump, generating .fastq.gz files
Optionally trim the resulting .fastq.gz files (using Trim Galore)
Perform FastQC on the parallel-fastq-dump files, and optionally on the resulting trimmed files
Perform STAR align on files from parallel-fastq-dump (or trim) files to the generated genome files
Optionally get RNAseqMetrics using Picard
Optionally get insert sizes using Picard
Optionally get fragment sizes using RSeQC
Perform MultiQC, using the files from parallel-fastq-dump, FastQC, and STAR aligner
Organize a MADRID_inputs file that can be directly interfaced with our MADRID package to aid with metabolic drug discovery and repurposing.

This pipeline is primarily designed to interface the GEO Database with MADRID, and should be run in a high-performance computing cluster, as the memory requirement is quite high to use STAR (about 40GB for the human genome). Even if you do not plan to use MADRID, if your goal is to align fastq files from bulk RNA-seq, perform essential quality control, and output gene counts files from STAR for transcription-based model construction such as Differential Gene Expression Analysis, this pipeline could be of service.

Tags:

Getting Started

Welcome!