Overview
This example will identify drug targets for Rheumatoid Arthritis using GSMN of naïve CD4+ T-cell subtypes
Follow the steps found at Starting the Container to start the container.
Step 1
The outputs of STAR aligner with the --GeneCounts
argument can be interfaced directly with COMO. Under the data/STAR_output/
folder, we provide a template structure with naïve B control cells from bulk RNA-sequencing experiments found in NCBI’s Gene Expression Omnibus. We can run the file src/rnaseq_preprocess.py
with the create_count_matrix
argument set to True
to merge the counts from each replicate in each study into a single matrix.
This follows Step 1 in the COMO container.
Alternatively, you can provide a count matrix directly and run with create_count_matrix
set to False
to just fetch the necessary information required for normalization
Step 2
In the `` folder, we include the following files:
microarracy_data_inputs.xlsx
contains GEO accession numbers of microarray dataproteomics_data_inputs.xlsx
contains sample names of proteomics data
Protein abundance is located under /ProteomicsDataMatrix_Naive.csv
. Sample names for bulk RNA-sequencing data is given in /bulk_data_inputs.xlsx
. Running src/rnaseq_preprocess.py
will create data/results/Gene_Info_naiveB.csv
and data/data_matrices/BulkRNAseqMatrix_naiveB.csv
if create_counts_matrix
is set to True
.
/microarray_data_inputs.xlsx
includes microarray samples of naice CD4+ T cells from
GSE22886,
GSE43005,
GSE22045, and
GSE24634.
The file data/config_sheets/proteomics_data_inputs.xlsx
contains sample names of naive CD4+ T cells, with its results found in the file data/data_matrices/naiveB/protein_abundance_naiveB.csv
Using src/merge_xomics.py
, you can specify any number of available data sources (microarray, bulk RNA-seq, and proteomics) as inputs. You can also set the expression_requirement
parameter, which defines the minimum number of data sources that must have gene expression above the threshold limit for the said gene to be considered active
expression_requirement
value will decrease by one for each input data source that does not support the gene.Running Step 1 in the COMO container will generate “gene activity” files based on transcriptomics and proteomics data, as described by Puniya et al., 2020
This will save final output in GeneExpression_Naive_merged.csv
and its path in step1_results_files.json
Step 3
Our pipeline includes a modified version of the Recon3D model to use as a reference for model contextualization. The modified version of Recon3D is available at data/GeneralModelUpdatedV2.mat
.
Step 4 in COMO will use GeneExpression_Naive_merged.csv
(from Step 1, above) in combination with data/GeneralModelUpdatedV2.mat
to construct a cell-type specific model of Naive CD4+ cells.
We can use this model in the next steps. However, we advise users to properly investigate, manually curate, and reupload the refined version in data/results
to use in Step 4.
Step 4
We used a dataset (GSE56649) of Rheumatoid Arthritis to identify differentially expressed genes (disease genes). We defined accession IDs of this dataset in the inputs found under /naiveB/disease/gene_counts_matrix_arthritis_naiveB.csv
.
This step will generate files data/results/naiveB/Disease_UP_GSE56649.txt
and data/results/naiveB/Disease_DOWN_GSE56649.txt
, and their paths, at step2_results_files.json
. Finally, this step will create a disease_files
variable that will include aths of files for up- and down- regulated genes.
Step 5
This step will use the model (constructed in Step 2/uploaded curated version) and perform knock-out simulations of genes overlapping with the drug-target data file obtained from the ConnectivityMap database. We refined the drug target-data file and included it at data/RepurposingHub.txt
.
This step will use the following files:
Naive_SpecificModel.json
(or a pre-curated version uploaded asNaiveModel.mat
)Disease_UP_GSE56649.txt
Disease_DOWN_GSE56649.txt
RepurposingHub.txt
The final output files will include drug targets ranked based on Perburbation Effect Score (PES) as described by Puniya et al., 2020.
The output file data/output/d_score.csv
will contain Entrez IDs of ranked genes and their corresponding PES. The file drug_score.csv
will contain PES ranked drug targets (Entrez IDs and Gene Symbols) with mapped repurposed drugs.