BioXpress DESeq step

Step 3 of the BioXpress pipeline.

General Flow of Scripts

run_per_study.py -> run_per_tissue.py -> run_per_case.py

Procedure

DESeq step 1: Run the script run_per_study.sh

Summary

The python script run_per_study.py provides arguments to the R script deseq.R. The count and category files generated from the Annotation step are used to calculate differential expression and statistical significance. The result is a series of files per tissue including the normalized reads (DESeq normalization method), the DE results and significance, and QC files such as the PCA plot.

Note: this step is time consuming (~2-3 hours of run time)

Method

Edit the hard-coded paths in the script run_per_tissue.py

Specify the in_dir to be the folder containing the final output files of the Annotation steps for per study

Specify the out_dir

Ensure that the file list_files/studies.csv contains all of the tissues you wish to process - Note: the studies can be run separately (in the event that 2-3 hours cannot be dedicated to run the all studies at once) by creating separate dat files with specific tissues to run

Run the shell script sh run_per_study.sh

Note: the R libraries specified in deseq.R will need to be installed if running on a new server or system, as these installations are not included in the scripts

Output

A set of files:

log file

deSeq_reads_normalized.csv - Normalized read counts (DESeq normalization method applied)

results_significance.csv - log2fc differential expression results and statistical significance (t-test)

dispersion.png

distance_heatmap.png

pca.png - Principal component analysis plot, important for observing how well the Primary Tumor and Solid Tissue Normal group together

DESeq Step 2 : Run the script run_per_tissue.sh

Summary

The python script run_per_tissue.py provides arguments to the R script deseq.R. The count and category files generated from the Annotation step are used to calculate differential expression and statistical significance. The result is a series of files per study including the normalized reads (DESeq normalization method), the DE results and significance, and QC files such as the PCA plot.

Note: this step is time consuming (~2-3 hours of run time)

Method

Edit the hard-coded paths in the script run_per_tissue.py

Specify the in_dir to be the folder containing the final output files of the Annotation steps for per tissue

Specify the out_dir

Ensure that the file list_files/tissue.dat contains all of the tissues you wish to process - Note: the tissues can be run separately (in the event that 2-3 hours cannot be dedicated to run the all tissues at once) by creating separate dat files with specific tissues to run

Run the shell script sh run_per_tissue.sh

Output

A set of files:

log file

deSeq_reads_normalized.csv - Normalized read counts (DESeq normalization method applied)

results_significance.csv - log2fc differential expression results and statistical significance (t-test)

dispersion.png

distance_heatmap.png

pca.png - Principal component analysis plot, important for observing how well the Primary Tumor and Solid Tissue Normal group together

DESeq Step 3 : Run the script run_per_case.sh

Summary

The python script run_per_case.py provides arguments to the R script deseq.R. The count and category files generated from the Annotation step are used to calculate differential expression and statistical significance. The result is a series of files per case including the normalized reads (DESeq normalization method), the DE results and significance, and QC files such as the PCA plot.

Note: this step is time consuming (~2-3 hours of run time)

Method

Edit the hard-coded paths in the script run_per_case.py

Specify the in_dir to be the folder containing the final output files of the Annotation step for per_case

Specify the out_dir

Ensure that the file list_files/cases.csv contains all of the cases you wish to process - Note: the cases can be run separately (in the event that 2-3 hours cannot be dedicated to run the all tissues at once) by creating separate dat files with specific cases to run

Run the shell script sh run_per_tissue.sh

Output

A set of files:

log file

deSeq_reads_normalized.csv - Normalized read counts (DESeq normalization method applied)

results_significance.csv - log2fc differential expression results and statistical significance (t-test)

dispersion.png

distance_heatmap.png

pca.png - Principal component analysis plot, important for observing how well the Primary Tumor and Solid Tissue Normal group together