Major Changes from v-4.0

Major updates to the BioXpress from the previous version (v-4.0)

Tumor samples added for each tissue

Tissue

TCGA Studies

New Samples

Bladder

BLCA

126

Breast

BRCA

159

Colorectal

COAD/READ

159 (141/18)

Esophageal

ESCA

25

Head and Neck

HNSC

118

Kidney

KICH/KIRP/KIRC

289(15/82/192)

Liver

LIHC

169

Lung

LUAD/LUSC

264 (174/90)

Prostate

PRAD

116

Stomach

STAD

22

Thyroid

THCA

176

Uterine

UCEC

216

Mapping files updated to reflect most recent mapping of DOIDs to UBERON IDs.

The following is a list of the current cancer tissue (DOID) to healthy tissue (UBERON ID) mapping:

DO Name (DOID)

UBERON Name (UBERON ID)

Stomach Cancer (DOID:10534)

Stomach (UBERON:0000945)

Thyroid Cancer (DOID:1781)

Thyroid Gland (UBERON:0002046)

Esophageal Cancer (DOID:5041)

Esophagus (UBERON:0001043)

Kidney Cancer (DOID:263)

Adult Mammalian Kidney

(UBERON:0000082)

Lung Cancer (DOID:1324)

Lung (UBERON:0002048)

Uterine Cancer (DOID:363)

Uterine Cervix (UBERON:0000002)

Bladder Cancer (DOID:11054)

Urinary Bladder (UBERON:0001255)

Prostate Cancer (DOID:10283)

Prostate Gland (UBERON:0002367)

Colorectal Cancer (DOID:9256)

Colon (UBERON:0001155)

Rectum (UBERON:0001052)

Liver Cancer (DOID:3571)

Liver (UBERON:0002107)

Breast Cancer (DOID:1612)

Thoracic Mammary Gland (UBERON:0005200)

Head and Neck Cancer (DOID:11934)

Oral Cavity (UBERON:0000167)

Automatic alphabetical re-ordering of count matrices for DESeq2

Due to the added samples in v-5.0, the ordering of samples in the count matrices needed for DESeq2 was disrupted and DESeq2 was producing randomized results. Column and row names in count matrices are now re-ordered as part of the DESeq.R script, so that samples are aligned correctly. This re-ordering should account for instances of added samples in future versions.

Issue Running DESEq per case

The step for DESeq per case was performed, however the results were not used to calculate subjects up/down/total in the publisher step, as was the case in v-4.0. Also, a final publisher file per case was not generated.

The run_per_case.py script performs DESeq analysis using both the tumor and normal count files per case. For most cases, there is only one tumor counts file and one normal counts file. DESeq encounters an error when running analysis with a sample size of 1 per group:

Error in checkForExperimentalReplicates(object, modelMatrix):`

The design matrix has the same number of samples and coefficients to fit, so estimation of dispersion is not possible. Treating samples as replicates was deprecated in v1.20 and no longer supported since v1.22.

The DESeq2 vignette also mentions DESeq analysis with no replicates in their FAQ:

Can I use DESeq2 to analyze a dataset without replicates? No. This analysis is not possible in DESeq2.

This is likely sue to the read count normalization model used by DESeq. DESeq’s model contains a variable called the dispersion estimate, which relies on the variance of the one sample’s read counts for a gene to the mean read count for that gene across the whole group (condition). If there are no other replicates on the group then there is no comparison to be made and no normalization can occur.

Even for cases that have only 2-3 replicates, the significance of the DE analysis should be heavily scrutinized as such a low replicate number is not a standard statistical practice, because low sample sizes may lead to an increase in false positive and false negatives.