In this notebook, we explore the transcriptomic and clinical data from the TCGA Prostate Adenocarcinoma (PRAD) cohort. The goal is to perform data exploration and apply unsupervised machine learning methods such as hierarchical clustering, k-means, and dimensionality reduction techniques (PCA and t-SNE).
We use three datasets located in the data/
folder:
feno.RDS
: Clinical and phenotypic data.exp.RDS
: Complete gene expression matrix
(RNA-Seq).exp_top.RDS
: A subset of the expression matrix
including the top most variable genes, useful for clustering and
visualization.library(ggplot2)
library(tidyverse)
library(pheatmap)
library(broom)
library(skimr)
library(gtsummary)
library(ggfortify)
library(Rtsne)
Warning: package ‘Rtsne’ was built under R version 4.3.3
feno <- readRDS("data/feno.RDS")
exp <- readRDS("data/exp.RDS")
exp_top <- readRDS("data/exp_top.RDS")
dim(exp)
[1] 20530 549
dim(exp_top)
[1] 549 100
We can also take a glimpse of the data structures:
glimpse(feno)
Rows: 549
Columns: 16
$ sample_type <fct> Primary Tumor, Primary Tumor, Primary Tumor, Primary Tumor,…
$ age <int> 51, 57, 47, 52, 70, 54, 69, 57, 57, 56, 64, 73, 72, 65, 57,…
$ initial_weight <int> 140, 110, 140, 140, 160, 140, 150, 150, 80, 140, 380, NA, N…
$ histological_type <chr> "Prostate Adenocarcinoma Acinar Type", "Prostate Adenocarci…
$ ISUP_group <fct> 1, 1, 5, 1, 4, 3, 5, 2, 1, 1, 4, 3, 1, 2, 2, 5, 2, 3, 2, 2,…
$ clinical_T_simple <fct> NA, T1, T2, T2, T2, T1, T2, T1, T1, T1, T2, NA, NA, NA, NA,…
$ pathologic_T_simple <fct> T2, T3, T4, T2, T3, T3, T3, T3, T2, T3, T3, T2, NA, T3, T2,…
$ laterality <chr> "Left", "Bilateral", "Bilateral", "Right", "Bilateral", "Le…
$ PFI <int> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,…
$ PFI.time <int> 621, 1701, 1373, 671, 1378, 112, 198, 1364, 1272, 615, 350,…
$ PSA <dbl> 2.7, 5.0, 44.9, 5.6, 2.4, 7.0, 7.1, 11.6, 4.6, 8.4, 8.4, 14…
$ targeted_molecular_therapy <fct> NO, NO, YES, NO, NO, NA, NO, NO, NO, NO, YES, NA, NO, NO, N…
$ radiation_therapy <fct> NO, NO, YES, NO, NO, NA, NO, NO, NO, NO, NO, NA, NO, NO, NA…
$ is_ffpe <chr> "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO",…
$ initial_pathologic_diagnosis_method <chr> "Core needle biopsy", "Core needle biopsy", "Core needle bi…
$ progression <lgl> NA, FALSE, FALSE, NA, FALSE, NA, TRUE, FALSE, FALSE, NA, NA…
We will:
exp <- t(exp)
exp <- exp[, colSums(exp != 0) > 0]
# Keep only primary tumor samples
feno <- feno %>% filter(sample_type == "Primary Tumor")
exp <- exp[rownames(feno), ]
exp_top <- exp_top[rownames(feno), ]
# Set ISUP group as ordered factor
feno$ISUP_group <- factor(feno$ISUP_group, ordered = TRUE)
feno <- droplevels(feno)
# Add numeric version of ISUP group
feno$ISUP_group_num <- as.numeric(feno$ISUP_group)
We summarize the clinical characteristics using
gtsummary
.
feno %>%
tbl_summary(
statistic = list(
all_continuous() ~ "{mean} ({sd})",
all_categorical() ~ "{n} ({p}%)"
),
digits = all_continuous() ~ 2
)
Characteristic | N = 4971 |
---|---|
sample_type | |
    Primary Tumor | 497 (100%) |
age | 61.02 (6.82) |
initial_weight | 243.13 (293.09) |
    Unknown | 227 |
histological_type | |
    Prostate Adenocarcinoma Acinar Type | 482 (97%) |
    Prostate Adenocarcinoma, Other Subtype | 15 (3.0%) |
ISUP_group | |
    1 | 45 (9.1%) |
    2 | 146 (29%) |
    3 | 101 (20%) |
    4 | 64 (13%) |
    5 | 141 (28%) |
clinical_T_simple | |
    T1 | 177 (44%) |
    T2 | 173 (43%) |
    T3 | 53 (13%) |
    T4 | 2 (0.5%) |
    Unknown | 92 |
pathologic_T_simple | |
    T2 | 187 (38%) |
    T3 | 293 (60%) |
    T4 | 10 (2.0%) |
    Unknown | 7 |
laterality | |
    Bilateral | 432 (88%) |
    Left | 19 (3.9%) |
    Right | 38 (7.8%) |
    Unknown | 8 |
PFI | 93 (19%) |
PFI.time | 964.55 (756.34) |
PSA | 11.05 (12.37) |
    Unknown | 15 |
targeted_molecular_therapy | 52 (12%) |
    Unknown | 50 |
radiation_therapy | 59 (13%) |
    Unknown | 49 |
is_ffpe | |
    NO | 440 (100%) |
    Unknown | 57 |
initial_pathologic_diagnosis_method | |
    Core needle biopsy | 477 (97%) |
    Transurethral resection (TURBT) | 15 (3.0%) |
    Unknown | 5 |
progression | 72 (30%) |
    Unknown | 254 |
ISUP_group_num | |
    1 | 45 (9.1%) |
    2 | 146 (29%) |
    3 | 101 (20%) |
    4 | 64 (13%) |
    5 | 141 (28%) |
1 n (%); Mean (SD) |
We use the top variable genes (exp_top
) to reduce
dimensionality for visualization.
pheatmap(t(exp_top), show_rownames = FALSE, show_colnames = FALSE, scale = "row")
Now we add clinical annotations to the heatmap.
annotations <- data.frame(feno %>%
select(ISUP_group,
ISUP_group_num,
histological_type,
clinical_T_simple,
pathologic_T_simple,
initial_pathologic_diagnosis_method))
pheatmap(t(exp_top),
scale = "row",
annotation = annotations,
show_rownames = FALSE,
show_colnames = FALSE)
We apply PCA on the full expression matrix and visualize the principal components.
exp <- exp[, colSums(exp != 0) > 0]
pca <- prcomp(exp, scale. = TRUE)
autoplot(pca)
We can also color the PCA plot by clinical covariates:
autoplot(pca, data = feno, colour = "ISUP_group")
autoplot(pca, data = feno, colour = "age")
We apply k-means clustering using 5 clusters on the
exp_top
matrix.
set.seed(123)
km <- kmeans(exp_top, centers = 5, nstart = 25)
feno$cluster_kmeans <- factor(km$cluster)
# Visualize clusters in PCA space
autoplot(pca, data = feno, colour = "cluster_kmeans")
We use t-SNE to project samples into two dimensions and visualize the separation between clusters and clinical groups.
set.seed(123)
tsne <- Rtsne(exp_top, perplexity = 30)
plot(tsne$Y, col = km$cluster, pch = 19, main = "t-SNE + K-means Clusters")
plot(tsne$Y, col = feno$ISUP_group, pch = 19, main = "t-SNE Colored by ISUP Group")
Finally, we test whether the identified clusters differ in clinical features. Here, we compare PSA levels across k-means clusters.
boxplot(feno$PSA ~ feno$cluster_kmeans,
xlab = "K-means Cluster",
ylab = "Preoperative PSA",
main = "PSA Levels by Cluster")