Final project for the course “22100 - R for Bio Data Science”. The contributors to the project are Mikkel Spallou Eriksen, Rut Mas de Les Valls, Anna Oliver Almirall, Paul Jeremy Simon and Elisa Catafal Tardos.
Original paper : Yeoh EJ, Ross ME, Shurtleff SA, et al. Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling. Cancer Cell. 2002;1(2):133-143. doi:10.1016/s1535-6108(02)00032-6
The dataset has been downloaded from
library(stjudem)
It has been tidied to reach a clean dataset similar to the one given here.
All R scripts are available in the folder R/
1. 01_load loads the dataset and writes it into tsv.gz files
2. 02_clean filters cancer types and convert probe names into gene
names
3. 03_pre_analysis selects the top40 differetially expressed gene for
each cancer subtype
4. 04_plot performs several kind of statistical descriptive plots
5. 05_plots_heatmap aims at reproducing heatmaps from the paper
6. 06_kmeans performs a k means clustering on raw dataset and top40
genes dataset
7. 07_pca performs a PCA on raw dataset and top40 genes dataset
8. 08_regression.R performs a simple multinomial regression
A shiny app with plots has been created and is available here.
The final presentation can be found under doc/presentation.html