A multi-modal data resource for investigating topographic heterogeneity in patient-derived xenograft tumors
This repository contains the code for the paper "A multi-modal data resource for investigating topographic heterogeneity in patient-derived xenograft tumors" by Rajaram, Roth et. al. published in Scientific Data. The deposited code is intended to allow a user to generate the figures in the paper assuming they have downloaded the associated data from https://ncihub.org/groups/nci_physci/wiki/PSON0010 and intermediate results from figshare. Instructions to generate intermediate results (will likely need a cluster) are also provided.
Software Note: The code here is almost completely written in MATLAB (except for a single R function). It was tested on MATLAB 2019b on Linux (Ubuntu 18.04). As the code is essentially an analysis of data, it is essential that paths are set up appropriately so it can access the data. Attempts have been made to keep the code OS-agnostic, but special care must be taken with paths on non-Unix operating systems.
-
i. Download the data from https://ncihub.org/groups/nci_physci/wiki/PSON0010. You can confirm fidelity of downloads by comparing to the MD5Sums listed in ftp://caftpd.nci.nih.gov/psondcc/casix/Manifest.txt ii. Unzip every the zip file so that it produces a directory (in the root directory) with the same name as the zip file.
-
Because of the large data set size, calculations are highly time-consuming and many need to be performed on a cluster. Consequently, the figures generated by the code here is based on intermediate result files, which are available for download. Instructions on generating the intermediate data from the raw data are provided at the bottom of this document. i. Download the intermediate data from figshare ii. Download every dataset individually. Unzip all the zip files into folders bearing the same name as the zip file in the root directory. Special note: As the PhenoRipper files were too large to upload as a single file, they are split into multiple zip files. Please create a folder called
PR_Results
and copy the contents of eachPR_XX.zip
into this folder. The final expected directory structure (and md5sums) are listed in the "Online Resource" called Final Directory Structure in the figshare project.
i. Clone this repository
ii. In the file GetParams.m
(present at the top level folder in the code) change:
a) Line number 5 to point to the root directory of the location where you have saved the primary data:
rootDir='/home/myUserName/Data/RajaramEtAlData/'
b) Line number 6 to point to the location where you have saved the intermediate results:
rootInfoDir='/home/myUserName/Data/RajaramEtAlIntermediateResults/'
c) Line number 7 to point to a location where you would like to save figure files. Please ensure you have write permission in this folder
figSaveDir='/home/myUserName/Figures/RajaramEtAlFigures/'
iii. Fire up MATLAB and navigate to the base directory where you have saved this repository
iv. In the command line type:
GenerateFigures.m
This should sequentially generate all the figures for the paper. Note, code for individual figures is available as scripts in the Figures
folder, and can be individually run as indicated in GenerateFigures.m
As noted above, the large data set size necessitates calculations being performed on a cluster to generate intermediate result files, which can be downloaded from figshare. In addition to results for these time consuming calculations, this download contains several convenience files. Here, we explain the relationship between these files and the primary data, and the process to generate intermediate results from the raw data. Note: All data filenames described here are relative to the root directory for the intermediate data.
`Sample_Info.csv': is a text file mapping the 36 PDX samples back to the PDX model, tumor(i.e. mouse & passage number) and tumor sector that they were extracted for. This is an easier to understand version of the ISA-TAB samples files in the deposited data. Please note the following mapping between model names and those used in the paper
- PDX-L1 = CN1571
- PDX-L2 = CR-0104-O
- PDX-R1 = CN1572
- PDX-R2 = CN1574
There are two sets of intermediate results for the DNA data.
- AnnoVar Annotation of VCF calls: in the folder
annovar_results\
This is essentially annotations of the mutation calls (i.e. vcf files deposited in theDNA_Processed
folder of the primary data) by annovar. These files were generated using the annovar program against hg19 alignment, with refGene, cytoBand, exac03 and avsnp147 annotations. Steps to install annovar and produce this alignment from the deposited primary vcf files can be found at http://annovar.openbioinformatics.org/en/latest/user-guide/startup/ - Combined DNA Results:
DNA_Processed_Results_Final.mat
This file contains the results generated by combinging results from all the sequenced samples. It can be generated from the primary data by running the script:
DNA/Generate_DNA_Results.m
We make use of two files for the RNA:
- Summary of RNA Quality:
RNA_Quality.txt
This file is generated by scraping page 7 of deposited file 'RNA_Combined/AmpliSeq Transcriptome Performance Summary.pdf' deposited in the primary data. - FPKM normalized read counts:
normalized_reads.xls
This is a vendor supplied file providing the FPKM normalized read counts by combining across the different "replicate" chips for the same sample. Essentially the same data can be generated from the deposited read counts using the functionRNA/Load_Raw_RNA.m
. Specifically the function returns:
[combinedVals,singleChipVals,geneInfo,combinedData]=Load_Raw_RNA()
Here, combinedVals
is a numeric array with rows corresponding to genes and columns samples, containing the raw counts by combining the different chips. FPKM normalizing the columns of combinedVals (i.e., scaling so that they sum up to 1E6) essentially produces normalized_reads.xls
The results of the pathway analysis are stored in gsva_results/gsvaHallmark_rnaseq.txt
. This analysis is performed using the GSVA toolbox in R https://bioconductor.org/packages/release/bioc/html/GSVA.html, based on pathways indicated by the Hallmark Gene Set of MSIGDB http://software.broadinstitute.org/gsea/msigdb/index.jsp which was downloaded and saved to h.all.v6.0.symbols.gmt
. The code to generate gsva_results/gsvaHallmark_rnaseq.txt
from normalized_reads.xls
and h.all.v6.0.symbols.gmt
can be found in the R file
RNA/R_Pathway_Activation_GSVA.R
. To run this file the user will need to install the R packages indicated in the file and manually change filepaths to point to correct save locations.
- File Mapping: The mapping information linking IF/H&E filenames and the PDX samples is stored in
IF_Imaging_lookup.xlsx
andHE_lookup.xlsx
. The functionMicroscopy\GetImages.m
uses this information to pull the relevant image files given sample requirements. - Background Subtraction: The function
Microscopy\Generate_ImageList.m
is used to perform background subtraction on the raw images. The results are stored as a instances of theTissueImageCorrected
class in the mat fileImages\scaledImages.mat
(technically the file pointed to byparams.microscopy.bgSubtractedImgList
). - Average Intensity Calculations: The intensities of the different markers in various tissue compartments are calculated on the 36 samples by the script
Microscopy\Calculate_Marker_Intensities.m
and stores in the mat fileparams.microscopy.avgValueFile
(technically the file pointed to byparams.microscopy.avgValueFile
).
This analysis depends on a hierarchical decomposition of variation across scales being performed on every section of every sample on every marker set. As this is a time-consuming calculations, the were performed on a cluster and stored as .mat files in the Spatial_Downsampling
folder. To generate this data one can use the cluster friendly function Microscopy/Spatial/Downsampler.m
. This function just takes as input the sample number (1 to 36), and generates the output file in the folder specified by params.microscopy.downsamplingResultsDir
.
PhenoRipper profiling involves two steps: i) Training of Models and ii) Profiling based on Trained Models. Models need to be trained separately for each marker set. Additionally, the process of model training is itself stochastic, and so we repeat the model training procedure several times to account for this randomness. The models and profiles based on them are saved in the folder PR_Results
. To generate these results from the raw data, follow the following steps.
- Training: Because of the massive computations involved, these calculations need to be performed on a cluster. Please use the cluster friendly function
Microscopy/PhenoRipper/SerializePhenoRipperTraining.m
to generate models in a parallelizable way. This function takes as input numbers from 1-144 corresponding to the 36 samples x 4 markers sets and automatically generates all the required models in the folder pointed to byparams.microscopy.bgSubtractedPRDir
. - Profiling: This is similarly time consuming, and we use the function
Microscopy/PhenoRipper/SerializePhenoRipper.m
to parallelize these computations. Use the scriptMicroscopy/PhenoRipper/Run_PR_Profiling_Locally.m
for a demonstration of how this would be run locally to save results in the folder pointed to byparams.microscopy.bgSubtractedPRDir
. To load these results back into MATLAB one can use the functionMicroscopy/PhenoRipper/Load_PR_Results.m
. Additionally, the results for a single randomization stored in the filePR_Result_Clean.mat
can be saved by running the scriptSave_Clean_PR_Results.m
.