Skip to content

AltschulerWu-Lab/MultimodalPDXHeterogeneity

Repository files navigation

A multi-modal data resource for investigating topographic heterogeneity in patient-derived xenograft tumors

This repository contains the code for the paper "A multi-modal data resource for investigating topographic heterogeneity in patient-derived xenograft tumors" by Rajaram, Roth et. al. published in Scientific Data. The deposited code is intended to allow a user to generate the figures in the paper assuming they have downloaded the associated data from https://ncihub.org/groups/nci_physci/wiki/PSON0010 and intermediate results from figshare. Instructions to generate intermediate results (will likely need a cluster) are also provided.

Setup/Usage

Software Note: The code here is almost completely written in MATLAB (except for a single R function). It was tested on MATLAB 2019b on Linux (Ubuntu 18.04). As the code is essentially an analysis of data, it is essential that paths are set up appropriately so it can access the data. Attempts have been made to keep the code OS-agnostic, but special care must be taken with paths on non-Unix operating systems.

  1. Primary Data:

    i. Download the data from https://ncihub.org/groups/nci_physci/wiki/PSON0010. You can confirm fidelity of downloads by comparing to the MD5Sums listed in ftp://caftpd.nci.nih.gov/psondcc/casix/Manifest.txt ii. Unzip every the zip file so that it produces a directory (in the root directory) with the same name as the zip file.

  2. Intermediate Results:

    Because of the large data set size, calculations are highly time-consuming and many need to be performed on a cluster. Consequently, the figures generated by the code here is based on intermediate result files, which are available for download. Instructions on generating the intermediate data from the raw data are provided at the bottom of this document. i. Download the intermediate data from figshare ii. Download every dataset individually. Unzip all the zip files into folders bearing the same name as the zip file in the root directory. Special note: As the PhenoRipper files were too large to upload as a single file, they are split into multiple zip files. Please create a folder called PR_Results and copy the contents of each PR_XX.zip into this folder. The final expected directory structure (and md5sums) are listed in the "Online Resource" called Final Directory Structure in the figshare project.

  3. Code:

i. Clone this repository ii. In the file GetParams.m (present at the top level folder in the code) change: a) Line number 5 to point to the root directory of the location where you have saved the primary data: rootDir='/home/myUserName/Data/RajaramEtAlData/' b) Line number 6 to point to the location where you have saved the intermediate results: rootInfoDir='/home/myUserName/Data/RajaramEtAlIntermediateResults/' c) Line number 7 to point to a location where you would like to save figure files. Please ensure you have write permission in this folder figSaveDir='/home/myUserName/Figures/RajaramEtAlFigures/' iii. Fire up MATLAB and navigate to the base directory where you have saved this repository iv. In the command line type: GenerateFigures.m This should sequentially generate all the figures for the paper. Note, code for individual figures is available as scripts in the Figures folder, and can be individually run as indicated in GenerateFigures.m

Generation of Intermediate Results

As noted above, the large data set size necessitates calculations being performed on a cluster to generate intermediate result files, which can be downloaded from figshare. In addition to results for these time consuming calculations, this download contains several convenience files. Here, we explain the relationship between these files and the primary data, and the process to generate intermediate results from the raw data. Note: All data filenames described here are relative to the root directory for the intermediate data.

Overall Data Description

`Sample_Info.csv': is a text file mapping the 36 PDX samples back to the PDX model, tumor(i.e. mouse & passage number) and tumor sector that they were extracted for. This is an easier to understand version of the ISA-TAB samples files in the deposited data. Please note the following mapping between model names and those used in the paper

  1. PDX-L1 = CN1571
  2. PDX-L2 = CR-0104-O
  3. PDX-R1 = CN1572
  4. PDX-R2 = CN1574

DNA Data

There are two sets of intermediate results for the DNA data.

  1. AnnoVar Annotation of VCF calls: in the folder annovar_results\ This is essentially annotations of the mutation calls (i.e. vcf files deposited in the DNA_Processed folder of the primary data) by annovar. These files were generated using the annovar program against hg19 alignment, with refGene, cytoBand, exac03 and avsnp147 annotations. Steps to install annovar and produce this alignment from the deposited primary vcf files can be found at http://annovar.openbioinformatics.org/en/latest/user-guide/startup/
  2. Combined DNA Results: DNA_Processed_Results_Final.mat This file contains the results generated by combinging results from all the sequenced samples. It can be generated from the primary data by running the script:
DNA/Generate_DNA_Results.m

RNA Data

We make use of two files for the RNA:

  1. Summary of RNA Quality: RNA_Quality.txt This file is generated by scraping page 7 of deposited file 'RNA_Combined/AmpliSeq Transcriptome Performance Summary.pdf' deposited in the primary data.
  2. FPKM normalized read counts: normalized_reads.xls This is a vendor supplied file providing the FPKM normalized read counts by combining across the different "replicate" chips for the same sample. Essentially the same data can be generated from the deposited read counts using the function RNA/Load_Raw_RNA.m. Specifically the function returns:
[combinedVals,singleChipVals,geneInfo,combinedData]=Load_Raw_RNA()

Here, combinedVals is a numeric array with rows corresponding to genes and columns samples, containing the raw counts by combining the different chips. FPKM normalizing the columns of combinedVals (i.e., scaling so that they sum up to 1E6) essentially produces normalized_reads.xls

RNA Pathway Analysis

The results of the pathway analysis are stored in gsva_results/gsvaHallmark_rnaseq.txt. This analysis is performed using the GSVA toolbox in R https://bioconductor.org/packages/release/bioc/html/GSVA.html, based on pathways indicated by the Hallmark Gene Set of MSIGDB http://software.broadinstitute.org/gsea/msigdb/index.jsp which was downloaded and saved to h.all.v6.0.symbols.gmt. The code to generate gsva_results/gsvaHallmark_rnaseq.txt from normalized_reads.xls and h.all.v6.0.symbols.gmt can be found in the R file RNA/R_Pathway_Activation_GSVA.R. To run this file the user will need to install the R packages indicated in the file and manually change filepaths to point to correct save locations.

Image Data: Loading/Background Correction/Gross Analysis

  1. File Mapping: The mapping information linking IF/H&E filenames and the PDX samples is stored in IF_Imaging_lookup.xlsx and HE_lookup.xlsx. The function Microscopy\GetImages.m uses this information to pull the relevant image files given sample requirements.
  2. Background Subtraction: The function Microscopy\Generate_ImageList.m is used to perform background subtraction on the raw images. The results are stored as a instances of the TissueImageCorrected class in the mat file Images\scaledImages.mat (technically the file pointed to by params.microscopy.bgSubtractedImgList).
  3. Average Intensity Calculations: The intensities of the different markers in various tissue compartments are calculated on the 36 samples by the script Microscopy\Calculate_Marker_Intensities.m and stores in the mat file params.microscopy.avgValueFile (technically the file pointed to by params.microscopy.avgValueFile).

Image Data: Spatial Variation Analysis

This analysis depends on a hierarchical decomposition of variation across scales being performed on every section of every sample on every marker set. As this is a time-consuming calculations, the were performed on a cluster and stored as .mat files in the Spatial_Downsampling folder. To generate this data one can use the cluster friendly function Microscopy/Spatial/Downsampler.m. This function just takes as input the sample number (1 to 36), and generates the output file in the folder specified by params.microscopy.downsamplingResultsDir.

Image Data: PhenoRipper Analysis

PhenoRipper profiling involves two steps: i) Training of Models and ii) Profiling based on Trained Models. Models need to be trained separately for each marker set. Additionally, the process of model training is itself stochastic, and so we repeat the model training procedure several times to account for this randomness. The models and profiles based on them are saved in the folder PR_Results. To generate these results from the raw data, follow the following steps.

  1. Training: Because of the massive computations involved, these calculations need to be performed on a cluster. Please use the cluster friendly function Microscopy/PhenoRipper/SerializePhenoRipperTraining.m to generate models in a parallelizable way. This function takes as input numbers from 1-144 corresponding to the 36 samples x 4 markers sets and automatically generates all the required models in the folder pointed to by params.microscopy.bgSubtractedPRDir.
  2. Profiling: This is similarly time consuming, and we use the function Microscopy/PhenoRipper/SerializePhenoRipper.m to parallelize these computations. Use the script Microscopy/PhenoRipper/Run_PR_Profiling_Locally.m for a demonstration of how this would be run locally to save results in the folder pointed to by params.microscopy.bgSubtractedPRDir. To load these results back into MATLAB one can use the function Microscopy/PhenoRipper/Load_PR_Results.m. Additionally, the results for a single randomization stored in the file PR_Result_Clean.mat can be saved by running the script Save_Clean_PR_Results.m.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages