br Construction of the decision
Construction of the decision tree
We used a conditional reference tree algorithm for automated selection of the most discriminative variables (proteins) between conventional subtypes and for the generation of a decision tree, using ctree() function of R package party (Hothorn et al., 2006) with the control parameters set to default, except for the minimum sum of weights in a node in order to be considered for splitting (minsplit = 50) and the minimum sum of weights in a terminal node (minbucket = 3). The analysis was based on the set of 22 proteins with significantly different abundances between different conventional subtypes and related clinical-pathological characteristics, as determined by MSstats and presented in the Results part. The selected proteins were further validated in gene AM251 datasets and through immunohistochemistry. The decision tree was constructed also on gene expression data (see section g below).
Analysis of ERBB2 gene expression in the same sample set
The data were extracted from our previously published dataset (Bouchal et al., 2015).
Analysis of gene expression in independent microarray and RNA-Seq sets of samples
Publicly available gene expression datasets DFHCC, DFHCC2, IRB, PNC and SUPERTAM_HGU133PLUS_2 (all platform Affymetrix Human Genome U133A, 937 samples in total) were downloaded in the log2 normalized form (Haibe-Kains et al., 2012) and used in order to confirm at the transcriptome level the hypotheses derived from our analysis of proteomic SWATH-MS data. For this purpose, subsets of 883 samples with available information on gene expression, ER status, tumor grade, HER2 status or lymph node status were used, based on the type of comparison (see Figure S2).
First, we performed analysis of differential expression between conventional subtypes (pairwise) and between the categories of clinical-pathological variables using moderated t-statistics (method implemented in the R package Limma of R 3.0.2), on the set of 6,895 probesets representing 2,782 genes with corresponding products (proteins) measured also by SWATH-MS in our experi-ment. This means that out of 2,842 proteins measured by SWATH-MS, 97.9% had corresponding genes in the gene expression data-sets. P values were adjusted for multiple hypothesis testing by Benjamini-Hochberg FDR correction; see Data S4 for details. In this analysis we also validated INPP4B, CDK1 and ERBB2 (the proteins selected in the proteotype classification tree) as differentially regulated between ER+ versus ER- tumors, high versus low grade tumors, and HER2+ versus HER2- tumors, respectively.
Second, we correlated log2 protein fold-changes (log2FC) as obtained from pairwise group comparison with the respective tran-script log2FCs from the same comparisons in the five transcriptomic datasets. The same analysis was performed also for a subset of ERBB2, INPP4B and CDK1 protein-transcript pairs.
Third, a decision tree based on gene transcript levels was constructed in order to classify the samples into the five conventional subtypes and thus to compare the resulting model in terms of performance to the model (decision tree) based on proteotypes. In other words, we asked whether transcriptomics data are better at predicting the conventional subtypes. For this purpose, the same procedure as described above (‘‘Construction of the Decision Tree’’ section) was applied on 1036 most variable (top 5%) pro-besets representing unique gene symbols and a set of 474 samples.
Fourth, preprocessed Level 3 RNA-seq data (Cancer Genome Atlas, 2012; Ciriello et al., 2015) were downloaded from the TCGA data portal (https://portal.gdc.cancer.gov). Filtering and normalization was performed using edgeR package (Robinson et al., 2010; McCarthy et al., 2012). Limma (Ritchie et al., 2015) ‘‘RemoveBatchEffect’’ function was executed on log2 transformed Count Per Million (CPM) data. Batch corrected log2CPM values were then used in order to validate the hypotheses derived from our analysis of proteomic SWATH-MS data also on transcript level. Subsets of 1078 samples with available information on gene expression, ER status, or HER2 status were used to perform analysis of differential expression between the categories of clinical-pathological variable: 791 ER+ versus 237 ER- patients were compared in term of INPP4B expression, and 161 HER2+ versus 554 HER2- patients were compared in term of ERBB2 expression using Wilcoxon rank sum test. As the information on tumor grade was unavailable for the dataset, we performed Spearman correlation of CDK1 expression with expression data on commonly used proliferation marker MKI67.
Analysis of patient survival
Survival analysis was performed using Kaplan-Meier Plotter (http://kmplot.com) for relapse-free survival (RFS) involving a microarray dataset from 3951 breast cancer tissues (2018 database version) (Gyo¨rffy et al., 2010). Each gene was represented by user-defined probe set, Affymetrix IDs were as follows: 205376_at (INPP4B), 203213_at (CDK1, referenced as CDC2 in kmplot database), 216836_s_at (ERBB2), for reference genes 205225_at (ESR1) and 212023_s_at (MKI67). The population was split into high and low expression groups based on the incidence: (i) upper tertile for INPP4B and ESR1 based on approximate proportion of ER+ and ER- tumors, (ii) median for CDK1 and MKI67 genes based on approximate proportion of high and low grade tumors, and (iii) lower quartile for ERBB2 based on approximate proportion of HER2+ and HER2- tumors in the breast cancer population. 120 months follow up threshold was applied.