How the dynamic re-training algorithm works

The dynamic re-training algorithm makes a prognosis prediction without employing a pre-defined set of genes.

In brief, the system analyzes gene expression data from 3,534 breast cancers with clinical annotation including survival. For each test case a case-specific training subset is selected that includes only cases with the highest molecular similarity to the tests case. Similarity is measured by Euclidean distance over all genes. Informative genes are identified in the case-specific training cohort by computing Cox regression coefficient and develop a case-specific predictor. This fixed predictor is applied to the test case. This dynamic predictor building method yields different training sets and different predictors for each new case.

In more detail:

Database construction

In order to establish the largest possible pool of potential training cases for predictor building we assembled all publically available breast cancer gene expression data that was annotated with survival and treatment information. We searched the GEO database (http://www.ncbi.nlm.nih.gov/geo/) using the keywords “breast”, “cancer”, "microarray", "affymetrix". Only publications with raw gene expression data, clinical survival information, and at least 30 patients were included. We identified a total of 6,197 cases in 25 datasets. We further restricted our search to data generated on the HG-U133A (GPL96) and HG-U133 Plus 2.0 (GPL570) arrays only to minimize difficulties of predictor building across different platforms. We have performed a quality check for all arrays and included only arrays with background between 19 and 218, raw Q between 0.5-14, percent present calls > 30%, GAPDH 3':5' ratio < 4.3, beta-actin 3':5' ratio <18 and the presence of bioB-/C-/D- spikes. We also removed duplicate samples (n=1,418), when multiple GEO entries for the same case existed we retained the first published copy of an array. The final number of unique cases that passed the above QC filters and were included in our master data base was n=3,999. Of these cases, 3,534 had relapse-free survival information. For predictor building only probe sets that were measured by both the GPL96 and GPL570 arrays (n=22,277) were used. We also perform a second scaling normalization to set the average expression of each array to 1000 to reduce batch effects, and subsequently applied an intensity and frequency filter. Only probe sets for which at least one of the 3,534 samples showed a normalized expression value of 1000 were retained for predictor building. For genes targeted by multiple probe sets only the JetSet best probeis retained. The final number of probe sets/genes included in the training database pool for each cases is n=9,886.

Selection of case-matched training subset and predictor building

To select samples for model building (i.e. training subset) we identify cases that are most similar to the cancer under investigation (i.e. the test case) by computing Euclidean distance to yield a global similarity matrix of all genes. The distance is computed between the test case and all the samples in the master database by Euclidean distance across all genes. We next rank the potential training cases. Informative genes are selected for predictor model building by performing a Kaplan-Meier survival analysis for each gene using the median expression values as a cutoff. Since some genes correlate positively with survival and have higher expression values in the good prognosis group and others show the opposite relationship for each gene the difference to the median in the training set is used and if the hazard ratio is <1, the expression value is inverted to a negative value. The same processing steps are performed for the test case. The average expression of the informative genes in the test case is compared to the median of the average (the sample-specific cutoff value itself) expression of these genes in the good and the poor outcome groups in the training set. This process is termed "molecular classification" of the patient sample.

Comparison of the training set to the remaining patients

Besides the above described molecular classification we also want to estimate how reliable the training set itself performs in comparison to all breast cancer patients. For example, in case the training set has only patients with very bad clinical prognosis, the molecular classification might be erroneous when designating the test subject as someone having a good prognosis. For this, the entire training set is compared to all the remaining patients using a Kaplan-Meier analysis. The results of this analysis termed "training set assessment" are used in the final classification.

The final classification rule

The final classification rule takes into account both the risk assignment from the "training set assessment" and the output from the "molecular classification".When both predictors are concordant and assign good or poor prognosis, the decision rule follows the concordant vote. When the "molecular classification" is not significant for either good or poor prognosis or when the clinical prediction contradicts the molecular prediction the final output is "intermediate".