br Fig Feature Selector Stability a
Fig. 7. Feature Selector Stability: (a) Jaccard index for Feature Subsets with different cardinality; (b) MDS plot of the Feature Ranking Algorithms.
Top-40 Pearson and top-41 SVM-Wrapper feature sets.
Ranker Pearson SVM-wrapper Rank Classifier LR SVM
Level of education Family history of CRC 1
Age Level of education 2
RED MEAT Ethanol in the past decade 3
Cholesterol Red Meat 4
Ethanol in the past decade Ethanol in the present 5
Polysaccharides Age 6
Family history of CRC Fiber 7
Sex Physical exercise in the last decade 8
Zinc Legume 16
Monounsaturated fats FRUITS 19
Polyunsaturated fats Sex 22
Phosphorus Vitamin D 25
Vegetables Carotenoids 26
Thiamin FISH 30
Total grams RS 31
Cobalamin Niacin 32
Retinoids Race 36
Water Zinc 39
population are known to have a high association with CRC, while others are not considered so correlated, what deserves further study with other cancer data sets.
Stability is assessed in a conventional way by computing an scalar metric. Additionally, we also propose a graphical approach that works in 2D or 3D in order to evaluate not only the EPZ-6438 of the algorithms but also its similarities with other ranking algo-rithms. This graphical approach based on a MDS projection allows to see at a glance and in a single picture that: (a) the most sta-ble algorithm is RF, (b) the most unstable is NN-Wrapper, (c) the rankings yielded by RF and Pearson are very similar so that we can focus the analysis on one of them, (d) the before mentioned group leads to a ranking that is very different to Releif and SVM-RFE, (e) the SWM-Wrapper ranking is moderately stable and similar to the Pearson one.
The main strength of this study is that it analyzes stability and predictive power together. Additionally, feature selection tech-niques allow both the improvement of performance for risk pre-diction models and the identification of relevant features related to CRC cancer. It turns out that in this study the SVM-wrapper was one of the best ranking technique regarding model perfor-mance and it performs moderately in terms of robustness. This study (limited the multicase control-study of the Spanish popu-lation) also shows that the simple Pearson correlation coe cient shows a good trade in terms of performance and robustness and
can easily scale to high dimensional datasets. A comprehensive evaluation with more colorectal cancer datasets and more feature ranking algorithms can lead to more generalization in this field.
4.3. Comparison with the state-of-the-art knowledge
Up to present, there are many attempts aiming at predicting colorectal cancer risk in general population settings [4,21,29,33]. In this section, we assess the performance of a CRC prediction model built with 46 features (29 SNP and 17 environmental) selected by the experts in the field according to the state-of-the-art knowledge . Table 6 shows this list of features.
Each one of the classifiers considered in this study was assessed in terms of AUC with four different feature sets (see Table 7): The set selected by the experts (46 features), a feature set with the 28 variables that are common to the Experts’s set and the Top-40 union (that is, keeping only the features suggested by the experts that were found in a relevant position in our study) and finally, the Top-40 union set (64 features that appear in Table 5).
It can be observed that the performance is completely unaf-fected when the feature set is reduced from 46 features provided by the experts to 28 features. Removing the features that were not found relevant in our study either maintains or increases the AUC (Table 7). Thus, AUC for the SVM classifier increases from 0.652– 0.667 and from 0.679–0.683 for the LR approach. It is notewor-