Skip to main content
Article
Comparative evaluation of classifiers in the presence of statistical interaction between features in high-dimensionality data settings
International Journal of Biostatistics (2012)
  • Yu Guo
  • Raji Balasubramanian, University of Massachusetts - Amherst
Abstract

Background: A central challenge in high dimensional data settings in biomedical investigations involves the estimation of an optimal prediction algorithm to distinguish between different disease phenotypes. A significant complicating aspect in these analyses can be attributed to the presence of features that exhibit statistical interactions. Indeed, in several clinical investigations such as genetic studies of complex diseases, it is of interest to specifically identify such features. In this paper, we compare the performance of four commonly used classifiers (K-Nearest Neighbors, Prediction Analysis for Microarrays, Random Forests and Support Vector Machines) in settings involving high dimensional datasets including statistically interacting feature subsets. We evaluate the performance of these classifiers under conditions of varying sample size, levels of signal-to-noise ratio and strength of statistical interactions among features. We summarize two datasets from studies in diabetes and cardiovascular disease involving gene expression, metabolomics and proteomics measurements and compare results obtained using the four classifiers.

Results: Simulation studies revealed that the classifier Prediction Analysis of Microarrays had the highest classification accuracy in the absence of noise, statistical interactions and when feature distributions were multivariate Gaussian within each class. In the presence of statistical interactions, modest effect sizes and the absence of noise, Support Vector Machines achieved the best performance followed closely by Random Forests. Random Forests was optimal in settings that included both significant levels of high dimensional noise features and statistical interactions between biomarker pairs. The data applications revealed similar trends in the relative performances of each classifier.

Conclusion: Random Forests had the highest classification accuracy among the four classifiers and was successful in incorporating interaction effects between features in the presence of noise in high dimensional datasets.

Keywords
  • classifiers,
  • high dimensional data,
  • statistical interactions
Publication Date
2012
Publisher Statement
DOI: 10.1515/1557-4679.1373
Citation Information
Yu Guo and Raji Balasubramanian. "Comparative evaluation of classifiers in the presence of statistical interaction between features in high-dimensionality data settings" International Journal of Biostatistics (2012)
Available at: http://works.bepress.com/raji_balasubramanian/7/