Knowledge discovery from large data sets using classic data mining techniques has been proved to be difficult due to large size in both dimension and samples. In real applications, data sets often consist of many noisy, redundant, and irrelevant features, resulting in degrading the classification accuracy and increasing the complexity exponentially. Due to the inherent nature, the analysis of the quality of data sets is difficult and very limited approaches about this issue can be found in the literature. This paper presents a novel method to investigate the quality and structure of data sets, i.e., how to analyze whether there are noisy and irrelevant features embedded in data sets. In doing so, a wrapper-based feature selection method using genetic algorithm and an external classifier are mployed for selecting the discriminative features. The importance of features are ranked in terms of their frequency appeared in the selected chromosomes. The effectiveness of proposed idea has been investigated and discussed with some sample data sets.
Available at: http://works.bepress.com/craig_valli/62/