Feature Selection Method Based on Category-weighted and Variance Statistics
-
Graphical Abstract
-
Abstract
To improve the aceuracy and stability of text classification on unbalanced datasets, a feature selection method based on category-weighted strategy and variance statistics strategy was proposed. First, larger weights to rare categories was assigned, these features that characterize rare categories would be strengthened, and the performance on rare categories could be improved. Then, a method of variance statistics was presented to develop feature selection. Finally, based on the two strategies, a new feature selection algorithm combined with Information Gain (IG) and χ2-statistic (CHI) was developed.Experiments on Reuters-21578 corpus and Fudan corpus (unbalanced datasets) show that new algorithm has better performances on MicroF1 and MacroF1 than those of IG, CHI and DFICF.
-
-