• 综合性科技类中文核心期刊
    • 中国科技论文统计源期刊
    • 中国科学引文数据库来源期刊
    • 中国学术期刊文摘数据库(核心版)来源期刊
    • 中国学术期刊综合评价数据库来源期刊
JI Jun-zhong, WU Jin-yuan, WU Chen-sheng, DU Fang-hua. Feature Selection Method Based on Category-weighted and Variance Statistics[J]. Journal of Beijing University of Technology, 2014, 40(10): 1593-1602.
Citation: JI Jun-zhong, WU Jin-yuan, WU Chen-sheng, DU Fang-hua. Feature Selection Method Based on Category-weighted and Variance Statistics[J]. Journal of Beijing University of Technology, 2014, 40(10): 1593-1602.

Feature Selection Method Based on Category-weighted and Variance Statistics

More Information
  • Received Date: November 12, 2013
  • Available Online: January 10, 2023
  • To improve the aceuracy and stability of text classification on unbalanced datasets, a feature selection method based on category-weighted strategy and variance statistics strategy was proposed. First, larger weights to rare categories was assigned, these features that characterize rare categories would be strengthened, and the performance on rare categories could be improved. Then, a method of variance statistics was presented to develop feature selection. Finally, based on the two strategies, a new feature selection algorithm combined with Information Gain (IG) and χ2-statistic (CHI) was developed.Experiments on Reuters-21578 corpus and Fudan corpus (unbalanced datasets) show that new algorithm has better performances on MicroF1 and MacroF1 than those of IG, CHI and DFICF.
  • [1]
    YANG Y, PEDERSEN J O.A comparative study on feature selection in text categorization[C]∥Proc of the14th International Conference on Machine Learning (ICML’97) .San Francisco:Morgan Kaufmann, 1997:412-420.
    [2]
    QUINLAN J R.Constructing decision tree, C4.5[J].Programs for Machine Learning, 1993, 3:17-26.
    [3]
    COVER T M, THOMAS J A.Elements of Information Theory[M].New York:John Wiley and Sons, 1991:274.
    [4]
    周茜, 赵明盛, 扈旻.中文文本分类中的特征选择研究[J].中文信息学报, 2004, 18 (3) :17-23.ZHOU Qian, ZHAO Ming-sheng, HU Min.Study on feature selection in Chinese text categorization[J].Journal of Chinese Information Processing, 2004, 18 (3) :17-23. (in Chinese)
    [5]
    LI S, ZHOU G, WANG Z, et al.Imbalanced sentiment classification[C]∥Proc of CIKM-11.New York:ACM, 2011:2469-2472.
    [6]
    谷琼, 袁磊, 宁彬, 等.一种基于混合重取样策略的非均衡数据集分类算法[J].计算机工程与科学, 2012, 34 (10) :128-134.GU Qiong, YUAN Lei, NING Bin, et al.A novel classification algorithm for imbalanced datasets based on hybrid resampling strategy[J].Journal of Computer Engineering and Science, 2012, 34 (10) :128-134. (in Chinese)
    [7]
    JOSHI M V, KUMAR V, AGARWAL R C.Evaluating boosting algorithms to classify rare classes:comparison and improvements[C]∥Proc of ICDM.San Jose:IEEE, 2001:257-264.
    [8]
    YANG Y.The research of imbalanced data set of sample sampling method based on K-means cluster and genetic algorithm[J].Energy Procedia, 2012 (17) :164-170.
    [9]
    李卓然, 张永.基于集成的非均衡数据分类主动学习算法[J].计算机应用与软件, 2012, 29 (6) :81-83.LI Zhuo-ran, ZHANG Yong.Imbalanced data classification active learning algorithm based on boosting[J].Journal of Computer Applications and Software, 2012, 29 (6) :81-83. (in Chinese)
    [10]
    ZHENG Z, WU X, SRIHARI R.Feature selection for text categorization on imbalanced data[C]∥Proc of ACM SIGKDD Explorations Newsletter.New York:ACM, 2004:80-89.
    [11]
    CASTILLO MDD, SERRANO J I.A multi-strategy approach for digital text categorization from imbalanced documents[J].SIGKDD Explorations Newsletter, 2004, 6 (1) :70-79.
    [12]
    BONG C H, NARAYANAN K.An empirical study of feature selection for text categorization based on term weight[C]∥Proc of the 2004 IEEE International Conference on Web Intelligence.Washington D.C.:IEEE Computer Socity, 2004:599-602.
    [13]
    LI S S, ZONG C Q.A new approach to feature selection for text categorization[C]∥Proc of the 2005 IEEE International Conference on Natural Language Processing and Knowledge Engineering.Wuhan:IEEE, 2005:626-630.
    [14]
    徐燕, 李锦涛, 王斌, 等.不均衡数据集上文本分类的特征选择研究[J].计算机研究与发展, 2007, 44 (增刊1) :58-62.XU Yan, LI Jin-tao, WANG Bin, et al.A study of feature selection for text categorization on imbalanced data[J].Journal of Computer Research and Development, 2007, 44 (Suppl 1) :58-62. (in Chinese)
    [15]
    CHEN J N, HUANG H K, TIAN F Z, et al.Feature selection for text classification with naiva bayes[J].Expert System with Applications, 2009, 36 (3) :5432-5435.
    [16]
    LEE C H, LEE D H, CHUANG J W.Using genetic feature selection for improving cyber attack detection rate[C]∥Proc of the 3rd IASTED Int’1 Conf Advances in Computer Science and Technology.Anaheim:ACTA, 2007:517-522.
    [17]
    LIU H, YU L.Toward integrating feature selection algorithms for classification and clustering[J].IEEE Trans on Knowledge and Data Engineering, 2005, 17 (4) :491-502.
    [18]
    VERIKAS A, BACAUSKIENE M.Feature selection with neural networks[J].Pattern Recognition Letters, 2002 (23) :1323-1335.
    [19]
    WESTON J, MUKHERJEE S, CHAPELLE O, et al.Feature selection for SVMs[C]∥Proc of NIPS 2000.Denver:MIT Press, 2000:668-674.
    [20]
    SANCHIS J S, MARRUGAT J, SORIAOLIVAS S, et al.Support vector machines and genetic algorithms for detecting unstable angina[M]∥Computers in Cardiology.Memphis:IEEE Computer Society Press, 2002:413-416.
    [21]
    MLADENIC D, GROBELNK M.Feature selection for unbalanced class distribution and Naive Bayes[C]∥Proc of ICML.Bled:Morgan Kaufmann, 1999:258-267.
    [22]
    陈铁明, 马继霞, Samuel H.Huang, 等.一种新的快速特征选择和数据分类方法[J].计算机研究与发展, 2012, 49 (4) :735-745.CHEN Tie-ming, MA Ji-xia, HUANG S H, et al.Novel and efficient method on feature selection and data classification[J].Journal of Computer Research and Development, 2012, 49 (4) :735-745. (in Chinese)
    [23]
    靖红芳, 王斌, 杨雅辉, 等.基于类别分布的特征选择框架[J].计算机研究与发展, 2009, 46 (9) :1586-1593.JING Hong-fang, WANG Bin, YANG Ya-hui, et al.Category distribution-based feature selection framework[J].Journal of Computer Research and Development, 2009, 46 (9) :1586-1593. (in Chinese)
    [24]
    裴英博, 刘晓霞.文本分类中改进型CHI特征选择方法的研究[J].计算机工程与应用, 2011, 47 (4) :128-130.PEI Ying-bo, LIU Xiao-xia.Study on improved CHI for feature selection in Chinese text categorization[J].Computer Engineering and Applications, 2011, 47 (4) :128-130. (in Chinese)

Catalog

    Article views (13) PDF downloads (5) Cited by()

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return