基于DF和CHI的联合特征提取方法及其应用
Joint Feature Selection Method of Document Frequency and CHI With Application to Web Pages Categorization
-
摘要: 分析了与类别信息有关的CHI统计特征选取方法和与类别无关的文档频率特征选取方法,在此基础上提出文档频率与CHI统计相结合的特征提取方法,以选取分类能力强的词项特征,从而提高网页分类效果.以该联合特征提取方法为基础构建的网页分类系统,在参加SEWM2007分类评测的8个代表队中,取得Macro-F1值排名第3的成绩.Abstract: Based on the analysis of document frequency and CHI feature selection,which are respectively related and unrelated to class information,this paper proposes a joint feature selection method in order to increase the accuracy of categorization results by selecting effective token features.The web pages categorization system using the joint method won the third out of eight teams on Macro-F1 on SEWM2007 evaluation.