基于特征映射的半监督文本分类算法

    Semi-supervised Text Classification Algorithm Based on a Feature Mapping

    • 摘要: 针对已标记数据与未标记数据分布不一致可能导致半监督分类器性能降低的不足,提出了一种基于特征映射的半监督文本分类算法.首先通过不同的特征选择方法,分别在训练集的已标记数据、未标记数据以及测试集数据中选取各自的特征集,并初始化特征的权值;在此基础之上,分别建立已标记数据与未标记数据、已标记数据与测试集数据、未标记数据与测试集数据之间的映射函数,并利用这3个特征映射函数重新计算特征的权重;最后利用期望最大比(expectation maximization,EM)算法进行半监督文本分类.在标准数据集上的实验结果表明:提出的算法是有效的.

       

      Abstract: There are many algorithms based on data distribution to effectively solve the problem of semisupervised text categorization.However,they may perform badly when the labeled data distribution is different from the unlabeled data.To solve the problem,semi-supervised text classification algorithm based on feature mapping was proposed.First,three sets of features were selected respectively from labeled data,unlabeled data and test data by using different feature selection methods,and their values were initialize.Second,three feature mapping functions were studied,and the weight of each feature was recalculated by them.Finally,the EM algorithm was employ to classify the text data.Experiments of standard data sets show that the proposed algorithm is effective.

       

    /

    返回文章
    返回