基于多特征的自适应新词识别

    Adaptive Method for Chinese New Word Identification Based on Multi-features

    • 摘要: 为提高自动分词系统对未登录词的识别性能,提出和实现了一种基于多特征的自适应新词识别方法,综合考虑了被处理文本中重复字符串的上下文统计特征(上下文熵)、内部耦合特征(似然比)、背景语料库对比特征(相关频率比值)以及自动分词系统辅助的边界确认信息等,并直接从被抽取文本中自动训练识別模型.同时,新词识别过程在字串PAT-Array数据结构上进行,可以抽取任意长度的新词语.实验结果表明,该方法新词发现速度快、节省存储空间.

       

      Abstract: To improve the performance of new word identification in Chinese word segment, the authors propose an adaptive method for Chinese new word identification based on multi-feature method for offline corpus processing, in which many features, including context-entropy, likelihood ratios, frequency ratio against background corpus and boundary-verification with basic segmentation are introduced to evaluate the candidate words. And all of the features are integrated into an adaptive SVM classifier. Candidate new words are extracted efficiently on PAT-Array with much less space overhead and arbitrary n-gram words can be identified by the method. The results show that the method can run fast upon new word identification and save much memory.

       

    /

    返回文章
    返回