基于Boosting的半结构化信息抽取

Semi-structured Text Information Extraction Based on Boosting Algorithm

摘要: 为了对半结构化文本实现自动抽取信息.介绍了一种基于Boosting算法的信息抽取方法,它能够自动对一个训练例生成规则,将该规则应用于正例集并改变正例集权重分布,找到权重最大的正例生成下一条规则.给出了一种能描述不符合英文词法的词的模式匹配约束.试验表明:在特征简单的抽取规则学习中,该方法精确度与召回率可达100%.在特征比较复杂的抽取规则学习中,该方法F₁评估值也能达到80%以上.

Abstract: A new information extraction method which is based on Boosting algorithm is provided. It can automatically generate a rule based on an training instance. This rule is applied to training set and change the probability distribution on the weights of positive examples. Next instance will be selected from training set based on this distribution. A constraint named mode-match which can describe words that do not accord with lexical rules is provided too. As experiments show, for the texts with simple characters, both recall and precision can be achieved to 100%. Even for the texts with complex characters, the evaluation of F₁ can be achieved to 80%.