基于XML技术的USPTO专利抽取系统

USPTO Patent Information Extraction System Based on XML

摘要: 为了给北京市知识产权预警能力研究提供基础数据,通过检索美国专利商标局(USPTO)网络专利数据库可以得到动态网页形式的专利信息.基于XML相关技术,提出了将这些网页形式的专利数据抽取到关系数据库的技术和方法.使用正则表达式匹配的方法进行页面过滤,将网页解析为文档对象模型(DOM)进行清洗,通过可扩散样式表转换语言(XSLT)模板抽取专利信息,并通过对象映射的方法将专利信息存入关系数据库,实现了专利信息抽取原型系统.实验结果表明,该原型系统具有较高的召回率和准确率.

Abstract: In order to provide basic data for improving the intellectual property early warming capacity and the competitiveness of high-tech industries of Beijing,by searching the database of the United States Patent and Trademark Office,patent information in the form of dynamic pages can be gotten.Based on XML related technology,a method to extract and store patent information in local relational database is put forward in this paper.The web pages are filtered by regular expression matching,and then the document object models of the pages are cleaned.Finally the patent information is extracted by XSLT matching and stored to relation database by object mapping.The prototype of the patent extraction system is designed and implemented,which has a high recall rate and precision rate.