面向蒙古文主题的网络爬虫采集策略模型

Collecting Model of Focused Crawler for Mongolian Website

摘要: 针对蒙古文主题爬虫主要面临的预测采集URL和发现隧道2个核心问题,提出一种基于主题团的站点聚类、排序和隧道发现的采集模型.通过站点的主题识别,将待爬行URL分为站点链接和非站点链接,使用文本相似度和超链图分析建立了预测URL优先级排序算法,基于站点粒度设计了站点自适应隧道发现算法,最后,构建了一个面向蒙古文主题的网络爬虫系统.实验结果表明:该算法在采准率、信息总量与采集速率上都得到了提高,明显优于基线算法.

Abstract: Forecast of collecting URL and tunnel discovery are two core issues in Focused crawler for Mongolian website. Therefore, a collecting model was proposed based on topic group of site clustering, ordering and tunnel discovery. First, through the topic identification text, to be crawling URL was divided into the site links and non site links. Second, a URL priority ordering algorithm was established by using the text similarity and the hyperlink graph analysis, and an adaptive tunnel discovery algorithm based on website was designed. Finally, the system of focused crawler for the Mongolian website was constructed. The experimental results show that the accurate rate of collecting, the amount of information and the collection rate have been improved significantly compared than the baseline algorithm.