• 综合性科技类中文核心期刊
    • 中国科技论文统计源期刊
    • 中国科学引文数据库来源期刊
    • 中国学术期刊文摘数据库(核心版)来源期刊
    • 中国学术期刊综合评价数据库来源期刊

基于字符串相似性聚类的网络短文本舆情热点发现技术

杨震, 段立娟, 赖英旭

杨震, 段立娟, 赖英旭. 基于字符串相似性聚类的网络短文本舆情热点发现技术[J]. 北京工业大学学报, 2010, 36(5): 669-673.
引用本文: 杨震, 段立娟, 赖英旭. 基于字符串相似性聚类的网络短文本舆情热点发现技术[J]. 北京工业大学学报, 2010, 36(5): 669-673.
YANG Zhen, DUAN Li-juan, LAI Ying-xu. Online Public Opinion Hotspot Detection and Analysis Based on Short Text Clustering Using String Distance[J]. Journal of Beijing University of Technology, 2010, 36(5): 669-673.
Citation: YANG Zhen, DUAN Li-juan, LAI Ying-xu. Online Public Opinion Hotspot Detection and Analysis Based on Short Text Clustering Using String Distance[J]. Journal of Beijing University of Technology, 2010, 36(5): 669-673.

基于字符串相似性聚类的网络短文本舆情热点发现技术

基金项目: 

国家“九七三”计划资助项目(2007CB311100)

北京市自然科学基金资助项目(4102012,4102013)

北京市教育委员会科技发展计划面上资助项目(KM200810005030)

北京工业大学青年科学基金资助项目.

详细信息
    作者简介:

    杨震(1979—),男,贵州六盘水人,讲师.

  • 中图分类号: TP393

Online Public Opinion Hotspot Detection and Analysis Based on Short Text Clustering Using String Distance

  • 摘要: 将每个短文本文档看成一个由文字、数字和标点构成的字符串,并基于字符串自身的特性直接计算其相似性,在此基础上进行短文本层次化聚类,进而发现网络舆情热点.由于这种方法免去特征提取和文本表示过程,在一定程度上避免了传统方法在短文本表示时特征向量稀疏的不足,有效解决了短文本内容聚类问题.实验结果表明,本文提出方法有效.
    Abstract: The unique language characteristic of short texts has made the performance of traditional natural language processing methods degradation,or even unavailable.Exact representation and calculation of the similarity between short texts are great helpful to content based clustering.That this paper treated each short text as a composition of characters,numbers and punctuation,and a similarity measure based on string similarity was proposed.Then a public opinion hotspot detection and analysis system based on short text hierarchical clustering was built.This method calculated the similarity directly which skipped the feature extraction and representation processing of short text,to a certain extent,and avoided using the sparse feature vectors.Experimental results show the effectiveness of the proposed method.
  • [1] 中国信息产业商会信息安全产业分会.中国信息安全产业发展白皮书(2005—2010)[EB/OL].[2005-03-11].http:∥www.itsec.gov.cn/webportal/document/baipishu.doc.
    [2] 龚才春.短本语言计算的关键技术研究[D].北京:中国科学院研究生院计算技术研究所,2008.GONG Cai-Chun.Research on short text language computing[D].Beijing:Institute of Computing Technology,ChineseAcademy of Sciences,2008.(in Chinese)
    [3]

    SCOTTJ.Social network analysis:a handbook[M].2nd Edition.London:Sage,2000:123-145.

    [4] 车万翔,刘挺,秦兵,等.基于改进编辑距离的中文相似句子检索[J].高技术通讯,2004,14(7):15-20.CHE Wan-xiang,LIU Ting,QIN Bing,et al.Similar Chinese sentence retrieval based on improved edit-distance[J].HighTechnology Letters,2004,14(7):15-20.(in Chinese)
    [5] 杨震,范科峰,雷建军,等.基于语义的文本流形研究[J].电子学报,2009,37(3):557-561.YANG Zhen,FAN Ke-feng,LEI Jian-jun,et al.Text manifold based on semantic analysis[J].Acta Electronica Sinica,2009,37(3):557-561.(in Chinese)
    [6] 陈黎飞,姜青山,王声瑞.基于层次划分的最佳聚类数确定方法[J].软件学报,2008,19(1):62-72.CHEN Li-fei,JIANG Qing-shan,WANG Sheng-rui.Ahierarchical method for determining the number of clusters[J].Journalof Software,2008,19(1):62-72.(in Chinese)
    [7]

    BOUGUESSA M,WANG S,SUN H.An objective approach to cluster validation[J].Pattern Recognition Letters,2006,27(13):1419-1430.

    [8] 马旭,徐蔚然,郭军,等.SMS-2008标注中文短信息库[J].中文信息学报,2009,23(4):22-26.MA Xu,XU Wei-ran,GUO Jun,et al.SMS-2008:an annotated Chinese short messages corpus[J].Journal of ChineseInformation Processing,2009,23(4):22-26.(in Chinese)
计量
  • 文章访问数:  21
  • HTML全文浏览量:  3
  • PDF下载量:  9
  • 被引次数: 0
出版历程
  • 收稿日期:  2009-12-09
  • 网络出版日期:  2022-12-14

目录

    /

    返回文章
    返回