基于字符串相似性聚类的网络短文本舆情热点发现技术

杨震; 段立娟; 赖英旭

基于字符串相似性聚类的网络短文本舆情热点发现技术

北京工业大学计算机学院, 北京 100124

基金项目:

国家“九七三”计划资助项目(2007CB311100)

北京市自然科学基金资助项目(4102012,4102013)

北京市教育委员会科技发展计划面上资助项目(KM200810005030)

北京工业大学青年科学基金资助项目.

详细信息

作者简介:
杨震(1979—),男,贵州六盘水人,讲师.

中图分类号: TP393
计量
- 文章访问数: 21
- HTML全文浏览量: 3
- PDF下载量: 9
出版历程
- 收稿日期: 2009-12-09
- 网络出版日期: 2022-12-14

Online Public Opinion Hotspot Detection and Analysis Based on Short Text Clustering Using String Distance

College of Computer Science, Beijing University of Technology, Beijing 100124, China

摘要

摘要: 将每个短文本文档看成一个由文字、数字和标点构成的字符串,并基于字符串自身的特性直接计算其相似性,在此基础上进行短文本层次化聚类,进而发现网络舆情热点.由于这种方法免去特征提取和文本表示过程,在一定程度上避免了传统方法在短文本表示时特征向量稀疏的不足,有效解决了短文本内容聚类问题.实验结果表明,本文提出方法有效.
- 舆情分析 /
- 短文本处理 /
- 层次聚类
Abstract: The unique language characteristic of short texts has made the performance of traditional natural language processing methods degradation,or even unavailable.Exact representation and calculation of the similarity between short texts are great helpful to content based clustering.That this paper treated each short text as a composition of characters,numbers and punctuation,and a similarity measure based on string similarity was proposed.Then a public opinion hotspot detection and analysis system based on short text hierarchical clustering was built.This method calculated the similarity directly which skipped the feature extraction and representation processing of short text,to a certain extent,and avoided using the sparse feature vectors.Experimental results show the effectiveness of the proposed method.
- public opinion analysis /
- short text processing /
- hierarchical clustering

HTML全文

参考文献(8)

[1]	中国信息产业商会信息安全产业分会.中国信息安全产业发展白皮书(2005—2010)[EB/OL].[2005-03-11].http:∥www.itsec.gov.cn/webportal/document/baipishu.doc.
[2]	龚才春.短本语言计算的关键技术研究[D].北京:中国科学院研究生院计算技术研究所,2008.GONG Cai-Chun.Research on short text language computing[D].Beijing:Institute of Computing Technology,ChineseAcademy of Sciences,2008.(in Chinese)
[3]	SCOTTJ.Social network analysis:a handbook[M].2nd Edition.London:Sage,2000:123-145.
[4]	车万翔,刘挺,秦兵,等.基于改进编辑距离的中文相似句子检索[J].高技术通讯,2004,14(7):15-20.CHE Wan-xiang,LIU Ting,QIN Bing,et al.Similar Chinese sentence retrieval based on improved edit-distance[J].HighTechnology Letters,2004,14(7):15-20.(in Chinese)
[5]	杨震,范科峰,雷建军,等.基于语义的文本流形研究[J].电子学报,2009,37(3):557-561.YANG Zhen,FAN Ke-feng,LEI Jian-jun,et al.Text manifold based on semantic analysis[J].Acta Electronica Sinica,2009,37(3):557-561.(in Chinese)
[6]	陈黎飞,姜青山,王声瑞.基于层次划分的最佳聚类数确定方法[J].软件学报,2008,19(1):62-72.CHEN Li-fei,JIANG Qing-shan,WANG Sheng-rui.Ahierarchical method for determining the number of clusters[J].Journalof Software,2008,19(1):62-72.(in Chinese)
[7]	BOUGUESSA M,WANG S,SUN H.An objective approach to cluster validation[J].Pattern Recognition Letters,2006,27(13):1419-1430.
[8]	马旭,徐蔚然,郭军,等.SMS-2008标注中文短信息库[J].中文信息学报,2009,23(4):22-26.MA Xu,XU Wei-ran,GUO Jun,et al.SMS-2008:an annotated Chinese short messages corpus[J].Journal of ChineseInformation Processing,2009,23(4):22-26.(in Chinese)

施引文献

资源附件(0)

计量

文章访问数: 21
HTML全文浏览量: 3
PDF下载量: 9
被引次数: 0

基于字符串相似性聚类的网络短文本舆情热点发现技术

作者简介: 杨震(1979—),男,贵州六盘水人,讲师.

计量

出版历程

Online Public Opinion Hotspot Detection and Analysis Based on Short Text Clustering Using String Distance

计量

出版历程

目录

作者简介:
杨震(1979—),男,贵州六盘水人,讲师.