Online Public Opinion Hotspot Detection and Analysis Based on Short Text Clustering Using String Distance
-
摘要: 将每个短文本文档看成一个由文字、数字和标点构成的字符串,并基于字符串自身的特性直接计算其相似性,在此基础上进行短文本层次化聚类,进而发现网络舆情热点.由于这种方法免去特征提取和文本表示过程,在一定程度上避免了传统方法在短文本表示时特征向量稀疏的不足,有效解决了短文本内容聚类问题.实验结果表明,本文提出方法有效.Abstract: The unique language characteristic of short texts has made the performance of traditional natural language processing methods degradation,or even unavailable.Exact representation and calculation of the similarity between short texts are great helpful to content based clustering.That this paper treated each short text as a composition of characters,numbers and punctuation,and a similarity measure based on string similarity was proposed.Then a public opinion hotspot detection and analysis system based on short text hierarchical clustering was built.This method calculated the similarity directly which skipped the feature extraction and representation processing of short text,to a certain extent,and avoided using the sparse feature vectors.Experimental results show the effectiveness of the proposed method.
-
-
[1] 中国信息产业商会信息安全产业分会.中国信息安全产业发展白皮书(2005—2010)[EB/OL].[2005-03-11].http:∥www.itsec.gov.cn/webportal/document/baipishu.doc. [2] 龚才春.短本语言计算的关键技术研究[D].北京:中国科学院研究生院计算技术研究所,2008.GONG Cai-Chun.Research on short text language computing[D].Beijing:Institute of Computing Technology,ChineseAcademy of Sciences,2008.(in Chinese) [3] SCOTTJ.Social network analysis:a handbook[M].2nd Edition.London:Sage,2000:123-145.
[4] 车万翔,刘挺,秦兵,等.基于改进编辑距离的中文相似句子检索[J].高技术通讯,2004,14(7):15-20.CHE Wan-xiang,LIU Ting,QIN Bing,et al.Similar Chinese sentence retrieval based on improved edit-distance[J].HighTechnology Letters,2004,14(7):15-20.(in Chinese) [5] 杨震,范科峰,雷建军,等.基于语义的文本流形研究[J].电子学报,2009,37(3):557-561.YANG Zhen,FAN Ke-feng,LEI Jian-jun,et al.Text manifold based on semantic analysis[J].Acta Electronica Sinica,2009,37(3):557-561.(in Chinese) [6] 陈黎飞,姜青山,王声瑞.基于层次划分的最佳聚类数确定方法[J].软件学报,2008,19(1):62-72.CHEN Li-fei,JIANG Qing-shan,WANG Sheng-rui.Ahierarchical method for determining the number of clusters[J].Journalof Software,2008,19(1):62-72.(in Chinese) [7] BOUGUESSA M,WANG S,SUN H.An objective approach to cluster validation[J].Pattern Recognition Letters,2006,27(13):1419-1430.
[8] 马旭,徐蔚然,郭军,等.SMS-2008标注中文短信息库[J].中文信息学报,2009,23(4):22-26.MA Xu,XU Wei-ran,GUO Jun,et al.SMS-2008:an annotated Chinese short messages corpus[J].Journal of ChineseInformation Processing,2009,23(4):22-26.(in Chinese)
计量
- 文章访问数: 21
- HTML全文浏览量: 3
- PDF下载量: 9