Abstract:
The unique language characteristic of short texts has made the performance of traditional natural language processing methods degradation,or even unavailable.Exact representation and calculation of the similarity between short texts are great helpful to content based clustering.That this paper treated each short text as a composition of characters,numbers and punctuation,and a similarity measure based on string similarity was proposed.Then a public opinion hotspot detection and analysis system based on short text hierarchical clustering was built.This method calculated the similarity directly which skipped the feature extraction and representation processing of short text,to a certain extent,and avoided using the sparse feature vectors.Experimental results show the effectiveness of the proposed method.