分层子树合并聚类算法

Hierarchical Subtrees Agglomerative Clustering Algorithms

摘要: 为了解决传统分层合并聚类算法可能产生不唯一的二叉树结果问题,提出了分层子树合并聚类算法,其基本思想是通过在数据集的最小树中分析θ-极大紧邻子树然后合并它的顶点集,该算法每步可将多个对象聚类,计算结果用多叉树表示.在理论上证明了该树在不计分支次序时是唯一的,并且通过计算实验说明,在样本中存在较多距离彼此相等的点对时,该树所描述的聚类结果要明显比传统分层合并聚类算法用二叉树描述的聚类结果更为合理.

Abstract: In order to solve the problem that Traditional Hierarchical Agglomerative Clustering Algorithms (HACA) may produce a nonunique binary tree as the clustering result of a same dataset, this paper presents Hierarchical Subtrees Agglomerative Clustering Algorithm (HSACA), the basic idea of which is to find maximal θ-distant subtrees in a minimal spanning tree of the data set and merge its vertex set. HSACA can merge many objects into a cluster in each step, and its clustering result is usually a multiple tree. This paper proves in theory that the multiple tree generated by HSACA is unique for a dataset without considering the branchy orders, and shows in computer simulations that the multiple tree describes a more reasonable clustering result than the binary tree generated by traditional HACA if there are many equidistant pairs of points in the data set.