基于多数据集的胃癌亚型标志基因选择
Marker Gene Selection of Gastric Cancer Subtype Based on Multi Microarray Data Sets
-
摘要: 基于机器学习方法分析胃癌微阵列数据, 寻找和发现新的胃癌亚型分类的相关基因, 可为进一步研究胃癌发生的分子机制及其基因水平的诊断和治疗提供标志与依据.现有方法大多使用单个数据集提取特征基因, 样本量少, 提取的特征基因应用于其他同类数据分类效果差.本文提出了一种遗传算法与支持向量机(support vector machine, SVM)相结合的特征基因提取方法, 并行分析了3个胃癌微阵列数据集, 提取的特征基因在所有数据集中均达90%以上的分类准确率.进行了4 580次实验, 统计基因在遗传算法种群中出现的次数依次排序, 得出了可能对胃癌亚型分类起关键作用的基因(AGT、FBLN1等).对提取的特征基因的生物学意义分析结果表明, 本方法能很好地识别胃癌亚型分类基因, 所选择的特征基因对人类胃癌肿瘤的诊断和分型有重要意义.Abstract: Using machine learning methods to analyze microarray data of gastric cancer and discover novel marker gene can provide suggestion for further study of the molecular mechanism, gene level diagnosis and treatment, of gastric cancer.Most existing methods use machine learning methods to extract marker gene using only one data set.This paper proposed a hybrid genetic algorithm (GA)/support vector machine (SVM) approach to analyze multi gastric cancer microarray dataset in parallel and select marker genes.Three datasets are analyzed.The experiment was performed 4 580 times.The top 20 genes with highest occurrence times in the final populations of the GA (the occurrence times can represent the significance of classification in a sense) are selected as marker genes.Based on these genes the classification accuracies are above 90% in each of the three datasets.Meanwhile, biological significance analyses show that this method can identify the tumor related genes efficaciously.These genes are vital for human gastric cancer diagnosis and classification.