«上一篇/Previous Article|本期目录/Table of Contents|下一篇/Next Article»

j. issn. 1673-629X. 2019. 09. 010]
点击复制

基于信息增益和基尼不纯度的 K 近邻算法()

分享到：

《计算机技术与发展》[ISSN:1006-6977/CN:61-1281/TN]

卷:: 29
期数:: 2019年09期

页码:: 51-54

栏目:: 智能、算法、系统工程

出版日期:: 2019-09-10

文章信息/Info

Title:: K-Nearest Neighbor Algorithm Based on Information Gain and Gini Impurity

文章编号:: 1673-629X(2019)09-0051-04

作者:: 孙傲; 赵礼峰; 南京邮电大学理学院,江苏南京 210023

Author(s):: SUN Ao; ZHAO Li-feng; School of Science,Nanjing University of Posts and Telecommunications,Nanjing 210023,China

关键词:: 数据挖掘; K 近邻; 信息增益; 基尼不纯度

Keywords:: data mining; K-nearest neighbor; information gain; Gini impurity

分类号:: TP301.6

DOI:: 10. 3969 / j. issn. 1673-629X. 2019. 09. 010

摘要:: 传统 K 近邻算法忽略每个属性对分类的不同重要程度,将每个属性同等看待,在计算样本间距离时赋予每个属性相同的权重,影响样本分类的正确性。利用单一指标来确定属性重要性过于片面,无法全面反应属性对分类的重要程度。针对这一问题,利用信息增益和基尼不纯度的综合指标作为判断属性重要程度的指标,该综合指标越大,属性对分类的重要程度越高。并依据综合指标构造属性权重,计算样本间的加权距离进行分类。为验证该方法的有效性,分别基于 UCI数据库中 Iris 数据集和 Wine 数据集对基于信息增益和基尼不纯度综合指标的加权 K 近邻算法进行仿真实验,并与传统 K 近邻算法和基于信息增益加权 K 近邻算法进行对比,基于信息增益和基尼不纯度综合指标的加权 K 近邻算法错误率均低于传统 K 近邻算法和基于信息增益加权 K 近邻算法。结果表明该方法比传统 K 近邻法和基于单一指标加权 K 近邻算法能更有效地对样本进行分类。

Abstract:: The traditional K-nearest neighbor algorithm ignores the importance of each attribute to the classification,and treats each attribute equally. When calculating the distance between samples,the same weight is given to each attribute,which affects the correctness of the sample classification. The use of a single indicator to determine the importance of attributes is too one-sided and does not fully reflect the importance of attributes to classification. Aiming at this problem,the comprehensive index of information gain and Gini impurity is used as the index to judge the importance of the attribute. The larger the comprehensive index,the higher the importance of the attribute to the classification. The attribute weights are constructed according to the comprehensive index,and the weighted distance between the samples is calculated for classification. In order to verify the effectiveness of the proposed method,the weighted K-neares neighbor algorithm based on information gain and Gini impurity comprehensive index is simulated based on Iris dataset and Wine datasetin UCI database,and compared with traditional K-nearest neighbor algorithm and information gain-based weighting. Compared with the K-nearest neighbor algorithm,the error rate of the weighted K-nearest neighbor algorithm based on the information gain and Gini ntegrity comprehensive index is lower than the traditional K-nearest neighbor algorithm and the information-gain-weighted K-nearest neighbor algorithm. The results show that the proposed method can classify samples more effectively than the traditional K-nearest neighbor method and the single-index weighted K-nearest neighbor algorithm.

相似文献/References:

[1]项响琴汪彩梅.基于聚类高维空间算法的离群数据挖掘技术研究[J].计算机技术与发展,2010,(01):120.
　XIANG Xiang-qin,WANG Cai-mei.Study of Outlier Data Mining Based on CLIQUE Algorithm[J].,2010,(09):120.
[2]李雷丁亚丽罗红旗.基于规则约束制导的入侵检测研究[J].计算机技术与发展,2010,(03):143.
　LI Lei,DING Ya-li,LUO Hong-qi.Intrusion Detection Technology Research Based on Homing - Constraint Rule[J].,2010,(09):143.
[3]吉同路柏永飞王立松.住宅与房地产电子政务中数据挖掘的应用研究[J].计算机技术与发展,2010,(01):235.
　JI Tong-lu,BAI Yong-fei,WANG Li-song.Study and Application of Data Mining in E-government of House and Real Estate Industry[J].,2010,(09):235.
[4]杨静张楠男李建刘延明梁美红.决策树算法的研究与应用[J].计算机技术与发展,2010,(02):114.
　YANG Jing,ZHANG Nan-nan,LI Jian,et al.Research and Application of Decision Tree Algorithm[J].,2010,(09):114.
[5]赵裕啸倪志伟王园园伍章俊.SQL Server 2005数据挖掘技术在证券客户忠诚度的应用[J].计算机技术与发展,2010,(02):229.
　ZHAO Yu-xiao,NI Zhi-wei,WANG Yuan-yuan,et al.Application of Data Mining Technology of SQL Server 2005 in Customer Loyalty Model in Securities Industry[J].,2010,(09):229.
[6]张笑达徐立臻.一种改进的基于矩阵的频繁项集挖掘算法[J].计算机技术与发展,2010,(04):93.
　ZHANG Xiao-da,XU Li-zhen.An Advanced Frequent Itemsets Mining Algorithm Based on Matrix[J].,2010,(09):93.
[7]王爱平王占凤陶嗣干燕飞飞.数据挖掘中常用关联规则挖掘算法[J].计算机技术与发展,2010,(04):105.
　WANG Ai-ping,WANG Zhan-feng,TAO Si-gan,et al.Common Algorithms of Association Rules Mining in Data Mining[J].,2010,(09):105.
[8]张广路雷景生吴兴惠.一种改进的Apriori关联规则挖掘算法（英文）[J].计算机技术与发展,2010,(06):84.
　ZHANG Guang-lu,LEI Jing-sheng,WU Xing-hui.An Improved Apriori Algorithm for Mining Association Rules[J].,2010,(09):84.
[9]吴楠胡学钢.基于聚类分区的序列模式挖掘算法研究[J].计算机技术与发展,2010,(06):109.
　WU Nan,HU Xue-gang.Research on Clustering Partition-Based Approach of Sequential Pattern Mining[J].,2010,(09):109.
[10]吴青傅秀芬.水平分布数据库的正负关联规则挖掘[J].计算机技术与发展,2010,(06):113.
　WU Qing,FU Xiu-fen.Positive and Negative Association Rules Mining on Horizontally Partitioned Database[J].,2010,(09):113.

常用功能

工具/Tools

统计/Statistics

摘要浏览/Viewed871
全文下载/Downloads665
评论/Comments