[1]刘 佳,朱鹏云,荀亚玲.基于相关子空间的扩展隔离森林离群检测算法[J].计算机技术与发展,2022,32(10):26-33.[doi:10. 3969 / j. issn. 1673-629X. 2022. 10. 005]
 LIU Jia,ZHU Peng-yun,XUN Ya-ling.An Extended Isolation Forest Outlier Detection Algorithm Based onRelevant Subspace[J].,2022,32(10):26-33.[doi:10. 3969 / j. issn. 1673-629X. 2022. 10. 005]
点击复制

基于相关子空间的扩展隔离森林离群检测算法()
分享到:

《计算机技术与发展》[ISSN:1006-6977/CN:61-1281/TN]

卷:
32
期数:
2022年10期
页码:
26-33
栏目:
大数据与云计算
出版日期:
2022-10-10

文章信息/Info

Title:
An Extended Isolation Forest Outlier Detection Algorithm Based onRelevant Subspace
文章编号:
1673-629X(2022)10-0026-08
作者:
刘 佳朱鹏云荀亚玲
太原科技大学 计算机科学与技术学院,山西 太原 030024
Author(s):
LIU JiaZHU Peng-yunXUN Ya-ling
School of Computer Science and Technology,Taiyuan University of Science and Technology,Taiyuan 030024,China
关键词:
离群检测扩展隔离森林相关子空间高斯混合模型稀疏数据区域
Keywords:
outlier detectionextended isolation forestrelevant subspaceGaussian mixture modelsparse data area
分类号:
TP311
DOI:
10. 3969 / j. issn. 1673-629X. 2022. 10. 005
摘要:
扩展隔离森林离群检测作为一种集成离群检测方法,可选取随机斜率的超平面,具有将离群数据与正常数据对象快速分离,时间复杂度较低等优点,但隔离树超平面选取在数据集密集区域或含有无关维度数据区域时,严重影响了其离群检测的效果。 采用相关子空间思想和方法,提出了一种扩展隔离森林离群检测算法。 该算法利用高斯混合模型确定数据对象的相关子空间,从而保证了能够在稀疏数据区域中选取隔离树的切割超平面;隔离树分枝分割优先在稀疏数据区域中,选择隔离树超平面的随机截距点,可快速地将离群数据对象从稀疏数据区域中隔离出来,从而避免了在超平面的随机斜率选取时无关属性维度的干扰;将每个数据对象在各隔离树上的平均路径长度归一化后作为离群得分,并选取离群得分最大的若干个数据对象作为离群数据;在 UCI 数据集上通过实验验证了该算法的有效性,以及抽样数、隔离树个数和近邻数参数对其离群检测效果的影响。
Abstract:
The extended isolation forest outlier detection algorithm,as an ensemble outlier detection method,can select the hyperplane of random slope and has the advantages in separating outliers from normal data and time complexity. But the hyperplane selection of the extended isolation tree in the dense area of the data set or the area with irrelevant dimensions is of great significance to the outlier detection effect. An extended isolation forest outlier detection algorithm is proposed by using the idea and method of relevant subspace. It utilizes Gaussian mixture model to definite the relevant subspace of data objects, which guarantees to select the branching hyperplane of the isolation tree in the sparse data area. During constructing each extended isolation tree, random intercept points of hyperplanes are preferentially selected in the data - sparse region so as to isolate outliers from the data - sparse region quickly. And it can avoid the interference of irrelevant attribute dimensions when selecting the hyperplane’s random slope. Then the outlier score of each data object is obtained by normalizing the average path length in each isolation tree,and the selection of several data objects with the largest outlierscore is defined as the outliers. Experimental results validate the effectiveness of the algorithm and the effects of parameters,including sub-sample size,the number of isolation tree and nearest neighbors on outlier detection in UCI data sets.

相似文献/References:

[1]张凯棋,宋亦静,陈 鑫.基于属性组权重的分类数据离群检测[J].计算机技术与发展,2023,33(11):20.[doi:10. 3969 / j. issn. 1673-629X. 2023. 11. 004]
 ZHANG Kai-qi,SONG Yi-jing,CHEN Xin.Attribute Group Weight-based Outlier Detection for Categorical Data[J].,2023,33(10):20.[doi:10. 3969 / j. issn. 1673-629X. 2023. 11. 004]

更新日期/Last Update: 2022-10-10