基于相关子空间的扩展隔离森林离群检测算法-《计算机技术与发展》

文章信息/Info

Title:: An Extended Isolation Forest Outlier Detection Algorithm Based onRelevant Subspace

Author(s):: LIU Jia; ZHU Peng-yun; XUN Ya-ling; School of Computer Science and Technology,Taiyuan University of Science and Technology,Taiyuan 030024,China

Keywords:: outlier detection; extended isolation forest; relevant subspace; Gaussian mixture model; sparse data area

摘要:: 扩展隔离森林离群检测作为一种集成离群检测方法,可选取随机斜率的超平面,具有将离群数据与正常数据对象快速分离,时间复杂度较低等优点,但隔离树超平面选取在数据集密集区域或含有无关维度数据区域时,严重影响了其离群检测的效果。采用相关子空间思想和方法,提出了一种扩展隔离森林离群检测算法。该算法利用高斯混合模型确定数据对象的相关子空间,从而保证了能够在稀疏数据区域中选取隔离树的切割超平面;隔离树分枝分割优先在稀疏数据区域中,选择隔离树超平面的随机截距点,可快速地将离群数据对象从稀疏数据区域中隔离出来,从而避免了在超平面的随机斜率选取时无关属性维度的干扰;将每个数据对象在各隔离树上的平均路径长度归一化后作为离群得分,并选取离群得分最大的若干个数据对象作为离群数据;在 UCI 数据集上通过实验验证了该算法的有效性,以及抽样数、隔离树个数和近邻数参数对其离群检测效果的影响。

Abstract:: The extended isolation forest outlier detection algorithm,as an ensemble outlier detection method,can select the hyperplane of random slope and has the advantages in separating outliers from normal data and time complexity. But the hyperplane selection of the extended isolation tree in the dense area of the data set or the area with irrelevant dimensions is of great significance to the outlier detection effect. An extended isolation forest outlier detection algorithm is proposed by using the idea and method of relevant subspace. It utilizes Gaussian mixture model to definite the relevant subspace of data objects, which guarantees to select the branching hyperplane of the isolation tree in the sparse data area. During constructing each extended isolation tree, random intercept points of hyperplanes are preferentially selected in the data - sparse region so as to isolate outliers from the data - sparse region quickly. And it can avoid the interference of irrelevant attribute dimensions when selecting the hyperplane’s random slope. Then the outlier score of each data object is obtained by normalizing the average path length in each isolation tree,and the selection of several data objects with the largest outlierscore is defined as the outliers. Experimental results validate the effectiveness of the algorithm and the effects of parameters,including sub-sample size,the number of isolation tree and nearest neighbors on outlier detection in UCI data sets.