相似重复数据检测的数据清洗算法优化-《计算机技术与发展》

文章信息/Info

Title:: Optimization of Data Cleaning Algorithm for Similar Duplicate Data Detection

Author(s):: JIANG Yuan; HAN Xu; MA Dan-xuan; LUO Deng-chang; Changjiang Institute of Survey Planning Design and Research,Changjiang Water Resource Commission,Wuhan 430010,China

摘要:: 数据一直是各大企业竞争的对象,而企业在采集、处理以及最终录入数据库的数据中往往存在着相似重复的数据,这些数据也即“脏数据冶。脏数据如果不进行处理,势必会影响后续数据的操作,最终影响到数据的质量。数据清洗是处理脏数据、提高数据质量的热门技术手段,而其中相似重复数据检测更是数据清洗中的重要方面,比如堤防工程的数据存在很多地名、经纬度、砖孔数据等等,录入到数据库时相似重复度很高。目前针对重复数据检测应用最多的是 SNM(基本邻近有序法)算法,主要是先将原有的数据集进行排序,再比较排序后相邻数据的相识度。但这种算法的时间复杂度很高。文中对 SNM 算法进行优化,首先将数据库记录的属性值进行分类,并结合三区间排序算法进行排序来减少比对范围,最后通过设定属性的权重并求和,根据记录相似度的结果来判断。实验结果证明了该算法的正确性。

Abstract:: Data has always been the object of competition for large enterprises,and enterprises often have similar and repeated data in the data collected,processed and finally entered into the database,which is also known as “dirty data冶. If these dirty data are not processed, they will affect the operation of subsequent data and ultimately affect the quality of data. Nowadays,data cleaning is a popular technical method to improve the data quality by processing dirty data,and similar duplicate data detection is more important in data cleaning. The data of many place names,latitude and longitude,brick hole data and so on are highly similar in Dike database. At present,the most widely used application for repeated data detection is the SNM (basic proximity ordered method),which mainly sorts the original data firstly,then compares the acquaintances of the adjacent data. However,the time complexity of this calculation is very high. In this paper, by optimizing the SNM algorithm,the database records are first classified to reduce the comparison range,and then the weight of the attributes is set to detect and judge the similarity of the records. Finally,an example is given to prove the correctness of the algorithm.