[1]蒋 园,韩 旭,马丹璇,等.相似重复数据检测的数据清洗算法优化[J].计算机技术与发展,2019,29(10):79-82.[doi:10. 3969 / j. issn. 1673-629X. 2019. 10. 017]
 JIANG Yuan,HAN Xu,MA Dan-xuan,et al.Optimization of Data Cleaning Algorithm for Similar Duplicate Data Detection[J].,2019,29(10):79-82.[doi:10. 3969 / j. issn. 1673-629X. 2019. 10. 017]
点击复制

相似重复数据检测的数据清洗算法优化()
分享到:

《计算机技术与发展》[ISSN:1006-6977/CN:61-1281/TN]

卷:
29
期数:
2019年10期
页码:
79-82
栏目:
应用开发研究
出版日期:
2019-10-10

文章信息/Info

Title:
Optimization of Data Cleaning Algorithm for Similar Duplicate Data Detection
文章编号:
1673-629X(2019)10-0079-04
作者:
蒋 园韩 旭马丹璇罗登昌
长江水利委员会 长江勘测规划设计研究有限责任公司,湖北 武汉 430010
Author(s):
JIANG YuanHAN XuMA Dan-xuanLUO Deng-chang
Changjiang Institute of Survey Planning Design and Research,Changjiang Water Resource Commission,Wuhan 430010,China
关键词:
脏数据相似重复数据清洗SNM 算法
Keywords:
dirty datasimilar repetitiondata cleaningSNM
分类号:
TP31
DOI:
10. 3969 / j. issn. 1673-629X. 2019. 10. 017
摘要:
数据一直是各大企业竞争的对象,而企业在采集、处理以及最终录入数据库的数据中往往存在着相似重复的数据,这些数据也即“脏数据冶。 脏数据如果不进行处理,势必会影响后续数据的操作,最终影响到数据的质量。 数据清洗是处理脏数据、提高数据质量的热门技术手段,而其中相似重复数据检测更是数据清洗中的重要方面,比如堤防工程的数据存在很多地名、经纬度、砖孔数据等等,录入到数据库时相似重复度很高。 目前针对重复数据检测应用最多的是 SNM(基本邻近有序法)算法,主要是先将原有的数据集进行排序,再比较排序后相邻数据的相识度。 但这种算法的时间复杂度很高。 文中对 SNM 算法进行优化,首先将数据库记录的属性值进行分类,并结合三区间排序算法进行排序来减少比对范围,最后通过设定属性的权重并求和,根据记录相似度的结果来判断。 实验结果证明了该算法的正确性。
Abstract:
Data has always been the object of competition for large enterprises,and enterprises often have similar and repeated data in the data collected,processed and finally entered into the database,which is also known as “dirty data冶. If these dirty data are not processed, they will affect the operation of subsequent data and ultimately affect the quality of data. Nowadays,data cleaning is a popular technical method to improve the data quality by processing dirty data,and similar duplicate data detection is more important in data cleaning. The data of many place names,latitude and longitude,brick hole data and so on are highly similar in Dike database. At present,the most widely used application for repeated data detection is the SNM (basic proximity ordered method),which mainly sorts the original data firstly,then compares the acquaintances of the adjacent data. However,the time complexity of this calculation is very high. In this paper, by optimizing the SNM algorithm,the database records are first classified to reduce the comparison range,and then the weight of the attributes is set to detect and judge the similarity of the records. Finally,an example is given to prove the correctness of the algorithm.

相似文献/References:

[1]孔钦,叶长青,孙赟.大数据下数据预处理方法研究[J].计算机技术与发展,2018,28(05):1.[doi:10.3969/j.issn.1673-629X.2018.05.001]
 KONG Qin,YE Changqing,SUN Yun.Research on Data Preprocessing Methods for Big Data[J].,2018,28(10):1.[doi:10.3969/j.issn.1673-629X.2018.05.001]

更新日期/Last Update: 2019-10-10