[1]白文磊,常丽琼,郭 军,等.一种基于论文画像的科技文献数据去重算法[J].计算机技术与发展,2022,32(08):148-154.[doi:10. 3969 / j. issn. 1673-629X. 2022. 08. 024]
 BAI Wen-lei,CHANG Li-qiong,GUO Jun,et al.A Data Deduplication Algorithm for Scientific Literature Based on Paper Portrait[J].,2022,32(08):148-154.[doi:10. 3969 / j. issn. 1673-629X. 2022. 08. 024]
点击复制

一种基于论文画像的科技文献数据去重算法()
分享到:

《计算机技术与发展》[ISSN:1006-6977/CN:61-1281/TN]

卷:
32
期数:
2022年08期
页码:
148-154
栏目:
应用前沿与综合
出版日期:
2022-08-10

文章信息/Info

Title:
A Data Deduplication Algorithm for Scientific Literature Based on Paper Portrait
文章编号:
1673-629X(2022)08-0148-07
作者:
白文磊1常丽琼1郭 军12刘宝英1甘大广3
1. 西北大学 信息科学与技术学院,陕西 西安 710127;
2. 西北大学 京东人工智能与物联网联合研究院,陕西 西安 710127;
3. 万方数据有限公司,北京 100038
Author(s):
BAI Wen-lei1CHANG Li-qiong1GUO Jun12LIU Bao-ying1?GAN Da-guang3
1. School of Information Science and Technology,Northwest Univ,Xi’an 710127,China;
2. Jingdong Joint Research Institute of AI and Internet of Things,Northwest Univ,Xi’an 710127,China;
3. Wanfang Data Co. ,Beijing 100038,China
关键词:
论文画像数据清洗数据去重词频-逆文档频率词向量
Keywords:
paper portraitdata cleandata deduplicationtf-idfword2vec
分类号:
TP391
DOI:
10. 3969 / j. issn. 1673-629X. 2022. 08. 024
摘要:
快速准确地将不同数据库中重复数据过滤清除是构建数据仓库的重要技术之一。 在科技文献资源服务领域,传统的数据去重方法主要是利用数据库检索技术,进行字段内容匹配,过滤内容相同的论文数据. 然而,分布在不同数据库中的论文,一般有着不同的字段信息和字段类型,即使有相同的字段也会因为字段内容可能存在乱码信息,导致算法鲁棒性不强,这是传统搜索查找匹配方法面临的一个主要挑战。 为解决这个问题,借鉴推荐系统中物品画像和人物画像算法的思想,该文提出了一种基于论文画像的科技文献数据去重算法。 该算法通过 tf-idf 技术提取文章摘要中的关键字信息,再将关键字信息通过 word2vec 转换为词向量,进而计算出论文之间的相似程度并过滤掉重复数据。 实验结果表明,在真实的大型论文数据集下,该算法能够有效去除重复信息,auc 均值可达到 0. 98 以上。
Abstract:
It is one of the important techniques for constructing data warehouse to filter and remove duplicate data from different databasesquickly and accurately. In scientific literature service,the traditional data deduplication methods mainly use data searching technology tomatch the fields and filter out the papers with the same content. However,papers in different databases usually have different field information and field types. Even if there are the same fields,there may be garbled information in the field content,which leads to the weakrobustness of the algorithm. This is a major challenge faced by traditional search and matching methods. To solve this problem,we propose a data deduplication algorithm based on the paper portrait inspired by the algorithm of item portrait and person portrait in the recommendation system. This algorithm adopts tf-idf technology to extract the keyword information in the article abstract,which are convertedinto word vectors by word2vec so that the similarity between papers can be calculated. The duplicate data is filtered according to theirsimilarities. The experimental results show? that the proposed algorithm can effectively filter duplicate information under the real-worlddata set.

相似文献/References:

[1]于飞 丁华福 姜伦.Web日志挖掘中数据预处理技术的研究[J].计算机技术与发展,2010,(05):47.
 YU Fei,DING Hua-fu,JIANG Lun.Research on Data Preprocessing Technology in Web Log Mining[J].,2010,(08):47.
[2]包从剑 李星毅 施化吉.可扩展和可交互的数据清洗系统[J].计算机技术与发展,2007,(07):84.
 BAO Cong-jian,LI Xing-yi,SHI Hua-ji.Extendible and Interactive Data Cleaning System[J].,2007,(08):84.
[3]秦学勇 姚燕生.可扩展数据仓库若干关键问题研究与分析[J].计算机技术与发展,2006,(12):136.
 QIN Xue-yong,YAO Yan-sheng.Research and Analysis on Some Key Problems of Scalable Data Warehouse[J].,2006,(08):136.
[4]董艳.数据预处理方法在移动通信行业中的应用[J].计算机技术与发展,2010,(11):225.
 DONG Yan.Application of Data Pre-processing Method in Mobile Telecommunication Industry[J].,2010,(08):225.
[5]王庆生 魏晓伟.RFID复杂事件处理关键技术的研究与改进[J].计算机技术与发展,2012,(01):45.
 WANG Qing-sheng,WEI Xiao-wei.Research and Improvement on Key Technology of RFID Complex Event Processing[J].,2012,(08):45.
[6]田兴邦,华蓓,吕颖,等. 基于动态冲突度计算的敏感规则清洗算法[J].计算机技术与发展,2015,25(02):126.
 TIAN Xing-bang,HUA Bei,Lü Ying,et al. Sensitive-rule Sanitization Algorithm Based on Computing Dynamic Conflict Degree[J].,2015,25(08):126.
[7]郑超,高茂庭,吴爱华. 基于RFID及其路径约束的生产检查流程控制[J].计算机技术与发展,2015,25(02):225.
 ZHENG Chao,GAO Mao-ting,WU Ai-hua. Production Testing Process Control Based on RFID with Path Constraint[J].,2015,25(08):225.
[8]张方舟,高晓松. 基于条件函数依赖的挖掘算法研究[J].计算机技术与发展,2015,25(05):56.
 ZHANG Fang-zhou,GAO Xiao-song. Research on Mining Algorithm Based on Conditional Functional Dependence[J].,2015,25(08):56.
[9]蒋 园,韩 旭,马丹璇,等.相似重复数据检测的数据清洗算法优化[J].计算机技术与发展,2019,29(10):79.[doi:10. 3969 / j. issn. 1673-629X. 2019. 10. 017]
 JIANG Yuan,HAN Xu,MA Dan-xuan,et al.Optimization of Data Cleaning Algorithm for Similar Duplicate Data Detection[J].,2019,29(08):79.[doi:10. 3969 / j. issn. 1673-629X. 2019. 10. 017]
[10]王晓霞,孙德才.一种基于 MapReduce 的局部相似自连接算法[J].计算机技术与发展,2020,30(02):88.[doi:10. 3969 / j. issn. 1673-629X. 2020. 02. 018]
 WANG Xiao-xia,SUN De-cai.A MapReduce-based Local Similarity Self-join Algorithm[J].,2020,30(08):88.[doi:10. 3969 / j. issn. 1673-629X. 2020. 02. 018]

更新日期/Last Update: 2022-08-10