[1]张亚男,陈卫卫,付印金,等.基于 Simhash 改进的文本去重算法[J].计算机技术与发展,2022,32(08):26-32.[doi:10. 3969 / j. issn. 1673-629X. 2022. 08. 005]
 ZHANG Ya-nan,CHEN Wei-wei,FU Yin-jin,et al.Improved Text Deduplication Algorithm Based on Simhash[J].,2022,32(08):26-32.[doi:10. 3969 / j. issn. 1673-629X. 2022. 08. 005]
点击复制

基于 Simhash 改进的文本去重算法()
分享到:

《计算机技术与发展》[ISSN:1006-6977/CN:61-1281/TN]

卷:
32
期数:
2022年08期
页码:
26-32
栏目:
大数据分析与挖掘
出版日期:
2022-08-10

文章信息/Info

Title:
Improved Text Deduplication Algorithm Based on Simhash
文章编号:
1673-629X(2022)08-0026-07
作者:
张亚男陈卫卫付印金徐 堃
陆军工程大学 指挥控制工程学院,江苏 南京 210007
Author(s):
ZHANG Ya-nanCHEN Wei-weiFU Yin-jinXU Kun
School of Command and Control Engineering,Army Engineering University,Nanjing 210007,China
关键词:
Simhash文本去重词频-逆文本频率Jaccard 相似度二进制压缩算法位置特征
Keywords:
Simhashtext deduplicationterm frequency-inverse document frequency ( TF-IDF) Jaccard similaritybinary dimension reduction ( BDR) position feature
分类号:
TP301
DOI:
10. 3969 / j. issn. 1673-629X. 2022. 08. 005
摘要:
为了提高大规模文本去重算法 Simhash 对重复数据的检测精度,针对词袋( Bag of Words,BoW) 模型无法体现特征词位置分布信息的缺点,提出一种改进的 Simhash 算法( P-Simhash) 。 该算法首先改进了 Simhash 计算特征词权重的方法,在由 TF-IDF 算法计算得到特征词的权重的基础上,引入 Jaccard 相似度量对共现词的权重进行优化,以降低共现词权重过高对检测文本差异的影响。 其次采用 BDR 算法降维思想,设计了体现特征词位置差异的签名方案,将特征词在文本中出现的位置特征转化为一组由二进制向量表示的签名。 最后,将特征词哈希签名与位置特征签名加权求和的结果作为其对应的特征向量,与经过优化后的特征词权重进行二次加权,合并降维后得到新的文本签名。 使用开放的搜狗新闻数据集进行实验,并与其他算法进行了性能比较。 实验结果表明,P - Simhash 算法在去重效果和执行效率上较传统的Simhash 算法有明显提高。
Abstract:
In order to improve the detection accuracy of Simhash for repeated data, an improved Simhash algorithm ( P - Simhash) isproposed to solve the problem that the Bag of Words ( BoW)? ?model cannot reflect the location distribution information of featuredwords. Firstly,the method of calculating the weight of key words by Simhash is improved. On the basis of the weight? ? ? ? ?of key words calculated by TF-IDF algorithm,Jaccard similarity measure is introduced to optimize the weight of co-occurrence words,so as to reduce theinfluence of excessive weight of? ? ? cooccurrence words on the detection of text differences. Secondly,a signature scheme is designed toreflect the difference of key words爷 position based on the idea of dimension reduction? by BDR algorithm. The position features of keywords appearing in text are transformed into a set of signatures represented by binary vector. Finally,the result of the weighted sum of the hash signature and the position signature is taken as the corresponding feature vector,which is second weighted with the optimized keyword weight,and the new text signature is obtained after combining and reducing the dimension. The open Sogou News data set is usedfor experiments and the performance is compared with other algorithms. The experimental results show that the P-Simhash algorithm significantly improves the deduplication effect and execution efficiency compared with the traditional Simhash algorithm.

相似文献/References:

[1]徐济惠. 基于Simhash算法的海量文档反作弊技术研究[J].计算机技术与发展,2014,24(09):103.
 XU Ji-hui. Research on Huge Amounts of Documents Anti-spamming Technique Based on Simhash Algorithm[J].,2014,24(08):103.
[2]石雁,李朝锋. 结合统计和词间关系的文本关键词计算方法[J].计算机技术与发展,2015,25(12):22.
 SHI Yan,LI Chao-feng. A Method of Text Keyword Calculation by Combining Statistics with Relationship Between Words[J].,2015,25(08):22.
[3]彭双和,图尔贡·麦提萨比尔,周巧凤. 基于Simhash的中文文本去重技术研究[J].计算机技术与发展,2017,27(11):137.
 PENG Shuang-he,Tuergong MAITISABIER,ZHOU Qiao-feng. Research on Deduplication Technique of Chinese Text with Simhash[J].,2017,27(08):137.
[4]王诚,王宇成.基于Simhash 的大规模文档去重改进算法研究[J].计算机技术与发展,2019,29(02):115.[doi:10.3969/j.issn.1673-629X.2019.02.024]
 WANG Cheng,WANG Yucheng.Research on Improved Large-scale Documents Deduplication Algorithm Based on Simhash[J].,2019,29(08):115.[doi:10.3969/j.issn.1673-629X.2019.02.024]

更新日期/Last Update: 2022-08-10