[1]倪高伟,李 涛,刘 峥.结合语义和结构的短文本相似度计算[J].计算机技术与发展,2018,28(08):104-108.[doi:10.3969/ j. issn.1673-629X.2018.08.022]
 NI Gao-wei,LI Tao,LIU Zheng.Similarity Calculation of Short Text Combined with Semantic and Structure[J].,2018,28(08):104-108.[doi:10.3969/ j. issn.1673-629X.2018.08.022]
点击复制

结合语义和结构的短文本相似度计算()
分享到:

《计算机技术与发展》[ISSN:1006-6977/CN:61-1281/TN]

卷:
28
期数:
2018年08期
页码:
104-108
栏目:
智能、算法、系统工程
出版日期:
2018-08-10

文章信息/Info

Title:
Similarity Calculation of Short Text Combined with Semantic and Structure
文章编号:
1673-629X(2018)08-0104-05
作者:
倪高伟李 涛刘 峥
南京邮电大学 计算机学院,江苏 南京 210046
Author(s):
NI Gao-weiLI TaoLIU Zheng
School of Computer Science,Nanjing University of Posts and Telecommunications,Nanjing 210046,China
关键词:
EMDWord2Vec短文本相似度计算语义相似度词序结构
Keywords:
earth mover’s distanceWord2Vecshort text similarity calculationsemantic similaritywords order structure
分类号:
TP181
DOI:
10.3969/ j. issn.1673-629X.2018.08.022
文献标志码:
A
摘要:
短文本相似度不仅包括语义相似度,还包括语法相似度。 目前在短文本相似度度量算法中,大多只分析短文本语义层次的相似性,往往忽略了短文本的语法结构对短文本相似度的重要影响,导致无法捕获大量的文本语义信息,同时在短文本分类任务中召回率不够理想。 通过分析短文本的特征,将 EMD(earth mover’s distance)求解线性规划中运输问题的最优解应用于度量两个短文本的相似度,用 Word2Vec 度量两个单词的语义相似性,提出了词序位置相似度的概念,即在计算短文本相似度的同时考虑语句词组顺序对相似度的贡献。 实验结果表明,在捕获大量文本语义信息的基础上,将算法应用于 k 近邻(k-nearest neighbor,KNN)文本分类中,有较好的准确率和召回率。
Abstract:
Short text similarity includes both semantic similarity and syntax similarity. At present,similarity calculation method of short text based on word2Vec mostly only analyzes the semantic similarity,but often ignores the important influence of syntactic structure on short text similarity. As a result,a large amount of text semantic information could not be captured,and the recall rate is not satisfactory in the classified task of the short text. With the analysis of the features of short text,the optimal solution of EMD (earth mover’s dis- tance) in linear programming is used to measure the similarity of two short texts. Word2Vec is used to measure the semantic similarity of two words,and the concept of word order similarity is proposed which means calculating the short text similarity while considering the sentence phrase order contributing to the similarity. The experiment shows that the algorithm applied to k-nearest neighbor (KNN) text classification on the basis of capturing a large number of text semantic information achieves better accuracy and recall rate.

相似文献/References:

[1]肖秦琨,李俊芳,肖秦汉.基于四元数描述和EMD的人体运动捕获数据检索[J].计算机技术与发展,2014,24(03):90.
 XIAO Qin-kun[],LI Jun-fang[],XIAO Qin-han[].Human Motion Capture Data Retrieval Based on Quaternion and EMD[J].,2014,24(08):90.
[2]卫华,韩立新,夏建华. 基于Word2 fea模型的文本建模方法[J].计算机技术与发展,2016,26(02):165.
 WEI Hua,HAN Li-xin,XIA Jian-hua. Text Modeling Method Based on Word2 fea Model[J].,2016,26(08):165.
[3]张兴兰,刘炀. 基于复杂网络及神经网络挖掘用户兴趣的方法[J].计算机技术与发展,2016,26(12):22.
 ZHANG Xing-lan,LIU Yang. Method of Mining User Interest Based on Complex Network and Neural Network[J].,2016,26(08):22.
[4]贾 清,杨 抒.基于 Word2vec 的克隆代码检测方法研究[J].计算机技术与发展,2020,30(08):124.[doi:10. 3969 / j. issn. 1673-629X. 2020. 08. 021]
 JIA Qing,YANG Shu.Research on Clone Code Detection Method Based on Word2vec[J].,2020,30(08):124.[doi:10. 3969 / j. issn. 1673-629X. 2020. 08. 021]
[5]李 鑫.一种面向 Mashup 应用的 API 推荐方法[J].计算机技术与发展,2021,31(02):38.[doi:10. 3969 / j. issn. 1673-629X. 2021. 02. 007]
 LI Xin.An API Recommendation Method for Mashup Application[J].,2021,31(08):38.[doi:10. 3969 / j. issn. 1673-629X. 2021. 02. 007]
[6]何烨辛,谷 林,孙 晨.基于CNN的程序编译错误信息特征提取[J].计算机技术与发展,2021,31(05):204.[doi:10. 3969 / j. issn. 1673-629X. 2021. 05. 035]
 ,CNN-basedProgram CompilationErrorMessageFeatureExtractio[J].,2021,31(08):204.[doi:10. 3969 / j. issn. 1673-629X. 2021. 05. 035]
[7]冼广铭,王鲁栋,曾碧卿,等.基于 LDA 和 BiGRU 的文本分类[J].计算机技术与发展,2022,32(04):15.[doi:10. 3969 / j. issn. 1673-629X. 2022. 04. 003]
 XIAN Guang-ming,WANG Lu-dong,ZENG Bi-qing,et al.Text Classification Based on LDA and BiGRU[J].,2022,32(08):15.[doi:10. 3969 / j. issn. 1673-629X. 2022. 04. 003]
[8]王小楠,黄卫东.基于类别主题词集的加权相似度短文本分类[J].计算机技术与发展,2022,32(09):95.[doi:10. 3969 / j. issn. 1673-629X. 2022. 09. 015]
 WANG Xiao-nan,HUANG Wei-dong.Short Text Classification with Weighted Similarity Based on Category Topic Word Set[J].,2022,32(08):95.[doi:10. 3969 / j. issn. 1673-629X. 2022. 09. 015]
[9]关 慧,曹同洲.基于 CNN 和多注意力机制的 XSS 检测模型[J].计算机技术与发展,2023,33(04):175.[doi:10. 3969 / j. issn. 1673-629X. 2023. 04. 026]
 GUAN Hui,CAO Tong-zhou.XSS Detection Model Based on CNN and Multi-attention Mechanism[J].,2023,33(08):175.[doi:10. 3969 / j. issn. 1673-629X. 2023. 04. 026]

更新日期/Last Update: 2018-09-10