«上一篇/Previous Article|本期目录/Table of Contents|下一篇/Next Article»

[1]孙昌年,郑诚,夏青松.基于 LDA 的中文文本相似度计算[J].计算机技术与发展,2013,(01):217-220.
　SUN Chang-nian,ZHENG Cheng,XIA Qing-song.Chinese Text Similarity Computing Based on LDA[J].,2013,(01):217-220.
点击复制

基于 LDA 的中文文本相似度计算()

分享到：

《计算机技术与发展》[ISSN:1006-6977/CN:61-1281/TN]

卷:
期数:: 2013年01期

页码:: 217-220

栏目:: 应用开发研究

出版日期:: 1900-01-01

文章信息/Info

Title:: Chinese Text Similarity Computing Based on LDA

文章编号:: 1673-629X（2013）01-0217-04

作者:: 孙昌年¹; 2; 郑诚¹; 2; 夏青松¹; 2; [1]安徽大学计算机科学与技术学院;[2]教育部计算智能与信号处理重点实验室

Author(s):: SUN Chang-nian; ZHENG Cheng; XIA Qing-song

关键词:: 向量空间模型; 文本相似度; 自然语言处理; 潜在狄里克雷分配; 主题模型

Keywords:: vector space model; text similarity; natural language processing; latent Dirichlet allocation; topic model

文献标志码:: A

摘要:: 传统基于 TF-IDF 的向量空间模型的文本相似度计算存在高维、数据稀疏、缺乏语义和维度未归一等问题,基于其上的语义扩展的 TF-IDF 向量空间模型中部分解决了语义问题,但是其基于词典的词语相似度计算限制了其应用范围.提出了一种基于潜在狄利克雷分配(Latent Dirichlet Allocation, LDA)的文本相似度计算方法,LDA 模型可以在没有词典的情况下解决上述所有问题,通过吉比斯抽样方法将文本建模到主题空间,然后使用 JS(Jensen-Shannon)距离来计算文本相似度.通过聚类实验表明该方法取得了较高的 F 值

Abstract:: Text similarity calculation based on traditional TF-IDF vector space model exists high dimensional sparse data,lack of semantic and dimension normalization,the TF-IDF vector space model based on its semantic extension is to solve the partial problem of semantic, but its word similarity computation based on dictionary limits its application scope. Proposed a text similarity computing method based on potential Dirichlet distribution (Latent Dirichlet Allocation,LDA),LDA model can solve all these problems in no dictionary,through the Gibbs sampling method,the text modeling to subject space,and then use JS (Jensen-Shannon) distance computing text similarity. The clustering experiment results show that this method can achieve high F value

相似文献/References:

[1]许幸张启蕊.基于KNN算法的医药信息文本分类系统的研究[J].计算机技术与发展,2009,(04):206.
　XU Xing,ZHANG Qi-rui.Research of Medical Information Text Categorization Based on KNN Algorithm[J].,2009,(01):206.
[2]曹毅贺卫红.基于内容过滤的电子商务推荐系统研究[J].计算机技术与发展,2009,(06):182.
　CAO Yi,HE Wei-hong.Research on E- Commerce Recommender System Based on Content - Based Filtering[J].,2009,(01):182.
[3]苏小虎.基于改进VSM的句子相似度研究[J].计算机技术与发展,2009,(08):113.
　SU Xiao-hu.Research of Sentence Similarity Based on Improved VSM[J].,2009,(01):113.
[4]赵治军陈立潮谢斌红王秀慧.基于VSM的OAI—PMH元数据相似度计算研究[J].计算机技术与发展,2009,(09):119.
　ZHAO Zhi-jun,CHEN Li-chao,XIE Bin-hong,et al.Research of Calculating Metadata Similarity in OAI Framework Based on VSM[J].,2009,(01):119.
[5]张成伟郑诚.基于改进VSM的文本信息检索研究[J].计算机技术与发展,2009,(01):71.
　ZHANG Cheng-wei,ZHENG Cheng.Research of Text Information Retrieval Based on Improved VSM[J].,2009,(01):71.
[6]李想吴国新郭晶.基于分布式倒排索引和VSM算法的P2P复杂搜索[J].计算机技术与发展,2009,(04):25.
　LI Xiang,WU Guo-xin,GUO Jing.Distributed Inverted Index and VSM Algorithm Based Complex Peer- to- Peer Search[J].,2009,(01):25.
[7]乌庆敏杨思春.基于潜在语义分析的智能答疑系统研究与实现[J].计算机技术与发展,2008,(09):251.
　WU Qing-min,YANG Si-chun.Research on Intelligent Question Answering System Based on Latent Semantic Analysis[J].,2008,(01):251.
[8]饶文碧柯慧燕.Web文本分类技术研究及其实现[J].计算机技术与发展,2006,(03):116.
　RAO Wen-bi,KE Hui-yan.Research and Implementation of Web Text Classification[J].,2006,(01):116.
[9]宋丹王卫东陈英.基于改进向量空间模型的话题识别与跟踪[J].计算机技术与发展,2006,(09):62.
　SONG Dan,WANG Wei-dong,CHEN Ying.Topic Detection and Tracking with a Developed Vector Space Model[J].,2006,(01):62.
[10]侯亚南黄映辉.用于形式背景提取的中文文本表示[J].计算机技术与发展,2010,(09):36.
　HOU Ya-nan,HUANG Ying-hui.Chinese Document Representation for Extracting Formal Context[J].,2010,(01):36.
[11]邱欢堂何聚厚何秀青.教学反思内容自动评估模型研究[J].计算机技术与发展,2012,(09):173.
　QIU Huan-tang,HE Ju-hou,HE Xiu-qing.Automatic Assessment Model for Content of Teaching Reflection[J].,2012,(01):173.

常用功能

工具/Tools

统计/Statistics

摘要浏览/Viewed1371
全文下载/Downloads821
评论/Comments