«上一篇/Previous Article|本期目录/Table of Contents|下一篇/Next Article»

j．issn．1673－629X．2018．02．039]
点击复制

基于LDA 的社科文献主题建模方法()

分享到：

《计算机技术与发展》[ISSN:1006-6977/CN:61-1281/TN]

卷:: 28
期数:: 2018年02期

页码:: 182-187

栏目:: 应用开发研究

出版日期:: 2018-02-10

文章信息/Info

Title:: A Topic Modeling Method for Social Science Literature Based on LDA

文章编号:: 1673－629X(2018)02－0182－06

作者:: 李昌亚; 刘方方; 上海大学计算机工程与科学学院，上海 200444

Author(s):: LI Chang－ya; LIU Fang－fang; School of Computer Engineering and Science，Shanghai University，Shanghai 200444，China

关键词:: 主题模型; LDA; 社科文献; Gibbs 抽样

Keywords:: topic model; LDA; social science literature; Gibbs sampling

分类号:: TP31

DOI:: 10．3969/j．issn．1673－629X．2018．02．039

文献标志码:: A

摘要:: 随着互联网的发展，文本分类和主题提取的应用越来越广泛，而主题模型在文本主题提取中起着很大的作用。LDA(latent Dirichlet allocation)模型是一种应用非常广泛且很成熟的主题模型，也是一个概率生成模型，可以很好地解决多词一义和一词多义的问题。但是当利用 LDA 模型对社科文献领域类的文档集进行主题建模时，由于该建模方法忽略了文档集自身的主题特点，提取的主题分布是偏向文档中高频词汇，所以造成最后提取的主题偏离文档的本质意义上的主题、结果不够准确。针对 LDA 模型对文档进行主题建模的过程，结合社科文献领域的文档特点，对主题建模的过程进行相应的改进，提出一种新的主题建模方法，从而使最终提取的主题更加准确，更符合文档集本身的主题特点。

Abstract:: With the development of the Internet，the application of text classification and topic extraction is becoming more and more widely，and topic model plays a critical role in topic extraction of the text．LDA (latent Dirichlet allocation)，as an extensive and mature topic model，is also a probability generation model，which can solve the problem of synonym and polysemy．But when LDA model is used to model the document collection in the domain of social science literature，because of its ignorance of the topic characteristics of document collection it-
self，the topic distribution extracted by the modeling method is to trend the high frequency words，which makes the extracted topic deviated from the document topic in nature and the results inaccurate．In this paper，aiming at the topic modeling of document with LDA model and combined with the characteristics of the document in the field of social literature，we present a new topic modeling method to improve accordingly the process of modeling，so that the topic of the final extraction is more accurate and more consistent with the topic characteristics of the document collection itself．

相似文献/References:

[1]孙昌年,郑诚,夏青松.基于 LDA 的中文文本相似度计算[J].计算机技术与发展,2013,(01):217.
　SUN Chang-nian,ZHENG Cheng,XIA Qing-song.Chinese Text Similarity Computing Based on LDA[J].,2013,(02):217.
[2]郑诚,刘娇丽,项珑.基于VSM和LDA模型的FAQ问答系统[J].计算机技术与发展,2014,24(01):133.
　ZHENG Cheng,LIU Jiao-li,XIANG Long.FAQ Answering System Based on VSM and LDA Model[J].,2014,24(02):133.
[3]李菲菲,王移芝.基于频繁词网络的 LDA 最优主题个数选取方法[J].计算机技术与发展,2018,28(08):1.[doi:10.3969/ j. issn.1673-629X.2018.08.001]
　LI Fei-fei,WANG Yi-zhi.Selection Method of LDA Optimal Topic Number Based on Frequent Word Network[J].,2018,28(02):1.[doi:10.3969/ j. issn.1673-629X.2018.08.001]
[4]潘晓英,胡开开,朱静. 一种基于TextRank的文本二次聚类算法[J].计算机技术与发展,2016,26(08):7.
　PAN Xiao-ying,HU Kai-kai,ZHU Jing. A Secondary Text Clustering Algorithm Based on TextRank[J].,2016,26(02):7.
[5]孙伟,刘文静,葛丽阁,等.一种基于词加权 LDA 模型的专利文献分类方法[J].计算机技术与发展,2019,29(03):23.[doi:10.3969/ j. issn.1673-629X.2019.03.005]
　SUN Wei,LIU Wen-jing,GE Li-ge,et al.A Patent Document Classification Method Based on Word Weighted LDA Model[J].,2019,29(02):23.[doi:10.3969/ j. issn.1673-629X.2019.03.005]
[6]伍哲,杨芳.时间加权的 TF-LDA 学术文献摘要主题分析[J].计算机技术与发展,2020,30(01):194.[doi:10. 3969 / j. issn. 1673-629X. 2020. 01. 035]
　WU Zhe,YANG Fang.A Thematic Analysis Method of Academic Documents Based on TF-IDF and LDA[J].,2020,30(02):194.[doi:10. 3969 / j. issn. 1673-629X. 2020. 01. 035]
[7]骆梅柳,裴可锋.大数据下的基于主题模型的社交网络链接预测[J].计算机技术与发展,2020,30(04):36.[doi:10. 3969 / j. issn. 1673-629X. 2020. 04. 007]
　LUO Mei-liu,PEI Ke-feng.Social Networking Link Prediction Based on Topic Model under Big Data[J].,2020,30(02):36.[doi:10. 3969 / j. issn. 1673-629X. 2020. 04. 007]
[8]李鑫.一种面向 Mashup 应用的 API 推荐方法[J].计算机技术与发展,2021,31(02):38.[doi:10. 3969 / j. issn. 1673-629X. 2021. 02. 007]
　LI Xin.An API Recommendation Method for Mashup Application[J].,2021,31(02):38.[doi:10. 3969 / j. issn. 1673-629X. 2021. 02. 007]
[9]陈莹,叶宁,徐康,等.基于领域特征指示词的隐式特征识别研究[J].计算机技术与发展,2021,31(09):24.[doi:10. 3969 / j. issn. 1673-629X. 2021. 09. 005]
　CHEN Ying,YE Ning,XU Kang,et al.Research on Implicit Feature Identification Based on Domain Feature Indicators[J].,2021,31(02):24.[doi:10. 3969 / j. issn. 1673-629X. 2021. 09. 005]
[10]关慧,宗福焱,曲盼.基于 BTM 和长文本语义增强的用户评论分类 …[J].计算机技术与发展,2023,33(07):181.[doi:10. 3969 / j. issn. 1673-629X. 2023. 07. 027]
　GUAN Hui,ZONG Fu-yan,QU Pan.User Comment Classification Based on BTM and Long Text Semantic Enhancement[J].,2023,33(02):181.[doi:10. 3969 / j. issn. 1673-629X. 2023. 07. 027]
[11]白振凯,黄孝喜,王荣波,等. 基于主题模型的汉语动词隐喻识别[J].计算机技术与发展,2016,26(11):67.
　BAI Zhen-kai,HUANG Xiao-xi,WANG Rong-bo,et al. Chinese Verb Metaphor Recognition Based on Topic Model[J].,2016,26(02):67.

常用功能

工具/Tools

统计/Statistics

摘要浏览/Viewed1057
全文下载/Downloads677
评论/Comments