«上一篇/Previous Article|本期目录/Table of Contents|下一篇/Next Article»

HTML)

分享到：

《计算机技术与发展》[ISSN:1006-6977/CN:61-1281/TN]

卷:: 28
期数:: 2018年08期

页码:: 1-5

栏目:: 智能、算法、系统工程

出版日期:: 2018-08-10

文章信息/Info

Title:: Selection Method of LDA Optimal Topic Number Based on Frequent Word Network

文章编号:: 1673-629X(2018)08-0001-05

作者:: 李菲菲; 王移芝; 北京交通大学计算机与信息技术学院,北京 100044

Author(s):: LI Fei-fei; WANG Yi-zhi; School of Computer and Information Technology,Beijing Jiaotong University,Beijing 100044,China

关键词:: 隐含狄利克雷分布; 主题模型; 频繁词网络; 聚类; 社区划

Keywords:: LDA; topic model; frequent word network; clustering; community partition

分类号:: TP393

DOI:: 10.3969/ j. issn.1673-629X.2018.08.001

文献标志码:: A

摘要:: LDA(latent Dirichlet allocation,隐含狄利克雷分布)主题模型被广泛应用于大规模文档处理,通常用于主题提取、情感分析和文本降维等。这些模型使用类似期望最大算法从文档集合中提取低维语义分布,并将每一维分布有效结合,形成主题。在模型构建过程中,初始主题数 K 对迭代过程与结果非常重要。针对这一问题,根据文档聚类簇数(即社区个数)与文档集隐含主题数相一致的特点,提出了一种以频繁词集网络的社区划分个数用来指定 LDA 主题模型主题输入个数的方法。该方法对文档构建频繁词对,并以此为基础构建词共现网络,然后采用无监督社区划分算法对该词共现网络进行社区划分,并以划分的社区个数作为 LDA 主题模型的主题个数。实验结果表明,该方法可以自动化指定主题个数 K ,显著提升主题查准率和查全率,主题独立性更强。

Abstract:: LDA topic model is widely used in large-scale document processing and usually used for topic extraction,emotional analysis and text reduction. These models use the similar expectation maximum algorithm to extract the low-dimensional semantic distribution from the document collection,and effectively combine each dimension distribution to form the topic. In the model building process,the initial topic number K is very important for the iterative process and result. In order to solve this problem,according to the characteristics that the number of frequent words implied in the network community is consistent with the implied topics of document sets,we propose a method to specify the number of inputs for LDA topic model based on the number of community partition in the frequent word set net-work. This method builds frequent word pairs of documents,based on which the word co-occurrence network is constructed. And then,
the unsupervised community partition algorithm is used to partition the co-occurrence network,and the number of communities is used as the number of topics in the LDA topic model. The experiment shows that this method can automatically specify the number of topic number K ,which significantly improves the precision and recall of topic and makes the independence of topic stronger.

相似文献/References:

[1]孙昌年,郑诚,夏青松.基于 LDA 的中文文本相似度计算[J].计算机技术与发展,2013,(01):217.
　SUN Chang-nian,ZHENG Cheng,XIA Qing-song.Chinese Text Similarity Computing Based on LDA[J].,2013,(08):217.
[2]李昌亚,刘方方.基于LDA 的社科文献主题建模方法[J].计算机技术与发展,2018,28(02):182.[doi:10．3969/j．issn．1673－629X．2018．02．039]
　LI Changya,LIU Fangfang. A Topic Modeling Method for Social Science Literature Based on LDA[J].,2018,28(08):182.[doi:10．3969/j．issn．1673－629X．2018．02．039]
[3]白振凯,黄孝喜,王荣波,等. 基于主题模型的汉语动词隐喻识别[J].计算机技术与发展,2016,26(11):67.
　BAI Zhen-kai,HUANG Xiao-xi,WANG Rong-bo,et al. Chinese Verb Metaphor Recognition Based on Topic Model[J].,2016,26(08):67.
[4]伍哲,杨芳.时间加权的 TF-LDA 学术文献摘要主题分析[J].计算机技术与发展,2020,30(01):194.[doi:10. 3969 / j. issn. 1673-629X. 2020. 01. 035]
　WU Zhe,YANG Fang.A Thematic Analysis Method of Academic Documents Based on TF-IDF and LDA[J].,2020,30(08):194.[doi:10. 3969 / j. issn. 1673-629X. 2020. 01. 035]
[5]骆梅柳,裴可锋.大数据下的基于主题模型的社交网络链接预测[J].计算机技术与发展,2020,30(04):36.[doi:10. 3969 / j. issn. 1673-629X. 2020. 04. 007]
　LUO Mei-liu,PEI Ke-feng.Social Networking Link Prediction Based on Topic Model under Big Data[J].,2020,30(08):36.[doi:10. 3969 / j. issn. 1673-629X. 2020. 04. 007]
[6]陈莹,叶宁,徐康,等.基于领域特征指示词的隐式特征识别研究[J].计算机技术与发展,2021,31(09):24.[doi:10. 3969 / j. issn. 1673-629X. 2021. 09. 005]
　CHEN Ying,YE Ning,XU Kang,et al.Research on Implicit Feature Identification Based on Domain Feature Indicators[J].,2021,31(08):24.[doi:10. 3969 / j. issn. 1673-629X. 2021. 09. 005]
[7]关慧,宗福焱,曲盼.基于 BTM 和长文本语义增强的用户评论分类 …[J].计算机技术与发展,2023,33(07):181.[doi:10. 3969 / j. issn. 1673-629X. 2023. 07. 027]
　GUAN Hui,ZONG Fu-yan,QU Pan.User Comment Classification Based on BTM and Long Text Semantic Enhancement[J].,2023,33(08):181.[doi:10. 3969 / j. issn. 1673-629X. 2023. 07. 027]

更新日期/Last Update: 2018-09-27

《计算机技术与发展》[ISSN:1006-6977/CN:61-1281/TN]

文章信息/Info

相似文献/References:

常用功能

导航/Navigate

工具/Tools

统计/Statistics