[1]伍 哲,杨 芳.时间加权的 TF-LDA 学术文献摘要主题分析[J].计算机技术与发展,2020,30(01):194-200.[doi:10. 3969 / j. issn. 1673-629X. 2020. 01. 035]
 WU Zhe,YANG Fang.A Thematic Analysis Method of Academic Documents Based on TF-IDF and LDA[J].Computer Technology and Development,2020,30(01):194-200.[doi:10. 3969 / j. issn. 1673-629X. 2020. 01. 035]
点击复制

时间加权的 TF-LDA 学术文献摘要主题分析()
分享到:

《计算机技术与发展》[ISSN:1006-6977/CN:61-1281/TN]

卷:
30
期数:
2020年01期
页码:
194-200
栏目:
应用开发研究
出版日期:
2020-01-10

文章信息/Info

Title:
A Thematic Analysis Method of Academic Documents Based on TF-IDF and LDA
文章编号:
1673-629X(2020)01-0194-07
作者:
伍 哲杨 芳
西安邮电大学 计算机学院,陕西 西安 710121
Author(s):
WU ZheYANG Fang
School of Computer Science,Xi’an University of Posts and Telecommunications,Xi’an 710121,China
关键词:
主题模型学术文献TF-IDF时间因素
Keywords:
LDAthematic modelacademic literatureTF-IDFtime factor
分类号:
TP31
DOI:
10. 3969 / j. issn. 1673-629X. 2020. 01. 035
摘要:
随着网络的发展,主题提取的应用越来越广泛,尤其是学术文献的主题提取。 尽管学术文献摘要是短文本,但其具有高维性的特点导致文本主题模型难以处理,其时效性的特点致使主题挖掘时容易忽略时间因素,造成主题分布不均、不明确。 针对此类问题,提出一种基于 TTF-LDA(time+tf-idf+latent Dirichlet allocation)的学术文献摘要主题聚类模型。 通过引入 TF-IDF 特征提取的方法,对摘要进行特征词的提取,能有效降低 LDA 模型的输入文本维度,融合学术文献的发表时间因素,建立时间窗口,限定学术文献主题分析的时间,并通过文献的发表时间增加特征词的时间权重,使用特征词的时间权重之和协同主题引导特征词词库作为 LDA 的影响因子。 通过在爬虫爬取的数据集上进行实验,与标准的 LDA 和MVC-LDA 相比,在选取相同的主题数的情况下,模型的混乱程度更低,主题与主题之间的区分度更高,更符合学术文献本身的特点。
Abstract:
With the development of network,topic extraction has been applied more and more widely,especially in academic literature. Although abstracts of academic literature are short texts,their high dimensionality makes it difficult to deal with text topic models,and their timeliness makes it easy to ignore the time factor in topic mining, resulting in uneven and unclear topic distribution. In order to solve these problems,a topic clustering model of academic literature abstracts based on TTF-LDA (tf-idf+latent Dirichlet allocation) is proposed. By introducing TF-IDF feature extraction method to extract feature words from abstracts,the extraction of feature words in the abstract can effectively reduce the input text dimension of LDA model,integrate the publication time factor of academic literature,establish a time window,and limit the time of subject analysis of academic literature. The time weights of feature words are increased by the publication time of documents,and the time weights of feature words are combined with the collaborative topics to guide the feature lexicon as the influencing factors of LDA. Through experiments on data sets crawled by crawlers,compared with standard LDA and MVC-LDA,the chaotic degree of the model is lower when the number of topics is the same,and the distinction between topics is higher,which is more in line with the characteristics of academic literature itself.

相似文献/References:

[1]孙昌年,郑诚,夏青松.基于 LDA 的中文文本相似度计算[J].计算机技术与发展,2013,(01):217.
 SUN Chang-nian,ZHENG Cheng,XIA Qing-song.Chinese Text Similarity Computing Based on LDA[J].Computer Technology and Development,2013,(01):217.
[2]李昌亚,刘方方.基于LDA 的社科文献主题建模方法[J].计算机技术与发展,2018,28(02):182.[doi:10.3969/j.issn.1673-629X.2018.02.039]
 LI Changya,LIU Fangfang. A Topic Modeling Method for Social Science Literature Based on LDA[J].Computer Technology and Development,2018,28(01):182.[doi:10.3969/j.issn.1673-629X.2018.02.039]
[3]李菲菲,王移芝.基于频繁词网络的 LDA 最优主题个数选取方法[J].计算机技术与发展,2018,28(08):1.[doi:10.3969/ j. issn.1673-629X.2018.08.001]
 LI Fei-fei,WANG Yi-zhi.Selection Method of LDA Optimal Topic Number Based on Frequent Word Network[J].Computer Technology and Development,2018,28(01):1.[doi:10.3969/ j. issn.1673-629X.2018.08.001]
[4]白振凯,黄孝喜,王荣波,等. 基于主题模型的汉语动词隐喻识别[J].计算机技术与发展,2016,26(11):67.
 BAI Zhen-kai,HUANG Xiao-xi,WANG Rong-bo,et al. Chinese Verb Metaphor Recognition Based on Topic Model[J].Computer Technology and Development,2016,26(01):67.
[5]骆梅柳,裴可锋.大数据下的基于主题模型的社交网络链接预测[J].计算机技术与发展,2020,30(04):36.[doi:10. 3969 / j. issn. 1673-629X. 2020. 04. 007]
 LUO Mei-liu,PEI Ke-feng.Social Networking Link Prediction Based on Topic Model under Big Data[J].Computer Technology and Development,2020,30(01):36.[doi:10. 3969 / j. issn. 1673-629X. 2020. 04. 007]
[6]陈 莹,叶 宁,徐 康,等.基于领域特征指示词的隐式特征识别研究[J].计算机技术与发展,2021,31(09):24.[doi:10. 3969 / j. issn. 1673-629X. 2021. 09. 005]
 CHEN Ying,YE Ning,XU Kang,et al.Research on Implicit Feature Identification Based on Domain Feature Indicators[J].Computer Technology and Development,2021,31(01):24.[doi:10. 3969 / j. issn. 1673-629X. 2021. 09. 005]
[7]关 慧,宗福焱,曲 盼.基于 BTM 和长文本语义增强的用户评论分类 …[J].计算机技术与发展,2023,33(07):181.[doi:10. 3969 / j. issn. 1673-629X. 2023. 07. 027]
 GUAN Hui,ZONG Fu-yan,QU Pan.User Comment Classification Based on BTM and Long Text Semantic Enhancement[J].Computer Technology and Development,2023,33(01):181.[doi:10. 3969 / j. issn. 1673-629X. 2023. 07. 027]

更新日期/Last Update: 2020-01-10