«上一篇/Previous Article|本期目录/Table of Contents|下一篇/Next Article»

j. issn.1673-629X.2019.03.005]
点击复制

一种基于词加权 LDA 模型的专利文献分类方法()

《计算机技术与发展》[ISSN:1006-6977/CN:61-1281/TN]

卷:: 29
期数:: 2019年03期

页码:: 23-29

栏目:: 智能、算法、系统工程

出版日期:: 2019-03-10

文章信息/Info

Title:: A Patent Document Classification Method Based on Word Weighted LDA Model

文章编号:: 1673-629X(2019)03-0023-07

作者:: 孙伟; 刘文静; 葛丽阁; 余璇; 上海海事大学信息工程学院,上海 201306

Author(s):: SUN Wei; LIU Wen-jing; GE Li-ge; YU Xuan; School of Information Engineering,Shanghai Maritime University,Shanghai 201306,China

关键词:: 加权模型; LDA; KeyGraph 算法; 专利文献分类

Keywords:: weighted model; latent Dirichlet allocation; KeyGraph algorithm; patent literature classification

分类号:: TP18

DOI:: 10.3969/ j. issn.1673-629X.2019.03.005

摘要:: 传统的主题模型在进行文本分类时,特征词多选取统计规律下的高频词,而在专利文献分类中,多数专业词汇往往被高频词所淹没,造成主题模型在专利文献分类的准确率不高。对此,提出一种基于词加权的有监督 LDA 主题模型用于专利文献的分类。从专业词与高频词的共现关系出发,利用 KeyGraph 算法选取特征表征能力更优的关键词,再利用互信息函数计算各关键词权重,建立专业词字典。在此基础上,建立一个有监督的 LDA 模型,将词加权扩展至 LDA 模型,并采用 Gibbs Sampling 进行参数估计。在专利文献上进行分类实验,与 LDA 模型及其两种变型模型相比,该模型分类准确率分别平均提高了 4.62%、3.74%和 3.26%。表明该模型选取的高区分度的专业词汇与主题关联度更高,分类效率和准确率均有明显提高。

Abstract:: When the traditional topic model carries on the text classification,its characteristic words choose the high frequency words under the law of statistics. However,in the patent literature classification,most professional words are often overwhelmed by high frequencywords,resulting in the low accuracy of the topic model in the classification of patent documents. Therefore,we present a supervised LDAtopic model based on word weighted for the classification of patent documents. Based on the co-occurrence relationship between professional words and high-frequency words,KeyGraph algorithm is used to select the keywords with better characterization,and the mutualinformation function is used to calculate the weight of each keyword to establish a professional word dictionary. On this basis,a supervised LDA model is built,the word weighted is extended to the LDA model and Gibbs Sampling is used to estimate the parameters. Compared with the LDA model and its two variant models,the classification accuracy of the model is improved by 4. 62%,3. 74% and 3.26% respectively on the patent documents. It shows that the high degree of specialization words selected by the model has a higher degree of relevance to the topic,and the classification efficiency and accuracy are significantly improved.

相似文献/References:

[1]郑诚,刘娇丽,项珑.基于VSM和LDA模型的FAQ问答系统[J].计算机技术与发展,2014,24(01):133.
　ZHENG Cheng,LIU Jiao-li,XIANG Long.FAQ Answering System Based on VSM and LDA Model[J].,2014,24(03):133.
[2]李昌亚,刘方方.基于LDA 的社科文献主题建模方法[J].计算机技术与发展,2018,28(02):182.[doi:10．3969/j．issn．1673－629X．2018．02．039]
　LI Changya,LIU Fangfang. A Topic Modeling Method for Social Science Literature Based on LDA[J].,2018,28(03):182.[doi:10．3969/j．issn．1673－629X．2018．02．039]
[3]潘晓英,胡开开,朱静. 一种基于TextRank的文本二次聚类算法[J].计算机技术与发展,2016,26(08):7.
　PAN Xiao-ying,HU Kai-kai,ZHU Jing. A Secondary Text Clustering Algorithm Based on TextRank[J].,2016,26(03):7.
[4]白振凯,黄孝喜,王荣波,等. 基于主题模型的汉语动词隐喻识别[J].计算机技术与发展,2016,26(11):67.
　BAI Zhen-kai,HUANG Xiao-xi,WANG Rong-bo,et al. Chinese Verb Metaphor Recognition Based on Topic Model[J].,2016,26(03):67.
[5]李鑫.一种面向 Mashup 应用的 API 推荐方法[J].计算机技术与发展,2021,31(02):38.[doi:10. 3969 / j. issn. 1673-629X. 2021. 02. 007]
　LI Xin.An API Recommendation Method for Mashup Application[J].,2021,31(03):38.[doi:10. 3969 / j. issn. 1673-629X. 2021. 02. 007]

常用功能

工具/Tools

统计/Statistics

摘要浏览/Viewed1428
全文下载/Downloads1008
评论/Comments