[1]武高敏[],张宇晨[],韩京宇[]. 基于隐含语义分析的在线新闻话题发现方法[J].计算机技术与发展,2016,26(09):1-7.
 WU Gao-min[],ZHANG Yu-chen[],HAN Jing-yu[]. Online News Topics Extraction Based on Latent Semantic Analysis[J].,2016,26(09):1-7.
点击复制

 基于隐含语义分析的在线新闻话题发现方法()
分享到:

《计算机技术与发展》[ISSN:1006-6977/CN:61-1281/TN]

卷:
26
期数:
2016年09期
页码:
1-7
栏目:
应用开发研究
出版日期:
2016-09-10

文章信息/Info

Title:
 Online News Topics Extraction Based on Latent Semantic Analysis
文章编号:
1673-629X(2016)09-0001-07
作者:
 武高敏[1] 张宇晨[1] 韩京宇[2]
1.南京邮电大学 计算机学院;2.东南大学 计算机网络和信息集成教育部重点实验室
Author(s):
 WU Gao-min[1] ZHANG Yu-chen[1] HAN Jing-yu[2]
关键词:
 话题发现 向量空间模型 隐含语义分析 文本聚类 奇异值分解
Keywords:
 topic detection vector space model latent semantic analysis text clustering singular value decomposition
分类号:
TP181
文献标志码:
A
摘要:
 互联网的飞速发展和海量数据的不断增长,使得如何快速、有效地识别当前新闻热点信息成为迫切需求。在线新闻话题发现已成为当前研究热点。对于在线环境下的新闻文本特征表示,传统向量空间模型随着数据的增长向量维度不断增长,使得数据稀疏和同名异议问题愈加明显,导致文本相似度难以准确度量。使用基于特征加权的隐含语义分析将高维、稀疏的词-文档矩阵映射到隐藏的k维语义空间,充分挖掘词、文档之间的语义信息,以提高同主题文档间的语义相似度,克服在线环境下文本稀疏性和同名异议问题。此外,对于不断增长的大规模新闻数据,传统聚类算法存在时间复杂度过高或者输入依赖等问题,难以快速、有效地得到理想结果。基于新闻报道在时间上的顺序性和相关性,提出改进的Single-pass在线增量聚类算法检测话题类,并引入话题热度值的概念来筛选当前关注度较高的热点话题。实验结果表明,该方法能够有效提高话题检测的准确率,实现基于真实新闻数据集的在线话题捕捉。
Abstract:
 With the rapid development of the Internet and the continuous increasing of massive data,how to identify the current news topic quickly and effectively is becoming an urgent demand,and online hot news topic detection has become an hot area of research. For online news stream,the degree of traditional Vector Space Model ( VSM) will grow with the increasing of data,resulting in obvious problem of data sparsity and synonymy,which makes it difficult to quickly and accurately calculate the similarity of texts. The latent semantic analysis based on weighted features is used to map the sparse matrix with high-dimension of words and documents to the hidden k-dimension se-mantic space,making full use of the semantic information between words and documents to improve the semantic similarity between the same subject documents,overcoming the problems of text sparsity and synonymy in Internet. In addition,traditional clustering algorithm exists the problem of high time complexity and input dependency for increasing massive news data,which is difficult to get the expected result quickly and efficiently. A Single-pass online clustering algorithm is used to detect the topic clusters based on succession and corre-lation in time for news,and the concept of topic heat is introduced to screen the public attention of news topics. Experiment shows that the method proposed can effectively improve the accuracy of the detection of topics.

相似文献/References:

[1]张志宏,吴庆波,邵立松,等.基于飞腾平台TOE协议栈的设计与实现[J].计算机技术与发展,2014,24(07):1.
 ZHANG Zhi-hong,WU Qing-bo,SHAO Li-song,et al. Design and Implementation of TCP/IP Offload Engine Protocol Stack Based on FT Platform[J].,2014,24(09):1.
[2]梁文快,李毅. 改进的基因表达算法对航班优化排序问题研究[J].计算机技术与发展,2014,24(07):5.
 LIANG Wen-kuai,LI Yi. Research on Optimization of Flight Scheduling Problem Based on Improved Gene Expression Algorithm[J].,2014,24(09):5.
[3]黄静,王枫,谢志新,等. EAST文档管理系统的设计与实现[J].计算机技术与发展,2014,24(07):13.
 HUANG Jing,WANG Feng,XIE Zhi-xin,et al. Design and Implementation of EAST Document Management System[J].,2014,24(09):13.
[4]侯善江[],张代远[][][]. 基于样条权函数神经网络P2P流量识别方法[J].计算机技术与发展,2014,24(07):21.
 HOU Shan-jiang[],ZHANG Dai-yuan[][][]. P2P Traffic Identification Based on Spline Weight Function Neural Network[J].,2014,24(09):21.
[5]李璨,耿国华,李康,等. 一种基于三维模型的文物碎片线图生成方法[J].计算机技术与发展,2014,24(07):25.
 LI Can,GENG Guo-hua,LI Kang,et al. A Method of Obtaining Cultural Debris’ s Line Chart Based on Three-dimensional Model[J].,2014,24(09):25.
[6]翁鹤,皮德常. 混沌RBF神经网络异常检测算法[J].计算机技术与发展,2014,24(07):29.
 WENG He,PI De-chang. Chaotic RBF Neural Network Anomaly Detection Algorithm[J].,2014,24(09):29.
[7]刘茜[],荆晓远[],李文倩[],等. 基于流形学习的正交稀疏保留投影[J].计算机技术与发展,2014,24(07):34.
 LIU Qian[],JING Xiao-yuan[,LI Wen-qian[],et al. Orthogonal Sparsity Preserving Projections Based on Manifold Learning[J].,2014,24(09):34.
[8]尚福华,李想,巩淼. 基于模糊框架-产生式知识表示及推理研究[J].计算机技术与发展,2014,24(07):38.
 SHANG Fu-hua,LI Xiang,GONG Miao. Research on Knowledge Representation and Inference Based on Fuzzy Framework-production[J].,2014,24(09):38.
[9]叶偲,李良福,肖樟树. 一种去除运动目标重影的图像镶嵌方法研究[J].计算机技术与发展,2014,24(07):43.
 YE Si,LI Liang-fu,XIAO Zhang-shu. Research of an Image Mosaic Method for Removing Ghost of Moving Targets[J].,2014,24(09):43.
[10]余松平[][],蔡志平[],吴建进[],等. GSM-R信令监测选择录音系统设计与实现[J].计算机技术与发展,2014,24(07):47.
 YU Song-ping[][],CAI Zhi-ping[] WU Jian-jin[],GU Feng-zhi[]. Design and Implementation of an Optional Voice Recording System Based on GSM-R Signaling Monitoring[J].,2014,24(09):47.

更新日期/Last Update: 2016-10-24