[1]戚后林,顾磊. 概率潜在语义分析的KNN文本分类算法[J].计算机技术与发展,2017,27(07):57-61.
 QI Hou-lin,GU Lei. KNN Text Classification Algorithm with Probabilistic Latent Semantic Analysis[J].,2017,27(07):57-61.
点击复制

 概率潜在语义分析的KNN文本分类算法()
分享到:

《计算机技术与发展》[ISSN:1006-6977/CN:61-1281/TN]

卷:
27
期数:
2017年07期
页码:
57-61
栏目:
智能、算法、系统工程
出版日期:
2017-07-10

文章信息/Info

Title:
 KNN Text Classification Algorithm with Probabilistic Latent Semantic Analysis
文章编号:
1673-629X(2017)07-0057-05
作者:
 戚后林顾磊
 南京邮电大学 计算机学院
Author(s):
 QI Hou-linGU Lei
关键词:
 文本分类KNN算法文本表示模型语义分类概率潜在主题模型
Keywords:
 text classificationKNNtext presentation modelsemantic classificationprobability latent semantic analysis
分类号:
TP301.6
文献标志码:
A
摘要:
 传统的KNN文本算法在计算文本之间的相似度时,只是做简单的概念匹配,没有考虑到训练集与测试集文本中词项携带的语义信息,因此在利用KNN分类器进行文本分类过程中有可能导致语义丢失,分类结果不准确.针对这种情况,提出了一种基于概率潜在主题模型的KNN文本分类算法.该算法预先使用概率主题模型对训练集文本进行文本-主题、主题-词项建模,将文本携带的语义信息映射到主题上的低维空间,把文本相似度用文本-主题、主题-词项的概率分布表示,对低维文本的语义信息利用KNN算法进行文本分类.实验结果表明,在训练较大的训练数据集和待分类数据集上,所提算法能够利用KNN分类器进行文本的语义分类,且能提高KNN分类的准确率和召回率以及F1值.
Abstract:
 Traditional KNN Text Classification (TC) algorithm just implements a simple concept matching during calculation of the similarity between texts without taking the semantic information of the text in training and test set into account.Thus it is possible to lose semantic meaning in the process of text classification with KNN classifier as well as inaccurate categorization results.Against this problem,a KNN text classification algorithm based on probabilistic latent topic model has been proposed,which establishes probabilistic topic models of text-theme,theme-lexical item for training set texts beforehand to map the semantic information to low dimensional space of theme and dictates text similarity with probability distributions of text-theme and theme-lexical.The semantic information of low dimensional text can be classified with the proposed KNN algorithm.The experimental results show that in training of large training dataset and unclassified dataset,the proposed algorithm can conduct semantic classification of text with KNN classifier and enhance the accuracy and recall rate as well as F1 measure in KNN classification.

相似文献/References:

[1]田昕辉 李成基.带有短语切分的中文文本分类方法[J].计算机技术与发展,2010,(01):5.
 TIAN Xin-hui,LEE Sung-kee.Phrase Segmentation for Chinese Text Classification[J].,2010,(07):5.
[2]姜鹤 陈丽亚.SVM文本分类中一种新的特征提取方法[J].计算机技术与发展,2010,(03):17.
 JIANG He,CHEN Li-ya.A New Feature Selection Method in SVM Text Categorization[J].,2010,(07):17.
[3]周瑛 张铃.有限混合模型在文本分类中的应用研究[J].计算机技术与发展,2010,(06):18.
 ZHOU Ying,ZHANG Ling.Study of Application of Finite Mixture Model in Text Classification[J].,2010,(07):18.
[4]许幸 张启蕊.基于KNN算法的医药信息文本分类系统的研究[J].计算机技术与发展,2009,(04):206.
 XU Xing,ZHANG Qi-rui.Research of Medical Information Text Categorization Based on KNN Algorithm[J].,2009,(07):206.
[5]陈锦禾 范新 沈闻 沈洁.基于情感词识别的BBS情感分类研究[J].计算机技术与发展,2009,(07):120.
 CHEN Jin-he,FAN Xin,SHEN Wen,et al.Research on Sentiment Classification of BBS Reviews Based on Identifying Words with Polarity[J].,2009,(07):120.
[6]刘锋 唐佳 仲红.一种基于RBF神经网络的XML文本分类方法[J].计算机技术与发展,2009,(08):34.
 LIU Feng,TANG Jia,ZHONG Hong.A Text Categorization Method Based on RBF Neural Network[J].,2009,(07):34.
[7]晋幼丽 周明全 王学松.SVM和K-means结合的文本分类方法研究[J].计算机技术与发展,2009,(11):35.
 JIN You-li,ZHOU Ming-quan,WANG Xue-song.Research on Text Classification Method of SVM and K - means[J].,2009,(07):35.
[8]张燕平 徐庆鹏 苏守宝 邢猛.一种基于贪婪覆盖的文本分类方法[J].计算机技术与发展,2009,(01):74.
 ZHANG Yan-ping,XU Qing-peng,SU Shou-bao,et al.A Text Categorization Method Based on Greedy Cover[J].,2009,(07):74.
[9]陈素萍 谢丽聪.一种文本特征选择方法的研究[J].计算机技术与发展,2009,(02):112.
 CHEN Su-ping,XIE Li-cong.Research on Document Feature Selection[J].,2009,(07):112.
[10]于水英 丁华福 付志超.基于遗传算法和模糊聚类的文本分类研究[J].计算机技术与发展,2009,(04):131.
 YU Shui-ying,DING Hua-fu,FU Zhi-chao.Study on Text Categorization Based on Genetic Algorithm and Fuzzy Clustering[J].,2009,(07):131.
[11]李妍坊,许歆艺,刘功申. 面向情感倾向性识别的特征分析研究[J].计算机技术与发展,2014,24(09):33.
 LI Yan-fang,XU Xin-yi,LIU Gong-shen. Research on Feature Analysis Oriented Text Sentiment Identification[J].,2014,24(07):33.
[12]龚静,胡平霞,胡灿. 用于文本分类的特征项权重算法改进[J].计算机技术与发展,2014,24(09):128.
 GONG Jing,HU Ping-xia,HU Can. Improvement of Algorithm for Weight of Characteristic Item in Text Classification [J].,2014,24(07):128.
[13]李琼,陈利. 一种改进的支持向量机文本分类方法[J].计算机技术与发展,2015,25(05):78.
 LI Qiong CHEN Li. An Improved Text Classification Method for Support Vector Machine[J].,2015,25(07):78.
[14]裴向杰,唐红昇,陈鹏. 一种改进的贝叶斯算法在短信过滤中的研究[J].计算机技术与发展,2015,25(09):89.
 PEI Xiang-jie,TANG Hong-sheng,CHEN Peng. Research on Optimized Naive Bayesian Algorithm in SMS Spam Filtering[J].,2015,25(07):89.
[15]卫华,韩立新,夏建华. 基于Word2 fea模型的文本建模方法[J].计算机技术与发展,2016,26(02):165.
 WEI Hua,HAN Li-xin,XIA Jian-hua. Text Modeling Method Based on Word2 fea Model[J].,2016,26(07):165.

更新日期/Last Update: 2017-08-22