«上一篇/Previous Article|本期目录/Table of Contents|下一篇/Next Article»

j. issn. 1673-629X. 2022. 01. 010]
点击复制

多类别文本分类方法比较研究()

分享到：

《计算机技术与发展》[ISSN:1006-6977/CN:61-1281/TN]

卷:: 32
期数:: 2022年01期

页码:: 54-60

栏目:: 大数据分析与挖掘

出版日期:: 2022-01-10

文章信息/Info

Title:: Study on Comparison of Multi-class Text Classification Methods

文章编号:: 1673-629X(2022)01-0054-07

作者:: 于卫红; 大连海事大学航运经济与管理学院,辽宁大连 116026

Author(s):: YU Wei-hong; School of Transportation Economics and Management,Dalian Maritime University,Dalian 116026,China

关键词:: 文本分类; 多类别; 机器学习; 文本特征表示; 分类算法

Keywords:: text classification; multi-class; machine learning; text representation; classification algorithm

分类号:: TP391. 1

DOI:: 10. 3969 / j. issn. 1673-629X. 2022. 01. 010

摘要:: 文本分类特别是多类别文本分类问题是非常重要的经典问题,在舆情监测、新闻推荐、在线评论情感分析等领域有着广泛的应用。目前,可用于多类别文本分类的算法很多,但每个算法都有其特定的假设和优缺点。为了帮助使用者或研究者更好地选择和改进分类方法,设计了多类别文本分类方法比较方案,综合考虑了文本特征表示方法和分类算法两个维度,对 3 种文本特征表示方法和 5 种分类算法进行组合,形成 15 种分类模型作为比较对象。基于所设计的比较流程,以从媒体阅读网站 SKIP-GRAM 爬取 SKIP-GRAM 的 3 000 条不同类别的资讯文本为研究语料,对 15 种模型在不同数据规模下进行若干次比较后,以 Kappa 系数和运行时间作为评估指标。综合评估后认为:使用词嵌入进行文本特征表示无论在分类模型的运行速度上还是分类效果上都具有明显的优势,KNN+CBOW、SVM+CBOW、朴素贝叶斯+CBOW 都是解决多类别文本分类问题较佳的模型。

Abstract:: Text classification,especially multi-class text classification,is a classical problem of great significance,which has a wide rangeof applications in the fields of public opinion monitoring, news recommendation, online comment sentiment analysis and so on. Atpresent,there are many algorithms for multi-class text classification,but each algorithm has its own specific assumptions and advantagesand disadvantages. To help users and researchers better choose and improve the classification methods,a comparison scheme based onmulti- class text classification is designed. Considering text feature representation and classification algorithm, three text featurerepresentation methods and five classification algorithms are combined to form 15 classification models which are ranked by thecomparison scheme. Using 3 000 documents with different categories crawled from media sites as the corpus,these 15 combinations arecompared with different scale of data following the process established and are ranked by Kappa coefficient and running time. It isconcluded that word embedding for text feature representation has obvious advantages both in the running speed of the model and the classification performance. Meanwhile, KNN+CBOW,SVM+CBOW and Naive Bayes+CBOW are all better models for solving multi-classtext classification problems.

相似文献/References:

[1]田昕辉李成基.带有短语切分的中文文本分类方法[J].计算机技术与发展,2010,(01):5.
　TIAN Xin-hui,LEE Sung-kee.Phrase Segmentation for Chinese Text Classification[J].,2010,(01):5.
[2]姜鹤陈丽亚.SVM文本分类中一种新的特征提取方法[J].计算机技术与发展,2010,(03):17.
　JIANG He,CHEN Li-ya.A New Feature Selection Method in SVM Text Categorization[J].,2010,(01):17.
[3]周瑛张铃.有限混合模型在文本分类中的应用研究[J].计算机技术与发展,2010,(06):18.
　ZHOU Ying,ZHANG Ling.Study of Application of Finite Mixture Model in Text Classification[J].,2010,(01):18.
[4]许幸张启蕊.基于KNN算法的医药信息文本分类系统的研究[J].计算机技术与发展,2009,(04):206.
　XU Xing,ZHANG Qi-rui.Research of Medical Information Text Categorization Based on KNN Algorithm[J].,2009,(01):206.
[5]陈锦禾范新沈闻沈洁.基于情感词识别的BBS情感分类研究[J].计算机技术与发展,2009,(07):120.
　CHEN Jin-he,FAN Xin,SHEN Wen,et al.Research on Sentiment Classification of BBS Reviews Based on Identifying Words with Polarity[J].,2009,(01):120.
[6]刘锋唐佳仲红.一种基于RBF神经网络的XML文本分类方法[J].计算机技术与发展,2009,(08):34.
　LIU Feng,TANG Jia,ZHONG Hong.A Text Categorization Method Based on RBF Neural Network[J].,2009,(01):34.
[7]晋幼丽周明全王学松.SVM和K-means结合的文本分类方法研究[J].计算机技术与发展,2009,(11):35.
　JIN You-li,ZHOU Ming-quan,WANG Xue-song.Research on Text Classification Method of SVM and K - means[J].,2009,(01):35.
[8]张燕平徐庆鹏苏守宝邢猛.一种基于贪婪覆盖的文本分类方法[J].计算机技术与发展,2009,(01):74.
　ZHANG Yan-ping,XU Qing-peng,SU Shou-bao,et al.A Text Categorization Method Based on Greedy Cover[J].,2009,(01):74.
[9]陈素萍谢丽聪.一种文本特征选择方法的研究[J].计算机技术与发展,2009,(02):112.
　CHEN Su-ping,XIE Li-cong.Research on Document Feature Selection[J].,2009,(01):112.
[10]于水英丁华福付志超.基于遗传算法和模糊聚类的文本分类研究[J].计算机技术与发展,2009,(04):131.
　YU Shui-ying,DING Hua-fu,FU Zhi-chao.Study on Text Categorization Based on Genetic Algorithm and Fuzzy Clustering[J].,2009,(01):131.

常用功能

工具/Tools

统计/Statistics

摘要浏览/Viewed596
全文下载/Downloads336
评论/Comments