[1]于卫红.多类别文本分类方法比较研究[J].计算机技术与发展,2022,32(01):54-60.[doi:10. 3969 / j. issn. 1673-629X. 2022. 01. 010]
 YU Wei-hong.Study on Comparison of Multi-class Text Classification Methods[J].,2022,32(01):54-60.[doi:10. 3969 / j. issn. 1673-629X. 2022. 01. 010]
点击复制

多类别文本分类方法比较研究()
分享到:

《计算机技术与发展》[ISSN:1006-6977/CN:61-1281/TN]

卷:
32
期数:
2022年01期
页码:
54-60
栏目:
大数据分析与挖掘
出版日期:
2022-01-10

文章信息/Info

Title:
Study on Comparison of Multi-class Text Classification Methods
文章编号:
1673-629X(2022)01-0054-07
作者:
于卫红
大连海事大学 航运经济与管理学院,辽宁 大连 116026
Author(s):
YU Wei-hong
School of Transportation Economics and Management,Dalian Maritime University,Dalian 116026,China
关键词:
文本分类多类别机器学习文本特征表示分类算法
Keywords:
text classificationmulti-classmachine learningtext representationclassification algorithm
分类号:
TP391. 1
DOI:
10. 3969 / j. issn. 1673-629X. 2022. 01. 010
摘要:
文本分类特别是多类别文本分类问题是非常重要的经典问题,在舆情监测、新闻推荐、在线评论情感分析等领域有着广泛的应用。 目前,可用于多类别文本分类的算法很多,但每个算法都有其特定的假设和优缺点。 为了帮助使用者或研究者更好地选择和改进分类方法,设计了多类别文本分类方法比较方案,综合考虑了文本特征表示方法和分类算法两个维度,对 3 种文本特征表示方法和 5 种分类算法进行组合,形成 15 种分类模型作为比较对象。 基于所设计的比较流程,以从媒体阅读网站 SKIP-GRAM 爬取 SKIP-GRAM 的 3 000 条不同类别的资讯文本为研究语料,对 15 种模型在不同数据规模下进行若干次比较后,以 Kappa 系数和运行时间作为评估指标。 综合评估后认为:使用词嵌入进行文本特征表示无论在分类模型的运行速度上还是分类效果上都具有明显的优势,KNN+CBOW、SVM+CBOW、朴素贝叶斯+CBOW 都是解决多类别文本分类问题较佳的模型。
Abstract:
Text classification,especially multi-class text classification,is a classical problem of great significance,which has a wide rangeof applications in the fields of public opinion monitoring, news recommendation, online comment sentiment analysis and so on. Atpresent,there are many algorithms for multi-class text classification,but each algorithm has its own specific assumptions and advantagesand disadvantages. To help users and researchers better choose and improve the classification methods,a comparison scheme based onmulti- class text classification is designed. Considering text feature representation and classification algorithm, three text featurerepresentation methods and five classification algorithms are combined to form 15 classification models which are ranked by thecomparison scheme. Using 3 000 documents with different categories crawled from media sites as the corpus,these 15 combinations arecompared with different scale of data following the process established and are ranked by Kappa coefficient and running time. It isconcluded that word embedding for text feature representation has obvious advantages both in the running speed of the model and the classification performance. Meanwhile, KNN+CBOW,SVM+CBOW and Naive Bayes+CBOW are all better models for solving multi-classtext classification problems.

相似文献/References:

[1]田昕辉 李成基.带有短语切分的中文文本分类方法[J].计算机技术与发展,2010,(01):5.
 TIAN Xin-hui,LEE Sung-kee.Phrase Segmentation for Chinese Text Classification[J].,2010,(01):5.
[2]姜鹤 陈丽亚.SVM文本分类中一种新的特征提取方法[J].计算机技术与发展,2010,(03):17.
 JIANG He,CHEN Li-ya.A New Feature Selection Method in SVM Text Categorization[J].,2010,(01):17.
[3]周瑛 张铃.有限混合模型在文本分类中的应用研究[J].计算机技术与发展,2010,(06):18.
 ZHOU Ying,ZHANG Ling.Study of Application of Finite Mixture Model in Text Classification[J].,2010,(01):18.
[4]许幸 张启蕊.基于KNN算法的医药信息文本分类系统的研究[J].计算机技术与发展,2009,(04):206.
 XU Xing,ZHANG Qi-rui.Research of Medical Information Text Categorization Based on KNN Algorithm[J].,2009,(01):206.
[5]陈锦禾 范新 沈闻 沈洁.基于情感词识别的BBS情感分类研究[J].计算机技术与发展,2009,(07):120.
 CHEN Jin-he,FAN Xin,SHEN Wen,et al.Research on Sentiment Classification of BBS Reviews Based on Identifying Words with Polarity[J].,2009,(01):120.
[6]刘锋 唐佳 仲红.一种基于RBF神经网络的XML文本分类方法[J].计算机技术与发展,2009,(08):34.
 LIU Feng,TANG Jia,ZHONG Hong.A Text Categorization Method Based on RBF Neural Network[J].,2009,(01):34.
[7]晋幼丽 周明全 王学松.SVM和K-means结合的文本分类方法研究[J].计算机技术与发展,2009,(11):35.
 JIN You-li,ZHOU Ming-quan,WANG Xue-song.Research on Text Classification Method of SVM and K - means[J].,2009,(01):35.
[8]张燕平 徐庆鹏 苏守宝 邢猛.一种基于贪婪覆盖的文本分类方法[J].计算机技术与发展,2009,(01):74.
 ZHANG Yan-ping,XU Qing-peng,SU Shou-bao,et al.A Text Categorization Method Based on Greedy Cover[J].,2009,(01):74.
[9]陈素萍 谢丽聪.一种文本特征选择方法的研究[J].计算机技术与发展,2009,(02):112.
 CHEN Su-ping,XIE Li-cong.Research on Document Feature Selection[J].,2009,(01):112.
[10]于水英 丁华福 付志超.基于遗传算法和模糊聚类的文本分类研究[J].计算机技术与发展,2009,(04):131.
 YU Shui-ying,DING Hua-fu,FU Zhi-chao.Study on Text Categorization Based on Genetic Algorithm and Fuzzy Clustering[J].,2009,(01):131.

更新日期/Last Update: 2022-01-10