«上一篇/Previous Article|本期目录/Table of Contents|下一篇/Next Article»

j. issn. 1673-629X. 2022. 03. 007]
点击复制

一种基于权重预处理的中文文本分类算法()

分享到：

《计算机技术与发展》[ISSN:1006-6977/CN:61-1281/TN]

卷:: 32
期数:: 2022年03期

页码:: 40-45

栏目:: 大数据分析与挖掘

出版日期:: 2022-03-10

文章信息/Info

Title:: A Chinese Text Classification Algorithm Based on Weight Preprocessing

文章编号:: 1673-629X(2022)03-0040-06

作者:: 何铠; 管有庆; 龚锐; 南京邮电大学物联网学院,江苏南京 210003

Author(s):: HE Kai; GUAN You-qing; GONG Rui; School of Internet of Things,Nanjing University of Posts and Telecommunications,Nanjing 210003,China

关键词:: 自然语言处理; 词频算法; 中文文本分类; 权重预处理; 词密度权重

Keywords:: natural language processing; word frequency algorithm; Chinese text classification; weight pretreatment; word density weight

分类号:: TP391

DOI:: 10. 3969 / j. issn. 1673-629X. 2022. 03. 007

摘要:: 文本分类是 NLP( natural language processing,自然语言处理) 处理技术的重要分支。信息检索、文本挖掘作为自然语言处理领域的关键技术,给人们的生活带来了许多便利,而文本分类正是这些关键技术开展的重要基础。文本分类作为自然语言处理研究的一个热点,其主要原理是将文本数据按照一定的分类规则实现自动化分类。目前常见的文本分类方式主要分为基于机器学习和基于深度学习两种,它们的本质是通过计算机自主学习从而提取文本信息中的规则来进行分类。针对数据量较小、硬件运算能力较低的应用场景,往往使用基于机器学习算法而衍生的文本分类模型。该文以期刊论文作为实验数据,研究中文文本分类问题,在改进传统词频算法的基础上提出了一种基于权重预处理的中文文本分类算法 PRE-TF-IDF( pre-processing term frequency inverse document frequency) 。传统词频算法在对词加权时仅考虑词的出现频率而不考虑词在文本中的位置;PRE-TF-IDF 算法在 TF-IDF( term frequency inverse document frequency)算法的基础上增加权重预处理和词密度权重两个环节。实验结果显示 PRE-TF-IDF 算法能够有效提高文本分类的准确性。

Abstract:: Text classification is an important branch of NLP ( natural language processing) . Information retrieval and text mining,? ?as key technologies in the field of natural language processing,have brought a lot of convenience to people’ s lives,and text classification is an important basis for the development of those key technologies. Text classification is a hot topic in natural language processing. The main principle of text classification is to automatically classify text data according to certain classification rules. At present,common text classification methods are mainly divided into two types: machine learning and deep learning. Their essence is to extract rules from text information through computer autonomous learning for classification. The text classification model derived from a machine learning algorithm is often used for application scenarios with a small amount of data and low hardware computing power. We take journal papers as experimental data to study the classification of Chinese text. Based on improving the traditional word frequency algorithm,a Chinese text classification algorithm based on weight preprocessing, PRE-TF-IDF ( pre-processing term frequency inverse document frequency) ,is proposed. The traditional word frequency algorithm only considers the occurrence frequency of words but does not consider the position of words in the text when weighing words. Based on the TF-IDF ( term frequency inverse document frequency) algorithm,the PRE-TF-IDF algorithm has two additional steps:weight preprocessing and word density weight. Experiment shows that the PRE-TF-IDF algorithm can effectively improve the accuracy of text classification.

相似文献/References:

[1]陈国华赵克李亚涛易帅.自然语言处理系统中的事件类名词的耦合处理[J].计算机技术与发展,2008,(06):60.
　CHEN Guo-hua,ZHAO Ke,LI Ya-tao,et al.Coupling Processing of Event Noun in NLP Systems[J].,2008,(03):60.
[2]程节华.基于FAQ的智能答疑系统中分词模块的设计[J].计算机技术与发展,2008,(07):181.
　CHENG Jie-hua.Design of Words Module in Intelligent Q/A System Based on FAQ[J].,2008,(03):181.
[3]杨欢许威赵克陈余.动词属性在自然语言处理当中的研究与应用[J].计算机技术与发展,2008,(07):233.
　YANG Huan,XU Wei,ZHAO Ke,et al.Research and Application of Verb Attributes in Natural Language Processing[J].,2008,(03):233.
[4]孙超张仰森.面向综合语言知识库的知识融合与获取研究[J].计算机技术与发展,2010,(08):25.
　SUN Chao,ZHANG Yang-sen.Research of Knowledge Integration and Obtaining Oriented Comprehensive Language Knowledge System[J].,2010,(03):25.
[5]党建亿珍珍赵克殷鸿.数学领域集体词结构形式化处理研究[J].计算机技术与发展,2007,(05):121.
　DANG Jian,YI Zhen-zhen,ZHAO Ke,et al.Research of Formalization Processing for Collective Structures in Mathematics Domain[J].,2007,(03):121.
[6]江有福郑庆华.自然语言网络答疑系统中倒排索引技术的研究[J].计算机技术与发展,2006,(02):126.
　JIANG You-fu,ZHENG Qing-hua.Research of Inverted Index in NLWAS[J].,2006,(03):126.
[7]刘亚清张瑾于纯妍.基于义原同现频率的汉语词义排歧系统[J].计算机技术与发展,2006,(05):184.
　LIU Ya-qing,ZHANG Jin,YU Chun-yan.A Chinese Word Sense Disambiguation System Based on Primitive CO- Occurrence Data[J].,2006,(03):184.
[8]刘政怡李炜吴建国.基于IMM—IME的汉字键盘输入法编程技术研究[J].计算机技术与发展,2006,(12):43.
　LIU Zheng-yi,LI Wei,WU Jian-guo.Research of Programming Technology of Chinese Input Method Based on IMM- IME[J].,2006,(03):43.
[9]赵鹏何留进孙凯方薇[].基于情感计算的网络中文信息分析技术[J].计算机技术与发展,2010,(11):146.
　ZHAO Peng,HE Liu-jin,SUN Kai,et al.Analyzing Technologies of Internet Chinese Information Based on Affective Computing[J].,2010,(03):146.
[10]徐远方李成城.基于SVM和词间特征的新词识别研究[J].计算机技术与发展,2012,(05):134.
　XU Yuan-fang,LI Cheng-cheng.Research on New Word Identification Based on SVM and Word Characteristics[J].,2012,(03):134.
[11]何铠,管有庆,龚锐.基于深度学习和支持向量机的文本分类模型[J].计算机技术与发展,2022,32(07):22.[doi:10. 3969 / j. issn. 1673-629X. 2022. 07. 004]
　HE Kai,GUAN You-qing,GONG Rui.Text Classification Model Based on Deep Learning and Support Vector Machine[J].,2022,32(03):22.[doi:10. 3969 / j. issn. 1673-629X. 2022. 07. 004]

常用功能

工具/Tools

统计/Statistics

摘要浏览/Viewed1026
全文下载/Downloads442
评论/Comments