基于 Spark 与词语相关度的 KNN 文本分类算法-《计算机技术与发展》

文章信息/Info

Title:: KNN Text Classification Based on Word Relatedness and Spark Framework

作者:: 于苹苹1 ; 倪建成2 ; 韦锦涛2 ; 曹博1 ; 姚彬修1; 1. 曲阜师范大学信息科学与工程学院,山东日照 276826;
2. 曲阜师范大学软件学院,山东曲阜 273100

Author(s):: YU Ping-ping 1 ; NI Jian-cheng 2 ; WEI Jin-tao 2 ; CAO Bo 1 ; YAO Bin-xiu 1; 1. School of Information Science and Engineering,Qufu Normal University,Rizhao 276826,China;
2. School of Software Engineering,Qufu Normal University,Qufu 273100,China

摘要:: 针对 K-最近邻(KNN)分类算法在当前大数据背景下分类效率降低、分类效果不理想的问题,提出了一种基于Spark 框架与词语相关度优化的高效 KNN 文本分类算法。在相似度计算过程中,采用词语相关度将文本词语间的关系考虑在内,对分类算法相似度计算进行优化,从而提高文本分类的准确度;依托 Spark 计算框架的内存处理机制,实现文本分类的并行化,从而提高 KNN 文本分类算法的处理效率,同时在并行化过程中建立类别-距离向量,以进一步加快文本分类的处理速度。实验结果表明,Spark 框架下基于词语相关度的 KNN 文本分类算法在保证分类效果的基础上大大提高了分类效率,较 Hadoop 平台有较好的加速比,可有效地对大数据进行分类处理。

Abstract:: In view of the problem that K-nearest neighbor (KNN) classification algorithm is not satisfactory and inefficient under the big data background,we put forward a highly efficient algorithm of KNN based on Spark framework and word relatedness. In the calculation of the similarity,taking into the relationship between the words account by using the word relatedness,the similarity calculation of the classification algorithm is optimized to improve the accuracy of the text classification. We rely on the in-memory mechanism of Spark to realize the parallelization of text categorization,so as to rise the efficiency of KNN text categorization algorithm. At the same time,the class-distance vector is established to further speed up the processing of text categorization in the calculation. The experiments show that the proposed parallel algorithm could shorten the classification time on the basis of ensuring the classification effect. And it has better speedup,which can effectively classify the big data.