[1]建宇,周爱武,肖云,等. 基于特征空间的文本聚类[J].计算机技术与发展,2017,27(09):75-77.
 HUANG Jian-yu,ZHOU Ai-wu,XIAO Yun,et al. Text Clustering Based on Feature Space[J].,2017,27(09):75-77.
点击复制

 基于特征空间的文本聚类()
分享到:

《计算机技术与发展》[ISSN:1006-6977/CN:61-1281/TN]

卷:
27
期数:
2017年09期
页码:
75-77
栏目:
智能、算法、系统工程
出版日期:
2017-09-10

文章信息/Info

Title:
 Text Clustering Based on Feature Space
文章编号:
1673-629X(2017)09-0075-03
作者:
 建宇周爱武肖云谭天诚
 安徽大学 计算机科学与技术学院
Author(s):
 HUANG Jian-yuZHOU Ai-wuXIAO YunTAN Tian-cheng
关键词:
 知网领域词典主题义原聚类K值优化
Keywords:
 HowNetdomain dictionarythemesememesclusteringoptimized K value
分类号:
TP301.6
文献标志码:
A
摘要:
 文本聚类是聚类算法的一种具体应用,随着互联网的发展,文本聚类应用越来越广泛,譬如在信息检索、智能搜索引擎等方面都有较为广泛的应用.文本聚类算法主要涉及文本预处理和文本聚类算法,故对文本聚类进行改进可以从这两方面入手.传统文本聚类的文本预处理采用VSM模型,该模型不考虑词与词的语义相似度和词与词的相关性,导致文本聚类精确度非常低.针对该问题,提出了基于特征空间文本聚类的方法.该方法根据文档集合的特征空间构造一个替代词库,并根据这个替代词库得到文档的主题,依据主题配合其对应的领域词典对文档词进行相应的替换.传统的文本聚类使用K-means算法,但该算法需要人工指定K值.为此,提出了基于K值优化的K-means改进算法.实验结果表明,所提出的文本聚类方法和K-means改进算法显著提高了文本聚类的智能性和精确性.
Abstract:
 Text clustering is a specific application of the clustering algorithm. With the development of Internet,the text clustering has got-ten an increasingly wide utilization in many fields,such as information retrieval and intelligent search engine. Text clustering algorithm in-volves text preprocessing and text clustering primarily,so some improvements on text clustering from these two aspects have been conduc-ted. The traditional text clustering adopts the VSM without considering the semantic similarity and correlation between words,which leads to low accuracy. In view of it,the text clustering method based on feature space is proposed which constructs an alternative word library through the feature space of document collection and gets the document theme according to the alternative word library,and then replaces the words in document based on the themes and its corresponding domain dictionary. However the traditional text clustering algorithm must need artificial K value. Therefore, K-means algorithm is presented based on the K value optimization. The experimental results show that the two improvements above mentioned have made text clustering more intelligent and more precise.

相似文献/References:

[1]张明宝 马静.一种基于知网的中文词义消歧算法[J].计算机技术与发展,2009,(02):9.
 ZHANG Ming-bao,MA Jing.An Approach to Chinese Word Sense Disambiguation Based on HowNet[J].,2009,(09):9.
[2]魏凯斌 冉延平 余牛.语义相似度的计算方法研究与分析[J].计算机技术与发展,2010,(07):102.
 WEI Kai-bin,RAN Yan-ping,YU Niu.The Research and Analysis of Computing Methods on Semantic Similarity[J].,2010,(09):102.
[3]闫蓉 张蕾.一种新的汉语词义消歧方法[J].计算机技术与发展,2006,(03):22.
 YAN Rong,ZHANG Lei.New Chinese Word Sense Disambiguation Method[J].,2006,(09):22.
[4]周永梅 陶红 陈姣姣 张再跃.自动问答系统中的句子相似度算法的研究[J].计算机技术与发展,2012,(05):75.
 ZHOU Yong-mei,TAO Hong,CHEN Jiao-jiao,et al.Study on Sentence Similarity Approach of Automatic Ask & Answer System[J].,2012,(09):75.
[5]吴旭东 成卫青 黄卫东.改进的主客观结合的词语语义相似度算法[J].计算机技术与发展,2012,(09):45.
 WU Xu-dong,CHENG Wei-qing,HUANG Wei-dong.An Improved Subjective and Objective Combination Method for Measuring Word Semantic Similarity[J].,2012,(09):45.
[6]张志宏,吴庆波,邵立松,等.基于飞腾平台TOE协议栈的设计与实现[J].计算机技术与发展,2014,24(07):1.
 ZHANG Zhi-hong,WU Qing-bo,SHAO Li-song,et al. Design and Implementation of TCP/IP Offload Engine Protocol Stack Based on FT Platform[J].,2014,24(09):1.
[7]梁文快,李毅. 改进的基因表达算法对航班优化排序问题研究[J].计算机技术与发展,2014,24(07):5.
 LIANG Wen-kuai,LI Yi. Research on Optimization of Flight Scheduling Problem Based on Improved Gene Expression Algorithm[J].,2014,24(09):5.
[8]黄静,王枫,谢志新,等. EAST文档管理系统的设计与实现[J].计算机技术与发展,2014,24(07):13.
 HUANG Jing,WANG Feng,XIE Zhi-xin,et al. Design and Implementation of EAST Document Management System[J].,2014,24(09):13.
[9]侯善江[],张代远[][][]. 基于样条权函数神经网络P2P流量识别方法[J].计算机技术与发展,2014,24(07):21.
 HOU Shan-jiang[],ZHANG Dai-yuan[][][]. P2P Traffic Identification Based on Spline Weight Function Neural Network[J].,2014,24(09):21.
[10]李璨,耿国华,李康,等. 一种基于三维模型的文物碎片线图生成方法[J].计算机技术与发展,2014,24(07):25.
 LI Can,GENG Guo-hua,LI Kang,et al. A Method of Obtaining Cultural Debris’ s Line Chart Based on Three-dimensional Model[J].,2014,24(09):25.
[11]张培颖[],房龙云[]. 多特征结合的词语相似度计算模型[J].计算机技术与发展,2014,24(12):37.
 ZHANG Pei-ying[],FANG Long-yun[]. Word Similarity Computation Model of Multi-features Combination[J].,2014,24(09):37.
[12]赵涛[],张太红[][],陈燕红[]. 中文农业网页去重及相似度判断研究[J].计算机技术与发展,2015,25(01):191.
 ZHAO Tao[],ZHANG Tai-hong[][],CHEN Yan-hong[]. Research on Duplicate Removal and Similarity Evaluation of Chinese Agricultural Web Pages[J].,2015,25(09):191.
[13]王小林,陆骆勇,邰伟鹏. 基于信息熵的新的词语相似度算法研究[J].计算机技术与发展,2015,25(09):119.
 WANG Xiao-lin,LU Luo-yong,TAI Wei-peng. Research of a New Algorithm of Words Similarity Based on Information Entropy[J].,2015,25(09):119.
[14]闫红[],李付学[],周云[]. 基于HowNet句子相似度的计算[J].计算机技术与发展,2015,25(11):53.
 YAN Hong[],LI Fu-xue[],ZHOU Yun[]. Calculation of Sentence Similarity Based on HowNet[J].,2015,25(09):53.

更新日期/Last Update: 2017-10-20