[1]韩月阳 邓世昆 贾时银 李远方.基于字分类的中文分词的研究[J].计算机技术与发展,2011,(07):29-31.
 HAN Yue-yang,DENG Shi-kun,JIA Shi-yin,et al.Chinese Word Segmentation Research Based on Classification of Words[J].,2011,(07):29-31.
点击复制

基于字分类的中文分词的研究()
分享到:

《计算机技术与发展》[ISSN:1006-6977/CN:61-1281/TN]

卷:
期数:
2011年07期
页码:
29-31
栏目:
智能、算法、系统工程
出版日期:
1900-01-01

文章信息/Info

Title:
Chinese Word Segmentation Research Based on Classification of Words
文章编号:
1673-629X(2011)07-0029-03
作者:
韩月阳 邓世昆 贾时银 李远方
云南大学信息学院
Author(s):
HAN Yue-yangDENG Shi-kunJIA Shi-yinLI Yuan-fang
College of Information,Yunnan University
关键词:
中文分词互信息t-测试分类
Keywords:
Chinese word segmentation mutual information t-test categorization
分类号:
TP391.1
文献标志码:
A
摘要:
中文分词是自然语言处理的前提和基础,利用基于字分类的方法实现中文分词,就是将中文分词想象成字分类的过程。把字放入向前向后相邻两个字这样的一个语境下根据互信息统计将字分成四种类别,即跟它前面结合的字,跟它后面结合的字,跟它前后结合的字,独立的字。在分词的过程中采用了t-测试算法,一定程度上解决了歧义问题。以人民日报为语料库进行训练和测试,实验结果表明,该方法能够很好地处理歧义问题,分词的正确率达到了90.3%,有了明显的提高
Abstract:
Chinese word segmentation is the premise and foundation of natural language processing,which is realized by mutual statistics principles.Imagining Chinese word segmentation as the process of characters classification and putting a character into certain context,the category of the character can be identified.Based on mutual statistics principles,classified characters into four categories: a character connects with the left one,a character connects with the right one,a character in the middle of the other two and an independent character.Applying to t-test algorithm in the process of segmentation,some ambiguity problems are solved.Taking People Daily as the corpus of training and testing,this experiment shows that ambiguity problems are better solved and the accuracy of word segmentation reached 90.3% and improved significantly

相似文献/References:

[1]许镇 王洪国 冉玉梅 杨玉会.基于判别模型的垃圾邮件过滤方法[J].计算机技术与发展,2010,(01):177.
 XU Zhen,WANG Hong-guo,RAN Yu-mei,et al.Spam Filter Method Based on Discriminative Model[J].,2010,(07):177.
[2]盛启东 谭守标 徐超 冯二媛 陈军宁.巧用黑盒法逆推百度中文分词算法[J].计算机技术与发展,2010,(04):136.
 SHENG Qi-dong,TAN Shou-biao,XU Chao,et al.Inferring Baidu's Chinese Word Segmentation Algorithm by Supposing a Black Box[J].,2010,(07):136.
[3]王友国.基于互信息的多阈值系统中随机谐振现象研究[J].计算机技术与发展,2010,(06):89.
 WANG You-guo,LIU Hong-wei,LUO Ji.Stochastic Resonance in Multi-Threshold Systems Based on Mutual Information[J].,2010,(07):89.
[4]张赢 万仲保.对专业搜索引擎中未登录词的识别研究[J].计算机技术与发展,2009,(05):134.
 ZHANG Ying,WAN Zhong-bao.Professional Search Engine Unknown Word of Recognition[J].,2009,(07):134.
[5]牟帅 黄映辉 李冠宇.语义Web服务的OWL—S描述及其应用[J].计算机技术与发展,2009,(01):13.
 MU Shuai,HUANG Ying-hui,LI Guan-yu.OWL - S Description of Semantic Web Service and Its Applications[J].,2009,(07):13.
[6]赵俊杰 胡学钢.一种基于段落词频统计的论文抄袭判定算法[J].计算机技术与发展,2009,(04):231.
 ZHAO Jun-jie,HU Xue-gang.A Way to Judge Plagiarism in Academic Papers Based on Word - Frequency Statistics of Paragraphs[J].,2009,(07):231.
[7]罗桂琼 费洪晓 戴弋.基于反序词典的中文分词技术研究[J].计算机技术与发展,2008,(01):80.
 LUO Gui-qiong,FEI Hong-xiao,DAI Yi.Research of Chinese Segmentation Based on Converse Segmentation Dictionary[J].,2008,(07):80.
[8]程节华 段汉根.汉语短语识别方法研究[J].计算机技术与发展,2008,(04):67.
 CHENG Jie-hua,DUAN Han-gen.Research on Phrase Chunking Methods[J].,2008,(07):67.
[9]钟锋 罗燕京 杨曦 李虎.一种基于合并策略的机构名称切分方法[J].计算机技术与发展,2008,(05):12.
 ZHONG Feng,LUO Yan-jing,YANG Xi,et al.An Organization Name Segmentation Approach Based on Combination Strategy[J].,2008,(07):12.
[10]翟利志 王敬东 李鹏.基于邻域信息的红外与可见光图像互信息配准[J].计算机技术与发展,2008,(10):151.
 ZHAI Li-zhi,WANG Jing-dong,LI Peng.Infrared and Visible Light Image Mutual Information Registration Based on Neighborhood Information[J].,2008,(07):151.
[11]魏博诚 王爱平 沙先军 王永.一种消除中文分词中交集型歧义的方法[J].计算机技术与发展,2011,(05):60.
 WEI Bo-cheng,WANG Ai-ping,SHA Xian-jun,et al.A Method about Removing Overlapping Ambiguity Producing in Chinese Matching[J].,2011,(07):60.

备注/Memo

备注/Memo:
云南省自然科学基金(2007F174M); 云南大学研究生科研课题资助项目(ynny200928)韩月阳(1985-),男,河南人,硕士生,主要从事中文信息处理、信息检索方面的研究;邓世昆,教授,主要从事计算机网络、智能建筑方面的研究
更新日期/Last Update: 1900-01-01