[1]祁斌川 杨端端 丁建国.基于聚类和索引技术的语言模型压缩方法[J].计算机技术与发展,2012,(12):25-28.
 QI Bin-chuan,YANG Duan-duan,DING Jian-guo.Compression Method of Language Model Based on Clustering Algorithm and Multistep Indexing[J].,2012,(12):25-28.
点击复制

基于聚类和索引技术的语言模型压缩方法()
分享到:

《计算机技术与发展》[ISSN:1006-6977/CN:61-1281/TN]

卷:
期数:
2012年12期
页码:
25-28
栏目:
智能、算法、系统工程
出版日期:
1900-01-01

文章信息/Info

Title:
Compression Method of Language Model Based on Clustering Algorithm and Multistep Indexing
文章编号:
1673-629X(2012)12-0025-04
作者:
祁斌川1 杨端端2 丁建国1
[1]中国科学院上海应用物理研究所束测控制部门[2]盛大创新研究院语音主题部门
Author(s):
QI Bin-chuan YANG Duan-duan DING Jian-guo
[1]Dept. of Beam Instrumentation & Control, Shanghai Institute of Applied Physics, Chinese Academy of Science[2]Dept. of Speech Research, SNDA Institute of Innovation Research
关键词:
语言模型压缩方法聚类算法多级索引
Keywords:
language model compression method K-means clustering algorithm multilevel index technology
分类号:
TP319.14
文献标志码:
A
摘要:
由于训练语料的庞大,SRILM训练生成的ARPA统计语言模型数据文件体积过大,导致查找效率低下以及消耗大量的存储空间。针对该问题,借鉴聚类和索引查找的思想,提出了一种基于K均值(K—means)聚类算法的对语言模型中的转移概率和回退概率压缩,并通过多级索引技术提高查找速度的压缩方法。理论分析和实验表明,该方法可以在减少压缩造成的数据失真对选词影响的同时,取得非常好的压缩效果,同时提高了对语言模型文件查找效率,并且输入法的反应速度得到了明显的提升
Abstract:
Because of the large-scale training corpus,the language model data file of the ARPA format produced by SRILM toolkit usually takes too much space and reduces the search rate. For the problem, learning from the idea of unsupervised clustering analysis and multi level index ,proposed a compression method of N-Gram Chinese language model file based on K-means clustering algorithm and multi level index technology to increase search speed. Theoretical analysis and experiments show that the method can promptly obtain an out standing compression ratio and effectively reduce the redundant search times, showing a good performance

相似文献/References:

[1]张志强[],张太红[][],等. 一种基于词树的高效解码算法[J].计算机技术与发展,2017,27(08):43.
 ZHANG Zhi-qiang[],ZHANG Tai-hong[][],DONG Luan[][]. An Efficient Decoding Algorithm Based on Word Tree[J].,2017,27(12):43.

备注/Memo

备注/Memo:
国家“973”重点基础研究发展计划项目(2011CB808300)祁斌川(1986-),男,江苏新沂人,硕士研究生,研究方向为输入法设计
更新日期/Last Update: 1900-01-01