基于上下文模型的超长哈夫曼码校正算法-《计算机技术与发展》

文章信息/Info

Title:: Correction Algorithm for Ultra-long Huffman Codes Based on Context Model

Author(s):: ZHANG Yong-xing; WU Rui-zhen; JIA Xiao-long; CHEN Jing-jing; SUN Hua-jin; Inspur Artificial Intelligence Research Institute Co. ,Ltd. ,Xi’an 710077,China

Keywords:: Deflate; Huffman coding; Huffman tree; ultra-long Huffman codes; correction of ultra-long Huffman codes correction

摘要:: 常见的 Gzip、 Zlib 数据压缩标准都采用 Deflate 协议压缩封装数据, Deflate 协议中采用哈夫曼码编码源符号(Source symbols) 。哈夫曼编码算法通过构建哈夫曼树生成哈夫曼码,Deflate 协议限定源符号的哈夫曼码的码长不能超过最大值。源符号的哈夫曼码长最大值等于哈夫曼树的高度,因此当哈夫曼树的高度超过限定值时,需要先把哈夫曼树进行“校正”，随后再为每个符号分配。 Gzip、Zlib 软件参考代码中使用的基于二叉树搜索的“ 校正” 算法,校正时需要遍历搜索哈夫曼树,寻找嫁接“节点”。校正流程时间消耗非常大,而且硬件实现难度较大。该文探索一种基于上下文模型校正超长哈夫曼树的算法,与参考二叉树搜索算法相比:该算法可以快速校正超长哈夫曼树,将校正的时间消耗降至为 0,而且对压缩效果几乎没有影响( 压缩比平均下降率仅为 0. 372% )。该算法也易于硬件化实现,可以实时校正超长哈夫曼码。

Abstract:: Common data compression standards such as Gzip and Zlib compress and encapsulate data with the Deflate format. In Deflateprotocol,Source symbols are encoded with Huffman codes that are distributed by constructing the Huffman tree. In the " Deflate "protocol,the Huffman codes for the various alphabets must not exceed certain maximum code lengths. Therefore,if the height of FirstHuffman tree was greater than the maximum value,First Huffman tree needs to be " corrected" before generating Huffman code. Thecurrent " correction" algorithm based on the binary tree consumes a lot of time in the correction process and is difficult to implement inhardware. We explore an algorithm to quickly correct ultra-long Huffman trees based on context models. Compared to exist correctingalgorithm,the ultra-long Huffman trees could be quickly corrected with a consuming time of nearly 0. Moreover,mean compression ratiois only reduced by 0. 372% . The proposed algorithm is suitable for hardware implementation and can correct ultra-long Huffman codesin real time.