HXDSP 上双精度矩阵向量乘运算的实现与优化-《计算机技术与发展》

文章信息/Info

Title:: Realization and Optimization of Double-precision Matrix Vector Multiplication Based on HXDSP

作者:: 廖晓群¹ ; 王佳仪¹ ; 苏涛² ; 李敏¹; 张美春¹; 1. 西安科技大学通信与信息工程学院,陕西西安 710054;
2. 西安电子科技大学雷达信号处理国家重点实验室,陕西西安 710071

Author(s):: LIAO Xiao-qun¹ ; WANG Jia-yi¹ ; SU Tao² ; LI Min¹ ; ZHANG Mei-chun¹; 1. School of Communication and Information Engineering,Xi’an University of Science and Technology,Xi’an 710054,China;
2. National Lab of Radar Signal Processing,Xidian University,Xi’an 710071,China

Keywords:: multiple clusters; single instruction multiple data ( SIMD) ; 64-bit data operation; software pipelining; digital signal processor( DSP)

摘要:: 目前 HXDSP1042 编译器的编程模型已经可以支持以字节为单位的寻址模式以及 64 位数据的存取与运算,这对于提高浮点数据运算的精度具有重要的意义。矩阵类算法是雷达信号处理的常用运算,在自适应波束形成、方向估计中矩阵运算占有相当大的比重,现在很多 DSP 处理器并不能自动地充分利用自身所拥有的硬件架构,如何让编译器高效地处理矩阵类的运算变得尤为重要。 HXDSP1042 是一款针对数字信号处理及嵌入式应用的处理器,如何在 HXDSP1042 指令框架下,针对该芯片的硬件特点展开矩阵类运算的设计,是芯片走向高性能应用的重要一步。文中结合多簇 VLIW 指令架构的特点,基于循环展开、指令调度以及软件流水等并行优化技术,充分利用芯片内部硬件资源,对 HXDSP1042 芯片中的双精度浮点矩阵乘以向量运算函数实施并行优化。实验结果表明,相对于优化前的串行算法结构来说,并行优化后的函数加速比达到了 11 以上。

Abstract:: At present,the programming model of the HXDSP1042 compiler can support the addressing mode in bytes and the access and operation of? ? ? 64-bit data,which is of great significance for improving the accuracy of floating-point data operations. Matrix algorithms are common operations? ?in radar signal processing,and matrix operations occupy a large proportion in adaptive beam forming and direction estimation. Now many DSP processors cannot automatically make full use of their own hardware architecture. How to make the compiler handle matrix operations efficiently becomes particularly important. HXDSP1042 is a processor for digital signal processing and embedded applications. How to design matrix operations based on the hardware characteristics of the chip under the HXDSP1042 instruction framework is an important step towards high - performance applications for the chip. In this paper, combining the characteristics of the multi - cluster VLIW instruction architecture, based on parallel optimization techniques such as loop unrolling,instruction scheduling,and software pipeline,making full use of the internal hardware resources of the chip,the double-precision floating-point matrix multiplying the vector operation function in the HXDSP1042 chip is implemented? in parallel optimization. The experiment shows that compared with the serial algorithm structure before optimization,the function speedup ratio after parallel optimization reaches 11 or more.