[1]廖晓群,王佳仪,苏 涛,等.HXDSP 上双精度矩阵向量乘运算的实现与优化[J].计算机技术与发展,2021,31(11):101-107.[doi:10. 3969 / j. issn. 1673-629X. 2021. 11. 017]
 LIAO Xiao-qun,WANG Jia-yi,SU Tao,et al.Realization and Optimization of Double-precision Matrix Vector Multiplication Based on HXDSP[J].,2021,31(11):101-107.[doi:10. 3969 / j. issn. 1673-629X. 2021. 11. 017]
点击复制

HXDSP 上双精度矩阵向量乘运算的实现与优化()
分享到:

《计算机技术与发展》[ISSN:1006-6977/CN:61-1281/TN]

卷:
31
期数:
2021年11期
页码:
101-107
栏目:
系统工程
出版日期:
2021-11-10

文章信息/Info

Title:
Realization and Optimization of Double-precision Matrix Vector Multiplication Based on HXDSP
文章编号:
1673-629X(2021)11-0101-07
作者:
廖晓群1 王佳仪1 苏 涛2 李 敏1 张美春1
1. 西安科技大学 通信与信息工程学院,陕西 西安 710054;
2. 西安电子科技大学 雷达信号处理国家重点实验室,陕西 西安 710071
Author(s):
LIAO Xiao-qun1 WANG Jia-yi1 SU Tao2 LI Min1 ZHANG Mei-chun1
1. School of Communication and Information Engineering,Xi’an University of Science and Technology,Xi’an 710054,China;
2. National Lab of Radar Signal Processing,Xidian University,Xi’an 710071,China
关键词:
多簇单指令流多数据流64 位数据运算软件流水数字信号处理器
Keywords:
multiple clusterssingle instruction multiple data ( SIMD) 64-bit data operationsoftware pipeliningdigital signal processor( DSP)
分类号:
TP301
DOI:
10. 3969 / j. issn. 1673-629X. 2021. 11. 017
摘要:
目前 HXDSP1042 编译器的编程模型已经可以支持以字节为单位的寻址模式以及 64 位数据的存取与运算,这对于提高浮点数据运算的精度具有重要的意义。 矩阵类算法是雷达信号处理的常用运算,在自适应波束形成、方向估计中矩阵运算占有相当大的比重,现在很多 DSP 处理器并不能自动地充分利用自身所拥有的硬件架构,如何让编译器高效地处理矩阵类的运算变得尤为重要。 HXDSP1042 是一款针对数字信号处理及嵌入式应用的处理器,如何在 HXDSP1042 指令框架下,针对该芯片的硬件特点展开矩阵类运算的设计,是芯片走向高性能应用的重要一步。 文中结合多簇 VLIW 指令架构的特点,基于循环展开、指令调度以及软件流水等并行优化技术,充分利用芯片内部硬件资源,对 HXDSP1042 芯片中的双精度浮点矩阵乘以向量运算函数实施并行优化。 实验结果表明,相对于优化前的串行算法结构来说,并行优化后的函数加速比达到了 11 以上。
Abstract:
At present,the programming model of the HXDSP1042 compiler can support the addressing mode in bytes and the access and operation of? ? ? 64-bit data,which is of great significance for improving the accuracy of floating-point data operations. Matrix algorithms are common operations? ?in radar signal processing,and matrix operations occupy a large proportion in adaptive beam forming and direction estimation. Now many DSP processors cannot automatically make full use of their own hardware architecture. How to make the compiler handle matrix operations efficiently becomes particularly important. HXDSP1042 is a processor for digital signal processing and embedded applications. How to design matrix operations based on the hardware characteristics of the chip under the HXDSP1042 instruction framework is an important step towards high - performance applications for the chip. In this paper, combining the characteristics of the multi - cluster VLIW instruction architecture, based on parallel optimization techniques such as loop unrolling,instruction scheduling,and software pipeline,making full use of the internal hardware resources of the chip,the double-precision floating-point matrix multiplying the vector operation function in the HXDSP1042 chip is implemented? in parallel optimization. The experiment shows that compared with the serial algorithm structure before optimization,the function speedup ratio after parallel optimization reaches 11 or more.
更新日期/Last Update: 2021-11-10