[1]裴 航,王 磊,王 威,等.基于申威 421 的视频解码的向量化并行[J].计算机技术与发展,2021,31(10):81-86.[doi:10. 3969 / j. issn. 1673-629X. 2021. 10. 014]
 PEI Hang,WANG Lei,WANG Wei,et al.Vectorization Parallelism of Video Decoding Based on Shenwei 421[J].,2021,31(10):81-86.[doi:10. 3969 / j. issn. 1673-629X. 2021. 10. 014]
点击复制

基于申威 421 的视频解码的向量化并行()
分享到:

《计算机技术与发展》[ISSN:1006-6977/CN:61-1281/TN]

卷:
31
期数:
2021年10期
页码:
81-86
栏目:
系统工程
出版日期:
2021-10-10

文章信息/Info

Title:
Vectorization Parallelism of Video Decoding Based on Shenwei 421
文章编号:
1673-629X(2021)10-0081-06
作者:
裴 航12 王 磊12 王 威12 张书钦1
1. 中原工学院 计算机学院,河南 郑州 451191;
2. 中原工学院 前沿信息技术研究院,河南 郑州 451191
Author(s):
PEI Hang12 WANG Lei12 WANG Wei12 ZHANG Shu-qin1
1. School of Computer Science,Zhongyuan University of Technology,Zhengzhou 451191,China;
2. Research Institute of Frontier Information Technology,Zhongyuan University of Technology,Zhengzhou 451191,China
关键词:
H. 264 解码器FFmpeg 编解码库申威处理器单指令多数据流并行计算
Keywords:
H. 264 decoderFFmpeg codec libraryShenwei instructionssingle instruction multi-data streamparallel computing
分类号:
TP302
DOI:
10. 3969 / j. issn. 1673-629X. 2021. 10. 014
摘要:
H. 264 解码器在申威平台移植后,遇到解码效率低、视频播放不流畅等问题。 为提升视频解码性能, 满足国产申威平台用户的多媒体需求, 首先对 FFmpeg 开源编解码库中 H. 264 解码器进行了详细分析,使用性能分析工具找到视频解码的热点函数。 然后充分利用申威处理器的向量扩展部件, 对解码器运动补偿、DCT 反变换等关键模块代码使用手工嵌入式汇编进行向量指令替换来缩短指令周期,实现向量化并行。 最后对环路滤波代码中不能直接向量化的循环通过数组重组等方式满足向量化分析,然后进行向量化计算,更深层次挖掘多媒体并行能力,从而提升多媒体程序运行速度。 实验结果表明,向量化后的视频解码性能最高提升了 35. 3% ,释放了 CPU 资源,解决了视频播放不流畅的问题,有效推动了申威处理器市场化发展。
Abstract:
The H. 264 decoder encountered problems such as low decoding efficiency and unsmooth video playback after being transplanted on the Shenwei platform. To promote the video decoding performance and meet the multimedia needs of domestic Shenwei platform users,firstly the? ? H. 264 decoder in the FFmpeg open source codec library is analyzed in detail,and the performance analysis tool is used to find the hot functions of video decoding. Then making full use of the vector expansion components of the Shenwei processor,we use manual embedded assembly for vector instruction replacement for key module codes such as decoder motion compensation and DCT inverse transformation to shorten the instruction cycle and achieve vectorization parallelism. Finally, in the loop filter code that cannot be directly vectorized, the vectorized analysis is satisfied by means of array reorganization, and then vectorized calculation iscarried out to dig deeper into the multimedia parallel capabilities, thereby improving the running speed of the multimedia program. The experiment shows that the video decoding performance after vectorization is improved by up to 35. 3% ,which frees up CPU resources,solves the problem of unsmooth video playback,and effectively promotes the market development of Shenwei processors.
更新日期/Last Update: 2021-10-10