[1]吕相霖,臧兆祥,李思博,等.基于注意力的循环 PPO 算法及其应用[J].计算机技术与发展,2024,34(01):136-142.[doi:10. 3969 / j. issn. 1673-629X. 2024. 01. 020]
 LYU Xiang-lin,ZANG Zhao-xiang,LI Si-bo,et al.Attention-based Recurrent PPO Algorithm and Its Application[J].,2024,34(01):136-142.[doi:10. 3969 / j. issn. 1673-629X. 2024. 01. 020]
点击复制

基于注意力的循环 PPO 算法及其应用()
分享到:

《计算机技术与发展》[ISSN:1006-6977/CN:61-1281/TN]

卷:
34
期数:
2024年01期
页码:
136-142
栏目:
人工智能
出版日期:
2024-01-10

文章信息/Info

Title:
Attention-based Recurrent PPO Algorithm and Its Application
文章编号:
1673-629X(2024)01-0136-07
作者:
吕相霖12 臧兆祥12 李思博12 王俊英12
1. 三峡大学 水电工程智能视觉监测湖北省重点实验室,湖北 宜昌 443002;
2. 三峡大学 计算机与信息学院,湖北 宜昌 443002
Author(s):
LYU Xiang-lin12 ZANG Zhao-xiang12 LI Si-bo12 WANG Jun-ying12
1. Hubei Key Laboratory of Intelligent Visual Monitoring for Hydropower Engineering, Three Gorges University,Yichang 443002,China;
2. School of Computer and Information,Three Gorges University,Yichang 443002,China
关键词:
深度强化学习部分可观测注意力机制LSTM 网络近端策略优化算法
Keywords:
deep reinforcement learningpartially observableattention mechanismLSTM networkproximal policy optimization algorithm
分类号:
TP242. 6
DOI:
10. 3969 / j. issn. 1673-629X. 2024. 01. 020
摘要:
针对深度强化学习算法在部分可观测环境中面临信息掌握不足、存在随机因素等问题,提出了一种融合注意力机制与循环神经网络的近端策略优化算法( ARPPO 算法) 。 该算法首先通过卷积网络层提取特征;其次采用注意力机制突出状态中重要的关键信息;再次通过 LSTM 网络提取数据的时域特性;最后基于 Actor-Critic 结构的 PPO 算法进行策略学习与训练提升。 基于 Gym-Minigrid 环境设计了两项探索任务的消融与对比实验,实验结果表明 ARPPO 算法较已有的A2C 算法、PPO 算法、RPPO 算法具有更快的收敛速度,且 ARPPO 算法在收敛之后具有很强的稳定性,并对存在随机因素的未知环境具备更强的适应力。
Abstract:
A proximal policy optimization model based on attention mechanism and recurrent neural network ( ARPPO) is proposed toaddress the problems faced by deep reinforcement learning algorithms in partially observable environments, such as insufficientinformation about the environment and randomness factors. The algorithm first processes the encoded information of environmentalimages through convolutional network layers; then highlights important key information in states using attention mechanism; then extractstemporal characteristics of data through LSTM network; finally improves policy learning and training based on PPO with Actor-Criticstructure. Ablation and comparative experiments of two exploration tasks were designed based on the Gym-Minigrid environment. Theexperimental results show that ARPPO has faster training speed and stronger stability compared with A2C, PPO and RPPO, and hasstronger adaptability to unknown environments with random factors.

相似文献/References:

[1]赵 纯,董小明.基于深度 Q-Learning 的信号灯配时优化研究[J].计算机技术与发展,2021,31(08):198.[doi:10. 3969 / j. issn. 1673-629X. 2021. 08. 034]
 ZHAO Chun,DONG Xiao-ming.Research on Signal Timing Optimization Based on Deep Q-Learning[J].,2021,31(01):198.[doi:10. 3969 / j. issn. 1673-629X. 2021. 08. 034]
[2]况立群,冯 利,韩 燮,等.基于双深度 Q 网络的智能决策系统研究[J].计算机技术与发展,2022,32(02):137.[doi:10. 3969 / j. issn. 1673-629X. 2022. 02. 022]
 KUANG Li-qun,FENG Li,HAN Xie,et al.Research on Intelligent Decision-making System Based on Double Deep Q-Network[J].,2022,32(01):137.[doi:10. 3969 / j. issn. 1673-629X. 2022. 02. 022]
[3]高文斌,王 睿,王田丰,等.基于深度强化学习的 QoS 感知 Web 服务组合[J].计算机技术与发展,2022,32(06):92.[doi:10. 3969 / j. issn. 1673-629X. 2022. 06. 016]
 GAO Wen-bin,WANG Rui,WANG Tian-feng,et al.QoS-aware Service Composition Based on Deep Reinforcement Learning[J].,2022,32(01):92.[doi:10. 3969 / j. issn. 1673-629X. 2022. 06. 016]
[4]詹 御,张郭健,彭麟杰,等.基于 DRL 的 MEC 卸载网络竞争窗口优化[J].计算机技术与发展,2022,32(06):99.[doi:10. 3969 / j. issn. 1673-629X. 2022. 06. 017]
 ZHAN Yu,ZHANG Guo-jian,PENG Lin-jie,et al.Optimization of Contention Window of MEC Offloading Network Based on DRL[J].,2022,32(01):99.[doi:10. 3969 / j. issn. 1673-629X. 2022. 06. 017]
[5]牟轩庭,张宏军,廖湘琳,等.规则引导的智能体决策框架[J].计算机技术与发展,2022,32(10):156.[doi:10. 3969 / j. issn. 1673-629X. 2022. 10. 026]
 MU Xuan-ting,ZHANG Hong-jun,LIAO Xiang-lin,et al.Rule-guided Agent Decision-Making Framework[J].,2022,32(01):156.[doi:10. 3969 / j. issn. 1673-629X. 2022. 10. 026]
[6]林泽阳,赖 俊,陈希亮.基于课程学习的深度强化学习研究综述[J].计算机技术与发展,2022,32(11):16.[doi:10. 3969 / j. issn. 1673-629X. 2022. 11. 003]
 LIN Ze-yang,LAI Jun,CHEN Xi-liang.An Overview of Deep Reinforcement Learning Based on Curriculum Learning[J].,2022,32(01):16.[doi:10. 3969 / j. issn. 1673-629X. 2022. 11. 003]
[7]龚亮亮,张 影,张俊尧,等.基于深度强化学习的任务卸载和资源分配优化[J].计算机技术与发展,2024,34(04):116.[doi:10. 3969 / j. issn. 1673-629X. 2024. 04. 018]
 GONG Liang-liang,ZHANG Ying,ZHANG Jun-yao,et al.Joint Optimization of Task Offloading and Resource Allocation Based on Deep Reinforcement Learning[J].,2024,34(01):116.[doi:10. 3969 / j. issn. 1673-629X. 2024. 04. 018]

更新日期/Last Update: 2024-01-10