带 Q 网络过滤的两阶段 TD3 深度强化学习方法-《计算机技术与发展》

文章信息/Info

Title:: Two-stage TD3 Deep Reinforcement Learning Algorithm with Q Network Filtration

文章编号:: 1673-629X(2023)10-0101-08

作者:: 周娴玮; 包明豪; 叶鑫; 余松森; 华南师范大学软件学院,广东佛山 528000

Author(s):: ZHOU Xian-wei; BAO Ming-hao; YE Xin; YU Song-sen; School of Software,South China Normal University,Foshan 528000,China

关键词:: 两阶段深度强化学习; 冷启动问题; 模仿学习; 预训练网络; TD3

Keywords:: two-stage deep reinforcement learning; cold start; imitation learning; pretraining network; TD3

分类号:: TP391. 9

DOI:: 10. 3969 / j. issn. 1673-629X. 2023. 10. 016

摘要:: 常规的深度强化学习模型训练方式从“ 零” 开始,其起始策略为随机初始化,这将导致智能体在训练前期阶段探索效率低、样本学习率低,网络难以收敛,该阶段也被称为冷
启动过程。为解决冷启动问题,目前大多数工作使用两阶段深度强化学习训练方式;但是使用这种方式的智能体由模仿学习过渡至深度强化学习阶段后可能会出现遗忘演示动作的情况,表现为性能和回报突然性回落。因此,该文提出一种带 Q 网络过滤的两阶段 TD3 深度强化学习方法。首先,通过收集专家演示数据,使用模仿学习-行为克
隆以及 TD3 模型 Q 网络更新公式分别对 Actor 网络与 Critic 网络进行预训练工作;进一步地,为避免预训练后的 Actor 网络在策略梯度更新时误选择估值过高的演示数据集之外动作,从而遗忘演示动作,提出 Q 网络过滤算法,过滤掉预训练 Critic 网络中过高估值的演示数据集之外的动作估值,保持演示动作为最高估值动作,有效缓解遗忘现象。在 Deep Mind 提供的 Mujoco 机器人仿真平台中进行实验,验证了所提算法的有效性。

Abstract:: Training of conventional deep reinforcement learning model starts from “ zero冶 ,with random initialization strategy,which leadsto low exploration efficiency,low sample learning rate and low network convergence of the agent in the early stage of training,which isalso known as the cold start problem. To solve the problem,most of the current work use the two - stage deep reinforcement learningtraining mode. However,the agent using this method may forget the demonstration action after the transition from imitation learning todeep reinforcement learning,which is manifested as an abrupt decline in performance and reward. Therefore,a two-stage TD3 deep reinforcement learning method with Q network filtering is proposed. Firstly,collecting expert demonstration data,the pre-training of Actornetwork and Critic network is carried out respectively by using imitation learning-behavior cloning and TD3 model Q network update formula. Further,in order to avoid the pre-trained Actor network mistakenly selecting overvalued actions out of the demonstration data setwhile the strategy gradient update is taking place, resulting in forgetting the demonstration actions, we propose a Q network filteringalgorithm to filter out the overvalued action outside the demonstration data set in the pre - trained Critic network, and keep thedemonstration actions as the highest value actions,effectively alleviating the phenomenon of forgetting. Experiments were carried out onthe Mujoco robot simulation platform provided by Deep Mind to verify the effectiveness of the proposed algorithm.

《计算机技术与发展》[ISSN:1006-6977/CN:61-1281/TN]

文章信息/Info

相似文献/References:

常用功能

导航/Navigate

工具/Tools

统计/Statistics