[1]周娴玮,包明豪,叶 鑫,等.带 Q 网络过滤的两阶段 TD3 深度强化学习方法[J].计算机技术与发展,2023,33(10):101-108.[doi:10. 3969 / j. issn. 1673-629X. 2023. 10. 016]
 ZHOU Xian-wei,BAO Ming-hao,YE Xin,et al.Two-stage TD3 Deep Reinforcement Learning Algorithm with Q Network Filtration[J].,2023,33(10):101-108.[doi:10. 3969 / j. issn. 1673-629X. 2023. 10. 016]
点击复制

带 Q 网络过滤的两阶段 TD3 深度强化学习方法()
分享到:

《计算机技术与发展》[ISSN:1006-6977/CN:61-1281/TN]

卷:
33
期数:
2023年10期
页码:
101-108
栏目:
人工智能
出版日期:
2023-10-10

文章信息/Info

Title:
Two-stage TD3 Deep Reinforcement Learning Algorithm with Q Network Filtration
文章编号:
1673-629X(2023)10-0101-08
作者:
周娴玮包明豪叶 鑫余松森
华南师范大学 软件学院,广东 佛山 528000
Author(s):
ZHOU Xian-weiBAO Ming-haoYE XinYU Song-sen
School of Software,South China Normal University,Foshan 528000,China
关键词:
两阶段深度强化学习冷启动问题模仿学习预训练网络TD3
Keywords:
two-stage deep reinforcement learningcold startimitation learningpretraining networkTD3
分类号:
TP391. 9
DOI:
10. 3969 / j. issn. 1673-629X. 2023. 10. 016
摘要:
常规的深度强化学习模型训练方式从“ 零” 开始,其起始策略为随机初始化,这将导致智能体在训练前期阶段探索效率低、样本学习率低,网络难以收敛,该阶段也被称为冷
启动过程。 为解决冷启动问题,目前大多数工作使用两阶段深度强化学习训练方式;但是使用这种方式的智能体由模仿学习过渡至深度强化学习阶段后可能会出现遗忘演示动作的情况,表现为性能和回报突然性回落。 因此,该文提出一种带 Q 网络过滤的两阶段 TD3 深度强化学习方法。 首先,通过收集专家演示数据,使用模仿学习-行为克
隆以及 TD3 模型 Q 网络更新公式分别对 Actor 网络与 Critic 网络进行预训练工作;进一步地,为避免预训练后的 Actor 网络在策略梯度更新时误选择估值过高的演示数据集之外动作,从而遗忘演示动作,提出 Q 网络过滤算法,过滤掉预训练 Critic 网络中过高估值的演示数据集之外的动作估值,保持演示动作为最高估值动作,有效缓解遗忘现象。 在 Deep Mind 提供的 Mujoco 机器人仿真平台中进行实验,验证了所提算法的有效性。
Abstract:
Training of conventional deep reinforcement learning model starts from “ zero冶 ,with random initialization strategy,which leadsto low exploration efficiency,low sample learning rate and low network convergence of the agent in the early stage of training,which isalso known as the cold start problem. To solve the problem,most of the current work use the two - stage deep reinforcement learningtraining mode. However,the agent using this method may forget the demonstration action after the transition from imitation learning todeep reinforcement learning,which is manifested as an abrupt decline in performance and reward. Therefore,a two-stage TD3 deep reinforcement learning method with Q network filtering is proposed. Firstly,collecting expert demonstration data,the pre-training of Actornetwork and Critic network is carried out respectively by using imitation learning-behavior cloning and TD3 model Q network update formula. Further,in order to avoid the pre-trained Actor network mistakenly selecting overvalued actions out of the demonstration data setwhile the strategy gradient update is taking place, resulting in forgetting the demonstration actions, we propose a Q network filteringalgorithm to filter out the overvalued action outside the demonstration data set in the pre - trained Critic network, and keep thedemonstration actions as the highest value actions,effectively alleviating the phenomenon of forgetting. Experiments were carried out onthe Mujoco robot simulation platform provided by Deep Mind to verify the effectiveness of the proposed algorithm.

相似文献/References:

[1]李雪婷,杨 抒,赛亚热·迪力夏提,等.融合内容与协同过滤的混合推荐算法应用研究[J].计算机技术与发展,2021,31(10):24.[doi:10. 3969 / j. issn. 1673-629X. 2021. 10. 005]
 LI Xue-ting,YANG Shu,Saiyare DILIXIATI,et al.Research on Application of Hybrid Recommendation Algorithm of Content Fusion and Collaborative Filtering[J].,2021,31(10):24.[doi:10. 3969 / j. issn. 1673-629X. 2021. 10. 005]

更新日期/Last Update: 2023-10-10