«上一篇/Previous Article|本期目录/Table of Contents|下一篇/Next Article»

j.cnki.ISSN1673-629X.2024.0051]
点击复制

基于相似度加权的无模型元强化学习方法()

《计算机技术与发展》[ISSN:1006-6977/CN:61-1281/TN]

卷:: 34
期数:: 2024年05期

页码:: 133-140

栏目:: 人工智能

出版日期:: 2024-05-10

文章信息/Info

Title:: Model-agnostic Meta Reinforcement Learning Based on Similarity Weighting

文章编号:: 1673-629X(2024)05-0133-08

作者:: 赵春宇; 赖俊*; 陈希亮; 张人文; 陆军工程大学指挥控制工程学院,江苏南京 210007

Author(s):: ZHAO Chun-yu; LAI Jun*; CHEN Xi-liang; ZHANG Ren-wen; School of Command and Control Engineering,Army Engineering University of PLA,Nanjing 210007,China

关键词:: 元学习; 强化学习; 元强化学习; 梯度下降; 无模型

Keywords:: meta-learning; reinforcement learning; meta-reinforcement learning; gradient descent; model agnostic

分类号:: TP181

DOI:: 10.20165/j.cnki.ISSN1673-629X.2024.0051

摘要:: 强化学习在游戏对弈、机器人控制等领域内已取得良好成效。为进一步提高训练效率,将元学习拓展至强化学习中,由此所产生的元强化学习已成为当前强化学习领域中的研究热点。元知识质量是决定元强化学习效果的关键因素,基于梯度的元强化学习以模型初始参数为元知识指导后续学习。为提高元知识质量,提出了一种通用元强化学习方法,通过加权机制显式表现训练过程中子任务对训练效果的贡献。该方法利用不同子任务所得的梯度更新向量与任务集内所有梯度更新向量的相似性作为更新权重,完善梯度更新过程,提高以模型初始参数为元知识的质量,使训练好的模型在一个良好的起点上解决新任务。该方法可通用在基于梯度的强化学习中,达到使用少量样本快速解决新任务的目标。在二维导航任务和仿真机器人运动控制任务的对比实验中,该方法优于其他基准算法,证明了加权机制的合理性。

Abstract:: Reinforcement learning has achieved excellent performance in the fields of game games and robotics control. In order to further improve the training efficiency,meta - learning is extended to reinforcement learning, the resulting meta - reinforcement learning has become a research hotspot in the field of reinforcement learning. The quality of meta-knowledge is the key factor determining the effect of meta-reinforcement learning,and gradient - based meta - reinforcement learning takes the initial parameters of the model as meta - knowledge to guide the subsequent learning. To improve the quality of meta - knowledge,we propose a general meta - reinforcement learning method,which explicitly shows the contribution of subtasks to the training effect in the training process by weighting. The proposed method uses the similarity between the gradient update vectors obtained by different subtasks and the gradient update vectors obtained by the overall task set as update weights,improves the gradient update process,improves the quality of the meta-knowledge based on the initial parameters of the model,and makes the trained model solve the new task at a good starting point. The proposed method can be used in gradient - based reinforcement learning to quickly solve new tasks with a small number of samples. In the experiments of 2D navigation tasks and locomotion tasks,the proposed method outperforms other benchmark algorithms,which proves the rationality of weighted mechanism.

相似文献/References:

[1]冯林李琛孙焘.Robocup半场防守中的一种强化学习算法[J].计算机技术与发展,2008,(01):59.
　FENG Lin,LI Chen,SUN Tao.A Reinforcement Learning Method for Robocup Soccer Half Field Defense[J].,2008,(05):59.
[2]汤萍萍王红兵.基于强化学习的Web服务组合[J].计算机技术与发展,2008,(03):142.
　TANG Ping-ping,WANG Hong-bing.Web Service Composition Based on Reinforcement -Learning[J].,2008,(05):142.
[3]王朝晖孙惠萍.图像检索中IRRL模型研究[J].计算机技术与发展,2008,(12):35.
　WANG Zhao-hui,SUN Hui-ping.Research of IRRL Model in Image Retrieval[J].,2008,(05):35.
[4]林联明王浩王一雄.基于神经网络的Sarsa强化学习算法[J].计算机技术与发展,2006,(01):30.
　LIN Lian-ming,WANG Hao,WANG Yi-xiong.Sarsa Reinforcement Learning Algorithm Based on Neural Networks[J].,2006,(05):30.
[5]农汉琦,孙蕴琪,黄洁,等.基于机器学习的认知无线网络优化策略[J].计算机技术与发展,2020,30(05):125.[doi:10. 3969 / j. issn. 1673-629X. 2020. 05. 024]
　NONG Han-qi,SUN Yun-qi,HUANG Jie,et al.Optimization Strategy of Cognitive Radio Network Based on Machine Learning[J].,2020,30(05):125.[doi:10. 3969 / j. issn. 1673-629X. 2020. 05. 024]
[6]雷莹,许道云.一种合作 Markov 决策系统[J].计算机技术与发展,2020,30(12):8.[doi:10. 3969 / j. issn. 1673-629X. 2020. 12. 002]
　LEI Ying,XU Dao-yun.A Cooperation Markov Decision Process System[J].,2020,30(05):8.[doi:10. 3969 / j. issn. 1673-629X. 2020. 12. 002]
[7]彭云建,梁进.基于探索-利用权衡优化的 Q 学习路径规划[J].计算机技术与发展,2022,32(04):1.[doi:10. 3969 / j. issn. 1673-629X. 2022. 04. 001]
　PENG Yun-jian,LIANG Jin.Q-learning Path Planning Based on Exploration / Exploitation Tradeoff Optimization[J].,2022,32(05):1.[doi:10. 3969 / j. issn. 1673-629X. 2022. 04. 001]
[8]王文龙,张磊,张誉馨,等.行人属性识别:基于元学习的概率集成方法[J].计算机技术与发展,2022,32(03):71.[doi:10. 3969 / j. issn. 1673-629X. 2022. 03. 012]
　WANG Wen-long,ZHANG Lei,ZHANG Yu-xin,et al.Pedestrian Attribute Recognition:Probabilistic Ensemble LearningMethod Based on Meta-learning[J].,2022,32(05):71.[doi:10. 3969 / j. issn. 1673-629X. 2022. 03. 012]
[9]乔通,周洲,程鑫,等.基于 Q-学习的底盘测功机自适应 PID 控制模型[J].计算机技术与发展,2022,32(05):117.[doi:10. 3969 / j. issn. 1673-629X. 2022. 05. 020]
　QIAO Tong,ZHOU Zhou,CHENG Xin,et al.Adaptive PID Control Model of Chassis Dynamometer Based on Q-Learning[J].,2022,32(05):117.[doi:10. 3969 / j. issn. 1673-629X. 2022. 05. 020]
[10]魏竞毅,赖俊,陈希亮.基于互信息的智能博弈对抗分层强化学习研究[J].计算机技术与发展,2022,32(09):142.[doi:10. 3969 / j. issn. 1673-629X. 2022. 09. 022]
　WEI Jing-yi,LAI Jun,CHEN Xi-liang.Research on Hierarchical Reinforcement Learning of Intelligent Game Confrontation Based on Mutual Information[J].,2022,32(05):142.[doi:10. 3969 / j. issn. 1673-629X. 2022. 09. 022]

常用功能

工具/Tools

统计/Statistics

摘要浏览/Viewed263
全文下载/Downloads139
评论/Comments