«上一篇/Previous Article|本期目录/Table of Contents|下一篇/Next Article»

j. issn. 1673-629X. 2023. 06. 025]
点击复制

基于记忆启发的强化学习方法研究()

分享到：

《计算机技术与发展》[ISSN:1006-6977/CN:61-1281/TN]

卷:: 33
期数:: 2023年06期

页码:: 168-172

栏目:: 人工智能

出版日期:: 2023-06-10

文章信息/Info

Title:: Research on Memory Heuristic Reinforcement Learning

文章编号:: 1673-629X(2023)06-0168-05

作者:: 刘晓峰^{1 *} ; 刘智斌²; 董兆安²; 1. 曲阜师范大学图书馆,山东日照 276826;
2. 曲阜师范大学计算机学院,山东日照 276826

Author(s):: LIU Xiao-feng1 * ; LIU Zhi-bin2 ; DONG Zhao-an2; 1. Liberary,Qufu Normal University,Rizhao 276826,China;
2. School of Computer Science,Qufu Normal University,Rizhao 276826,China

关键词:: 强化学习; Q 学习; 启发式搜索; Shaping 函数; 路径规划

Keywords:: reinforcement learning; Q-learning; heuristic search; Shaping function; path planning

分类号:: TP301

DOI:: 10. 3969 / j. issn. 1673-629X. 2023. 06. 025

摘要:: 该文旨在研究人工智能领域的强化学习问题。在处理优化问题的过程中,强化学习具有不依赖于模型信息的特点,在信息产业和生产领域逐步获得应用,并取得了较好的效果。然而,传统的强化
学习算法通过随机探索获得优化行为,存在学习速度慢、收敛不及时的问题。为了提高强化学习的效率,提出一种方法,让 Agent 利用自身学习得到的知识,指导和加速其以后的学习过程。将 Q?
学习和启发式 Shaping 回报函数结合起来,利用记忆的知识加速了 Agent 的学习过程。另外,证明了采用启发函数与不使用启发函数在策略优化上的一致性。针对一个路径规划问题,采用了学
习过程中生成的势场函数作为启发函数,通过启发函数对强化学习的探索过程给予指导。在实验中对该方法进行了验证,分析了采用不同参数带来的不同效果,并提出了一个解决死点问题的方法。结果表明,该方法对强化学习过程有明显的加速作用,并能取得优化的搜索路径。

Abstract:: We aim to research on the reinforcement learning problems in the field of artificial intelligence. In the process of dealing withoptimization problems,reinforcement learning has the feature?
of not relying on model information,which gradually gains applications inthe areas of information and production,achieving better results. However,the traditional reinforcement learning algorithm obtains the optimization behavior by random exploration,which has the problems of slow learning speed and untimely convergence. In order to improvethe efficiency of reinforcement learning,we propose a method that allows an agent to use the knowledge obtained from its own learning toguide and accelerate its subsequent learning process. Q-learning and heuristic Shaping reward function are combined to accelerate thelearning process of the agent by utilizing the knowledge of memory. In addition, we demonstrate the consistency of?
using heuristicfunction and not using heuristic function in policy optimization. For a path planning problem, we adopt the potential field functiongenerated during the learning process as a heuristic function,which gives guidance to the exploration process of reinforcement learning.The method is validated in experiments,the different effects brought by using different parameters are analyzed,and a method to solve thedead point problem is proposed. The results show that adopting the proposed method has a significant acceleration effect on thereinforcement learning process and can obtain an optimized searching path.

相似文献/References:

[1]冯林李琛孙焘.Robocup半场防守中的一种强化学习算法[J].计算机技术与发展,2008,(01):59.
　FENG Lin,LI Chen,SUN Tao.A Reinforcement Learning Method for Robocup Soccer Half Field Defense[J].,2008,(06):59.
[2]汤萍萍王红兵.基于强化学习的Web服务组合[J].计算机技术与发展,2008,(03):142.
　TANG Ping-ping,WANG Hong-bing.Web Service Composition Based on Reinforcement -Learning[J].,2008,(06):142.
[3]王朝晖孙惠萍.图像检索中IRRL模型研究[J].计算机技术与发展,2008,(12):35.
　WANG Zhao-hui,SUN Hui-ping.Research of IRRL Model in Image Retrieval[J].,2008,(06):35.
[4]林联明王浩王一雄.基于神经网络的Sarsa强化学习算法[J].计算机技术与发展,2006,(01):30.
　LIN Lian-ming,WANG Hao,WANG Yi-xiong.Sarsa Reinforcement Learning Algorithm Based on Neural Networks[J].,2006,(06):30.
[5]刘成健,罗杰.基于参数融合的 Q 学习交通信号控制方法[J].计算机技术与发展,2018,28(11):48.[doi:10.3969/ j. issn.1673-629X.2018.11.011]
　LIU Cheng-jian,LUO Jie.A Control Method of Traffic Signals Based on Parameter Fusion of Q-learning[J].,2018,28(06):48.[doi:10.3969/ j. issn.1673-629X.2018.11.011]
[6]农汉琦,孙蕴琪,黄洁,等.基于机器学习的认知无线网络优化策略[J].计算机技术与发展,2020,30(05):125.[doi:10. 3969 / j. issn. 1673-629X. 2020. 05. 024]
　NONG Han-qi,SUN Yun-qi,HUANG Jie,et al.Optimization Strategy of Cognitive Radio Network Based on Machine Learning[J].,2020,30(06):125.[doi:10. 3969 / j. issn. 1673-629X. 2020. 05. 024]
[7]雷莹,许道云.一种合作 Markov 决策系统[J].计算机技术与发展,2020,30(12):8.[doi:10. 3969 / j. issn. 1673-629X. 2020. 12. 002]
　LEI Ying,XU Dao-yun.A Cooperation Markov Decision Process System[J].,2020,30(06):8.[doi:10. 3969 / j. issn. 1673-629X. 2020. 12. 002]
[8]魏竞毅,赖俊,陈希亮.基于互信息的智能博弈对抗分层强化学习研究[J].计算机技术与发展,2022,32(09):142.[doi:10. 3969 / j. issn. 1673-629X. 2022. 09. 022]
　WEI Jing-yi,LAI Jun,CHEN Xi-liang.Research on Hierarchical Reinforcement Learning of Intelligent Game Confrontation Based on Mutual Information[J].,2022,32(06):142.[doi:10. 3969 / j. issn. 1673-629X. 2022. 09. 022]
[9]吴鹏,魏上清,董嘉鹏,等.基于 SARSA 强化学习的审判人力资源调度方法[J].计算机技术与发展,2022,32(09):82.[doi:10. 3969 / j. issn. 1673-629X. 2022. 09. 013]
　WU Peng,WEI Shang-qing,DONG Jia-peng,et al.Trial Human Resources Scheduling Method Based on SARSA Reinforcement Learning[J].,2022,32(06):82.[doi:10. 3969 / j. issn. 1673-629X. 2022. 09. 013]
[10]林泽阳,赖俊,陈希亮.基于课程学习的深度强化学习研究综述[J].计算机技术与发展,2022,32(11):16.[doi:10. 3969 / j. issn. 1673-629X. 2022. 11. 003]
　LIN Ze-yang,LAI Jun,CHEN Xi-liang.An Overview of Deep Reinforcement Learning Based on Curriculum Learning[J].,2022,32(06):16.[doi:10. 3969 / j. issn. 1673-629X. 2022. 11. 003]
[11]彭云建,梁进.基于探索-利用权衡优化的 Q 学习路径规划[J].计算机技术与发展,2022,32(04):1.[doi:10. 3969 / j. issn. 1673-629X. 2022. 04. 001]
　PENG Yun-jian,LIANG Jin.Q-learning Path Planning Based on Exploration / Exploitation Tradeoff Optimization[J].,2022,32(06):1.[doi:10. 3969 / j. issn. 1673-629X. 2022. 04. 001]
[12]乔通,周洲,程鑫,等.基于 Q-学习的底盘测功机自适应 PID 控制模型[J].计算机技术与发展,2022,32(05):117.[doi:10. 3969 / j. issn. 1673-629X. 2022. 05. 020]
　QIAO Tong,ZHOU Zhou,CHENG Xin,et al.Adaptive PID Control Model of Chassis Dynamometer Based on Q-Learning[J].,2022,32(06):117.[doi:10. 3969 / j. issn. 1673-629X. 2022. 05. 020]

常用功能

工具/Tools

统计/Statistics

摘要浏览/Viewed514
全文下载/Downloads321
评论/Comments