«上一篇/Previous Article|本期目录/Table of Contents|下一篇/Next Article»

j. issn. 1673-629X. 2022. 09. 022]
点击复制

基于互信息的智能博弈对抗分层强化学习研究()

分享到：

《计算机技术与发展》[ISSN:1006-6977/CN:61-1281/TN]

卷:: 32
期数:: 2022年09期

页码:: 142-147

栏目:: 人工智能

出版日期:: 2022-09-10

文章信息/Info

Title:: Research on Hierarchical Reinforcement Learning of Intelligent Game Confrontation Based on Mutual Information

文章编号:: 1673-629X(2022)09-0142-06

作者:: 魏竞毅; 赖俊; 陈希亮; 陆军工程大学指挥控制工程学院,江苏南京 210007

Author(s):: WEI Jing-yi; LAI Jun; CHEN Xi-liang; School of Command Information System,Army Engineering University,Nanjing 210007,China

关键词:: 智能博弈; 强化学习; 互信息; 分层; A3C 算法; 分队指挥

Keywords:: intelligent game; reinforcement learning; mutual information; hierarchical; A3C algorithm; unit commander

分类号:: TP181

DOI:: 10. 3969 / j. issn. 1673-629X. 2022. 09. 022

摘要:: 智能博弈在当前人工智能的发展中是较为热点的一个问题, 同时, 随着人工智能的不断发展, 在作战指挥领域也逐渐得到了广泛的应用, 尤其以美国 DAPPA 为首, 利用人工智能来为指挥员的战场决策提供全方位的策略支持,如何利用人工智能模拟战场环境下进行战场对抗也是研究的一方面。? ? ? 当前, 智能体虽然能够通过获得奖励不断进行优化,在策略上通常是根据即时奖励选择当时收益最大的策略,现实战场环境中有些决策当时虽不会有即时收益,但之后是会对整体的战场形势有更好的推动作用,能够取得更有利的战果。针对此问题,利用分层强化学习进行智能体的智能博弈训练,并应用于简单战场环境下来模拟虚拟指挥员,提出了一种基于互信息的智能博弈对抗的分层强化学习算法 MI-A3C。 MI -A3C 算法在模拟的战场环境中能够取得 86. 7% 的胜率,并能够完成主要任务,同时在实验中可以发现一些有利于长远收益的决策。

Abstract:: Intelligent game is a hot issue in the current development of artificial intelligence. At the same time,with the continuous development of artificial intelligence,it has gradually been widely used in the field of battle command. Especially,led by American DAPPA,artificial intelligence is used to provide all-round strategic support for commanders’ battlefield decisions. How to use artificial intelligence to simulate battlefield confrontation in battlefield environment is also one of its research aspects. At present, although agents can continuously optimize by obtaining rewards,they are usually real-time strategies in strategy. Although some decisions in battlefield environment will not have immediate benefits at that time,but then? ?it will play a better role in promoting the overall battlefield situation and achieve more favorable results. To solve this problem,hierarchical reinforcement learning is used for intelligent game training of agents and applied to simulate virtual commanders in a simple battlefield environment. A hierarchical reinforcement learning algorithm MI-A3Calgorithm based on intelligent game confrontation based on mutual information is proposed. MI - A3C algorithm can achieve 86. 7%victory rate in the simulated battlefield environment,and can complete the main tasks. At the same time,some decisions conducive to long-term benefits can be found in the experiment.

相似文献/References:

[1]冯林李琛孙焘.Robocup半场防守中的一种强化学习算法[J].计算机技术与发展,2008,(01):59.
　FENG Lin,LI Chen,SUN Tao.A Reinforcement Learning Method for Robocup Soccer Half Field Defense[J].,2008,(09):59.
[2]汤萍萍王红兵.基于强化学习的Web服务组合[J].计算机技术与发展,2008,(03):142.
　TANG Ping-ping,WANG Hong-bing.Web Service Composition Based on Reinforcement -Learning[J].,2008,(09):142.
[3]王朝晖孙惠萍.图像检索中IRRL模型研究[J].计算机技术与发展,2008,(12):35.
　WANG Zhao-hui,SUN Hui-ping.Research of IRRL Model in Image Retrieval[J].,2008,(09):35.
[4]林联明王浩王一雄.基于神经网络的Sarsa强化学习算法[J].计算机技术与发展,2006,(01):30.
　LIN Lian-ming,WANG Hao,WANG Yi-xiong.Sarsa Reinforcement Learning Algorithm Based on Neural Networks[J].,2006,(09):30.
[5]农汉琦,孙蕴琪,黄洁,等.基于机器学习的认知无线网络优化策略[J].计算机技术与发展,2020,30(05):125.[doi:10. 3969 / j. issn. 1673-629X. 2020. 05. 024]
　NONG Han-qi,SUN Yun-qi,HUANG Jie,et al.Optimization Strategy of Cognitive Radio Network Based on Machine Learning[J].,2020,30(09):125.[doi:10. 3969 / j. issn. 1673-629X. 2020. 05. 024]
[6]雷莹,许道云.一种合作 Markov 决策系统[J].计算机技术与发展,2020,30(12):8.[doi:10. 3969 / j. issn. 1673-629X. 2020. 12. 002]
　LEI Ying,XU Dao-yun.A Cooperation Markov Decision Process System[J].,2020,30(09):8.[doi:10. 3969 / j. issn. 1673-629X. 2020. 12. 002]
[7]彭云建,梁进.基于探索-利用权衡优化的 Q 学习路径规划[J].计算机技术与发展,2022,32(04):1.[doi:10. 3969 / j. issn. 1673-629X. 2022. 04. 001]
　PENG Yun-jian,LIANG Jin.Q-learning Path Planning Based on Exploration / Exploitation Tradeoff Optimization[J].,2022,32(09):1.[doi:10. 3969 / j. issn. 1673-629X. 2022. 04. 001]
[8]乔通,周洲,程鑫,等.基于 Q-学习的底盘测功机自适应 PID 控制模型[J].计算机技术与发展,2022,32(05):117.[doi:10. 3969 / j. issn. 1673-629X. 2022. 05. 020]
　QIAO Tong,ZHOU Zhou,CHENG Xin,et al.Adaptive PID Control Model of Chassis Dynamometer Based on Q-Learning[J].,2022,32(09):117.[doi:10. 3969 / j. issn. 1673-629X. 2022. 05. 020]
[9]吴鹏,魏上清,董嘉鹏,等.基于 SARSA 强化学习的审判人力资源调度方法[J].计算机技术与发展,2022,32(09):82.[doi:10. 3969 / j. issn. 1673-629X. 2022. 09. 013]
　WU Peng,WEI Shang-qing,DONG Jia-peng,et al.Trial Human Resources Scheduling Method Based on SARSA Reinforcement Learning[J].,2022,32(09):82.[doi:10. 3969 / j. issn. 1673-629X. 2022. 09. 013]
[10]林泽阳,赖俊,陈希亮.基于课程学习的深度强化学习研究综述[J].计算机技术与发展,2022,32(11):16.[doi:10. 3969 / j. issn. 1673-629X. 2022. 11. 003]
　LIN Ze-yang,LAI Jun,CHEN Xi-liang.An Overview of Deep Reinforcement Learning Based on Curriculum Learning[J].,2022,32(09):16.[doi:10. 3969 / j. issn. 1673-629X. 2022. 11. 003]

常用功能

工具/Tools

统计/Statistics

摘要浏览/Viewed1051
全文下载/Downloads586
评论/Comments