Guidance Design for Escape Flight Vehicle Using Evolution Strategy Enhanced Deep Reinforcement Learning
Guidance commands of flight vehicles can be regarded as a series of data sets having fixed time intervals, thus guidance design constitutes a typical sequential decision problem and satisfies the basic conditions for using the deep reinforcement learning (DRL) technique. In this paper, we consider t...
Main Authors: | , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
IEEE
2024-01-01
|
Series: | IEEE Access |
Subjects: | |
Online Access: | https://ieeexplore.ieee.org/document/10485410/ |
_version_ | 1797217377989951488 |
---|---|
author | Xiao Hu Tianshu Wang Min Gong Shaoshi Yang |
author_facet | Xiao Hu Tianshu Wang Min Gong Shaoshi Yang |
author_sort | Xiao Hu |
collection | DOAJ |
description | Guidance commands of flight vehicles can be regarded as a series of data sets having fixed time intervals, thus guidance design constitutes a typical sequential decision problem and satisfies the basic conditions for using the deep reinforcement learning (DRL) technique. In this paper, we consider the scenario where the escape flight vehicle (EFV) generates guidance commands based on the DRL technique and the pursuit flight vehicle (PFV) generates guidance commands based on the proportional navigation method. Evasion distance is described as the minimum distance between the EFV and the PFV during the escape-and-pursuit process. For the EFV, the objective of the guidance design entails progressively maximizing the residual velocity, which is described as the EFV’s velocity when the evasion distance occurs, subject to the constraint imposed by the given evasion distance. Thus an irregular dynamic max-min problem of extremely large-scale is formulated. In this problem, the time instant when the optimal solution (i.e., the maximum residual velocity satisfying the evasion distance constraint) can be attained is uncertain and the optimum solution is dependent on all the intermediate guidance commands generated before. For solving this challenging problem, a two-step strategy is conceived. In the first step, we use the proximal policy optimization (PPO) algorithm to generate the guidance commands of the EFV. The results obtained by PPO in the global search space are coarse, despite the fact that the reward function, the neural network parameters and the learning rate are designed elaborately. Therefore, in the second step, we propose to invoke the evolution strategy (ES) based algorithm, which uses the result of PPO as the initial value, to further improve the quality of the solution by searching in the local space. Extensive simulation results demonstrate that the proposed guidance design method based on the PPO algorithm is capable of achieving a residual velocity of 67.24 m/s, higher than the residual velocities achieved by the benchmark soft actor-critic and deep deterministic policy gradient algorithms. Furthermore, the proposed ES-enhanced PPO algorithm outperforms the PPO algorithm by 2.7%, achieving a residual velocity of 69.04 m/s. |
first_indexed | 2024-04-24T12:00:54Z |
format | Article |
id | doaj.art-747423c0bf2a427f95cef6a10c059f71 |
institution | Directory Open Access Journal |
issn | 2169-3536 |
language | English |
last_indexed | 2024-04-24T12:00:54Z |
publishDate | 2024-01-01 |
publisher | IEEE |
record_format | Article |
series | IEEE Access |
spelling | doaj.art-747423c0bf2a427f95cef6a10c059f712024-04-08T23:01:21ZengIEEEIEEE Access2169-35362024-01-0112482104822210.1109/ACCESS.2024.338332210485410Guidance Design for Escape Flight Vehicle Using Evolution Strategy Enhanced Deep Reinforcement LearningXiao Hu0Tianshu Wang1Min Gong2https://orcid.org/0009-0007-3011-7858Shaoshi Yang3https://orcid.org/0000-0003-2395-1637School of Aerospace Engineering, Tsinghua University, Beijing, ChinaSchool of Aerospace Engineering, Tsinghua University, Beijing, ChinaChina Academy of Launch Vehicle Technology, Beijing, ChinaSchool of Information and Communication Engineering, Beijing University of Posts and Telecommunications, Beijing, ChinaGuidance commands of flight vehicles can be regarded as a series of data sets having fixed time intervals, thus guidance design constitutes a typical sequential decision problem and satisfies the basic conditions for using the deep reinforcement learning (DRL) technique. In this paper, we consider the scenario where the escape flight vehicle (EFV) generates guidance commands based on the DRL technique and the pursuit flight vehicle (PFV) generates guidance commands based on the proportional navigation method. Evasion distance is described as the minimum distance between the EFV and the PFV during the escape-and-pursuit process. For the EFV, the objective of the guidance design entails progressively maximizing the residual velocity, which is described as the EFV’s velocity when the evasion distance occurs, subject to the constraint imposed by the given evasion distance. Thus an irregular dynamic max-min problem of extremely large-scale is formulated. In this problem, the time instant when the optimal solution (i.e., the maximum residual velocity satisfying the evasion distance constraint) can be attained is uncertain and the optimum solution is dependent on all the intermediate guidance commands generated before. For solving this challenging problem, a two-step strategy is conceived. In the first step, we use the proximal policy optimization (PPO) algorithm to generate the guidance commands of the EFV. The results obtained by PPO in the global search space are coarse, despite the fact that the reward function, the neural network parameters and the learning rate are designed elaborately. Therefore, in the second step, we propose to invoke the evolution strategy (ES) based algorithm, which uses the result of PPO as the initial value, to further improve the quality of the solution by searching in the local space. Extensive simulation results demonstrate that the proposed guidance design method based on the PPO algorithm is capable of achieving a residual velocity of 67.24 m/s, higher than the residual velocities achieved by the benchmark soft actor-critic and deep deterministic policy gradient algorithms. Furthermore, the proposed ES-enhanced PPO algorithm outperforms the PPO algorithm by 2.7%, achieving a residual velocity of 69.04 m/s.https://ieeexplore.ieee.org/document/10485410/Deep reinforcement learningevolution strategy (ES)guidance designmax-min problemproximal policy optimization (PPO) |
spellingShingle | Xiao Hu Tianshu Wang Min Gong Shaoshi Yang Guidance Design for Escape Flight Vehicle Using Evolution Strategy Enhanced Deep Reinforcement Learning IEEE Access Deep reinforcement learning evolution strategy (ES) guidance design max-min problem proximal policy optimization (PPO) |
title | Guidance Design for Escape Flight Vehicle Using Evolution Strategy Enhanced Deep Reinforcement Learning |
title_full | Guidance Design for Escape Flight Vehicle Using Evolution Strategy Enhanced Deep Reinforcement Learning |
title_fullStr | Guidance Design for Escape Flight Vehicle Using Evolution Strategy Enhanced Deep Reinforcement Learning |
title_full_unstemmed | Guidance Design for Escape Flight Vehicle Using Evolution Strategy Enhanced Deep Reinforcement Learning |
title_short | Guidance Design for Escape Flight Vehicle Using Evolution Strategy Enhanced Deep Reinforcement Learning |
title_sort | guidance design for escape flight vehicle using evolution strategy enhanced deep reinforcement learning |
topic | Deep reinforcement learning evolution strategy (ES) guidance design max-min problem proximal policy optimization (PPO) |
url | https://ieeexplore.ieee.org/document/10485410/ |
work_keys_str_mv | AT xiaohu guidancedesignforescapeflightvehicleusingevolutionstrategyenhanceddeepreinforcementlearning AT tianshuwang guidancedesignforescapeflightvehicleusingevolutionstrategyenhanceddeepreinforcementlearning AT mingong guidancedesignforescapeflightvehicleusingevolutionstrategyenhanceddeepreinforcementlearning AT shaoshiyang guidancedesignforescapeflightvehicleusingevolutionstrategyenhanceddeepreinforcementlearning |