Guidance Design for Escape Flight Vehicle Using Evolution Strategy Enhanced Deep Reinforcement Learning

Guidance commands of flight vehicles can be regarded as a series of data sets having fixed time intervals, thus guidance design constitutes a typical sequential decision problem and satisfies the basic conditions for using the deep reinforcement learning (DRL) technique. In this paper, we consider t...

Full description

Bibliographic Details
Main Authors: Xiao Hu, Tianshu Wang, Min Gong, Shaoshi Yang
Format: Article
Language:English
Published: IEEE 2024-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10485410/
_version_ 1797217377989951488
author Xiao Hu
Tianshu Wang
Min Gong
Shaoshi Yang
author_facet Xiao Hu
Tianshu Wang
Min Gong
Shaoshi Yang
author_sort Xiao Hu
collection DOAJ
description Guidance commands of flight vehicles can be regarded as a series of data sets having fixed time intervals, thus guidance design constitutes a typical sequential decision problem and satisfies the basic conditions for using the deep reinforcement learning (DRL) technique. In this paper, we consider the scenario where the escape flight vehicle (EFV) generates guidance commands based on the DRL technique and the pursuit flight vehicle (PFV) generates guidance commands based on the proportional navigation method. Evasion distance is described as the minimum distance between the EFV and the PFV during the escape-and-pursuit process. For the EFV, the objective of the guidance design entails progressively maximizing the residual velocity, which is described as the EFV’s velocity when the evasion distance occurs, subject to the constraint imposed by the given evasion distance. Thus an irregular dynamic max-min problem of extremely large-scale is formulated. In this problem, the time instant when the optimal solution (i.e., the maximum residual velocity satisfying the evasion distance constraint) can be attained is uncertain and the optimum solution is dependent on all the intermediate guidance commands generated before. For solving this challenging problem, a two-step strategy is conceived. In the first step, we use the proximal policy optimization (PPO) algorithm to generate the guidance commands of the EFV. The results obtained by PPO in the global search space are coarse, despite the fact that the reward function, the neural network parameters and the learning rate are designed elaborately. Therefore, in the second step, we propose to invoke the evolution strategy (ES) based algorithm, which uses the result of PPO as the initial value, to further improve the quality of the solution by searching in the local space. Extensive simulation results demonstrate that the proposed guidance design method based on the PPO algorithm is capable of achieving a residual velocity of 67.24 m/s, higher than the residual velocities achieved by the benchmark soft actor-critic and deep deterministic policy gradient algorithms. Furthermore, the proposed ES-enhanced PPO algorithm outperforms the PPO algorithm by 2.7%, achieving a residual velocity of 69.04 m/s.
first_indexed 2024-04-24T12:00:54Z
format Article
id doaj.art-747423c0bf2a427f95cef6a10c059f71
institution Directory Open Access Journal
issn 2169-3536
language English
last_indexed 2024-04-24T12:00:54Z
publishDate 2024-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj.art-747423c0bf2a427f95cef6a10c059f712024-04-08T23:01:21ZengIEEEIEEE Access2169-35362024-01-0112482104822210.1109/ACCESS.2024.338332210485410Guidance Design for Escape Flight Vehicle Using Evolution Strategy Enhanced Deep Reinforcement LearningXiao Hu0Tianshu Wang1Min Gong2https://orcid.org/0009-0007-3011-7858Shaoshi Yang3https://orcid.org/0000-0003-2395-1637School of Aerospace Engineering, Tsinghua University, Beijing, ChinaSchool of Aerospace Engineering, Tsinghua University, Beijing, ChinaChina Academy of Launch Vehicle Technology, Beijing, ChinaSchool of Information and Communication Engineering, Beijing University of Posts and Telecommunications, Beijing, ChinaGuidance commands of flight vehicles can be regarded as a series of data sets having fixed time intervals, thus guidance design constitutes a typical sequential decision problem and satisfies the basic conditions for using the deep reinforcement learning (DRL) technique. In this paper, we consider the scenario where the escape flight vehicle (EFV) generates guidance commands based on the DRL technique and the pursuit flight vehicle (PFV) generates guidance commands based on the proportional navigation method. Evasion distance is described as the minimum distance between the EFV and the PFV during the escape-and-pursuit process. For the EFV, the objective of the guidance design entails progressively maximizing the residual velocity, which is described as the EFV’s velocity when the evasion distance occurs, subject to the constraint imposed by the given evasion distance. Thus an irregular dynamic max-min problem of extremely large-scale is formulated. In this problem, the time instant when the optimal solution (i.e., the maximum residual velocity satisfying the evasion distance constraint) can be attained is uncertain and the optimum solution is dependent on all the intermediate guidance commands generated before. For solving this challenging problem, a two-step strategy is conceived. In the first step, we use the proximal policy optimization (PPO) algorithm to generate the guidance commands of the EFV. The results obtained by PPO in the global search space are coarse, despite the fact that the reward function, the neural network parameters and the learning rate are designed elaborately. Therefore, in the second step, we propose to invoke the evolution strategy (ES) based algorithm, which uses the result of PPO as the initial value, to further improve the quality of the solution by searching in the local space. Extensive simulation results demonstrate that the proposed guidance design method based on the PPO algorithm is capable of achieving a residual velocity of 67.24 m/s, higher than the residual velocities achieved by the benchmark soft actor-critic and deep deterministic policy gradient algorithms. Furthermore, the proposed ES-enhanced PPO algorithm outperforms the PPO algorithm by 2.7%, achieving a residual velocity of 69.04 m/s.https://ieeexplore.ieee.org/document/10485410/Deep reinforcement learningevolution strategy (ES)guidance designmax-min problemproximal policy optimization (PPO)
spellingShingle Xiao Hu
Tianshu Wang
Min Gong
Shaoshi Yang
Guidance Design for Escape Flight Vehicle Using Evolution Strategy Enhanced Deep Reinforcement Learning
IEEE Access
Deep reinforcement learning
evolution strategy (ES)
guidance design
max-min problem
proximal policy optimization (PPO)
title Guidance Design for Escape Flight Vehicle Using Evolution Strategy Enhanced Deep Reinforcement Learning
title_full Guidance Design for Escape Flight Vehicle Using Evolution Strategy Enhanced Deep Reinforcement Learning
title_fullStr Guidance Design for Escape Flight Vehicle Using Evolution Strategy Enhanced Deep Reinforcement Learning
title_full_unstemmed Guidance Design for Escape Flight Vehicle Using Evolution Strategy Enhanced Deep Reinforcement Learning
title_short Guidance Design for Escape Flight Vehicle Using Evolution Strategy Enhanced Deep Reinforcement Learning
title_sort guidance design for escape flight vehicle using evolution strategy enhanced deep reinforcement learning
topic Deep reinforcement learning
evolution strategy (ES)
guidance design
max-min problem
proximal policy optimization (PPO)
url https://ieeexplore.ieee.org/document/10485410/
work_keys_str_mv AT xiaohu guidancedesignforescapeflightvehicleusingevolutionstrategyenhanceddeepreinforcementlearning
AT tianshuwang guidancedesignforescapeflightvehicleusingevolutionstrategyenhanceddeepreinforcementlearning
AT mingong guidancedesignforescapeflightvehicleusingevolutionstrategyenhanceddeepreinforcementlearning
AT shaoshiyang guidancedesignforescapeflightvehicleusingevolutionstrategyenhanceddeepreinforcementlearning