Assessment of Reinforcement Learning Algorithms for Nuclear Power Plant Fuel Optimization

The nuclear fuel loading pattern optimization problem belongs to the class of large-scale combinatorial optimization and has been studied since the dawn of the commercial nuclear energy industry. It is also characterized by multiple objectives and constraints, which makes it impossible to solve expl...

Full description

Bibliographic Details
Main Author:	Seurin, Paul R.M.
Other Authors:	Shirvan, Koroush
Format:	Thesis
Published:	Massachusetts Institute of Technology 2024
Online Access:	https://hdl.handle.net/1721.1/155061 https://orcid.org/0000-0002-5940-7695

_version_	1826216939051024384
author	Seurin, Paul R.M.
author2	Shirvan, Koroush
author_facet	Shirvan, Koroush Seurin, Paul R.M.
author_sort	Seurin, Paul R.M.
collection	MIT
description	The nuclear fuel loading pattern optimization problem belongs to the class of large-scale combinatorial optimization and has been studied since the dawn of the commercial nuclear energy industry. It is also characterized by multiple objectives and constraints, which makes it impossible to solve explicitly. Stochastic optimization methodologies including Genetic Algorithms and Simulated Annealing are used by different nuclear utilities and vendors to perform fuel cycle reload design. Nevertheless, hand-designed solutions continue to be the prevalent method in the industry. To improve the state-of-the-art core reload patterns, we aim to create a method as scalable as possible, that agrees with the designer’s goal of performance and safety. To help in this task Deep Reinforcement Learning (RL), in particular, Proximal Policy Optimization is leveraged. RL has recently experienced a strong impetus from its successes applied to games, sometimes even reaching ”super-human” performances. This thesis presents a first-of-a-kind approach to utilize deep RL to solve the loading pattern problem and could be leveraged for any engi3neering design optimization with an integer or combinatorial input structure. This work is also to our knowledge the first to propose a study of the behavior of several hyper-parameters that influence the RL algorithm via a multi-measure approach helped with statistical tests. To demonstrate its superiority against industry-preferred computational methods, we compared its performance against the most adopted legacy Stochastic Optimization (SO)-based approaches in the literature and the industry namely, Parallel Simulated Annealing with Mixing of States (PSA), Genetic Algorithm (GA), and a novel first-of-a-kind parallel Tabu Search (TS) we developed for this effect. For this purpose, the full software development from scratch was done to enable the application of RL and SO-based algorithms optimization with SIMULATE3 and visualization of the results. The algorithm is highly dependent on multiple factors such as the shape of the objective function derived for the core design that behaves as a fudge factor that affects the stability of the learning. But also an exploration/exploitation trade-off that manifests through different parameters such as the number of loading patterns seen by the agents per episode, the number of samples collected before a policy update nsteps , and an entropy factor ent_coef that increases the randomness of the policy during training. We found that RL must be applied similarly to a Gaussian Process in which the acquisition function is replaced by a parametrized policy: in essence, a policy generates solutions, while a critic learns and evaluates the quality of these solutions. Then, once an initial set of hyper-parameters is found, reducing nsteps and ent_coef until no more learning is observed or instabilities occur will result in the highest sample efficiency robustly and stably. Applying this approach resulted in an economic benefit on average of 540,000 and 650,000 $/year/plant for a 1000 MWe and 1200 MWe Nuclear Power Plant, respectively. 4Extending this approach to eleven classical benchmarks, we demonstrated that the methodology developed in this work is problem agnostic and can be seamlessly leveraged to use RL as an optimization tool elsewhere for problems with an integer or combinatorial input space. Although we had not demonstrated it on the nuclear power plant fuel optimization problem, the initialization of the state at the beginning of an episode was also investigated with the benchmarks. We established that initializing the episode with the state of the best ever solution found might be more suitable for problem with complicated reward functions, which is the case for our problem and aligns with the way core designers operates by iterating on the best solution found. We suggest, however, to compare with initializing with random state instances on a case by case basis, hence we have not included this observation as an essential element of the approach. We also showed that, by learning which solution to generate next intrinsically while marching down the objective space (in contrast to SO-based, which are doing it randomly), the use of RL resulted in an algorithm that found solutions of greater quality systematically but also faster than legacy approaches. This opens the curtains to a new optimization paradigm that could result in significant contributions in engineering fields beyond loading pattern optimization, especially when an expensive physic-solver is required. Additional key observations include: (1) The RL algorithms cannot be applied without physic-based intuitions provided during the search. This intuition can be built up in the construction of the action space (e.g., through pre-defined templates) and the reward signal. (2) Defining the frame of your optimization (e.g., here the necessity to obtain results within a day), the shape of the reward (e.g., magnitude and curvature), and understanding the degree of exploration/exploitation needed in your problem influences the value of the 5hyper-parameters chosen. (3) RL algorithms are highly sensitive to these hyper-parameters but there is an approach (presented here) for gaining sample efficiency by playing with the exploration/exploitation trade-offs. (4) Because eventually, we are aiming at improving the economy of the Nuclear Power Plants, utilizing the Levelized Cost of Electricity (LCOE) to rigorously assess the true economic performance of the different algorithm configurations was pivotal to measure the true importance of hyper-parameter tuning and the superiority of RL over legacy approaches. Overall, the methodology developed in this research supports four important new capabilities for core designers: (1) accelerate the design of new reactors by proposing efficient solutions within a reasonable amount of time, (2) ensure feasibility and quality of the resulting design, limiting the overhead time allocated to re-design (3) propose a new set of computational methodologies more robust and stable than classical SO-based ones to result in higher economic gains for the existing fleet of operating reactors, and (4) propose a tool that could be leveraged in the future to gain managerial insights about strategies to optimize the loading pattern optimization problems beyond expert know-how. Keywords— Fuel loading pattern, Optimization, Reinforcement Learning, Proximal Policy Optimization
first_indexed	2024-09-23T16:55:33Z
format	Thesis
id	mit-1721.1/155061
institution	Massachusetts Institute of Technology
last_indexed	2024-09-23T16:55:33Z
publishDate	2024
publisher	Massachusetts Institute of Technology
record_format	dspace
spelling	mit-1721.1/1550612024-05-25T03:01:04Z Assessment of Reinforcement Learning Algorithms for Nuclear Power Plant Fuel Optimization Seurin, Paul R.M. Shirvan, Koroush Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science Massachusetts Institute of Technology. Department of Nuclear Science and Engineering The nuclear fuel loading pattern optimization problem belongs to the class of large-scale combinatorial optimization and has been studied since the dawn of the commercial nuclear energy industry. It is also characterized by multiple objectives and constraints, which makes it impossible to solve explicitly. Stochastic optimization methodologies including Genetic Algorithms and Simulated Annealing are used by different nuclear utilities and vendors to perform fuel cycle reload design. Nevertheless, hand-designed solutions continue to be the prevalent method in the industry. To improve the state-of-the-art core reload patterns, we aim to create a method as scalable as possible, that agrees with the designer’s goal of performance and safety. To help in this task Deep Reinforcement Learning (RL), in particular, Proximal Policy Optimization is leveraged. RL has recently experienced a strong impetus from its successes applied to games, sometimes even reaching ”super-human” performances. This thesis presents a first-of-a-kind approach to utilize deep RL to solve the loading pattern problem and could be leveraged for any engi3neering design optimization with an integer or combinatorial input structure. This work is also to our knowledge the first to propose a study of the behavior of several hyper-parameters that influence the RL algorithm via a multi-measure approach helped with statistical tests. To demonstrate its superiority against industry-preferred computational methods, we compared its performance against the most adopted legacy Stochastic Optimization (SO)-based approaches in the literature and the industry namely, Parallel Simulated Annealing with Mixing of States (PSA), Genetic Algorithm (GA), and a novel first-of-a-kind parallel Tabu Search (TS) we developed for this effect. For this purpose, the full software development from scratch was done to enable the application of RL and SO-based algorithms optimization with SIMULATE3 and visualization of the results. The algorithm is highly dependent on multiple factors such as the shape of the objective function derived for the core design that behaves as a fudge factor that affects the stability of the learning. But also an exploration/exploitation trade-off that manifests through different parameters such as the number of loading patterns seen by the agents per episode, the number of samples collected before a policy update nsteps , and an entropy factor ent_coef that increases the randomness of the policy during training. We found that RL must be applied similarly to a Gaussian Process in which the acquisition function is replaced by a parametrized policy: in essence, a policy generates solutions, while a critic learns and evaluates the quality of these solutions. Then, once an initial set of hyper-parameters is found, reducing nsteps and ent_coef until no more learning is observed or instabilities occur will result in the highest sample efficiency robustly and stably. Applying this approach resulted in an economic benefit on average of 540,000 and 650,000 $/year/plant for a 1000 MWe and 1200 MWe Nuclear Power Plant, respectively. 4Extending this approach to eleven classical benchmarks, we demonstrated that the methodology developed in this work is problem agnostic and can be seamlessly leveraged to use RL as an optimization tool elsewhere for problems with an integer or combinatorial input space. Although we had not demonstrated it on the nuclear power plant fuel optimization problem, the initialization of the state at the beginning of an episode was also investigated with the benchmarks. We established that initializing the episode with the state of the best ever solution found might be more suitable for problem with complicated reward functions, which is the case for our problem and aligns with the way core designers operates by iterating on the best solution found. We suggest, however, to compare with initializing with random state instances on a case by case basis, hence we have not included this observation as an essential element of the approach. We also showed that, by learning which solution to generate next intrinsically while marching down the objective space (in contrast to SO-based, which are doing it randomly), the use of RL resulted in an algorithm that found solutions of greater quality systematically but also faster than legacy approaches. This opens the curtains to a new optimization paradigm that could result in significant contributions in engineering fields beyond loading pattern optimization, especially when an expensive physic-solver is required. Additional key observations include: (1) The RL algorithms cannot be applied without physic-based intuitions provided during the search. This intuition can be built up in the construction of the action space (e.g., through pre-defined templates) and the reward signal. (2) Defining the frame of your optimization (e.g., here the necessity to obtain results within a day), the shape of the reward (e.g., magnitude and curvature), and understanding the degree of exploration/exploitation needed in your problem influences the value of the 5hyper-parameters chosen. (3) RL algorithms are highly sensitive to these hyper-parameters but there is an approach (presented here) for gaining sample efficiency by playing with the exploration/exploitation trade-offs. (4) Because eventually, we are aiming at improving the economy of the Nuclear Power Plants, utilizing the Levelized Cost of Electricity (LCOE) to rigorously assess the true economic performance of the different algorithm configurations was pivotal to measure the true importance of hyper-parameter tuning and the superiority of RL over legacy approaches. Overall, the methodology developed in this research supports four important new capabilities for core designers: (1) accelerate the design of new reactors by proposing efficient solutions within a reasonable amount of time, (2) ensure feasibility and quality of the resulting design, limiting the overhead time allocated to re-design (3) propose a new set of computational methodologies more robust and stable than classical SO-based ones to result in higher economic gains for the existing fleet of operating reactors, and (4) propose a tool that could be leveraged in the future to gain managerial insights about strategies to optimize the loading pattern optimization problems beyond expert know-how. Keywords— Fuel loading pattern, Optimization, Reinforcement Learning, Proximal Policy Optimization S.M. S.M. 2024-05-24T18:00:11Z 2024-05-24T18:00:11Z 2023-09 2023-10-24T19:28:17.443Z Thesis https://hdl.handle.net/1721.1/155061 https://orcid.org/0000-0002-5940-7695 In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/ application/pdf Massachusetts Institute of Technology
spellingShingle	Seurin, Paul R.M. Assessment of Reinforcement Learning Algorithms for Nuclear Power Plant Fuel Optimization
title	Assessment of Reinforcement Learning Algorithms for Nuclear Power Plant Fuel Optimization
title_full	Assessment of Reinforcement Learning Algorithms for Nuclear Power Plant Fuel Optimization
title_fullStr	Assessment of Reinforcement Learning Algorithms for Nuclear Power Plant Fuel Optimization
title_full_unstemmed	Assessment of Reinforcement Learning Algorithms for Nuclear Power Plant Fuel Optimization
title_short	Assessment of Reinforcement Learning Algorithms for Nuclear Power Plant Fuel Optimization
title_sort	assessment of reinforcement learning algorithms for nuclear power plant fuel optimization
url	https://hdl.handle.net/1721.1/155061 https://orcid.org/0000-0002-5940-7695
work_keys_str_mv	AT seurinpaulrm assessmentofreinforcementlearningalgorithmsfornuclearpowerplantfueloptimization

Assessment of Reinforcement Learning Algorithms for Nuclear Power Plant Fuel Optimization

Similar Items