Adversarial Network Optimization under Bandit Feedback: Maximizing Utility in Non-Stationary Multi-Hop Networks

Stochastic Network Optimization (SNO) concerns scheduling in stochastic queueing systems and has been widely studied in network theory. Classical SNO algorithms require network conditions to be stationary w.r.t. time, which fails to capture the non-stationary components in increasingly many real-wor...

Full description

Bibliographic Details
Main Authors:	Dai, Yan, Huang, Longbo
Other Authors:	Massachusetts Institute of Technology. Operations Research Center
Format:	Article
Language:	English
Published:	ACM 2025
Online Access:	https://hdl.handle.net/1721.1/158129

_version_	1824458193873403904
author	Dai, Yan Huang, Longbo
author2	Massachusetts Institute of Technology. Operations Research Center
author_facet	Massachusetts Institute of Technology. Operations Research Center Dai, Yan Huang, Longbo
author_sort	Dai, Yan
collection	MIT
description	Stochastic Network Optimization (SNO) concerns scheduling in stochastic queueing systems and has been widely studied in network theory. Classical SNO algorithms require network conditions to be stationary w.r.t. time, which fails to capture the non-stationary components in increasingly many real-world scenarios. Moreover, most existing algorithms in network optimization assume perfect knowledge of network conditions before decision, which again rules out applications where unpredictability in network conditions presents. Motivated by these issues, this paper considers Adversarial Network Optimization (ANO) under bandit feedback. Specifically, we consider the task of i) maximizing some unknown and time-varying utility function associated with scheduler's actions, where ii) the underlying network topology is a non-stationary multi-hop network whose conditions change arbitrarily with time, and iii) only bandit feedback (the effect of actually deployed actions) is revealed after decision-making. We propose the UMO2 algorithm, which does not require any pre-decision knowledge or counterfactual feedback, ensures network stability, and also matches the utility maximization performance of any "mildly varying" reference policy up to a polynomially decaying gap. To our knowledge, no previous algorithm can handle multi-hop networks or achieve utility maximization guarantees in ANO problems with bandit feedback, whereas ours is able to do both. Technically, our method builds upon a novel integration of online learning techniques into the Lyapunov drift-plus-penalty method. Specifically, we propose meticulous analytical techniques to jointly balance online learning and Lyapunov arguments, which is used to handle the complex inter-dependency among queues in multi-hop networks. To tackle the learning obstacles due to potentially unbounded queue sizes and negative queue differences, we design a new online linear optimization algorithm that automatically adapts to the unknown (potentially negative) loss magnitudes. Finally, we also propose a bandit convex optimization algorithm with novel queue-dependent learning rate scheduling that suites drastically varying queue lengths in utility maximization. Our new insights and techniques in online learning can also be of independent interest.
first_indexed	2025-02-19T04:22:00Z
format	Article
id	mit-1721.1/158129
institution	Massachusetts Institute of Technology
language	English
last_indexed	2025-02-19T04:22:00Z
publishDate	2025
publisher	ACM
record_format	dspace
spelling	mit-1721.1/1581292025-01-29T19:42:28Z Adversarial Network Optimization under Bandit Feedback: Maximizing Utility in Non-Stationary Multi-Hop Networks Dai, Yan Huang, Longbo Massachusetts Institute of Technology. Operations Research Center Stochastic Network Optimization (SNO) concerns scheduling in stochastic queueing systems and has been widely studied in network theory. Classical SNO algorithms require network conditions to be stationary w.r.t. time, which fails to capture the non-stationary components in increasingly many real-world scenarios. Moreover, most existing algorithms in network optimization assume perfect knowledge of network conditions before decision, which again rules out applications where unpredictability in network conditions presents. Motivated by these issues, this paper considers Adversarial Network Optimization (ANO) under bandit feedback. Specifically, we consider the task of i) maximizing some unknown and time-varying utility function associated with scheduler's actions, where ii) the underlying network topology is a non-stationary multi-hop network whose conditions change arbitrarily with time, and iii) only bandit feedback (the effect of actually deployed actions) is revealed after decision-making. We propose the UMO2 algorithm, which does not require any pre-decision knowledge or counterfactual feedback, ensures network stability, and also matches the utility maximization performance of any "mildly varying" reference policy up to a polynomially decaying gap. To our knowledge, no previous algorithm can handle multi-hop networks or achieve utility maximization guarantees in ANO problems with bandit feedback, whereas ours is able to do both. Technically, our method builds upon a novel integration of online learning techniques into the Lyapunov drift-plus-penalty method. Specifically, we propose meticulous analytical techniques to jointly balance online learning and Lyapunov arguments, which is used to handle the complex inter-dependency among queues in multi-hop networks. To tackle the learning obstacles due to potentially unbounded queue sizes and negative queue differences, we design a new online linear optimization algorithm that automatically adapts to the unknown (potentially negative) loss magnitudes. Finally, we also propose a bandit convex optimization algorithm with novel queue-dependent learning rate scheduling that suites drastically varying queue lengths in utility maximization. Our new insights and techniques in online learning can also be of independent interest. 2025-01-29T19:42:26Z 2025-01-29T19:42:26Z 2024-12-13 2025-01-01T08:51:24Z Article http://purl.org/eprint/type/JournalArticle 2476-1249 https://hdl.handle.net/1721.1/158129 Dai, Yan and Huang, Longbo. 2024. "Adversarial Network Optimization under Bandit Feedback: Maximizing Utility in Non-Stationary Multi-Hop Networks." Proceedings of the ACM on Measurement and Analysis of Computing Systems, 8 (3). PUBLISHER_CC en https://doi.org/10.1145/3700413 Proceedings of the ACM on Measurement and Analysis of Computing Systems Creative Commons Attribution https://creativecommons.org/licenses/by/4.0/ The author(s) application/pdf ACM Association for Computing Machinery
spellingShingle	Dai, Yan Huang, Longbo Adversarial Network Optimization under Bandit Feedback: Maximizing Utility in Non-Stationary Multi-Hop Networks
title	Adversarial Network Optimization under Bandit Feedback: Maximizing Utility in Non-Stationary Multi-Hop Networks
title_full	Adversarial Network Optimization under Bandit Feedback: Maximizing Utility in Non-Stationary Multi-Hop Networks
title_fullStr	Adversarial Network Optimization under Bandit Feedback: Maximizing Utility in Non-Stationary Multi-Hop Networks
title_full_unstemmed	Adversarial Network Optimization under Bandit Feedback: Maximizing Utility in Non-Stationary Multi-Hop Networks
title_short	Adversarial Network Optimization under Bandit Feedback: Maximizing Utility in Non-Stationary Multi-Hop Networks
title_sort	adversarial network optimization under bandit feedback maximizing utility in non stationary multi hop networks
url	https://hdl.handle.net/1721.1/158129
work_keys_str_mv	AT daiyan adversarialnetworkoptimizationunderbanditfeedbackmaximizingutilityinnonstationarymultihopnetworks AT huanglongbo adversarialnetworkoptimizationunderbanditfeedbackmaximizingutilityinnonstationarymultihopnetworks

Adversarial Network Optimization under Bandit Feedback: Maximizing Utility in Non-Stationary Multi-Hop Networks

Similar Items