Multi-armed linear bandits with latent biases

In a linear stochastic bandit model, each arm corresponds to a vector in Euclidean space, and the expected return observed at each time step is determined by an unknown linear function of the selected arm. This paper addresses the challenge of identifying the optimal arm in a linear stochastic bandi...

全面介紹

書目詳細資料
Main Authors: Kang, Qiyu, Tay, Wee Peng, She, Rui, Wang, Sijie, Liu, Xiaoqian, Yang, Yuan-Rui
其他作者: School of Electrical and Electronic Engineering
格式: Journal Article
語言:English
出版: 2024
主題:
在線閱讀:https://hdl.handle.net/10356/175416
實物特徵
總結:In a linear stochastic bandit model, each arm corresponds to a vector in Euclidean space, and the expected return observed at each time step is determined by an unknown linear function of the selected arm. This paper addresses the challenge of identifying the optimal arm in a linear stochastic bandit model, where latent biases corrupt each arm's expected reward. Unlike traditional linear bandit problems, where the observed return directly represents the reward, this paper considers a scenario where the unbiased reward at each time step remains unobservable. This model is particularly relevant in situations where the observed return is influenced by latent biases that need to be carefully excluded from the learning model. For example, in recommendation systems designed to prevent racially discriminatory suggestions, it is crucial to ensure that the users' race does not influence the system. However, the observed return, such as click-through rates, may have already been influenced by racial attributes. In the case where there are finitely many arms, we develop a strategy to achieve O(|D|log⁡n) regret, where |D| is the number of arms and n is the number of time steps. In the case where each arm is chosen from an infinite compact set, our strategy achieves O(n2/3(log⁡n)1/2) regret. Experiments verify the efficiency of our strategy.