Constrained learning in the bandit setting: doubly optimistic strategies and fast rates
OA Version
Citation
Abstract
The (stochastic) bandit problem is a classic example used to address the challenge of balancing exploration and exploitation when dealing with bandit feedback. This dissertation focuses on stochastic bandits with constraints, where each action (or arm) bears both a reward and a safety risk. The objective is to maximize rewards while avoiding unsafe options. We explore the application of the optimistic principle in this context. Specifically, we study two performance metrics, the reward regret and the constraint violation. Our formulation penalizes deficit reward and excess risk in a per-round sense, thus pertinent to the safety critical scenario. We propose a doubly optimistic approach that aggressively determines which arms are feasible to be considered, along with several realizations of this strategy to specific scenarios. We instantiate this scheme on the multi-armed bandit, linear bandit, and generalized linear bandit cases. Our findings demonstrate that these algorithms achieve a regret bound that is faster than existing results. For the constrained multi-armed bandit problem, we achieve logarithmic regret for both the reward and safety risk; for linear and generalized linear bandit settings, we carry out a dual analysis based on linear programming sensitivity, identify the discreteness from the continuous problem, and manage to show a logarithmic regret on reward and a log(T) to √T safety violation, which is the best to expect without prior information. The upper bounds are accompanied with lower bounds that are matching in terms of time horizon and gaps. We complement our study with illustrative simulations, and conclude with several future directions.
Description
License
Attribution 4.0 International