首页    期刊浏览 2024年12月13日 星期五
登录注册

文章基本信息

  • 标题:Optimal Exploration–Exploitation in a Multi-armed Bandit Problem with Non-stationary Rewards
  • 本地全文:下载
  • 作者:Omar Besbes ; Yonatan Gur ; Assaf Zeevi
  • 期刊名称:Stochastic Systems
  • 印刷版ISSN:1946-5238
  • 出版年度:2019
  • 卷号:9
  • 期号:4
  • 页码:319-337
  • DOI:10.1287/stsy.2019.0033
  • 语种:English
  • 出版社:Institute for Operations Research and the Management Sciences (INFORMS), Applied Probability Society
  • 摘要:In a multi-armed bandit problem, a gambler needs to choose at each round one of K arms, each characterized by an unknown reward distribution. The objective is to maximize cumulative expected earnings over a planning horizon of length T, and performance is measured in terms of regret relative to a (static) oracle that knows the identity of the best arm a priori. This problem has been studied extensively when the reward distributions do not change over time, and uncertainty essentially amounts to identifying the optimal arm. We complement this literature by developing a flexible non-parametric model for temporal uncertainty in the rewards. The extent of temporal uncertainty is measured via the cumulative mean change in the rewards over the horizon, a metric we refer to as temporal variation, and regret is measured relative to a (dynamic) oracle that plays the point-wise optimal action at each period. Assuming that nature can choose any sequence of mean rewards such that their temporal variation does not exceed V (a temporal uncertainty budget), we characterize the complexity of this problem via the minimax regret, which depends on V (the hardness of the problem), the horizon length T, and the number of arms K.
  • 关键词:multi-armed bandit;exploration/exploitation;nonstationary;dynamic oracle;minimax regret;dynamic regret
国家哲学社会科学文献中心版权所有