Perhaps the most general form of this problem is the non-stochastic bandit linear optimization problem, where the set of decisions is a convex set in some Euclidean space, and the cost functions are linear. Only recently an efficient algorithm attaining $$\tilde{O}(\sqrt{T})$$ regret was discovered in this setting.
In this paper we propose a new algorithm for the bandit linear optimization problem which obtains a regret bound of$$\tilde{O}(\sqrt{Q})$$ , where $$Q$$ is the total variation in the cost functions. This regret bound, previously conjectured to hold in the full information case, shows that it is possible to incur much less regret in a slowly changing environment even in the bandit setting. Our algorithm is efficient and applies several new ideas to bandit optimization such as reservoir sampling.