Elad Hazan and Satyen Kale
Proceedings of 27th Annual Conference on Neural Information Processing Systems (NeurIPS), 2011

We present an efficient algorithm for the problem of online multiclass prediction with bandit feedback in the fully adversarial setting. We measure its regret with respect to the log-loss defined in (Abernethy and Rakhlin, 2009), which is parameterized by a scalar \(\alpha\). We prove that the regret of Newtron is \(O(\log T)\) when \(\alpha\) is a constant that does not vary with horizon \(T\), and at most \(O(T^{2/3})\) if \(\alpha\) is allowed to increase to infinity with \(T\). For \(\alpha = O(\log T)\), the regret is bounded by \(O(\sqrt{T})\), thus solving the open problem of Kakade et al (2008) and Abernethy and Rakhlin (2009). Our algorithm is based on a novel application of the online Newton method (Hazan et al, 2007). We test our algorithm and show it to perform well in experiments, even when \(\alpha\)  is a small constant.