Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Minimizing Regret in Combinatorial Bandits and Reinforcement Learning
KTH, School of Electrical Engineering (EES), Automatic Control.ORCID iD: 0000-0002-1934-7421
2017 (English)Doctoral thesis, monograph (Other academic)
Abstract [en]

This thesis investigates sequential decision making tasks that fall in the framework of reinforcement learning (RL). These tasks involve a decision maker repeatedly interacting with an environment modeled by an unknown finite Markov decision process (MDP), who wishes to maximize a notion of reward accumulated during her experience. Her performance can be measured through the notion of regret, which compares her accumulated expected reward against that achieved by an oracle algorithm always following an optimal behavior. In order to maximize her accumulated reward, or equivalently to minimize the regret, she needs to face a trade-off between exploration and exploitation.

The first part of this thesis investigates combinatorial multi-armed bandit (MAB) problems, which are RL problems whose state-space is a singleton. It also addresses some applications that can be cast as combinatorial MAB problems. The number of arms in such problems generically grows exponentially with the number of basic actions, but the rewards of various arms are correlated. Hence, the challenge in such problems is to exploit the underlying combinatorial structure.For these problems, we derive asymptotic (i.e., when the time horizon grows large) lower bounds on the regret of any admissible algorithm and investigate how these bounds scale with the dimension of the underlying combinatorial structure. We then propose several algorithms and provide finite-time analyses of their regret. The proposed algorithms efficiently exploit the structure of the problem, provide better performance guarantees than existing algorithms, and significantly outperform these algorithms in practice.

The second part of the thesis concerns RL in an unknown and discrete MDP under the average-reward criterion. We develop some variations of the transportation lemma that could serve as novel tools for the regret analysis of RL algorithms. Revisiting existing regret lower bounds allows us to derive alternative bounds, which motivate that the local variance of the bias function of the MDP, i.e., the variance with respect to next-state transition laws, could serve as a notion of problem complexity for regret minimization in RL. Leveraging these tools also allows us to report a novel regret analysis of the KL-UCRL algorithm for ergodic MDPs. The leading term in our regret bound depends on the local variance of the bias function, thus coinciding with observations obtained from our presented lower bounds. Numerical evaluations in some benchmark MDPs indicate that the leading term of the derived bound can provide an order of magnitude improvement over previously known results for this algorithm.

Place, publisher, year, edition, pages
Stockholm: KTH Royal Institute of Technology, 2017. , p. 187
Series
TRITA-EE, ISSN 1653-5146 ; 2017:168
Keyword [en]
Multi-armed Bandits, Reinforcement Learning, Regret Minimization, Statistics
National Category
Engineering and Technology
Research subject
Electrical Engineering
Identifiers
URN: urn:nbn:se:kth:diva-219970ISBN: 978-91-7729-618-8 (print)OAI: oai:DiVA.org:kth-219970DiVA, id: diva2:1166385
Public defence
2017-12-19, Q2, KTH, Osquldas väg 10, Stockholm, 10:00 (English)
Opponent
Supervisors
Note

QC 20171215

Available from: 2017-12-15 Created: 2017-12-14 Last updated: 2017-12-15Bibliographically approved

Open Access in DiVA

Sadegh_Talebi_Doctoral_Thesis(3885 kB)163 downloads
File information
File name FULLTEXT01.pdfFile size 3885 kBChecksum SHA-512
187da4790bf97112f8dc8a8d7b68149ae5eea6f7048ccfc06cc5f1f44655e878488a02b1dd14ba75285973b0c007290f3ce67953187bec835528247aef2285b9
Type fulltextMimetype application/pdf

Search in DiVA

By author/editor
Talebi Mazraeh Shahi, Mohammad Sadegh
By organisation
Automatic Control
Engineering and Technology

Search outside of DiVA

GoogleGoogle Scholar
Total: 163 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 392 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf