• Title/Summary/Keyword: Bandit problem

Search Result 8, Processing Time 0.023 seconds

Combining Multiple Strategies for Sleeping Bandits with Stochastic Rewards and Availability (확률적 보상과 유효성을 갖는 Sleeping Bandits의 다수의 전략을 융합하는 기법)

  • Choi, Sanghee;Chang, Hyeong Soo
    • Journal of KIISE
    • /
    • v.44 no.1
    • /
    • pp.63-70
    • /
    • 2017
  • This paper considers the problem of combining multiple strategies for solving sleeping bandit problems with stochastic rewards and stochastic availability. It also proposes an algorithm, called sleepComb(${\Phi}$), the idea of which is to select an appropriate strategy for each time step based on ${\epsilon}_t$-probabilistic switching. ${\epsilon}_t$-probabilistic switching is used in a well-known parameter-based heuristic ${\epsilon}_t$-greedy strategy. The algorithm also converges to the "best" strategy properly defined on the sleeping bandit problem. In the experimental results, it is shown that sleepComb(${\Phi}$) has convergence, and it converges to the "best" strategy rapidly compared to other combining algorithms. Also, we can see that it chooses the "best" strategy more frequently.

A Note on the Two Dependent Bernoulli Arms

  • Kim, Dal-Ho;Cha, Young-Joon;Lee, Jae-Man
    • Journal of the Korean Data and Information Science Society
    • /
    • v.13 no.2
    • /
    • pp.195-200
    • /
    • 2002
  • We consider the Bernoulli two-armed bandit problem. It is well known that the my optic strategy is optimal when the prior distribution is concentrated at two points in the unit square. We investigate several cases in the unit square whether the my optic strategy is optimal or not. In general, the my optic strategy is not optimal when the prior distribution is not concentrated at two points in the unit square.

  • PDF

Opportunistic Spectrum Access Based on a Constrained Multi-Armed Bandit Formulation

  • Ai, Jing;Abouzeid, Alhussein A.
    • Journal of Communications and Networks
    • /
    • v.11 no.2
    • /
    • pp.134-147
    • /
    • 2009
  • Tracking and exploiting instantaneous spectrum opportunities are fundamental challenges in opportunistic spectrum access (OSA) in presence of the bursty traffic of primary users and the limited spectrum sensing capability of secondary users. In order to take advantage of the history of spectrum sensing and access decisions, a sequential decision framework is widely used to design optimal policies. However, many existing schemes, based on a partially observed Markov decision process (POMDP) framework, reveal that optimal policies are non-stationary in nature which renders them difficult to calculate and implement. Therefore, this work pursues stationary OSA policies, which are thereby efficient yet low-complexity, while still incorporating many practical factors, such as spectrum sensing errors and a priori unknown statistical spectrum knowledge. First, with an approximation on channel evolution, OSA is formulated in a multi-armed bandit (MAB) framework. As a result, the optimal policy is specified by the wellknown Gittins index rule, where the channel with the largest Gittins index is always selected. Then, closed-form formulas are derived for the Gittins indices with tunable approximation, and the design of a reinforcement learning algorithm is presented for calculating the Gittins indices, depending on whether the Markovian channel parameters are available a priori or not. Finally, the superiority of the scheme is presented via extensive experiments compared to other existing schemes in terms of the quality of policies and optimality.

Reinforcement Learning-Based Illuminance Control Method for Building Lighting System (강화학습 기반 빌딩의 방별 조명 시스템 조도값 설정 기법)

  • Kim, Jongmin;Kim, Sunyong
    • Journal of IKEEE
    • /
    • v.26 no.1
    • /
    • pp.56-61
    • /
    • 2022
  • Various efforts have been made worldwide to respond to environmental problems such as climate change. Research on artificial intelligence (AI)-based energy management has been widely conducted as the most effective way to alleviate the climate change problem. In particular, buildings that account for more than 20% of the total energy delivered worldwide have been focused as a target for energy management using the building energy management system (BEMS). In this paper, we propose a multi-armed bandit (MAB)-based energy management algorithm that can efficiently decide the energy consumption level of the lighting system in each room of the building, while minimizing the discomfort levels of occupants of each room.

A Heuristic Time Sharing Policy for Backup Resources in Cloud System

  • Li, Xinyi;Qi, Yong;Chen, Pengfei;Zhang, Xiaohui
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.10 no.7
    • /
    • pp.3026-3049
    • /
    • 2016
  • Cloud computing promises high performance and cost-efficiency. However, most cloud infrastructures operate at a low utilization, which greatly adheres cost effectiveness. Previous works focus on seeking efficient virtual machine (VM) consolidation strategies to increase the utilization of virtual resources in production environment, but overlook the under-utilization of backup virtual resources. We propose a heuristic time sharing policy of backup VMs derived from the restless multi-armed bandit problem. The proposed policy achieves increasing backup virtual resources utilization and providing high availability. Both the results in simulation and prototype system experiments show that the traditional 1:1 backup provision can be extended to 1:M (M≫1) between the backup VMs and the service VMs, and the utilization of backup VMs can be enhanced significantly.

Trust-based Relay Selection in Relay-based Networks

  • Wu, Di;Zhu, Gang;Zhu, Li;Ai, Bo
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.6 no.10
    • /
    • pp.2587-2600
    • /
    • 2012
  • It has been demonstrated that choosing an appropriate relay node can improve the transmission rate for the system. However, such system improvement brought by the relay selection may be degraded with the presence of the malicious relay nodes, which are selected but refuse to cooperate for transmissions deliberately. In this paper, we formulate the relay selection issue as a restless bandit problem with the objective to maximize the average rate, while considering the credibility of each relay node, which may be different at each time instant. Then the optimization problem is solved by using the priority-index heuristic method effectively. Furthermore, a low complexity algorithm is offered in order to facilitate the practical implementations. Simulation results are conducted to demonstrate the effectiveness of the proposed trust-based relay selection scheme.

The UCT algorithm applied to find the best first move in the game of Tic-Tac-Toe (삼목 게임에서 최상의 첫 수를 구하기 위해 적용된 신뢰상한트리 알고리즘)

  • Lee, Byung-Doo;Park, Dong-Soo;Choi, Young-Wook
    • Journal of Korea Game Society
    • /
    • v.15 no.5
    • /
    • pp.109-118
    • /
    • 2015
  • The game of Go originated from ancient China is regarded as one of the most difficult challenges in the filed of AI. Over the past few years, the top computer Go programs based on MCTS have surprisingly beaten professional players with handicap. MCTS is an approach that simulates a random sequence of legal moves until the game is ended, and replaced the traditional knowledge-based approach. We applied the UCT algorithm which is a MCTS variant to the game of Tic-Tac-Toe for finding the best first move, and compared it with the result generated by a pure MCTS. Furthermore, we introduced and compared the performances of epsilon-Greedy algorithm and UCB algorithm for solving the Multi-Armed Bandit problem to understand the UCB.

Optimal Exploration-Exploitation Strategies in Reinforcement Learning for Online Banner Advertising: The Impact of Word-of-Mouth Effects (온라인 배너 광고 강화학습의 최적 탐색-활용 전략: 구전효과의 영향)

  • Bumsoo Kim;Gun Jea Yu;Joonkyum Lee
    • Journal of Service Research and Studies
    • /
    • v.14 no.2
    • /
    • pp.1-17
    • /
    • 2024
  • One of the most important decisions for managers in the online banner advertising industry, is to choose the best banner alternative for exposure to customers. Since it is difficult to know the click probability of each banner alternative in advance, managers must experiment with multiple alternatives, estimate the click probability of each alternative based on customer clicks, and find the optimal alternative. In this reinforcement learning process, the main decision problem is to find the optimal balance between the level of exploitation strategy that utilizes the accumulated estimated click probability information and exploration strategy that tries new alternatives to find potentially better options. In this study we analyze the impact of word-of-mouth effects and the number of alternatives on the optimal exploration-exploitation strategies. More specifically, we focus on the word-of-mouth effect, where the click-through rate of the banner increases as customers promote the related product to those around them after clicking the exposed banner, and add it to the overall reinforcement learning process. We analyze our problem by employing the Multi-Armed Bandit model, and the analysis results show that the larger the word-of-mouth effect and the fewer the number of banner alternatives, the higher the optimal exploration level of advertising reinforcement learning. We find that as the probability of customers clicking on the banner increases due to the word-of-mouth effect, the value of the previously accumulated estimated click-through rate knowledge decreases, and therefore the value of exploring new alternatives increases. Additionally, when the number of advertising alternatives is small, a larger increase in the optimal exploration level was observed as the magnitude of the word-of-mouth effect increased. This study provides meaningful academic and managerial implications at a time when online word-of-mouth and its impact on society and business is becoming more important.