• Title/Summary/Keyword: markov decision problem

Search Result 69, Processing Time 0.03 seconds

Optimal Network Defense Strategy Selection Based on Markov Bayesian Game

  • Wang, Zengguang;Lu, Yu;Li, Xi;Nie, Wei
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.13 no.11
    • /
    • pp.5631-5652
    • /
    • 2019
  • The existing defense strategy selection methods based on game theory basically select the optimal defense strategy in the form of mixed strategy. However, it is hard for network managers to understand and implement the defense strategy in this way. To address this problem, we constructed the incomplete information stochastic game model for the dynamic analysis to predict multi-stage attack-defense process by combining Bayesian game theory and the Markov decision-making method. In addition, the payoffs are quantified from the impact value of attack-defense actions. Based on previous statements, we designed an optimal defense strategy selection method. The optimal defense strategy is selected, which regards defense effectiveness as the criterion. The proposed method is feasibly verified via a representative experiment. Compared to the classical strategy selection methods based on the game theory, the proposed method can select the optimal strategy of the multi-stage attack-defense process in the form of pure strategy, which has been proved more operable than the compared ones.

A Simulation Sample Accumulation Method for Efficient Simulation-based Policy Improvement in Markov Decision Process (마르코프 결정 과정에서 시뮬레이션 기반 정책 개선의 효율성 향상을 위한 시뮬레이션 샘플 누적 방법 연구)

  • Huang, Xi-Lang;Choi, Seon Han
    • Journal of Korea Multimedia Society
    • /
    • v.23 no.7
    • /
    • pp.830-839
    • /
    • 2020
  • As a popular mathematical framework for modeling decision making, Markov decision process (MDP) has been widely used to solve problem in many engineering fields. MDP consists of a set of discrete states, a finite set of actions, and rewards received after reaching a new state by taking action from the previous state. The objective of MDP is to find an optimal policy, that is, to find the best action to be taken in each state to maximize the expected discounted reward of policy (EDR). In practice, MDP is typically unknown, so simulation-based policy improvement (SBPI), which improves a given base policy sequentially by selecting the best action in each state depending on rewards observed via simulation, can be a practical way to find the optimal policy. However, the efficiency of SBPI is still a concern since many simulation samples are required to precisely estimate EDR for each action in each state. In this paper, we propose a method to select the best action accurately in each state using a small number of simulation samples, thereby improving the efficiency of SBPI. The proposed method accumulates the simulation samples observed in the previous states, so it is possible to precisely estimate EDR even with a small number of samples in the current state. The results of comparative experiments on the existing method demonstrate that the proposed method can improve the efficiency of SBPI.

Online Selective-Sample Learning of Hidden Markov Models for Sequence Classification

  • Kim, Minyoung
    • International Journal of Fuzzy Logic and Intelligent Systems
    • /
    • v.15 no.3
    • /
    • pp.145-152
    • /
    • 2015
  • We consider an online selective-sample learning problem for sequence classification, where the goal is to learn a predictive model using a stream of data samples whose class labels can be selectively queried by the algorithm. Given that there is a limit to the total number of queries permitted, the key issue is choosing the most informative and salient samples for their class labels to be queried. Recently, several aggressive selective-sample algorithms have been proposed under a linear model for static (non-sequential) binary classification. We extend the idea to hidden Markov models for multi-class sequence classification by introducing reasonable measures for the novelty and prediction confidence of the incoming sample with respect to the current model, on which the query decision is based. For several sequence classification datasets/tasks in online learning setups, we demonstrate the effectiveness of the proposed approach.

Machine Maintenance Policy Using Partially Observable Markov Decision Process

  • Pak, Pyoung Ki;Kim, Dong Won;Jeong, Byung Ho
    • Journal of Korean Society for Quality Management
    • /
    • v.16 no.2
    • /
    • pp.1-9
    • /
    • 1988
  • This paper considers a machine maintenance problem. The machine's condition is partially known by observing the machine's output products. This problem is formulated as an infinite horizon partially observable Markov decison process to find an optimal maintenance policy. However, even though the optimal policy of the model exists, finding the optimal policy is very time consuming. Thus, the intends of this study is to find ${\varepsilon}-optimal$ stationary policy minimizing the expected discounted total cost of the system, ${\varepsilon}-optimal$ policy is found by using a modified version of the well-known policy iteration algorithm. A numerical example is also shown.

  • PDF

Determination of Ship Collision Avoidance Path using Deep Deterministic Policy Gradient Algorithm (심층 결정론적 정책 경사법을 이용한 선박 충돌 회피 경로 결정)

  • Kim, Dong-Ham;Lee, Sung-Uk;Nam, Jong-Ho;Furukawa, Yoshitaka
    • Journal of the Society of Naval Architects of Korea
    • /
    • v.56 no.1
    • /
    • pp.58-65
    • /
    • 2019
  • The stability, reliability and efficiency of a smart ship are important issues as the interest in an autonomous ship has recently been high. An automatic collision avoidance system is an essential function of an autonomous ship. This system detects the possibility of collision and automatically takes avoidance actions in consideration of economy and safety. In order to construct an automatic collision avoidance system using reinforcement learning, in this work, the sequential decision problem of ship collision is mathematically formulated through a Markov Decision Process (MDP). A reinforcement learning environment is constructed based on the ship maneuvering equations, and then the three key components (state, action, and reward) of MDP are defined. The state uses parameters of the relationship between own-ship and target-ship, the action is the vertical distance away from the target course, and the reward is defined as a function considering safety and economics. In order to solve the sequential decision problem, the Deep Deterministic Policy Gradient (DDPG) algorithm which can express continuous action space and search an optimal action policy is utilized. The collision avoidance system is then tested assuming the $90^{\circ}$intersection encounter situation and yields a satisfactory result.

Opportunistic Spectrum Access Based on a Constrained Multi-Armed Bandit Formulation

  • Ai, Jing;Abouzeid, Alhussein A.
    • Journal of Communications and Networks
    • /
    • v.11 no.2
    • /
    • pp.134-147
    • /
    • 2009
  • Tracking and exploiting instantaneous spectrum opportunities are fundamental challenges in opportunistic spectrum access (OSA) in presence of the bursty traffic of primary users and the limited spectrum sensing capability of secondary users. In order to take advantage of the history of spectrum sensing and access decisions, a sequential decision framework is widely used to design optimal policies. However, many existing schemes, based on a partially observed Markov decision process (POMDP) framework, reveal that optimal policies are non-stationary in nature which renders them difficult to calculate and implement. Therefore, this work pursues stationary OSA policies, which are thereby efficient yet low-complexity, while still incorporating many practical factors, such as spectrum sensing errors and a priori unknown statistical spectrum knowledge. First, with an approximation on channel evolution, OSA is formulated in a multi-armed bandit (MAB) framework. As a result, the optimal policy is specified by the wellknown Gittins index rule, where the channel with the largest Gittins index is always selected. Then, closed-form formulas are derived for the Gittins indices with tunable approximation, and the design of a reinforcement learning algorithm is presented for calculating the Gittins indices, depending on whether the Markovian channel parameters are available a priori or not. Finally, the superiority of the scheme is presented via extensive experiments compared to other existing schemes in terms of the quality of policies and optimality.

Markov Model-based Static Obstacle Map Estimation for Perception of Automated Driving (자율주행 인지를 위한 마코브 모델 기반의 정지 장애물 추정 연구)

  • Yoon, Jeongsik;Yi, Kyongsu
    • Journal of Auto-vehicle Safety Association
    • /
    • v.11 no.2
    • /
    • pp.29-34
    • /
    • 2019
  • This paper presents a new method for construction of a static obstacle map. A static obstacle is important since it is utilized to path planning and decision. Several established approaches generate static obstacle map by grid method and counting algorithm. However, these approaches are occasionally ineffective since the density of LiDAR layer is low. Our approach solved this problem by applying probability theory. First, we converted all LiDAR point to Gaussian distribution to considers an uncertainty of LiDAR point. This Gaussian distribution represents likelihood of obstacle. Second, we modeled dynamic transition of a static obstacle map by adopting the Hidden Markov Model. Due to the dynamic characteristics of the vehicle in relation to the conditions of the next stage only, a more accurate map of the obstacles can be obtained using the Hidden Markov Model. Experimental data obtained from test driving demonstrates that our approach is suitable for mapping static obstacles. In addition, this result shows that our algorithm has an advantage in estimating not only static obstacles but also dynamic characteristics of moving target such as driving vehicles.

Optimal Packet Scheduling for Energy Harvesting Sources on Time Varying Wireless Channels

  • Kashef, Mohamed;Ephremides, Anthony
    • Journal of Communications and Networks
    • /
    • v.14 no.2
    • /
    • pp.121-129
    • /
    • 2012
  • In this paper, we consider a source node that operates over a time varying channel with energy harvesting capability. The goal of the source is to maximize the average number of successfully delivered packets per time slot. The source is able to choose whether to transmit a packet or defer the transmission in each time slot. The decision which is chosen by the source depends on the channel information available and the length of the energy queue. We formulate the problem of finding the optimal policy as a Markovian decision problem. We show some properties of the value function that represents the discounted number of successfully delivered packets per time slot. We prove that the optimal policy is a threshold type policy depending on the state of the channel and the length of the energy queue. We also derive an upper bound for the average number of packets per time slots successfully received by the destination. We show using numerical results that this bound is a tight bound on the performance of the optimal policy. And we consider the case of time varying channel but without channel state information (CSI). Then, we study the impact of channel time varying nature and the availability of CSI. In this case, we show that the optimal policy is a greedy policy. The performance of this greedy policy is also calculated.

Optimal LNG Procurement Policy in a Spot Market Using Dynamic Programming (동적 계획법을 이용한 LNG 현물시장에서의 포트폴리오 구성방법)

  • Ryu, Jong-Hyun
    • Journal of Korean Institute of Industrial Engineers
    • /
    • v.41 no.3
    • /
    • pp.259-266
    • /
    • 2015
  • Among many energy resources, natural gas has recently received a remarkable amount of attention, particularly from the electrical generation industry. This is in part due to increasing shale gas production, providing an environment-friendly fossil fuel, and high risk of nuclear power. Because South Korea, the world's second largest LNG importing nation after Japan, has no international natural gas pipelines and relies on imports in the form of LNG, the natural gas has been traditionally procured by long term LNG contracts at relatively high price. Thus, there is a need of developing an Asian LNG trading hub, where LNG can be traded at more competitive spot prices. In a natural gas spot market, the amount of natural gas to be bought should be carefully determined considering a limited storage capacity and future pricing dynamics. In this work, the problem to find the optimal amount of natural gas in a spot market is formulated as a Markov decision process (MDP) in risk neutral environment and the optimal base stock policy which depends on a stage and price is established. Taking into account price and demand uncertainties, the basestock target levels are simply approximated from dynamic programming. The simulation results show that the basestock policy can be one of effective ways for procurement of LNG in a spot market.

Topological measures for algorithm complexity of Markov decision processes (마르코프 결정 프로세스의 위상적 계산 복잡도 척도)

  • Yi, Seung-Joon;Zhang, Byoung-Tak
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2007.06c
    • /
    • pp.319-323
    • /
    • 2007
  • 실세계의 여러 문제들은 마르코프 결정 문제(Markov decision problem, MDP)로 표현될 수 있고, 이 MDP는 모델이 알려진 경우에는 평가치 반복(value iteration) 이나 모델이 알려지지 않은 경우에도 강화 학습(reinforcement learning) 알고리즘 등을 사용하여 풀 수 있다. 하지만 이들 알고리즘들은 시간 복잡도가 높아 크기가 큰 실세계 문제에 적용하기 쉽지 않아, MDP를 계층적으로 분할하거나, 여러 단계를 묶어서 수행하는 등의 시간적 추상화(temporal abstraction) 방법이 제안되어 왔다. 이러한 시간적 추상화 방법들의 문제점으로는 시간적 추상화의 디자인에 따라 MDP의 풀이 성능이 크게 달라질 수 있으며, 많은 경우 사용자가 이 디자인을 직접 제공해야 한다는 것들이 있다. 최근 사용자의 간섭이 필요 없이 자동적으로 시간적 추상화를 만드는 방법들이 제안된 바 있으나, 이들 방법들 역시 결과물에 대한 이론적인 성능 보장(performance guarantee)은 제공하지 못하고 있다. 본 연구에서는 이러한 문제점을 해결하기 위해 MDP의 구조와 그 풀이 성능을 연관짓는 복잡도 척도에 대해 살펴본다. 이를 위해 MDP로부터 얻은 상태 경로 그래프(state trajectory graph)의 위상적 성질들을 여러 네트워크 척도(network measurements) 들을 이용하여 측정하고, 이와 MDP의 풀이 성능과의 관계를 다양한 상황에 대해 실험적, 이론적으로 분석해 보았다.

  • PDF