• Title/Summary/Keyword: TD-learning

Search Result 30, Processing Time 0.03 seconds

Goal-Directed Reinforcement Learning System (목표지향적 강화학습 시스템)

  • Lee, Chang-Hoon
    • The Journal of the Institute of Internet, Broadcasting and Communication
    • /
    • v.10 no.5
    • /
    • pp.265-270
    • /
    • 2010
  • Reinforcement learning performs learning through interacting with trial-and-error in dynamic environment. Therefore, in dynamic environment, reinforcement learning method like TD-learning and TD(${\lambda}$)-learning are faster in learning than the conventional stochastic learning method. However, because many of the proposed reinforcement learning algorithms are given the reinforcement value only when the learning agent has reached its goal state, most of the reinforcement algorithms converge to the optimal solution too slowly. In this paper, we present GDRLS algorithm for finding the shortest path faster in a maze environment. GDRLS is select the candidate states that can guide the shortest path in maze environment, and learn only the candidate states to find the shortest path. Through experiments, we can see that GDRLS can search the shortest path faster than TD-learning and TD(${\lambda}$)-learning in maze environment.

Random Balance between Monte Carlo and Temporal Difference in off-policy Reinforcement Learning for Less Sample-Complexity (오프 폴리시 강화학습에서 몬테 칼로와 시간차 학습의 균형을 사용한 적은 샘플 복잡도)

  • Kim, Chayoung;Park, Seohee;Lee, Woosik
    • Journal of Internet Computing and Services
    • /
    • v.21 no.5
    • /
    • pp.1-7
    • /
    • 2020
  • Deep neural networks(DNN), which are used as approximation functions in reinforcement learning (RN), theoretically can be attributed to realistic results. In empirical benchmark works, time difference learning (TD) shows better results than Monte-Carlo learning (MC). However, among some previous works show that MC is better than TD when the reward is very rare or delayed. Also, another recent research shows when the information observed by the agent from the environment is partial on complex control works, it indicates that the MC prediction is superior to the TD-based methods. Most of these environments can be regarded as 5-step Q-learning or 20-step Q-learning, where the experiment continues without long roll-outs for alleviating reduce performance degradation. In other words, for networks with a noise, a representative network that is regardless of the controlled roll-outs, it is better to learn MC, which is robust to noisy rewards than TD, or almost identical to MC. These studies provide a break with that TD is better than MC. These recent research results show that the way combining MC and TD is better than the theoretical one. Therefore, in this study, based on the results shown in previous studies, we attempt to exploit a random balance with a mixture of TD and MC in RL without any complicated formulas by rewards used in those studies do. Compared to the DQN using the MC and TD random mixture and the well-known DQN using only the TD-based learning, we demonstrate that a well-performed TD learning are also granted special favor of the mixture of TD and MC through an experiments in OpenAI Gym.

Reinforcement Learning using Propagation of Goal-State-Value (목표상태 값 전파를 이용한 강화 학습)

  • Kim, Byeong-Cheon;Yun, Byeong-Ju
    • The Transactions of the Korea Information Processing Society
    • /
    • v.6 no.5
    • /
    • pp.1303-1311
    • /
    • 1999
  • In order to learn in dynamic environments, reinforcement learning algorithms like Q-learning, TD(0)-learning, TD(λ)-learning have been proposed. however, most of them have a drawback of very slow learning because the reinforcement value is given when they reach their goal state. In this thesis, we have proposed a reinforcement learning method that can approximate fast to the goal state in maze environments. The proposed reinforcement learning method is separated into global learning and local learning, and then it executes learning. Global learning is a learning that uses the replacing eligibility trace method to search the goal state. In local learning, it propagates the goal state value that has been searched through global learning to neighboring sates, and then searches goal state in neighboring states. we can show through experiments that the reinforcement learning method proposed in this thesis can find out an optimal solution faster than other reinforcement learning methods like Q-learning, TD(o)learning and TD(λ)-learning.

  • PDF

Max-Mean N-step Temporal-Difference Learning Using Multi-Step Return (멀티-스텝 누적 보상을 활용한 Max-Mean N-Step 시간차 학습)

  • Hwang, Gyu-Young;Kim, Ju-Bong;Heo, Joo-Seong;Han, Youn-Hee
    • KIPS Transactions on Computer and Communication Systems
    • /
    • v.10 no.5
    • /
    • pp.155-162
    • /
    • 2021
  • n-step TD learning is a combination of Monte Carlo method and one-step TD learning. If appropriate n is selected, n-step TD learning is known as an algorithm that performs better than Monte Carlo method and 1-step TD learning, but it is difficult to select the best values of n. In order to solve the difficulty of selecting the values of n in n-step TD learning, in this paper, using the characteristic that overestimation of Q can improve the performance of initial learning and that all n-step returns have similar values for Q ≈ Q*, we propose a new learning target, which is composed of the maximum and the mean of all k-step returns for 1 ≤ k ≤ n. Finally, in OpenAI Gym's Atari game environment, we compare the proposed algorithm with n-step TD learning and proved that the proposed algorithm is superior to n-step TD learning algorithm.

A Reinforcement Loaming Method using TD-Error in Ant Colony System (개미 집단 시스템에서 TD-오류를 이용한 강화학습 기법)

  • Lee, Seung-Gwan;Chung, Tae-Choong
    • The KIPS Transactions:PartB
    • /
    • v.11B no.1
    • /
    • pp.77-82
    • /
    • 2004
  • Reinforcement learning takes reward about selecting action when agent chooses some action and did state transition in Present state. this can be the important subject in reinforcement learning as temporal-credit assignment problems. In this paper, by new meta heuristic method to solve hard combinational optimization problem, examine Ant-Q learning method that is proposed to solve Traveling Salesman Problem (TSP) to approach that is based for population that use positive feedback as well as greedy search. And, suggest Ant-TD reinforcement learning method that apply state transition through diversification strategy to this method and TD-error. We can show through experiments that the reinforcement learning method proposed in this Paper can find out an optimal solution faster than other reinforcement learning method like ACS and Ant-Q learning.

Online Reinforcement Learning to Search the Shortest Path in Maze Environments (미로 환경에서 최단 경로 탐색을 위한 실시간 강화 학습)

  • Kim, Byeong-Cheon;Kim, Sam-Geun;Yun, Byeong-Ju
    • The KIPS Transactions:PartB
    • /
    • v.9B no.2
    • /
    • pp.155-162
    • /
    • 2002
  • Reinforcement learning is a learning method that uses trial-and-error to perform Learning by interacting with dynamic environments. It is classified into online reinforcement learning and delayed reinforcement learning. In this paper, we propose an online reinforcement learning system (ONRELS : Outline REinforcement Learning System). ONRELS updates the estimate-value about all the selectable (state, action) pairs before making state-transition at the current state. The ONRELS learns by interacting with the compressed environments through trial-and-error after it compresses the state space of the mage environments. Through experiments, we can see that ONRELS can search the shortest path faster than Q-learning using TD-ewor and $Q(\lambda{)}$-learning using $TD(\lambda{)}$ in the maze environments.

Human Adaptive Device Development based on TD method for Smart Home

  • Park, Chang-Hyun;Sim, Kwee-Bo
    • 제어로봇시스템학회:학술대회논문집
    • /
    • 2005.06a
    • /
    • pp.1072-1075
    • /
    • 2005
  • This paper presents that TD method is applied to the human adaptive devices for smart home with context awareness (or recognition) technique. For smart home, the very important problem is how the appliances (or devices) can adapt to user. Since there are many humans to manage home appliances (or devices), managing the appliances automatically is difficult. Moreover, making the users be satisfied by the automatically managed devices is much more difficult. In order to do so, we can use several methods, fuzzy controller, neural network, reinforcement learning, etc. Though the some methods could be used, in this case (in dynamic environment), reinforcement learning is appropriate. Among some reinforcement learning methods, we select the Temporal Difference learning method as a core algorithm for adapting the devices to user. Since this paper assumes the environment is a smart home, we simply explained about the context awareness. Also, we treated with the TD method briefly and implement an example by VC++. Thereafter, we dealt with how the devices can be applied to this problem.

  • PDF

Applying Neuro-fuzzy Reasoning to Go Opening Games (뉴로-퍼지 추론을 적용한 포석 바둑)

  • Lee, Byung-Doo
    • Journal of Korea Game Society
    • /
    • v.9 no.6
    • /
    • pp.117-125
    • /
    • 2009
  • This paper describes the result of applying neuro-fuzzy reasoning, which conducts Go term knowledge based on pattern knowledge, to the opening game of Go. We discuss the implementation of neuro-fuzzy reasoning for deciding the best next move to proceed through the opening game. We also let neuro-fuzzy reasoning play against TD($\lambda$) learning to test the performance. The experimental result reveals that even the simple neuro-fuzzy reasoning model can compete against TD($\lambda$) learning and it shows great potential to be applied to the real game of Go.

  • PDF

A Localized Adaptive QoS Routing using TD(${\lambda}$) method (TD(${\lambda}$) 기법을 사용한 지역적이며 적응적인 QoS 라우팅 기법)

  • Han Jeong-Soo
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.30 no.5B
    • /
    • pp.304-309
    • /
    • 2005
  • In this paper, we propose a localized Adaptive QoS Routing using TD method and evaluate performance of various exploration methods when path is selected. Expecially, through extensive simulation, the proposed routing algorithm and exploration method using Exploration Bonus are shown to be effective in significantly reducing the overall blocking probability, when compared to the other path selection method(exploration method), because the proposed exploration method is more adaptive to network environments than others when path is selected.

Dynamic power and bandwidth allocation for DVB-based LEO satellite systems

  • Satya Chan;Gyuseong Jo;Sooyoung Kim;Daesub Oh;Bon-Jun Ku
    • ETRI Journal
    • /
    • v.44 no.6
    • /
    • pp.955-965
    • /
    • 2022
  • A low Earth orbit (LEO) satellite constellation could be used to provide network coverage for the entire globe. This study considers multi-beam frequency reuse in LEO satellite systems. In such a system, the channel is time-varying due to the fast movement of the satellite. This study proposes an efficient power and bandwidth allocation method that employs two linear machine learning algorithms and take channel conditions and traffic demand (TD) as input. With the aid of a simple linear system, the proposed scheme allows for the optimum allocation of resources under dynamic channel and TD conditions. Additionally, efficient projection schemes are added to the proposed method so that the provided capacity is best approximated to TD when TD exceeds the maximum allowable system capacity. The simulation results show that the proposed method outperforms existing methods.