Reinforcement learning (RL) can be defined as a technique for learning in an unknown environment. Through learning, two main modes select actions, exploration and exploitation. The exploration is to investigate unexplored actions. The exploitation is to exploit current best actions. Balancing between exploration and exploitation is a challenge for RL. In this work, an exploration algorithm for RL is designed. This algorithm introduces two parameters for balancing purpose, which are the action-value function convergence error, and the exploration time threshold. The first parameter evaluates actions and selects the best ones based on the convergent values of their action-value functions. The exploration time threshold forces the agent to exploit the current best policy in the case of inability to explore available actions after a time. We show that this algorithm outperforms the well-known algorithm, which is the epsilon-greedy algorithm. We then study the effects of the introduced parameters on the performance.
Available at: http://works.bepress.com/zhengdao_wang/11/