When I was a young girl, I loved playing checkers with my brother. There was something about the thrill of getting to the end of the board and getting “crowned”. I liked pushing that stack of checkers around the board, knowing that I could move both backwards and forwards. Towards the end of a game, we would deliberate on each move for what seemed like an eternity. I guess when you are considering upwards of 500 quintillion combinations of moves during a game, then an eternity accurately describes the experience.
Reinforcement learning (RL) agents have famously learned to play games such as checkers, backgammon, Othello, and most recently land a lunar-lander in OpenAI Gym. The long-term reward is to win the game but getting there can take many different paths of moves. Furthermore, nothing is hidden so there is complete information available to all players. Lastly, anyone can learn to play the game by following the rules as the problem space has been defined. Expertise is developed based on the experiences of playing the game repeatedly. Games with win-lose-draw outcomes are not too dissimilar from real-world problems in robotics, process control, health care, trading, finance and much more.
The RL technique featured for scoring a model in the video below is the Deep-Q Network (DQN) which attempts to model the actions that perform best in each state in real-time. Think of this as a player trying to determine which move to make in a game that will lead to a win. A user-defined neural network will output a value for each possible action that assesses that action’s quality. These values are often identified with a function Q so the family of algorithms that rely on them has become collectively known as Q-learning. Using the output Q-values, an agent can decide an optimal policy by choosing the highest-quality action at each time step. In this example, Deep-Q Learning is being performed through the application of a DQN.
The task that the agent is trying to learn is known as CartPole-v0. This environment simulates a cart on a track trying to balance a pole upright. The objective is to keep the pole balanced upright. Rewards, states, and actions are the following:
- Reward: +1 for each time step until termination:
- Cart reaches the end of the track
- Pole angle too great (12 degrees)
- 200 iterations
- State Variables: Cart position on the track, cart velocity, pole angle, pole velocity (at the tip)
- Actions: Left, Right
The video will walk you through the simple steps needed to create an RL model for Deep-Q Learning in SAS Viya, using a Jupyter Notebook. Do you have a problem with a complex sequence of decisions for which you want to maximize the outcome? Then check out reinforcement learning (RL) in SAS Viya. Not just for fun and games anymore, RL can be used to solve a variety of real-world problems.