Machine Learning: Reinforcement Learning
When Google DeepMind unveiled “AlphaGo — The Movie”, it wasn’t just a film; it was a revelation in the artificial intelligence (AI) world — a testament to the capabilities of reinforcement learning. Humanity’s greatest Go players were defeated by a tale of algorithms which are able to learn, adapt, and evolve. This dynamic learning is kickstarted the popularity of reinforcement learning. This article seeks to provide an introduction to the concepts of Reinforcement Learning and its processes.
Introduction
Unlike traditional machine learning approaches, reinforcement learning agents learn by interacting with their environment, taking actions, and receiving feedback in the form of rewards or penalties. A key difference is its dynamic learning approach. It chooses what action to take from the millions of possibilities, it learns through the mistakes it makes.
In supervised learning, we are given data in terms of inputs and predict our outputs. The goal is to learn a function to map the inputs to outputs. In unsupervised learning, we only have access to the data with no notion of labels where we are not trying to predict labels but to find some sought of underlying structure in the data.
Reinforcement learning, on the other hand, are only given data in the form of “state-action pairs” and its objective is to maximize future rewards over many time steps in the future.
Terminologies
Let us define some key terminologies:
· Agent: A being that can take actions.
· Environment: The world the agent exists in
· Actions, A: A move the agent can make in the environment
· Action space: The set of possible actions an agent can make in the environment. It can exist in both discrete and continuous time.
· Observations: What the agent sees based on what the actions it just took
· State, S: A situation which the agent perceives
· Reward, R: Feedback that measures the success or failure of the agent’s action
· Policy, π: The policy defines the strategy that the agent uses to select actions in different states.
The diagram above shows the interaction cycle of all the terminologies in Reinforcement Learning.
Indeed, Reinforcement learning is based on the Markov decision process — a mathematical modelling of decision-making that uses discrete time steps. At every step, the agent takes a new action that results in a new environment state. Similarly, the current state is attributed to the sequence of previous actions. More information can be found in my article on All You Need to Know About Markov Chains.
Through trial and error in moving through the environment, the agent builds a set of if-then rules or policies. The policies help it decide which action to take next for optimal cumulative reward. The agent must also choose between further environment exploration to learn new state-action rewards or select known high-reward actions from a given state. This is called the exploration-exploitation trade-off (see Exploration-Exploitation Trade-Off).
Note that not all actions result in rewards being collect or accumulated over time and not all actions result in immediate rewards, it can be delayed rewards some time from now (see Reward System).
Exploration-Exploitation Trade-Off
Exploration involves trying out different actions or strategies to gather information about the environment and potentially discover better outcomes. Exploitation, on the other hand, entails leveraging known knowledge or strategies to exploit actions that have previously yielded high rewards. Too much exploration may lead to inefficient use of resources and missed opportunities for immediate rewards, while excessive exploitation may result in premature convergence to suboptimal solutions and limited learning. Striking the right balance between exploration and exploitation is crucial for agents to effectively learn and optimize their behavior over time. By dynamically adjusting the exploration-exploitation strategy based on environmental feedback and learning progress, agents can adapt their decision-making strategies to maximize cumulative rewards and achieve robust performance across different scenarios.
Reward System
The total reward is the sum of all rewards from that point to infinity. It is important to discount total reward (gamma) to make future reward worth less, think of it as the time value of money in Corporate Finance. For instance, we rather receive $5 now rather than $5 in the future as $5 in the future is worth less. This is because that same $5 today can be invested in a bank and earn interest which will be greater than $5 in the future. Same logic applies to reinforcement learning. A reward now has greater value than a reward in the future. It is defined as Gamma, γ.
The goal of Reinforcement Learning is to maximize the total cumulative reward by optimizing the actions taken. This is defined by the Q-Function where the total reward is the discounted sum of all rewards obtained from time, t. The Q-Function captures the expected total future reward an agent in state, s, can receive by executing a certain action, a. It can be defined recursively using the Bellman Equation which expresses the relationship between the value of a state-action pair and the value of its successor state-action pairs. Its formula is defined below.
In layman terms, the Q-Function tells us from any possible action, what is the reward for taking that particular action. Ultimately, the agent needs policy π(s), to infer the best action to take at its state, s, from its state-action pairs. The strategy is that the policy should choose an action that maximizes future reward.
Model Free versus Model Based Approaches
Model-based RL is typically used when environments are well-defined and unchanging and where real-world environment testing is difficult. The agent first builds an internal representation (model) of the environment where it first takes actions within the environment and notes the new state and reward value. Afterwards, it associates the action-state transition with the reward value. Once the model is complete, the agent simulates action sequences based on the probability of optimal cumulative rewards. It then further assigns values to the action sequences themselves. The agent thus develops different strategies within the environment to achieve the desired end goal.
On the other side of the spectrum, Model-free RL is best to use when the environment is large, complex, and not easily describable. It’s also ideal when the environment is unknown and changing, and environment-based testing does not come with significant downsides. The agent doesn’t build an internal model of the environment and its dynamics. Instead, it uses a trial-and-error approach within the environment. It scores and notes state-action pairs — and sequences of state-action pairs — to develop a policy.
Classifying Deep Reinforcement Learning Algorithms
· Value Learning: Tries to estimate and optimize the Q-Function. It finds the best possible action to take, given the state we are in.
· Policy Learning: Instead of first optimizing our Q-Function and finding a Q-value before using it to optimize our actions just like in Value Learning, we directly find the best policy for that particular state we find ourselves in. Essentially the model will directly learn from that policy function from the data, effectively skipping one step.
Value Learning {Q-Learning and DQN Algorithm}
Value-based methods aim to estimate the value function, typically the state-value function, V(S) or the action-value function Q(S, A), which represent the expected return.
When trying to model Q-Functions, we may encounter a very large action space. This means that we are required to do a large number of iterations which may lead to computational inefficiency. Instead, the Q-Learning algorithm {Model-Free} feeds it with all different Q-Values {all Q-Values for all the different possible actions}. This will result in one iteration, produce all the Q-Values for all possible actions and the model will pick the action with the highest Q-Value.
To maximize a target’s value, we construct a loss function as a performance metric that represents our expected return if we take all of our expected return. At the next future state, we recursively apply that same equation and discounting factor as our target. With its predicted Q-Value, we are able to take the expectations of the ground truth of expected return minus away the predicted return square. This is known as the Q-Loss function which measures the discrepancy between the predicted Q-Values and the target Q-Values, with the aim of minimizing this discrepancy through parameter updates. The most commonly used Q-Loss function is the mean squared error (MSE) loss.
Another Value-Based algorithm is the Deep Q Networks (DQN). DQN employs experience replay, which stores a replay buffer of past experiences (state, action, reward, next state) to break correlations between consecutive updates and improve sample efficiency. To stabilize training, DQN uses a target network — a separate neural network with the same architecture as the main network but with frozen parameters. The target network is periodically updated to match the parameters of the main network. This is measured by the loss function which is also the MSE between predicted Q-Values and the target Q-Values.
With our predicted Q-Value, we can then define a policy function that represents condition on a state, what is the best action to take. We can take the policy function by optimizing what the Q-Function is and afterwards, send the actions back to the environment to receive next state. The concept is somewhat similar to supervised learning with gradient descent.
Downsides:
· Complexity: Can model scenarios where the action space is discrete and small. However, it is difficult to handle continuous action spaces.
· Flexibility: Policy is deterministically computed from the Q Function by maximizing the reward. We cannot learn from stochastic policies.
Policy Learning
To address these downsides, consider a new class of Reinforcement Learning training algorithms known as policy-gradient methods. Policy-Gradient methods directly optimizes the policy parameters to increase the expected return where these methods estimate the gradient of the expected return with respect to its policy parameters and use it to update the parameter in the direction that improves performance. Instead of focusing on the numerical values of actions, Policy-Gradient methods prioritize the likelihood that selecting a particular action will yield the best expected return. This likelihood forms a distribution, where each action’s probability sums to 1, allowing for more nuanced action selection. After the Policy is found, we can just sample from the Policy’s distribution.
The Policy-Gradient loss is a common objective function where the negative log-likelihood of the selected action is multiplied by received reward. The negative log likelihood encourages the policy network to increase the probability of selecting actions that lead to higher rewards, while decreasing the probability of selecting actions that lead to lower rewards. The loss function is then minimized to improve its decision-making policy through techniques such as Gradient Descent.
The very big advantage of using Policy Gradient probabilities instead of numerical values is that it can be applied to both discrete and continuous variables. The mode will be the best possible action to take. If we want to explore, we can sample from that distribution. Instead of predicting a probability for every possible action, we can parameterize the action space. Just like how a Gaussian distribution can be modelled when we have the mean and variance.
A Real-World Application
A real-world example of its Reinforcement Learning would be the concept of self-driving cars:
· Agent: Vehicle
· State: Vehicle Sensors
· Action: Steering Wheel Angle
· Reward: Distance Travelled Before Crashing
Training Algorithm:
· Initialize the agent
· Run a policy until termination
· Record all states, actions and rewards {our mini data set for first round of training}
· Decrease probability of actions {eliminating the last action as it crashes} that resulted in low reward {crash}
· Increase probability of actions that resulted in high reward {first few trajectories}
· Repeat the process until completion until convergence
Loss function for the policy gradient neural network: log-likelihood of selection a particular action, multiplied by its reward. If we have a high reward given that particular action, we want to sample that action again in the future. We essentially want to minimize low rewards by using a Gradient Descent algorithm using a policy gradient.
Challenges of Reinforcement Learning Algorithms in Real-World Applications
The first challenge is practicality. In our case study, one of the steps is to run a policy until termination. Obviously, it is not feasible to keep crashing. Although some argue that we can run simulations, we run into the problem of Simulation-to-Real gap which states that simulation does not equate to reality. Our end goal is for the algorithm to be applied to real-world scenarios, not just constraint to simulation.
In our car-driving example, MIT has created a solution by providing photorealistic and high-fidelity simulator for training and testing of self-driving cars known as the VISTA simulator. It works by providing the reinforcement learning model with videos of cars driving on real roads to simulate the car’s trajectories, overcoming the Simulation-to-Real gap, together with logistical hazards and dangers of crashing.
The second challenge is the sparse reward setting. In our car example, just because the car crashes, it does not mean the actions leading up to the crash should be penalized. It could be just a slip at the end and thus, it does not indicate that all actions up to that point should be disregarded. This mean lead to inaccuracy in the model’s learning.
A solution is reward shaping where human intervention is needed to manually guide a reward system. However, it is going to be computational difficult as we have to create a new customized reward system for every new environment. Moreover, it also may overfit our data instead of generalizing. It is thus difficult to train with sparse rewards, yet at the same time, not optimal to individually customized its reward system.
Reinforcement learning also faces the common issues such as high sensitivity to hyperparameter tuning. For example, if our learning rate is too large, we are going to have a policy update that pushes policy network into a region of parameter space issue where it collects the next batch of data under a poor policy, resulting in poor recovery.
Conclusion
Reinforcement Learning stands as a powerful paradigm within the domain of artificial intelligence, offering a principled framework for agents to learn optimal behaviour through interaction with its environment. From its theoretical foundations to its practical applications, RL has witnessed remarkable advancements, driven by innovations in algorithms, computational resources, and real-world use cases. Reinforcement Learning is beginning to witness transformative breakthroughs that push the boundaries of what’s possible in autonomous decision-making and adaptive systems. The concept of Reinforcement Learning is ever-evolving concept and it is definitely crucial to always stay updated with the latest developments.
This is the end of the article, phew, so much information to absorb. If you made it to the end, kudos to you! Thank you for reading this article and stay tuned for more Machine Learning blogs 😊
References
[1] MIT 6.S191: Reinforcement Learning — Alexander Amini