Table of Contents
- Introduction
- Uncertainty in Reinforcement Learning
- Q-values and V-values
a. V-value
b. Q-value - Monte Carlo and Temporal Difference Methods
a. Monte Carlo Method (MC)
i. MC Estimation Algorithm
ii. Relationship between G and V
iii. V-value Estimation Formula
iv. V-value and Policy Correlation
v. Characteristics of MC
b. Temporal Difference Method (TD)
i. TD Estimation Algorithm
ii. Detailed Explanation of TD
iii. V-value Estimation Formula
1. Introduction
Artificial Intelligence = Deep Learning + Reinforcement Learning — David Silver
With the rise of deep neural networks, reinforcement learning has seen significant growth. Reinforcement learning is a branch of machine learning distinct from supervised and unsupervised learning. It focuses on how an agent learns to achieve favorable outcomes through interactions with its enviroment.
For example, a vacuum cleaner robot turns on and autonomously cleans areas with dust or debris without explicit instructions from a human. This represents the goal of reinforcement learning: enabling robots to perform tasks independently.
Key Concepts:
- Agent: An entity capable of perception and decision-making.
- Environment: The external world that responds to the agent's actions and provides feedback.
- State (s): Represents the condition of the environment; all possible states form the state space S.
- Action (a): A move taken by the agent based on its policy; all actions form the action space A.
- Policy (π): A mapping from states to action probabilities, guiding the agent’s behavior.
- Reward (r): Feedback from the environment indicating the desirability of an action—positive for encouragement, negative for discouragement, or zero otherwise.
Rewards are subjective and defined by the designer to facilitate effective learning.
The interaction between the agent and environment is illustrated in Figure 1 below.
Figure 1
Figure 2 shows the core components of reinforcement learning algorithms.
Figure 2
2. Uncertainty in Reinforcement Learning
- Policy Uncertainty: The choice of actions affects future states. Policies define the probability distribution over actions given a state. The objective is to find the optimal policy that maximizes cumulative rewards.
- Environmental Uncertainty: External factors can cause variations in outcomes even with identical actions. However, Markov processes allow for this uncertainty.
3. Q-values and V-values
In many cases, we cannot judge the quality of an action solely based on immediate reward. We must consider long-term consequences.
For instance, in chess, a move might initially seem disadvantageous but could lead to a greater gain later.
- Q-value: Expected total reward starting from a state, taking an action, and continuing until termination.
- V-value: Expected total reward starting from a state and continuing until termination.
Higher values indicate better expected returns. Thus, agents should select actions with higher Q-values or choose states with higher V-values.
a. V-value
- Begin at a state s.
- Generate multiple simulations of the agent following policy π.
- For each simulation, compute the sum of rewards from s to terminal state.
- Compute the average of these sums—the resulting value is the V-value for s.
In essence, V-value is the average return obtained by repeatedly executing the policy from a state until reaching the end.
b. Q-value
- Begin at a state s and take action a.
- Simulate the process from s to terminal state under action a.
- Record the accumulated reward across simulations.
- Take the average of these rewards—the result is the Q-value for (s,a).
Unlike V-values, Q-values depend on the transition probabilities of the environment rather than the policy itself.
Both Q-values and V-values assess nodes in a Markov Decision Process. V-values evaluate states, while Q-values evaluate state-action pairs.
4. Monte Carlo and Temporal Difference Methods
Since reinforcement learning aims to maximize total rewards, both Q-values and V-values are used as value functions. Estimating these functions can be done via two main methods: Monte Carlo (MC) and Temporal Difference (TD). Both methodds estimate either Q-values or V-values.
A key concept introduced here is the discount factor γ, which determines how much future rewards contribute to current value estimates. It is a hyperparameter chosen subjectively, similar to how present value calculations work in finance.
a. Monte Carlo Method (MC)
Figure 4
i. MC Estimation Algorithm
Algorithm: Estimate V-value using MC
1. Place agent at initial state s0
2. From s0, follow policy π to select action and move to new state s1
3. Repeat step 2 until terminal state reached
4. Backwardly calculate G (return) for each state along the trajectory
5. Repeat steps 1–4 multiple times, then average all G values for each state to get V-value
ii. Relationship Between G and V
The return G is computed as:
G = r_t + γ * r_{t+1} + γ² * r_{t+2} + ...
This formula accumulates discounted rewards from time t onwards. V-value represents the expected return starting from a state s, so it equals the mean of G values over multiple episodes.
iii. V-value Estimation Formula
New Average = Old Average + α * (New Sample - Old Average)
Where α is the learning rate—a hyperparameter controlling update magnitude—and G is the target value approaching the true return.
iv. V-value and Policy Correlation
As shown in Figures 5 and 6, different policies yield varying distributions of returns. Even if the same set of returns occurs, different policies will produce different V-values. This confirms that V-values are dependent on the selected policy.
Figure 5 Policy 1
Figure 6 Policy 2
v. Characteristics of MC
One major drawback is that MC requires full episode trajectories before updating estimates. If terminal states are rare, convergence becomes slow.
b. Temporal Difference Method (TD)
TD improves upon MC by allowing updates after each step instead of waiting for episode completion.
i. TD Estimation Algorithm
Algorithm: Estimate V-value using TD
1. Initialize environment and obtain initial state s0
2. For each step within an episode:
a. Select action according to current V-function
b. Execute action, observe reward r and next state s'
c. Update V(s) using TD rule
d. Set current state to s'
ii. Detailed Explanation of TD
Consider transitioning from state A to B. If B has been visited previously, it holds a value V(B), representing the expected return from B onward. Using this, we compute an updated estimate for A.
In TD, unlike MC where we wait for a complete trajectory, updates occur immediately after observing a reward, making the method more efficient.
iii. V-value Estimation Formula
In TD, the update uses the difference between the observed reward and the estimated value of the next state:
V(s) ← V(s) + α [r + γ * V(s') - V(s)]
Here, the target is not the full return G but a partial estimate based on one-step lookahead. This makes TD more flexible and applicable in real-time scenarios.