Markov Decision Problem with states \(s \in S\) and actions \(a \in A\), with \(T(s, a, s^\prime)\) and \(R(s, a, s^\prime)\) being unknown!
The agent learns from its experiences.
Use value iteration as with normal MDPs. Repeat steps one and two as needed.
Task: “policy evaluation”
In this case the learner has no choice about what actions to take, it just executes the policy and learns from experience.
There are two methods for Value-learning:
And one for Q-Learning.
Goal: Compute values for each state under \(\pi\)
Idea: Average together observed sample values
TD value leaning is a model-free way to do policy evaluation, mimicking Bellman updates with running sample averages.
\begin{align} \text{sample}_i &= R(s, \pi(s), s^\prime_i) + \gamma V_k^\pi (s^\prime_i)\\ V_{k+1}^{\pi} (s) & \leftarrow \frac{1}{n} \sum_i \text{sample}_i \end{align}
\begin{equation} \overline{x}_n = (1 - \alpha) \cdot \overline{x}_{n-1} + \alpha \cdot x_n \end{equation}
Learns the \(Q\)-values as-you-go, not values. Makes action selection model-free too.
Off-policy learning: Q-learning converges to optimal policy, even with non-optimal actions.
The agent also needs to decide how to collect experiences. Applies to model-based and model-free. In the lecture, only Q-Learning is covered!
There is a tradeoff between exploration and exploitation: “stay in the usual place that is okay” vs. “risk something new.” There are two possibilities to force exploration:
Takes a value estimate \(u\) and a visit count \(n\), and returns an optimistic utility such as \(f(u, n) = u + \frac{k}{n}\)
Regret is a measure of your total mistake cost: difference between your (expected) rewards and and optimal (expected) rewards.
Random exploration has a higher regret than exploration functions.
Useful for large state spaces. In realistic scenarios you cannot possibly learn every single state/\(q\)-value. There are both too many states to visit and not enough memory.
A slightly different state would not be recognized anymore by non-approximated learning methods. The goal here is to generalize experience to new, similar situations, saving time and space.
Feature-Based Representations: describe a state using a vector of features (properties)
Linear Value Functions: Using a feature representation, we can write a \(q\) function (or value function) for any state using a few weights: \begin{align} V(s) &= \sum_{i=1}^n w_i f_i (s)\\ Q(s, a) &= \sum_{i=1}^n w_i f_i (s, a) \end{align} Advantage: our experience is summed up in a few powerful numbers. Disadvantage: states may share features but actually be very different in value!
Adjust weights of active features. If something unexpectedly bad happens, blame the features that were on: disprefer all states with that state’s features. \begin{align} \text{transition} &= (s, a, r, s^\prime)\\ \text{difference} &= \left[ r + \gamma \max_{a^\prime} Q(s^\prime, a^\prime) \right] - Q(s,a)\\ Q(s, a) & \leftarrow Q(s, a) + \alpha [\text{difference}]\\ w_i & \leftarrow w_i + \alpha [\text{difference}] f_i (s, a) \end{align}
Optimization technique, can be used for linear-regression, such as comparing the observation \(y\) to the prediction \(\hat{y}\).
\begin{align} \text{total error} &= \sum_i (y_i - \hat{y}_i)^2\\ &= \sum_i \left( y_i - \sum_k w_k f_k (x_i) \right)^2 \end{align}
2023-07-05