Monte Carlo methods require only *experience* - sample sequence of states, actions, and rewards, while we do not have complete knowledge of the environment.

## 5.1 Monte Carlo Prediction

This section is about learning the state-value function for a given policy. In other words, estimate $v_\pi(s)$

*First-visit MC Method*: only averages the first visits to $s$.

*Every-visit MC Method*: averages the returns of all visits to $s$.

Monte Carlo Method's advantage over Dynamic Programming:

- Learn from actural experience
- Learn from simulated experience
- Computation-friendly

## 5.2 Monte Carlo Estimation of Action Values

To solve the problem of *maintaining exploration*, we have the assumption of *exploring starts* - we assume that the episodes start in a state-action pair.

Note: the reason why we use $q(s,a)$ in Monte Carlo Methods instead of using $v(s)$ like we did in Dynamic Programming is that, using $v(s)$ is more computational-friendly than using $q(s,a)$. However, using $v(s)$ requires us to know the environment well, in other words, $p(s',r|s,a)$.

## 5.3 Monte Carlo Control

General policy iteration: a loop of policy evaluation and policy improvement.

2 approaches:

- Complete the whole policy evaluation
- Give up using the whole policy evaluation (one extreme case for this is value iteration).

** Monte Carlo ES (Exploring Starts)**: An algorithm for estimating $\pi\approx\pi_\ast$. See page 99 of the book.

## 5.4 Monte Carlo Control without Exporing Starts

On-policy methods and off-policy methods:

Background: it is often unlikely to have exploring starts. As a result, we need to ensure all actions are selected infinitely often. There are 2 methods to ensure this: *on-policy* methods and *off-policy* methods.

On-policy methods: attempt to evaluate or improve the policy that is used to make decisions.

Off-policy methods: evaluate or improve a policy different from that used to generate the data. (Introduced in the next section).

Monte Carlo ES is one kind of on-policy method.

** On-policy first-visit MC control (for $\varepsilon$-soft policies)**: An algorithm for estimating $\pi\approx\pi_\ast$. See page 101 of the book.

## 5.5 Off-policy Prediction via Importance Sampling

Idea: solve the dilemma of optimal behavior and exploration, using *off-policy learning*.

** Target policy**: the policy being learned about.

** Behavior policy**: the policy used to generate behavior.

On-policy:

- Simpler

Off-policy:

- Require additional concepts and notations
- Of greater variance and are slower to converge
- More powerful and general
- Have additional use in applications

One special case of off-policy methods is on-policy methods - where the target policy and the behavior policy are the same.

*Prediction* problem: in which the target and behavior are fixed.

Assumption of *coverage*: every action taken under $\pi$, it is also taken under $b$, the behavior policy. That is, $\pi(a|s)>0 \Rightarrow b(a|s)>0$.

** Importance sampling**: estimating expected values under one distribution given samples from another.

*Importance-sampling ratio*:

$$ \rho_{t:T-1} \doteq \frac{\prod_{k=t}^{T-1}\pi(A_k|S_k)p(S_{k+1}|S_k,A_k)}{\prod_{k=t}^{T-1}b(A_k|S_k)p(S_{k+1}|S_k,A_k)}=\prod_{k=t}^{T-1}\frac{\pi(A_k|S_k)}{b(A_k|S_k)} $$ Note: the transition probabilities are cancelled in the equation above, thus not needed anymore.

Connect $v_\pi(s)$ with $\pi$ and $b$: $$ \mathbb{E}[\rho_{t:T-1}G_t|S_t=s] = v_\pi(s) $$

[why?]()

Estimate $v_\pi(s)$ using *ordinary importance sampling*:

First visit: **unbiased, unbounded variance**

Every visit: **biased**

$$
V(s) \doteq \frac{
\sum_{t\in\mathcal{J}(s)} \rho_{t:T-1}G_t
}{
|\mathcal{J}(s)|
}
$$
Estimate $v_\pi(s)$ using *weighted importance sampling*:

First visit: **biased, bounded variance**

Every visit: **biased**

$$ V(s) \doteq \frac{ \sum_{t\in\mathcal{J}(s)} \rho_{t:T-1}G_t }{ \rho_{t:T-1}G_t } $$