background.tex

\chapter{Reinforcement Learning}
\label{ch:background}

% \section{Reinforcement Learning}


\label{sec:background_rl}

In this chapter, we provide a brief review of \acf{RL}. \ac{RL} emerges from the idea that humans and other animals tend to learn about the world through interaction. We can observe the world around us and act in certain ways to achieve our goals, and at the same time learn more about the world that we live in. \ac{RL} is a computational approach to learning from interactions with the final goal of maximizing a numerical reward signal \cite{rlbook}.

\ac{RL} is applicable to a wide variety of applications, but this generality can also be problematic.
The learner is not told which actions are better or worse and it needs to figure everything out itself using a possibly weak reward signal.
To make matters worse, the consequences of an action may not be immediately visible to the learner as they might be revealed only after a long duration of time.
This article is focused on introducing some structure into this formulation while keeping its flexibility.


\section{Markov Decision Process}

The problem formulation in \ac{RL} is based on the concept of a \ac{MDP}. The \ac{MDP} is defined by a tuple $\{\mathcal{S}, \mathcal{A}, P, r, \gamma\}$, where $S \in \mathbb R^n$ and $A \in \mathbb R^m$ are the state space and action space of the problem, respectively. The transition function $P: S \times A \times S \to [0, \infty)$ 
is a probability density function, with $P(s_{t+1} \mid s_t, a_t)$ being the probability density of visiting $s_{t+1}$ given that at state $s_t$, the system takes action $a_t$.
The reward function $r: S\times A \to \mathbb R$ gives a scalar reward for each transition of the system. 
$\gamma \in (0, 1]$ is the discount factor. A deterministic \ac{MDP} is a special case where the transition function and the initial state distribution are deterministic.

Tasks can be categorized into two categories: \textit{episodic} and \textit{continuing} \cite{rlbook}. Episodic tasks can naturally be divided into subsequences known as \textit{episodes}, such as a single match in sports or one play of a game. Each episode ends after a certain period of time, known as the \textit{time horizon}, has passed or a pre-specified \textit{terminal state} has been reached. In continuing tasks, the task goes on without limit and there is no natural notion of an episode present. Here, we will consider the episodic case with a fixed time horizon $T$. For a more in depth discussion please refer to \cite{rlbook}. The goal of reinforcement learning is to find a parameterized policy $\pi_\theta$, where $\pi_\theta: S \times A \to [0, \infty)$ is the probability density of $a_t$ given $s_t$, that solves the following optimization problem:

\begin{align}
\label{eq:return}
\mathop{\mathrm{max}}_\theta J(\theta) = &\mathbb{E}_{\tau\sim p_\theta}\left[\sum_{t=0}^{T-1}\gamma^t{r({s}_t, {a}_t)} \right].
\end{align}

Here, $\tau = (s_0, a_0, r_0, \dots, s_T)$ is known as a \textit{trajectory} and the probability of encountering a trajectory is computed according to $p_\theta(\tau) = P(s_0) \Pi_{t=0}^{T-1} \pi_\theta(a_t|s_t) P(s_{t+1}|s_t,a_t)$. The discount factor ($\gamma$) controls how much we care about future rewards rather than immediate rewards. The (possibly discounted) sum of the future rewards is also known as the \textit{return} of a trajectory:
\begin{align*}
    G(\tau) \doteq \sum_{t=0}^{T-1}\gamma^t{r({s}_t, {a}_t)}.
\end{align*}

Another important categorization of tasks is based on whether state and action spaces are discrete or continuous. Discrete spaces are generally known to have more well-defined solutions, namely tabular algorithms, than the continuous spaces. In this thesis, we focus on tasks where both state and action spaces are continuous.

\section{Policy Gradient Methods}
\label{sec:background_pg}

Many algorithms have been proposed for solving \ac{RL} tasks and each is useful in certain scenarios. This section will explain the \ac{PPO} algorithm which is used in the following chapters. For a discussion of other existing methods please refer to~\cite{rl_survey}.

\ac{PPO} belongs to the policy gradient methods class which have been shown to work well on continuous tasks. The idea behind the \ac{PG} algorithm is straight-forward, namely, to optimize the average return by computing an approximate gradient with respect to the underlying policy parameters, and then taking a gradient ascent step to increase it.
% However, doing this naively is not possible. and that is where the algorithm works its magic. Even with such a high-level description, we might be able to guess some of the problems that can arise, such as: how to find the global optima, and are the gradients well-behaved or not. We will get back to these problems soon enough, but let us see how the algorithm works first.

To optimize the objective, \ac{PG} directly optimizes the policy $\pi$. One of the underlying assumptions of \ac{PG} is that the policy should be stochastic rather than deterministic for this algorithm to work, although this assumption can be relaxed \cite{ddpg}. Furthermore, we assume that this stochastic policy is parametrized by parameters $\theta$ and therefore the algorithm's job is to find the optimal parameters $\theta^*$ that maximize \Cref{eq:return}. To maximize the objective, we need to know the gradient $\nabla_\theta J(\theta)$. Since computing this gradient requires knowledge about the dynamics of the \ac{MDP}, \ac{PG} approximates it by using the REINFORCE trick \cite{williams1992simple}:

\begin{align}
    \nabla_\theta J(\theta) &= \nabla_\theta \int G(\tau) p_\theta(\tau) \\
    &= \int G(\tau) \nabla_\theta p_\theta(\tau)\\
    &= \int G(\tau) p_\theta(\tau) \nabla_\theta \log p_\theta(\tau)\\
    &= \mathbb{E}_{\tau \sim p_\theta} \left[ G(\tau) \nabla_\theta \log p_\theta(\tau) \right]
    \label{eq:obj_grad}
\end{align}

We can switch the integral and the gradient operator if the policy is differentiable everywhere. The equality is based on the following identity:

\begin{align*}
    p_\theta(\tau)\nabla_\theta \log p_\theta(\tau) = p_\theta(\tau) \frac{\nabla_\theta p_\theta(\tau)}{p_\theta(\tau)} = \nabla_\theta p_\theta(\tau)
\end{align*}

Next, we can expand $\nabla_\theta \log p_\theta$:

\begin{align}
    \nabla_\theta \log p_\theta &= \nabla_\theta \left[ \log p(s_1) + \sum_{t=1}^T \log \pi_\theta(a_t|s_t) + \log p(s_{t+1} | s_t, a_t) \right]\\
    &= \nabla_\theta \left[ \sum_{t=1}^T \log \pi_\theta(a_t|s_t) \right]\\
    &= \sum_{t=1}^T \nabla_\theta \log \pi_\theta(a_t|s_t)
\end{align}

The extra terms are independent of $\theta$ and therefore they do not contribute to the gradient. Substituting this back into \Cref{eq:obj_grad} we arrive at the following expression. For simplicity the discount factor, $\gamma$, has been set to one:

\begin{align}
    \nabla_\theta J(\theta) &= \mathbb{E}_{\tau \sim p_\theta} \left[ G(\tau) \sum_{t=1}^T \nabla_\theta \log \pi_\theta(a_t|s_t) \right]\\
    &= \mathbb{E}_{\tau \sim p_\theta} \left[ \left( \sum_{t=1}^T r_t \right) \left( \sum_{t=1}^T \nabla_\theta \log \pi_\theta(a_t|s_t) \right) \right]
\end{align}

Using this formulation we can use Monte Carlo sampling \cite{SHAPIRO2003353} to approximately compute the gradient in order to iteratively improve the policy:


\begin{align}
    \nabla_\theta J(\theta) &\approx \frac{1}{N} \sum_{i=1}^{N} \left[ \left( \sum_{t=1}^T r_{i,t} \right) \left( \sum_{t=1}^T \nabla_\theta \log \pi_\theta(a_{i,t}|s_{i,t}) \right) \right]
\end{align}

Where, $\tau_i = (s_{i,1}, a_{i,1}, r_{i,1}, \dots, s_{i,T})$ are sample trajectories from $p_\theta$. With this we arrive at the REINFORCE algorithm \cite{williams1992simple}.

Unfortunately, this is not a good estimator in practice as its variance can be quite high. Multiple tricks exists that try to alleviate this problem. The simplest is to increase the number of sampled trajectories $N$, however, this also makes the algorithm less efficient. One observation is that the reward time step $t$ only causally depends on actions that were made until time $t$ and are independent of decisions that are made afterwards. In other words, action $a_t$ can only be responsible for the cost to go from time $t$ forward. With some abuse of notation we can write:
\begin{align}
    \nabla_\theta J(\theta) &\approx \mathbb{E}_{\tau \sim p_\theta} \left[ \sum_{t=1}^T \left( \nabla_\theta \log \pi_\theta(a_t|s_t)  \sum_{t'=t}^T r_t \right) \right]\\
    &= \mathbb{E}_{s_t \sim p_\theta} \left[T \nabla_\theta \log \pi_\theta(a_t|s_t)  \sum_{t'=t}^T r_t \right],
\end{align}
where $s_t$ is any state randomly sampled from a randomly sampled trajectory.

Another trick is to use a learned value function $\hat{V}(s)$ also known as a \textit{critic}. It can be shown that subtracting out a fixed value from the cumulative return in the above formula does not change the value of the expectation. Therefore, if $\hat{V}(s_t)$ is a good estimate of $\sum_{t'=t}^T r_t$ then the following approximator would have lower variance:
\begin{align}
    \nabla_\theta J(\theta) &\approx \mathbb{E}_{s_t \sim p_\theta} \left[T \nabla_\theta \log \pi_\theta(a_t|s_t) \left( \sum_{t'=t}^T r_t - \hat{V}(s_t) \right) \right].
\end{align}

\section{Proximal Policy Optimization (PPO)}

\ac{PG} is difficult to get working in practice. This problem is partly attributed to destructive updates, where a bad update during training can make the performance drop rapidly.
The \ac{TRPO} algorithm~\cite{trpo} was introduced to solve this problem.
The authors explain that destructive updates happen because the algorithm makes large updates based on an optimistic guess.
Therefore \ac{TRPO} only allows the policy to be updated within a trusted region.
In other words, it is a step-size control mechanism.

Later on, \ac{PPO} was introduced as a faster alternative to \ac{TRPO}. Instead of explicitly enforcing a trust-region, \ac{PPO} slightly changes the \ac{PG} update rule as follows:
\begin{align}
    L_{CLIP} = \mathbb{E}_{s_t \sim p_\theta} \left[ \min \left(r_t(\theta) \hat{A}_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_t\right) \right],
\end{align}
where the probability ratio is defined as $r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}$ and the advantage is just a shorthand for $\hat{A}_t = \sum_{t'=t}^T r_t - \hat{V}(s_t)$. The hyper-parameter $\epsilon$ controls the step-size and it commonly takes on a value in the range $[0.1,0.2]$. For more information please refer to \cite{ppo}.

Intuitively, by clipping the objective, the flow of gradient is blocked if it tries to push the current policy outside the interval $[1-\epsilon, 1+\epsilon]$ around the old policy. Therefore, \ac{PPO} has been advertised as enforcing a trust-region on the updates.
Although the validity of this claim has recently been challenged~\cite{truly_pg}, the fact remains that with the right hyper-parameter setting \ac{PPO} is one of the best model-free algorithms in practice today.