RL Cheatsheet for Olfactory Navigation

Olivia is doing some cool olfactory navigation experiments and I thought collecting the following concepts from RL would be useful for her (and for me).

States $\ss$: Quantities capturing properties of the environment and the agent relevant for the task.
Actions $a$: The set of things an agent can do.
Reward $r$: A quantity delivered by the environment after an agent takes an action.
Discount factor $\gamma$: How reward one time-step in the future retains its value relative to the same reward now. Higher $\gamma$ means more retention, less depreciation.
Return: The expected discounted future reward the agent will accumulate over the time-period of the task. This is what we’re trying to maximize.
Policy $\pi(a|s)$: Describes how the agent picks actions in any given state.
- Greedy policy: In every state, takes the action it thinks will give the highest return.
- $\epsilon$-greedy policy: Just like the greedy policy, except for random $\epsilon$ fraction of the time, where it will generate a random action.
Value: $V_\pi(s)$ The expected future reward in a given state. Depends on the policy.
Q-values: $Q_\pi(\ss, a)$: The expected future reward in a given state, when taking a action $a$. This makes it easy to e.g. take the most rewarding action. Depends on the policy.
- Value can be derived from Q-values by averaging over actions according to the policy: $$V(s) = \sum_a \pi(a|s) Q(s, a).$$
Reward-prediction error: The value of a state should be the average of the reward one receives in it plus the discounted value of the next state. The difference between this evaluated for the action just taken, and the agent’s current estimate of the state’s value is the reward prediction error: $$ \text{RPE} \triangleq \underbrace{r + \gamma V(\ss_{t+1})}_{\text{Observed value of $\ss_t$}} – V(\ss_t).$$
Temporal difference (TD)-learning: A method for learning the values of each state by using RPE. After transitioning from state $\ss_t$ to $\ss_{t+1}$ and receiving a reward $r,$ the value of state $\ss_t$ is updated as $$ \Delta V(\ss_t) = \text{(learning rate)} \times \text{RPE}.$$
Q-learning: A method for learning Q-values. After taking action $a_t$ in state $\ss_t$ and transitioning to state $\ss_{t+1}$, updates the Q-value of the state-action pair by assuming the best possible action is taken in the next state.
$$ \Delta Q(\ss_t, a_t) = \text{(learning rate)} \times (r + \max_a Q(\ss_{t+1}, a) – Q(\ss_t, a_t)).$$
State-Action-Reward-State-Action (SARSA): Q-learning assumes the best action in the next state. SARSA uses the action, $a_{t+1}$, that was actually taken: $$ \Delta Q(\ss_t, a_t) = \text{(learning rate)} \times (r + Q(\ss_{t+1}, a_{t+1}) – Q(\ss_t, a_t)).$$
Model: A model of the environment used in planning. It takes an input state and action and reports a predicted output state and reward.
Model-based RL: Any approach where the agent uses a model to pick the best action, by e.g. simulating the future.
- Model-based navigation: learning a map of the environment, so if a route is blocked, picking the next fastest route, without necessarily having taken it before.
Model-free RL: Approaches that update value functions, Q tables, etc. by directly experiencing the association of states, actions and rewards in the environment, without using a model. Prominent examples are TD-learning and Q-learning.
- Model-free navigation: learning a route in the environment by associating each location with a travel direction. If a route is blocked, the learning process has to start again, exploring different states and actions and updating the associations until a new route is found.

$$ \blacksquare$$

RL Cheatsheet for Olfactory Navigation

Comments

Leave a Reply Cancel reply