Reinforcement Learning

Reinforcement learning (RL) is a type of machine learning where an agent learns to make decisions by interacting with an environment. Unlike supervised learning, there are no labeled examples—instead, the agent learns from the consequences of its actions through trial and error.

States

A state ( $s$ ) represents the current situation or configuration of the environment at a given time. It captures all the relevant information the agent needs to make a decision.

For example:

In a chess game, the state is the current position of all pieces on the board
In a self-driving car, the state includes position, velocity, and sensor readings
In a robot arm, the state is the joint angles and positions

The set of all possible states is called the state space ( $S$ ).

Actions

An action ( $a$ ) is a choice the agent can make that affects the environment. At each state, the agent selects an action from the set of available actions.

For example:

In a game, actions might be move left, right, jump, or shoot
In robotics, actions could be motor commands or joint movements
In trading, actions might be buy, sell, or hold

The set of all possible actions is called the action space ( $A$ ).

Rewards

A reward ( $R$ ) is a scalar feedback signal that tells the agent how good or bad its action was. The agent’s goal is to maximise cumulative reward over time.

Rewards can be:

Positive — encourages the agent to repeat the action
Negative — discourages the action (also called penalties)
Sparse — only given at certain key moments (e.g., winning a game)
Dense — given frequently to guide learning

The reward function $R(s, a, s')$ defines the reward received when taking action $a$ in state $s$ and transitioning to state $s'$ .

Discount Factor $\gamma$

The discount factor ( $\gamma$ , gamma) is a value between 0 and 1 that determines how much the agent values future rewards compared to immediate rewards.

$\gamma = 0$ : Agent only cares about immediate rewards (myopic)
$\gamma = 1$ : Agent values future rewards equally to immediate rewards
$\gamma \approx 0.9$ to $0.99$ : Common values that balance present and future

A discount factor less than 1 ensures that:

The total return remains finite in continuing tasks
The agent prefers rewards sooner rather than later
Uncertainty about the future is accounted for

Return

The return ( $G_t$ ) is the total accumulated reward from time step $t$ onwards. It represents what the agent is ultimately trying to maximise.

The discounted return is calculated as:

$G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}$

Where:

$R_{t+1}$ is the immediate reward after time $t$
$\gamma^k$ discounts rewards further in the future

The return can be written recursively:

$G_t = R_{t+1} + \gamma G_{t+1}$

Policy $\pi$

A policy ( $\pi$ ) defines the agent’s behaviour—it maps states to actions. The policy is what the agent learns and improves over time.

Policies can be:

Deterministic: $\pi(s) = a$ — always takes the same action in a given state
Stochastic: $\pi(a|s) = P(A_t = a | S_t = s)$ — a probability distribution over actions

The goal of reinforcement learning is to find an optimal policy $\pi^*$ that maximises the expected return from any state.

Markov Decision Process

A Markov Decision Process (MDP) is the formal mathematical framework used to model reinforcement learning problems. An MDP brings together all the core concepts into a unified structure.

An MDP is defined by the tuple $(S, A, P, R, \gamma)$ :

$S$ — the set of states
$A$ — the set of actions
$P(s'|s, a)$ — the state transition probability (probability of reaching state $s'$ given state $s$ and action $a$ )
$R(s, a, s')$ — the reward function
$\gamma$ — the discount factor

The Markov Property

The key assumption in an MDP is the Markov property: the future depends only on the current state, not on the history of how we got there.

$P(S_{t+1} | S_t, A_t, S_{t-1}, A_{t-1}, \ldots) = P(S_{t+1} | S_t, A_t)$

This means the current state contains all the information needed to make optimal decisions—the past doesn’t matter once we know the present.

State Transition Dynamics

The transition function $P(s'|s, a)$ defines the environment’s dynamics:

Deterministic: Taking action $a$ in state $s$ always leads to the same next state $s'$
Stochastic: Taking action $a$ in state $s$ leads to different states with certain probabilities

For example, in a grid world with slippery ice, moving “right” might have an 80% chance of going right but a 10% chance each of going up or down.

Solving an MDP

The goal is to find an optimal policy $\pi^*$ that maximises expected return. Common approaches include:

Value Iteration — iteratively compute optimal state values
Policy Iteration — iteratively improve the policy
Q-Learning — learn action-values through experience
Policy Gradient — directly optimise the policy using gradients

How It All Fits Together

graph LR
    A[Agent] -->|action a| E[Environment]
    E -->|state s| A
    E -->|reward r| A

The agent observes the current state $s$
Based on its policy $\pi$ , it selects an action $a$
The environment transitions to a new state $s'$
The agent receives a reward $r$
The agent updates its policy to maximise future returns (discounted by $\gamma$ )
Repeat

This cycle continues as the agent learns which actions lead to higher cumulative rewards.