Reinforcement Learning

Reinforcement learning (RL) is a type of machine learning where an agent learns to make decisions by interacting with an environment. Unlike supervised learning, there are no labeled examples—instead, the agent learns from the consequences of its actions through trial and error.

States

A state (ss) represents the current situation or configuration of the environment at a given time. It captures all the relevant information the agent needs to make a decision.

For example:

  • In a chess game, the state is the current position of all pieces on the board
  • In a self-driving car, the state includes position, velocity, and sensor readings
  • In a robot arm, the state is the joint angles and positions

The set of all possible states is called the state space (SS).

Actions

An action (aa) is a choice the agent can make that affects the environment. At each state, the agent selects an action from the set of available actions.

For example:

  • In a game, actions might be move left, right, jump, or shoot
  • In robotics, actions could be motor commands or joint movements
  • In trading, actions might be buy, sell, or hold

The set of all possible actions is called the action space (AA).

Rewards

A reward (RR) is a scalar feedback signal that tells the agent how good or bad its action was. The agent’s goal is to maximise cumulative reward over time.

Rewards can be:

  • Positive — encourages the agent to repeat the action
  • Negative — discourages the action (also called penalties)
  • Sparse — only given at certain key moments (e.g., winning a game)
  • Dense — given frequently to guide learning

The reward function R(s,a,s)R(s, a, s') defines the reward received when taking action aa in state ss and transitioning to state ss'.

Discount Factor γ\gamma

The discount factor (γ\gamma, gamma) is a value between 0 and 1 that determines how much the agent values future rewards compared to immediate rewards.

  • γ=0\gamma = 0: Agent only cares about immediate rewards (myopic)
  • γ=1\gamma = 1: Agent values future rewards equally to immediate rewards
  • γ0.9\gamma \approx 0.9 to 0.990.99: Common values that balance present and future

A discount factor less than 1 ensures that:

  1. The total return remains finite in continuing tasks
  2. The agent prefers rewards sooner rather than later
  3. Uncertainty about the future is accounted for

Return

The return (GtG_t) is the total accumulated reward from time step tt onwards. It represents what the agent is ultimately trying to maximise.

The discounted return is calculated as:

Gt=Rt+1+γRt+2+γ2Rt+3+=k=0γkRt+k+1G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}

Where:

  • Rt+1R_{t+1} is the immediate reward after time tt
  • γk\gamma^k discounts rewards further in the future

The return can be written recursively:

Gt=Rt+1+γGt+1G_t = R_{t+1} + \gamma G_{t+1}

Policy π\pi

A policy (π\pi) defines the agent’s behaviour—it maps states to actions. The policy is what the agent learns and improves over time.

Policies can be:

  • Deterministic: π(s)=a\pi(s) = a — always takes the same action in a given state
  • Stochastic: π(as)=P(At=aSt=s)\pi(a|s) = P(A_t = a | S_t = s) — a probability distribution over actions

The goal of reinforcement learning is to find an optimal policy π\pi^* that maximises the expected return from any state.

Markov Decision Process

A Markov Decision Process (MDP) is the formal mathematical framework used to model reinforcement learning problems. An MDP brings together all the core concepts into a unified structure.

An MDP is defined by the tuple (S,A,P,R,γ)(S, A, P, R, \gamma):

  • SS — the set of states
  • AA — the set of actions
  • P(ss,a)P(s'|s, a) — the state transition probability (probability of reaching state ss' given state ss and action aa)
  • R(s,a,s)R(s, a, s') — the reward function
  • γ\gamma — the discount factor

The Markov Property

The key assumption in an MDP is the Markov property: the future depends only on the current state, not on the history of how we got there.

P(St+1St,At,St1,At1,)=P(St+1St,At)P(S_{t+1} | S_t, A_t, S_{t-1}, A_{t-1}, \ldots) = P(S_{t+1} | S_t, A_t)

This means the current state contains all the information needed to make optimal decisions—the past doesn’t matter once we know the present.

State Transition Dynamics

The transition function P(ss,a)P(s'|s, a) defines the environment’s dynamics:

  • Deterministic: Taking action aa in state ss always leads to the same next state ss'
  • Stochastic: Taking action aa in state ss leads to different states with certain probabilities

For example, in a grid world with slippery ice, moving “right” might have an 80% chance of going right but a 10% chance each of going up or down.

Solving an MDP

The goal is to find an optimal policy π\pi^* that maximises expected return. Common approaches include:

  • Value Iteration — iteratively compute optimal state values
  • Policy Iteration — iteratively improve the policy
  • Q-Learning — learn action-values through experience
  • Policy Gradient — directly optimise the policy using gradients

How It All Fits Together

graph LR
    A[Agent] -->|action a| E[Environment]
    E -->|state s| A
    E -->|reward r| A
  1. The agent observes the current state ss
  2. Based on its policy π\pi, it selects an action aa
  3. The environment transitions to a new state ss'
  4. The agent receives a reward rr
  5. The agent updates its policy to maximise future returns (discounted by γ\gamma)
  6. Repeat

This cycle continues as the agent learns which actions lead to higher cumulative rewards.

-
-