Reinforcement Learning
Reinforcement learning (RL) is a type of machine learning where an agent learns to make decisions by interacting with an environment. Unlike supervised learning, there are no labeled examples—instead, the agent learns from the consequences of its actions through trial and error.
States
A state () represents the current situation or configuration of the environment at a given time. It captures all the relevant information the agent needs to make a decision.
For example:
- In a chess game, the state is the current position of all pieces on the board
- In a self-driving car, the state includes position, velocity, and sensor readings
- In a robot arm, the state is the joint angles and positions
The set of all possible states is called the state space ().
Actions
An action () is a choice the agent can make that affects the environment. At each state, the agent selects an action from the set of available actions.
For example:
- In a game, actions might be move left, right, jump, or shoot
- In robotics, actions could be motor commands or joint movements
- In trading, actions might be buy, sell, or hold
The set of all possible actions is called the action space ().
Rewards
A reward () is a scalar feedback signal that tells the agent how good or bad its action was. The agent’s goal is to maximise cumulative reward over time.
Rewards can be:
- Positive — encourages the agent to repeat the action
- Negative — discourages the action (also called penalties)
- Sparse — only given at certain key moments (e.g., winning a game)
- Dense — given frequently to guide learning
The reward function defines the reward received when taking action in state and transitioning to state .
Discount Factor
The discount factor (, gamma) is a value between 0 and 1 that determines how much the agent values future rewards compared to immediate rewards.
- : Agent only cares about immediate rewards (myopic)
- : Agent values future rewards equally to immediate rewards
- to : Common values that balance present and future
A discount factor less than 1 ensures that:
- The total return remains finite in continuing tasks
- The agent prefers rewards sooner rather than later
- Uncertainty about the future is accounted for
Return
The return () is the total accumulated reward from time step onwards. It represents what the agent is ultimately trying to maximise.
The discounted return is calculated as:
Where:
- is the immediate reward after time
- discounts rewards further in the future
The return can be written recursively:
Policy
A policy () defines the agent’s behaviour—it maps states to actions. The policy is what the agent learns and improves over time.
Policies can be:
- Deterministic: — always takes the same action in a given state
- Stochastic: — a probability distribution over actions
The goal of reinforcement learning is to find an optimal policy that maximises the expected return from any state.
Markov Decision Process
A Markov Decision Process (MDP) is the formal mathematical framework used to model reinforcement learning problems. An MDP brings together all the core concepts into a unified structure.
An MDP is defined by the tuple :
- — the set of states
- — the set of actions
- — the state transition probability (probability of reaching state given state and action )
- — the reward function
- — the discount factor
The Markov Property
The key assumption in an MDP is the Markov property: the future depends only on the current state, not on the history of how we got there.
This means the current state contains all the information needed to make optimal decisions—the past doesn’t matter once we know the present.
State Transition Dynamics
The transition function defines the environment’s dynamics:
- Deterministic: Taking action in state always leads to the same next state
- Stochastic: Taking action in state leads to different states with certain probabilities
For example, in a grid world with slippery ice, moving “right” might have an 80% chance of going right but a 10% chance each of going up or down.
Solving an MDP
The goal is to find an optimal policy that maximises expected return. Common approaches include:
- Value Iteration — iteratively compute optimal state values
- Policy Iteration — iteratively improve the policy
- Q-Learning — learn action-values through experience
- Policy Gradient — directly optimise the policy using gradients
How It All Fits Together
graph LR
A[Agent] -->|action a| E[Environment]
E -->|state s| A
E -->|reward r| A
- The agent observes the current state
- Based on its policy , it selects an action
- The environment transitions to a new state
- The agent receives a reward
- The agent updates its policy to maximise future returns (discounted by )
- Repeat
This cycle continues as the agent learns which actions lead to higher cumulative rewards.