ReLU

ReLU (Rectified Linear Unit) is the most popular activation function in neural networks. It’s remarkably simple: if the input is positive, keep it; if negative, make it zero.

The Formula

ReLU(x)=max(0,x)\text{ReLU}(x) = \max(0, x)

That’s it. The simplest activation function that actually works well.

How It Works

InputOutput
-50
-10
00
11
55
100100

Negative → Zero. Positive → Unchanged.

The shape is like a hockey stick lying flat: horizontal at zero for all negative inputs, then rising diagonally for positive inputs.

Why ReLU?

Before ReLU, neural networks used sigmoid or tanh activation functions. These caused a serious problem called the vanishing gradient.

The Vanishing Gradient Problem

When training deep networks with sigmoid/tanh:

  • Gradients get multiplied together through layers
  • Since sigmoid outputs are always between 0 and 1, gradients shrink
  • By the time you reach early layers, gradients are nearly zero
  • Early layers stop learning → network doesn’t improve

How ReLU Fixes It

  • For positive inputs, the gradient is always 1 (not a tiny fraction)
  • Gradients don’t shrink as they flow backward
  • Deep networks can actually learn

ReLU vs Sigmoid vs Tanh

PropertyReLUSigmoidTanh
Formulamax(0, x)1/(1+e⁻ˣ)(eˣ-e⁻ˣ)/(eˣ+e⁻ˣ)
Output range[0, ∞)(0, 1)(-1, 1)
Gradient (max)10.251
Vanishing gradient?Less proneYesYes
ComputationVery fastSlowerSlower

Benefits of ReLU

  1. Simple — just a max operation, very fast to compute
  2. Sparse activation — many neurons output zero, which is efficient
  3. No vanishing gradient — gradient is 1 for positive values
  4. Biologically plausible — neurons either fire or don’t

The Dying ReLU Problem

ReLU has one weakness: neurons can “die.”

If a neuron always receives negative input, it always outputs zero. With zero output, the gradient is also zero, so the weights never update. The neuron is stuck — permanently dead.

This can happen when:

  • Learning rate is too high
  • Bad weight initialisation
  • Unlucky data distribution

ReLU Variants

Several variants fix the dying ReLU problem by allowing small negative values:

Leaky ReLU

Instead of zero for negatives, use a small slope (typically 0.01):

LeakyReLU(x)={xif x>00.01xif x0\text{LeakyReLU}(x) = \begin{cases} x & \text{if } x > 0 \\ 0.01x & \text{if } x \leq 0 \end{cases}

Parametric ReLU (PReLU)

Like Leaky ReLU, but the slope is learned during training:

PReLU(x)={xif x>0αxif x0\text{PReLU}(x) = \begin{cases} x & \text{if } x > 0 \\ \alpha x & \text{if } x \leq 0 \end{cases}

ELU (Exponential Linear Unit)

Smoothly curves toward a negative value instead of a sharp corner:

ELU(x)={xif x>0α(ex1)if x0\text{ELU}(x) = \begin{cases} x & \text{if } x > 0 \\ \alpha(e^x - 1) & \text{if } x \leq 0 \end{cases}

GELU (Gaussian Error Linear Unit)

Used in transformers (like GPT). Smoothly gates values based on how positive they are:

GELU(x)=xΦ(x)\text{GELU}(x) = x \cdot \Phi(x)

where Φ(x)\Phi(x) is the cumulative distribution function of the standard normal distribution.

Quick Comparison

VariantNegative SideUse Case
ReLUZeroDefault choice, CNNs
Leaky ReLUSmall slope (0.01x)When ReLUs are dying
PReLULearned slopeWhen you have lots of data
ELUSmooth curveFaster convergence
GELUSmooth gateTransformers, NLP

When to Use ReLU

  • Default choice for hidden layers in most networks
  • Especially good for convolutional neural networks (CNNs)
  • Use variants (Leaky, ELU) if you notice dying neurons

When NOT to Use ReLU

  • Output layer for classification → use Softmax instead
  • Output layer for regression → use linear (no activation)
  • Recurrent networks (RNNs) → tanh often works better

Key Takeaways

  1. ReLU = max(0, x) — dead simple
  2. It solved the vanishing gradient problem that plagued early deep learning
  3. Fast to compute, works well in practice
  4. Watch out for dying neurons — use Leaky ReLU if needed
  5. Still the default activation function for most neural networks

See Also

-
-