HexaHype
Start reading
Neural Networks Browse lessons

Neural Networks Β· Neural Networks Β· 5 min read

The activation function

In the last lesson a neuron produced a raw number zz that could be any size. On its own that number is just a straight-line combination of the inputs. To build something that can bend and curve, each neuron passes zz through an activation function, a fixed nonlinear function applied to the output.

Without a nonlinear activation, a deep stack of neurons collapses into a single linear function, no matter how many layers you add.

Squashing the sum

The activation function takes the neuron’s raw output zz and returns the neuron’s final output aa:

a=Οƒ(z)a = \sigma(z)

Here zz is the weighted sum from the previous lesson, Οƒ\sigma is the chosen activation function, and aa is the activated output that the neuron actually passes on. Two common choices:

ReLU(z)=max⁑(0,z)\text{ReLU}(z) = \max(0, z)

ReLU returns zz when zz is positive and returns 00 otherwise. It keeps positive signals and discards negative ones.

Οƒ(z)=11+eβˆ’z\sigma(z) = \frac{1}{1 + e^{-z}}

This is the sigmoid. Here eβ‰ˆ2.718e \approx 2.718 is the base of the natural exponential, and zz is the raw input. The output is always between 00 and 11, so it never blows up no matter how large zz becomes.

The same neuron, now with its output z passing through an activation function to produce the final output a.
The same neuron, now with its output z passing through an activation function to produce the final output a.

A worked example

Take the value from the previous lesson, z=βˆ’1z = -1, and apply each function:

ReLU(βˆ’1)=max⁑(0,βˆ’1)=0\text{ReLU}(-1) = \max(0, -1) = 0 Οƒ(βˆ’1)=11+e1β‰ˆ13.718β‰ˆ0.269\sigma(-1) = \frac{1}{1 + e^{1}} \approx \frac{1}{3.718} \approx 0.269

In code:

import math

# the neuron's raw weighted sum from the previous lesson
z = -1.0

# ReLU keeps positive values and replaces negatives with zero
relu = max(0.0, z)

# sigmoid squashes any number into the range between 0 and 1
sigmoid = 1 / (1 + math.exp(-z))

print(relu)     # 0.0
print(sigmoid)  # 0.2689414213699951

Why the nonlinearity matters

Suppose neurons had no activation, so each layer just computed a weighted sum. Stacking two such layers would give W2(W1x)=(W2W1)x\mathbf{W}_2(\mathbf{W}_1 \mathbf{x}) = (\mathbf{W}_2 \mathbf{W}_1)\mathbf{x}, where W1\mathbf{W}_1 and W2\mathbf{W}_2 are the two layers’ weights. The product W2W1\mathbf{W}_2 \mathbf{W}_1 is itself just one matrix, so the whole stack reduces to a single weighted sum. Adding more layers changes nothing. The nonlinear Οƒ\sigma placed between layers is exactly what breaks this collapse and lets depth add real power.

In the next lesson, we will line up many of these neurons side by side into a single layer and write the whole layer as one compact equation.