Neural networks are mathematical functions inspired loosely by biological neurons. They are the foundation of deep learning and modern AI. This lesson explains how they work and the major architectures.
The Artificial Neuron
A neuron computes a weighted sum of its inputs, adds a bias, and applies a non-linear activation function:
output = activation(w1x1 + w2x2 + ... + b)
Common activation functions:
- ReLU (Rectified Linear Unit): max(0, x). The default in modern networks — fast and effective.
- Sigmoid: Squashes output to (0, 1). Used for binary classification output layers.
- Softmax: Converts a vector into a probability distribution. Used for multi-class classification output layers.
- Tanh: Squashes to (−1, 1). Less common in modern networks.
Layers and Networks
A layer is a collection of neurons all receiving the same inputs. A fully connected (dense) network stacks layers where every neuron in one layer connects to every neuron in the next:
Input layer → Hidden layer 1 → Hidden layer 2 → ... → Output layer
(e.g., 128 (e.g., 64 (e.g., 10
neurons) neurons) neurons for
10-class)
"Deep" learning means networks with many hidden layers — modern image and language models can have hundreds of layers and billions of parameters.
How Networks Learn: Backpropagation
Training has two passes:
- Forward pass: Push inputs through the network, get a prediction. Compare to the true label using a loss function (cross-entropy for classification, mean squared error for regression).
- Backward pass (backpropagation): Compute the gradient of the loss with respect to each weight using the chain rule of calculus. Update each weight by a small step in the direction that reduces the loss: w := w − learning_rate × ∂loss/∂w.
This is gradient descent. Repeat over millions of examples for many epochs (passes through the dataset). Modern training uses variants like Adam that adapt the learning rate per parameter.
Convolutional Neural Networks (CNNs)
For images, fully connected networks would have absurd numbers of parameters (a 224×224 colour image has 150,528 input pixels). CNNs solve this by:
- Convolutional layers apply small filters (e.g., 3×3 patches) that slide across the image, sharing weights across spatial locations.
- Pooling layers downsample the spatial dimensions, reducing parameters and providing translation invariance.
The result: a network that learns hierarchical features (edges → shapes → object parts → objects). CNNs dominated computer vision from 2012 to roughly 2020 — landmark architectures include AlexNet, VGG, ResNet, EfficientNet. Vision transformers (ViT) are now competitive or superior for many tasks.
Recurrent Neural Networks (RNNs) and LSTMs
For sequential data (text, audio, time series), RNNs maintain a hidden state that carries information from one time step to the next. LSTMs (Long Short-Term Memory) and GRUs are RNN variants that handle long-range dependencies better.
RNNs were dominant for natural language processing until 2017, but they have a critical limitation: they process sequences sequentially, making them slow to train and limiting their context length. Transformers replaced them.
Transformers
The architecture that powers GPT, Claude, Gemini, and essentially every modern LLM. Transformers replaced recurrence with self-attention: each token in the sequence attends directly to every other token, in parallel.
The next lesson covers transformers in depth. For now, the key insight: transformers are massively parallelisable, scale beautifully with compute and data, and have replaced RNNs not just for language but increasingly for vision, audio, and even tabular tasks.
Why GPUs?
Neural network training is dominated by matrix multiplications — exactly what GPUs were designed for. A high-end GPU performs tens of thousands of multiplications in parallel. Specialised AI chips (NVIDIA H100/B200, Google TPU, AWS Trainium) push this further with custom silicon for tensor operations.
This is why training a frontier LLM costs tens of millions of dollars and consumes the energy of a small town — and why the AI boom has driven NVIDIA to a multi-trillion-dollar valuation.
Frameworks
| Framework | Maintainer | Strengths |
|---|---|---|
| PyTorch | Meta | Dominant in research; flexible; eager execution; the de facto standard in 2025 |
| TensorFlow / Keras | Strong production tooling; TensorFlow Lite for mobile; declining research share | |
| JAX | Functional, composable; powers DeepMind research; tricky learning curve |