Transformers and Large Language Models — AI and ML Fundamentals | CertQnA

In June 2017, eight Google researchers published "Attention Is All You Need" — a paper that introduced the transformer architecture. Within five years it had displaced every other approach to natural language processing and become the foundation of the AI boom.

Why Transformers Won

Earlier sequence models (RNNs, LSTMs) processed text one word at a time, carrying information forward in a hidden state. This had two problems:

Slow training: You cannot parallelise across the sequence — each step depends on the previous one.
Long-range dependencies: Information from the start of a long passage gets diluted by the time you reach the end.

Transformers solve both with self-attention: every token in the input directly attends to every other token, in parallel.

Self-Attention in Plain English

Consider the sentence: "The cat sat on the mat because it was tired." What does "it" refer to?

Self-attention computes, for each word, a weighted blend of every other word in the sentence — where the weights are learned to capture relevance. For the word "it", attention strongly weights "cat" (because cats get tired) and weakly weights "mat" (mats don't get tired). The model learns these weights from billions of training examples.

Mathematically, self-attention transforms each token's representation into a Query, Key, and Value vector, then computes attention as softmax(Q · K^T / sqrt(d)) · V. The full architecture stacks many attention layers and feed-forward layers — typically 12 to 100+ layers in modern LLMs.

Tokens, Not Words

LLMs do not see whole words — they see tokens, which are sub-word units. A tokenizer (BPE or SentencePiece) breaks "unbelievable" into something like ["un", "believ", "able"]. As a rule of thumb in English: 1 token ≈ 4 characters ≈ 0.75 words.

Token-based pricing is now ubiquitous: OpenAI, Anthropic, and Google all charge per 1,000 input/output tokens. Understanding token counts is essential to controlling costs and staying within context limits.

The Context Window

The context window is the maximum number of tokens the model can attend to at once — both your prompt and its response combined. Modern frontier models have very large windows:

Model (2024–2025)	Approximate context window
GPT-4o	128,000 tokens
Claude Sonnet	200,000 tokens
Gemini 1.5 Pro	1–2 million tokens

A 1 million token window can hold roughly an entire codebase or a 1,500-page book. But longer windows are slower and more expensive — and accuracy can degrade in the middle of very long contexts ("lost in the middle" effect).

How LLMs Are Trained

Training an LLM happens in three stages:

1. Pretraining

Train a transformer on enormous amounts of internet text (Common Crawl, books, code, Wikipedia) using next-token prediction. The model learns grammar, facts, reasoning patterns, code syntax — all from predicting the next word billions of times. This is the most expensive stage: GPT-4 reportedly cost $100M+ to pretrain.

2. Supervised Fine-Tuning (SFT)

Train the pretrained model on curated examples of high-quality human dialogue. This teaches the model to follow instructions and be helpful, rather than just continue text.

3. Reinforcement Learning from Human Feedback (RLHF)

Show humans pairs of model outputs and ask which is better. Train a reward model on these preferences. Use reinforcement learning to fine-tune the LLM to produce responses the reward model rates highly. This is what makes ChatGPT-style assistants helpful, harmless, and honest. Modern variants include Direct Preference Optimization (DPO) and Constitutional AI (Anthropic's approach).

Foundation Models and Open Weights

"Foundation model" is the term coined by Stanford for large pretrained models that can be adapted to many downstream tasks. The major players in 2025:

Closed (API only): OpenAI GPT family, Anthropic Claude, Google Gemini
Open weights: Meta Llama, Mistral, Alibaba Qwen, DeepSeek — you can download the weights and run them yourself

Open weights enables on-premises deployment, custom fine-tuning, and lower per-token costs at scale, but requires GPU infrastructure. Most production teams use a hybrid approach: closed API for frontier capability, open weights for cost-sensitive bulk workloads.

Multimodal Models

Modern frontier models are no longer text-only. They accept and produce a mix of text, images, audio, and video:

GPT-4o and Gemini natively process images and audio
Vision-language models can describe screenshots, diagrams, or photos
The same transformer architecture, with different tokenizers per modality