You don't need to be able to derive backpropagation to build with LLMs — but you do need a working mental model. This lesson teaches just enough about how LLMs work that the rest of the course makes sense.
The Core Loop
Every LLM does one thing: given a sequence of tokens, predict the next token's probability distribution. Generation works by sampling repeatedly:
input → P(next token) → sample → append → P(next token) → sample → … → stop
That's it. Everything else — chat, code completion, function calling — is sugar on this single primitive.
What's a Token?
Tokens are sub-word pieces. The phrase "ServiceNow rocks" tokenises (in GPT-4 tokenizer) to roughly:
["Service", "Now", " rocks"] // 3 tokens
Rules of thumb (English):
- 1 token ≈ 4 characters ≈ 0.75 words
- 1,000 tokens ≈ 750 words ≈ 1.5 pages of prose
- Common words = 1 token; uncommon = 2-4; non-Latin scripts often 2-3× more
Costs are billed per token. A 10,000-word document is roughly 13,000 tokens of input — meaningful at scale.
The Context Window
The model has a fixed maximum context window — input + output combined cannot exceed it. By 2026:
| Model class | Typical context |
|---|---|
| Older / small | 4K - 32K tokens |
| Modern frontier | 200K tokens (Claude, GPT-5) |
| Long-context specialty | 1M+ tokens (Gemini 2.x, GPT-5 long) |
"Effective" context is often smaller than nominal — models attend better to the start and end of long contexts than the middle. This is the "lost-in-the-middle" effect. Practical implication: put the most important instructions and information at the start or the end, not buried in the middle.
Inside the Box (Briefly)
Modern LLMs are decoder-only transformers. The key innovation, attention, lets the model dynamically weigh prior tokens when predicting the next. You don't need the math — you need the consequence: position matters and recency matters. Instructions late in the prompt typically influence the next output more than instructions early in a long context.
Modern improvements layered on transformers include:
- Mixture of Experts (MoE): Many expert subnets; the model routes each token to a few — gives you a big model at a fraction of the compute
- Multi-head latent attention: Cheaper long-context attention
- Flash attention: A speed optimisation, not a quality change
Sampling Parameters
The model outputs a probability distribution over tens of thousands of tokens. How you pick from it controls output behaviour.
| Parameter | What it does | Typical values |
|---|---|---|
temperature | Flattens (high) or sharpens (low) the distribution before sampling. 0 = always pick most likely. | 0.0 for factual, 0.7 for creative |
top_p (nucleus) | Only sample from the smallest set of tokens whose probabilities sum to p | 0.9 is common |
top_k | Only consider the top K tokens | 40-100; less commonly tuned |
frequency_penalty | Penalises tokens already present — reduces repetition | 0-1 |
presence_penalty | Penalises token reuse regardless of count | 0-1 |
max_tokens | Hard cap on generated tokens | Set to match your UI |
stop | String(s) that, when generated, end the response | e.g., "\n\nHuman:" in old chat formats |
For deterministic output (testing, structured generation), set temperature=0. For exploration or creative writing, 0.7-1.0.
System, User, and Assistant Messages
Modern APIs use a structured message format:
{
"messages": [
{ "role": "system", "content": "You are a strict JSON-only translator." },
{ "role": "user", "content": "Translate to French: Hello world" },
{ "role": "assistant", "content": "{\"text\": \"Bonjour le monde\"}" },
{ "role": "user", "content": "Now to German" }
]
}
- System messages set persistent behaviour and tone. Often the highest-leverage edit you can make.
- User messages are the human input.
- Assistant messages are prior model responses (for multi-turn) or examples (for few-shot).
Streaming
Responses can be streamed token-by-token via Server-Sent Events. The first token typically arrives in 200-800ms; the rest stream as generated. Always stream user-facing chat — it dramatically improves perceived responsiveness.
Function Calling / Tool Use
Modern models can emit structured output the caller interprets:
{
"tools": [{
"name": "get_weather",
"description": "Get current weather for a city",
"input_schema": {
"type": "object",
"properties": { "city": { "type": "string" } },
"required": ["city"]
}
}],
"messages": [{ "role": "user", "content": "What's the weather in Paris?" }]
}
The model returns a tool-use message: {"tool": "get_weather", "input": {"city": "Paris"}}. Your code executes the function, sends the result back, the model produces the user-facing answer. This is how agents work — covered in detail in lesson 4.
JSON Mode and Structured Outputs
For machine-readable outputs without tools, use structured outputs / JSON mode. The model is constrained to emit valid JSON matching a schema you supply. Reliability is dramatically higher than relying on prompt-only instructions like "respond with JSON".
The Practical Implications
Knowing the above, certain behaviours stop feeling magical:
- Why prompts work: They are the conditioning context for next-token prediction; better prompts produce better distributions to sample from
- Why repetition happens: Without frequency/presence penalty, the model can drift into "loops" where high-probability tokens reinforce themselves
- Why long inputs degrade: Attention is finite; lost-in-the-middle is a known phenomenon
- Why hallucinations happen: The model samples from a probability distribution; "I don't know" is just another token sequence, often less likely than a confident-sounding wrong answer
- Why determinism is hard: Sampling involves a random seed; even temperature=0 has provider-side non-determinism (batching, quantisation)
Armed with this model of the model, the next lesson — actual prompt engineering — will feel less like incantation and more like engineering.