How LLMs Actually Work (Tokens, Context, Sampling) — Generative AI & Prompt Engineering | CertQnA

You don't need to be able to derive backpropagation to build with LLMs — but you do need a working mental model. This lesson teaches just enough about how LLMs work that the rest of the course makes sense.

The Core Loop

Every LLM does one thing: given a sequence of tokens, predict the next token's probability distribution. Generation works by sampling repeatedly:

input → P(next token) → sample → append → P(next token) → sample → … → stop

That's it. Everything else — chat, code completion, function calling — is sugar on this single primitive.

What's a Token?

Tokens are sub-word pieces. The phrase "ServiceNow rocks" tokenises (in GPT-4 tokenizer) to roughly:

["Service", "Now", " rocks"]   // 3 tokens

Rules of thumb (English):

1 token ≈ 4 characters ≈ 0.75 words
1,000 tokens ≈ 750 words ≈ 1.5 pages of prose
Common words = 1 token; uncommon = 2-4; non-Latin scripts often 2-3× more

Costs are billed per token. A 10,000-word document is roughly 13,000 tokens of input — meaningful at scale.

The Context Window

The model has a fixed maximum context window — input + output combined cannot exceed it. By 2026:

Model class	Typical context
Older / small	4K - 32K tokens
Modern frontier	200K tokens (Claude, GPT-5)
Long-context specialty	1M+ tokens (Gemini 2.x, GPT-5 long)

"Effective" context is often smaller than nominal — models attend better to the start and end of long contexts than the middle. This is the "lost-in-the-middle" effect. Practical implication: put the most important instructions and information at the start or the end, not buried in the middle.

Inside the Box (Briefly)

Modern LLMs are decoder-only transformers. The key innovation, attention, lets the model dynamically weigh prior tokens when predicting the next. You don't need the math — you need the consequence: position matters and recency matters. Instructions late in the prompt typically influence the next output more than instructions early in a long context.

Modern improvements layered on transformers include:

Mixture of Experts (MoE): Many expert subnets; the model routes each token to a few — gives you a big model at a fraction of the compute
Multi-head latent attention: Cheaper long-context attention
Flash attention: A speed optimisation, not a quality change

Sampling Parameters

The model outputs a probability distribution over tens of thousands of tokens. How you pick from it controls output behaviour.

Parameter	What it does	Typical values
`temperature`	Flattens (high) or sharpens (low) the distribution before sampling. 0 = always pick most likely.	0.0 for factual, 0.7 for creative
`top_p` (nucleus)	Only sample from the smallest set of tokens whose probabilities sum to p	0.9 is common
`top_k`	Only consider the top K tokens	40-100; less commonly tuned
`frequency_penalty`	Penalises tokens already present — reduces repetition	0-1
`presence_penalty`	Penalises token reuse regardless of count	0-1
`max_tokens`	Hard cap on generated tokens	Set to match your UI
`stop`	String(s) that, when generated, end the response	e.g., "\n\nHuman:" in old chat formats

For deterministic output (testing, structured generation), set temperature=0. For exploration or creative writing, 0.7-1.0.

System, User, and Assistant Messages

Modern APIs use a structured message format:

{
  "messages": [
    { "role": "system", "content": "You are a strict JSON-only translator." },
    { "role": "user", "content": "Translate to French: Hello world" },
    { "role": "assistant", "content": "{\"text\": \"Bonjour le monde\"}" },
    { "role": "user", "content": "Now to German" }
  ]
}

System messages set persistent behaviour and tone. Often the highest-leverage edit you can make.
User messages are the human input.
Assistant messages are prior model responses (for multi-turn) or examples (for few-shot).

Streaming

Responses can be streamed token-by-token via Server-Sent Events. The first token typically arrives in 200-800ms; the rest stream as generated. Always stream user-facing chat — it dramatically improves perceived responsiveness.

Function Calling / Tool Use

Modern models can emit structured output the caller interprets:

{
  "tools": [{
    "name": "get_weather",
    "description": "Get current weather for a city",
    "input_schema": {
      "type": "object",
      "properties": { "city": { "type": "string" } },
      "required": ["city"]
    }
  }],
  "messages": [{ "role": "user", "content": "What's the weather in Paris?" }]
}

The model returns a tool-use message: {"tool": "get_weather", "input": {"city": "Paris"}}. Your code executes the function, sends the result back, the model produces the user-facing answer. This is how agents work — covered in detail in lesson 4.

JSON Mode and Structured Outputs

For machine-readable outputs without tools, use structured outputs / JSON mode. The model is constrained to emit valid JSON matching a schema you supply. Reliability is dramatically higher than relying on prompt-only instructions like "respond with JSON".

The Practical Implications

Knowing the above, certain behaviours stop feeling magical:

Why prompts work: They are the conditioning context for next-token prediction; better prompts produce better distributions to sample from
Why repetition happens: Without frequency/presence penalty, the model can drift into "loops" where high-probability tokens reinforce themselves
Why long inputs degrade: Attention is finite; lost-in-the-middle is a known phenomenon
Why hallucinations happen: The model samples from a probability distribution; "I don't know" is just another token sequence, often less likely than a confident-sounding wrong answer
Why determinism is hard: Sampling involves a random seed; even temperature=0 has provider-side non-determinism (batching, quantisation)

Armed with this model of the model, the next lesson — actual prompt engineering — will feel less like incantation and more like engineering.