Skip to content
7 min read·Lesson 2 of 8

How LLMs Actually Work (Tokens, Context, Sampling)

A practical mental model: tokens, the transformer in one paragraph, context windows, sampling parameters, and why all of it shapes prompts.

You don't need to be able to derive backpropagation to build with LLMs — but you do need a working mental model. This lesson teaches just enough about how LLMs work that the rest of the course makes sense.

The Core Loop

Every LLM does one thing: given a sequence of tokens, predict the next token's probability distribution. Generation works by sampling repeatedly:

input → P(next token) → sample → append → P(next token) → sample → … → stop

That's it. Everything else — chat, code completion, function calling — is sugar on this single primitive.

What's a Token?

Tokens are sub-word pieces. The phrase "ServiceNow rocks" tokenises (in GPT-4 tokenizer) to roughly:

["Service", "Now", " rocks"]   // 3 tokens

Rules of thumb (English):

  • 1 token ≈ 4 characters ≈ 0.75 words
  • 1,000 tokens ≈ 750 words ≈ 1.5 pages of prose
  • Common words = 1 token; uncommon = 2-4; non-Latin scripts often 2-3× more

Costs are billed per token. A 10,000-word document is roughly 13,000 tokens of input — meaningful at scale.

The Context Window

The model has a fixed maximum context window — input + output combined cannot exceed it. By 2026:

Model classTypical context
Older / small4K - 32K tokens
Modern frontier200K tokens (Claude, GPT-5)
Long-context specialty1M+ tokens (Gemini 2.x, GPT-5 long)

"Effective" context is often smaller than nominal — models attend better to the start and end of long contexts than the middle. This is the "lost-in-the-middle" effect. Practical implication: put the most important instructions and information at the start or the end, not buried in the middle.

Inside the Box (Briefly)

Modern LLMs are decoder-only transformers. The key innovation, attention, lets the model dynamically weigh prior tokens when predicting the next. You don't need the math — you need the consequence: position matters and recency matters. Instructions late in the prompt typically influence the next output more than instructions early in a long context.

Modern improvements layered on transformers include:

  • Mixture of Experts (MoE): Many expert subnets; the model routes each token to a few — gives you a big model at a fraction of the compute
  • Multi-head latent attention: Cheaper long-context attention
  • Flash attention: A speed optimisation, not a quality change

Sampling Parameters

The model outputs a probability distribution over tens of thousands of tokens. How you pick from it controls output behaviour.

ParameterWhat it doesTypical values
temperatureFlattens (high) or sharpens (low) the distribution before sampling. 0 = always pick most likely.0.0 for factual, 0.7 for creative
top_p (nucleus)Only sample from the smallest set of tokens whose probabilities sum to p0.9 is common
top_kOnly consider the top K tokens40-100; less commonly tuned
frequency_penaltyPenalises tokens already present — reduces repetition0-1
presence_penaltyPenalises token reuse regardless of count0-1
max_tokensHard cap on generated tokensSet to match your UI
stopString(s) that, when generated, end the responsee.g., "\n\nHuman:" in old chat formats

For deterministic output (testing, structured generation), set temperature=0. For exploration or creative writing, 0.7-1.0.

System, User, and Assistant Messages

Modern APIs use a structured message format:

{
  "messages": [
    { "role": "system", "content": "You are a strict JSON-only translator." },
    { "role": "user", "content": "Translate to French: Hello world" },
    { "role": "assistant", "content": "{\"text\": \"Bonjour le monde\"}" },
    { "role": "user", "content": "Now to German" }
  ]
}
  • System messages set persistent behaviour and tone. Often the highest-leverage edit you can make.
  • User messages are the human input.
  • Assistant messages are prior model responses (for multi-turn) or examples (for few-shot).

Streaming

Responses can be streamed token-by-token via Server-Sent Events. The first token typically arrives in 200-800ms; the rest stream as generated. Always stream user-facing chat — it dramatically improves perceived responsiveness.

Function Calling / Tool Use

Modern models can emit structured output the caller interprets:

{
  "tools": [{
    "name": "get_weather",
    "description": "Get current weather for a city",
    "input_schema": {
      "type": "object",
      "properties": { "city": { "type": "string" } },
      "required": ["city"]
    }
  }],
  "messages": [{ "role": "user", "content": "What's the weather in Paris?" }]
}

The model returns a tool-use message: {"tool": "get_weather", "input": {"city": "Paris"}}. Your code executes the function, sends the result back, the model produces the user-facing answer. This is how agents work — covered in detail in lesson 4.

JSON Mode and Structured Outputs

For machine-readable outputs without tools, use structured outputs / JSON mode. The model is constrained to emit valid JSON matching a schema you supply. Reliability is dramatically higher than relying on prompt-only instructions like "respond with JSON".

The Practical Implications

Knowing the above, certain behaviours stop feeling magical:

  • Why prompts work: They are the conditioning context for next-token prediction; better prompts produce better distributions to sample from
  • Why repetition happens: Without frequency/presence penalty, the model can drift into "loops" where high-probability tokens reinforce themselves
  • Why long inputs degrade: Attention is finite; lost-in-the-middle is a known phenomenon
  • Why hallucinations happen: The model samples from a probability distribution; "I don't know" is just another token sequence, often less likely than a confident-sounding wrong answer
  • Why determinism is hard: Sampling involves a random seed; even temperature=0 has provider-side non-determinism (batching, quantisation)

Armed with this model of the model, the next lesson — actual prompt engineering — will feel less like incantation and more like engineering.

Key Takeaways

  • LLMs generate one token at a time; each token is a sub-word piece, not a character or word.
  • The context window is the maximum input + output length — measured in tokens, varies by model.
  • Temperature, top_p, and top_k control sampling randomness — lower = more deterministic.
  • The "attention" mechanism lets the model selectively weigh prior tokens when predicting the next.
  • Function calling / tool use is the model emitting structured output the caller interprets.

Test your knowledge

Try exam-style practice questions to reinforce what you've learned.

Practice Questions →