Out-of-the-box, LLMs can plausibly answer most questions — but accuracy on multi-step reasoning, math, and tasks requiring external information depends heavily on how you ask. The techniques in this lesson are the tools production teams use to get reliable results on hard problems.
Chain-of-Thought (CoT)
The classic technique. Instead of "What is the answer?" you ask "Think step by step, then give the answer."
Q: A juggler can juggle 16 balls. Half of the balls are golf balls, and half of the
golf balls are blue. How many blue golf balls are there?
A: Let's work through this step by step.
1. Total balls: 16
2. Half are golf balls: 16 / 2 = 8 golf balls
3. Half of golf balls are blue: 8 / 2 = 4 blue golf balls
Answer: 4
For 2023-era GPT-3.5, CoT roughly doubled accuracy on grade-school math word problems. Modern models still benefit on harder problems, but the effect is smaller because they have learned to CoT implicitly.
Zero-shot CoT
The phrase "Let's think step by step" added to the prompt is enough — no examples needed.
Few-shot CoT
Even stronger: include 2-3 examples showing the reasoning style you want:
Q: Anna had 12 apples. She gave 4 to Ben and ate 2. How many remain?
A: Anna started with 12. Gave 4 → 12 - 4 = 8. Ate 2 → 8 - 2 = 6. Answer: 6.
Q: A train travels 60 mph for 2.5 hours. How far does it go?
A: Distance = speed × time = 60 × 2.5 = 150 miles. Answer: 150 miles.
Q: {{ new question }}
A:
Self-Consistency
Sample N different reasoning paths (temperature > 0), parse the final answer from each, and majority-vote:
prompt → sample 7 responses → extract answer from each → most common = output
Self-consistency catches the case where the model occasionally reasons incorrectly but most paths reach the right answer. Expensive (7× tokens) but reliable. Use selectively on high-stakes queries.
Reasoning Models
Starting with OpenAI o1 (late 2024), a new model class emerged: reasoning models that perform extended chain-of-thought internally before responding. They expose a "thinking" mode that runs the model autoregressively on its own scratchpad, often for tens of thousands of tokens, before emitting the final answer.
- OpenAI: o1, o1-mini, o3, GPT-5 reasoning variants
- Anthropic: Claude Opus 4 / Sonnet 4 extended thinking
- Google: Gemini 2.x Deep Think
- Open-weight: DeepSeek-R1, QwQ
Practical implication: for math, code, scientific reasoning, complex planning — a reasoning model with a simple prompt often beats a non-reasoning model with elaborate CoT prompting. Cost-wise, reasoning models charge for "thinking tokens" you never see.
ReAct: Reasoning + Acting
For problems that need external information (web search, database lookup, calculation), pure reasoning isn't enough — the model needs to act. ReAct (Reason + Act) interleaves the two:
Thought: I need to find the population of Tokyo and compare it to NYC.
Action: search("population of Tokyo 2024")
Observation: 13.96 million
Thought: Now NYC.
Action: search("population of New York City 2024")
Observation: 8.34 million
Thought: Tokyo has about 1.67x as many people as NYC.
Final Answer: Tokyo has approximately 5.6 million more people than NYC (13.96M vs 8.34M).
In production, the "Action" lines are tool calls the runtime executes; "Observation" is the result fed back into the model's context. The loop continues until the model emits "Final Answer".
Function Calling / Tool Use
ReAct is implemented via function calling (OpenAI) / tool use (Anthropic). You declare tools as JSON schemas; the model emits structured calls; your code executes them.
{
"tools": [{
"name": "search_docs",
"description": "Search internal documentation",
"input_schema": {
"type": "object",
"properties": {
"query": { "type": "string" },
"max_results": { "type": "integer", "default": 5 }
},
"required": ["query"]
}
}, {
"name": "create_ticket",
"description": "Create a support ticket",
"input_schema": { ... }
}]
}
The runtime loop:
- Send user message + tool definitions
- Model returns either text OR a tool_use block
- If tool_use, run the function, append result as a tool_result, loop
- Eventually model returns final text
Designing Tools
Good tool design matters more than prompt wording:
- Few, well-named tools — the model decides which to call based on the name and description
- Clear input schema — required fields, types, enums
- Descriptive errors — return "no results found, try a broader query" rather than empty array; the model uses your error text to course-correct
- Idempotent when possible — agents retry; idempotency prevents duplicate side effects
- Granularity that matches the work — a "create_jira_ticket" tool is better than separate "set_title" / "set_description" / "submit" tools that an agent must orchestrate
The Model Context Protocol (MCP)
Introduced by Anthropic in late 2024, MCP is an open standard for connecting LLMs to data sources and tools. Instead of every app reimplementing tool-execution logic, MCP servers expose tools via a standard protocol; any MCP-capable client (Claude Desktop, IDEs, agents) can use them. Adoption is growing fast across providers — worth following.
Agent Frameworks
For complex multi-tool workflows, agent frameworks help:
- LangGraph: Graph-based agent orchestration; production-grade
- CrewAI: Multi-agent collaboration with role-based agents
- AutoGen (Microsoft): Conversation-based multi-agent patterns
- OpenAI Agents SDK: Lightweight, opinionated, official
Choose the simplest one that solves your problem. Most production agents are 200-500 lines of code on top of the underlying SDK; frameworks help when you have many agents or complex routing.
When NOT to Use Agents
Agents are powerful but unpredictable, slow, and expensive. Heuristics:
| Workload | Better fit |
|---|---|
| Single-turn Q&A from a knowledge base | Plain RAG (next lesson) |
| Form filling / extraction | Structured output, no agent |
| Multi-step research across the web | Agent with search + browse tools |
| Complex multi-system workflows (Jira + Slack + email) | Agent with one tool per system |
| Deterministic workflows you've written before | Plain code calling LLM where needed |
The Quality Ladder
For a hard task, work up this ladder until quality is sufficient — stop at the lowest-effort tier that works:
- Plain prompt
- Few-shot examples
- Chain-of-thought
- Self-consistency / sampling
- Switch to a reasoning model
- Add retrieval (next lesson)
- Add tools (this lesson)
- Build an agent loop
- Fine-tune (lesson 6)
Each step costs more and adds complexity. Climb only as needed.