Skip to content
7 min read·Lesson 4 of 8

Advanced Prompting: Chain-of-Thought, ReAct, and Tools

Unlock LLM reasoning with chain-of-thought, self-consistency, ReAct loops, and function-calling-based tool use.

Out-of-the-box, LLMs can plausibly answer most questions — but accuracy on multi-step reasoning, math, and tasks requiring external information depends heavily on how you ask. The techniques in this lesson are the tools production teams use to get reliable results on hard problems.

Chain-of-Thought (CoT)

The classic technique. Instead of "What is the answer?" you ask "Think step by step, then give the answer."

Q: A juggler can juggle 16 balls. Half of the balls are golf balls, and half of the
golf balls are blue. How many blue golf balls are there?

A: Let's work through this step by step.
1. Total balls: 16
2. Half are golf balls: 16 / 2 = 8 golf balls
3. Half of golf balls are blue: 8 / 2 = 4 blue golf balls
Answer: 4

For 2023-era GPT-3.5, CoT roughly doubled accuracy on grade-school math word problems. Modern models still benefit on harder problems, but the effect is smaller because they have learned to CoT implicitly.

Zero-shot CoT

The phrase "Let's think step by step" added to the prompt is enough — no examples needed.

Few-shot CoT

Even stronger: include 2-3 examples showing the reasoning style you want:

Q: Anna had 12 apples. She gave 4 to Ben and ate 2. How many remain?
A: Anna started with 12. Gave 4 → 12 - 4 = 8. Ate 2 → 8 - 2 = 6. Answer: 6.

Q: A train travels 60 mph for 2.5 hours. How far does it go?
A: Distance = speed × time = 60 × 2.5 = 150 miles. Answer: 150 miles.

Q: {{ new question }}
A:

Self-Consistency

Sample N different reasoning paths (temperature > 0), parse the final answer from each, and majority-vote:

prompt → sample 7 responses → extract answer from each → most common = output

Self-consistency catches the case where the model occasionally reasons incorrectly but most paths reach the right answer. Expensive (7× tokens) but reliable. Use selectively on high-stakes queries.

Reasoning Models

Starting with OpenAI o1 (late 2024), a new model class emerged: reasoning models that perform extended chain-of-thought internally before responding. They expose a "thinking" mode that runs the model autoregressively on its own scratchpad, often for tens of thousands of tokens, before emitting the final answer.

  • OpenAI: o1, o1-mini, o3, GPT-5 reasoning variants
  • Anthropic: Claude Opus 4 / Sonnet 4 extended thinking
  • Google: Gemini 2.x Deep Think
  • Open-weight: DeepSeek-R1, QwQ

Practical implication: for math, code, scientific reasoning, complex planning — a reasoning model with a simple prompt often beats a non-reasoning model with elaborate CoT prompting. Cost-wise, reasoning models charge for "thinking tokens" you never see.

ReAct: Reasoning + Acting

For problems that need external information (web search, database lookup, calculation), pure reasoning isn't enough — the model needs to act. ReAct (Reason + Act) interleaves the two:

Thought: I need to find the population of Tokyo and compare it to NYC.
Action: search("population of Tokyo 2024")
Observation: 13.96 million
Thought: Now NYC.
Action: search("population of New York City 2024")
Observation: 8.34 million
Thought: Tokyo has about 1.67x as many people as NYC.
Final Answer: Tokyo has approximately 5.6 million more people than NYC (13.96M vs 8.34M).

In production, the "Action" lines are tool calls the runtime executes; "Observation" is the result fed back into the model's context. The loop continues until the model emits "Final Answer".

Function Calling / Tool Use

ReAct is implemented via function calling (OpenAI) / tool use (Anthropic). You declare tools as JSON schemas; the model emits structured calls; your code executes them.

{
  "tools": [{
    "name": "search_docs",
    "description": "Search internal documentation",
    "input_schema": {
      "type": "object",
      "properties": {
        "query": { "type": "string" },
        "max_results": { "type": "integer", "default": 5 }
      },
      "required": ["query"]
    }
  }, {
    "name": "create_ticket",
    "description": "Create a support ticket",
    "input_schema": { ... }
  }]
}

The runtime loop:

  1. Send user message + tool definitions
  2. Model returns either text OR a tool_use block
  3. If tool_use, run the function, append result as a tool_result, loop
  4. Eventually model returns final text

Designing Tools

Good tool design matters more than prompt wording:

  • Few, well-named tools — the model decides which to call based on the name and description
  • Clear input schema — required fields, types, enums
  • Descriptive errors — return "no results found, try a broader query" rather than empty array; the model uses your error text to course-correct
  • Idempotent when possible — agents retry; idempotency prevents duplicate side effects
  • Granularity that matches the work — a "create_jira_ticket" tool is better than separate "set_title" / "set_description" / "submit" tools that an agent must orchestrate

The Model Context Protocol (MCP)

Introduced by Anthropic in late 2024, MCP is an open standard for connecting LLMs to data sources and tools. Instead of every app reimplementing tool-execution logic, MCP servers expose tools via a standard protocol; any MCP-capable client (Claude Desktop, IDEs, agents) can use them. Adoption is growing fast across providers — worth following.

Agent Frameworks

For complex multi-tool workflows, agent frameworks help:

  • LangGraph: Graph-based agent orchestration; production-grade
  • CrewAI: Multi-agent collaboration with role-based agents
  • AutoGen (Microsoft): Conversation-based multi-agent patterns
  • OpenAI Agents SDK: Lightweight, opinionated, official

Choose the simplest one that solves your problem. Most production agents are 200-500 lines of code on top of the underlying SDK; frameworks help when you have many agents or complex routing.

When NOT to Use Agents

Agents are powerful but unpredictable, slow, and expensive. Heuristics:

WorkloadBetter fit
Single-turn Q&A from a knowledge basePlain RAG (next lesson)
Form filling / extractionStructured output, no agent
Multi-step research across the webAgent with search + browse tools
Complex multi-system workflows (Jira + Slack + email)Agent with one tool per system
Deterministic workflows you've written beforePlain code calling LLM where needed

The Quality Ladder

For a hard task, work up this ladder until quality is sufficient — stop at the lowest-effort tier that works:

  1. Plain prompt
  2. Few-shot examples
  3. Chain-of-thought
  4. Self-consistency / sampling
  5. Switch to a reasoning model
  6. Add retrieval (next lesson)
  7. Add tools (this lesson)
  8. Build an agent loop
  9. Fine-tune (lesson 6)

Each step costs more and adds complexity. Climb only as needed.

Key Takeaways

  • Chain-of-thought asks the model to reason step-by-step before answering — large quality lift on math/logic.
  • Self-consistency samples multiple reasoning paths and votes — robust against noisy single answers.
  • ReAct interleaves Reasoning and Acting: the model thinks, then calls a tool, observes, then thinks again.
  • Function calling / tool use is the API primitive that makes ReAct loops practical.
  • Reasoning models (o1, Claude Sonnet 4 thinking, Gemini Deep Think) bake chain-of-thought into the model itself.

Test your knowledge

Try exam-style practice questions to reinforce what you've learned.

Practice Questions →