Advanced Prompting: Chain-of-Thought, ReAct, and Tools — Generative AI & Prompt Engineering | CertQnA

Out-of-the-box, LLMs can plausibly answer most questions — but accuracy on multi-step reasoning, math, and tasks requiring external information depends heavily on how you ask. The techniques in this lesson are the tools production teams use to get reliable results on hard problems.

Chain-of-Thought (CoT)

The classic technique. Instead of "What is the answer?" you ask "Think step by step, then give the answer."

Q: A juggler can juggle 16 balls. Half of the balls are golf balls, and half of the
golf balls are blue. How many blue golf balls are there?

A: Let's work through this step by step.
1. Total balls: 16
2. Half are golf balls: 16 / 2 = 8 golf balls
3. Half of golf balls are blue: 8 / 2 = 4 blue golf balls
Answer: 4

For 2023-era GPT-3.5, CoT roughly doubled accuracy on grade-school math word problems. Modern models still benefit on harder problems, but the effect is smaller because they have learned to CoT implicitly.

Zero-shot CoT

The phrase "Let's think step by step" added to the prompt is enough — no examples needed.

Few-shot CoT

Even stronger: include 2-3 examples showing the reasoning style you want:

Q: Anna had 12 apples. She gave 4 to Ben and ate 2. How many remain?
A: Anna started with 12. Gave 4 → 12 - 4 = 8. Ate 2 → 8 - 2 = 6. Answer: 6.

Q: A train travels 60 mph for 2.5 hours. How far does it go?
A: Distance = speed × time = 60 × 2.5 = 150 miles. Answer: 150 miles.

Q: {{ new question }}
A:

Self-Consistency

Sample N different reasoning paths (temperature > 0), parse the final answer from each, and majority-vote:

prompt → sample 7 responses → extract answer from each → most common = output

Self-consistency catches the case where the model occasionally reasons incorrectly but most paths reach the right answer. Expensive (7× tokens) but reliable. Use selectively on high-stakes queries.

Reasoning Models

Starting with OpenAI o1 (late 2024), a new model class emerged: reasoning models that perform extended chain-of-thought internally before responding. They expose a "thinking" mode that runs the model autoregressively on its own scratchpad, often for tens of thousands of tokens, before emitting the final answer.

OpenAI: o1, o1-mini, o3, GPT-5 reasoning variants
Anthropic: Claude Opus 4 / Sonnet 4 extended thinking
Google: Gemini 2.x Deep Think
Open-weight: DeepSeek-R1, QwQ

Practical implication: for math, code, scientific reasoning, complex planning — a reasoning model with a simple prompt often beats a non-reasoning model with elaborate CoT prompting. Cost-wise, reasoning models charge for "thinking tokens" you never see.

ReAct: Reasoning + Acting

For problems that need external information (web search, database lookup, calculation), pure reasoning isn't enough — the model needs to act. ReAct (Reason + Act) interleaves the two:

Thought: I need to find the population of Tokyo and compare it to NYC.
Action: search("population of Tokyo 2024")
Observation: 13.96 million
Thought: Now NYC.
Action: search("population of New York City 2024")
Observation: 8.34 million
Thought: Tokyo has about 1.67x as many people as NYC.
Final Answer: Tokyo has approximately 5.6 million more people than NYC (13.96M vs 8.34M).

In production, the "Action" lines are tool calls the runtime executes; "Observation" is the result fed back into the model's context. The loop continues until the model emits "Final Answer".

Function Calling / Tool Use

ReAct is implemented via function calling (OpenAI) / tool use (Anthropic). You declare tools as JSON schemas; the model emits structured calls; your code executes them.

{
  "tools": [{
    "name": "search_docs",
    "description": "Search internal documentation",
    "input_schema": {
      "type": "object",
      "properties": {
        "query": { "type": "string" },
        "max_results": { "type": "integer", "default": 5 }
      },
      "required": ["query"]
    }
  }, {
    "name": "create_ticket",
    "description": "Create a support ticket",
    "input_schema": { ... }
  }]
}

The runtime loop:

Send user message + tool definitions
Model returns either text OR a tool_use block
If tool_use, run the function, append result as a tool_result, loop
Eventually model returns final text

Designing Tools

Good tool design matters more than prompt wording:

Few, well-named tools — the model decides which to call based on the name and description
Clear input schema — required fields, types, enums
Descriptive errors — return "no results found, try a broader query" rather than empty array; the model uses your error text to course-correct
Idempotent when possible — agents retry; idempotency prevents duplicate side effects
Granularity that matches the work — a "create_jira_ticket" tool is better than separate "set_title" / "set_description" / "submit" tools that an agent must orchestrate

The Model Context Protocol (MCP)

Introduced by Anthropic in late 2024, MCP is an open standard for connecting LLMs to data sources and tools. Instead of every app reimplementing tool-execution logic, MCP servers expose tools via a standard protocol; any MCP-capable client (Claude Desktop, IDEs, agents) can use them. Adoption is growing fast across providers — worth following.

Agent Frameworks

For complex multi-tool workflows, agent frameworks help:

LangGraph: Graph-based agent orchestration; production-grade
CrewAI: Multi-agent collaboration with role-based agents
AutoGen (Microsoft): Conversation-based multi-agent patterns
OpenAI Agents SDK: Lightweight, opinionated, official

Choose the simplest one that solves your problem. Most production agents are 200-500 lines of code on top of the underlying SDK; frameworks help when you have many agents or complex routing.

When NOT to Use Agents

Agents are powerful but unpredictable, slow, and expensive. Heuristics:

Workload	Better fit
Single-turn Q&A from a knowledge base	Plain RAG (next lesson)
Form filling / extraction	Structured output, no agent
Multi-step research across the web	Agent with search + browse tools
Complex multi-system workflows (Jira + Slack + email)	Agent with one tool per system
Deterministic workflows you've written before	Plain code calling LLM where needed

The Quality Ladder

For a hard task, work up this ladder until quality is sufficient — stop at the lowest-effort tier that works:

Plain prompt
Few-shot examples
Chain-of-thought
Self-consistency / sampling
Switch to a reasoning model
Add retrieval (next lesson)
Add tools (this lesson)
Build an agent loop
Fine-tune (lesson 6)

Each step costs more and adds complexity. Climb only as needed.