Prompt Engineering and Retrieval-Augmented Generation — AI and ML Fundamentals | CertQnA

You can ask an LLM the same question in five different ways and get five different qualities of answer. Prompt engineering is the discipline of writing prompts that reliably produce the output you want — and RAG is the architecture that lets LLMs answer questions about data they were never trained on.

Anatomy of a Prompt

Modern LLMs accept a structured conversation with multiple roles:

Role	Purpose
system	Sets the model's persona, behavioural rules, and output format. Sent first.
user	The actual question or instruction from the human.
assistant	The model's previous responses — included to maintain conversation history.
tool (or function)	Output from a tool the model called — covered in the next lesson.

Example system prompt: "You are a senior SQL engineer. When the user asks for a query, return only the SQL with no explanation. Use PostgreSQL syntax. If the request is ambiguous, ask one clarifying question."

Zero-Shot, Few-Shot, and Many-Shot

Zero-shot

Just ask. The model has not been shown any examples for this specific task.

Classify the sentiment of this review as positive or negative:
"The food was cold and the service was rude."

Few-shot

Include a few input/output examples in the prompt to demonstrate the format.

Classify sentiment as positive or negative.

Review: "Best meal I've had this year." → positive
Review: "Waited 45 minutes and the food was lukewarm." → negative
Review: "The service was friendly and the dessert was outstanding." → positive

Review: "The food was cold and the service was rude." →

Few-shot examples often dramatically improve quality and consistency, especially for structured outputs.

Chain-of-Thought (CoT) Prompting

For tasks that require reasoning (math, multi-step logic, analysis), explicitly asking the model to think step-by-step before answering improves accuracy substantially:

A train leaves station A at 60 mph. Another train leaves station B
(120 miles away) towards A at 40 mph at the same time. When do they meet?

Think step by step, then give the final answer.

Modern frontier models (OpenAI o-series, Claude with extended thinking, Gemini) do this internally — they have a "reasoning" mode that produces a hidden chain of thought before the final answer. This is one of the main reasons modern models outperform earlier ones on math, coding, and logic tasks.

Other Useful Techniques

Role prompting: "You are an expert tax attorney..." — frames the response style.
Output format specification: "Respond in JSON with keys 'name' and 'score'" — many LLMs support strict JSON schema mode.
Self-critique / reflexion: Ask the model to critique its first answer and produce a revised version.
Decomposition: Break a complex task into a sequence of simpler prompts; chain the outputs together.
Negative instructions: "Do not include any markdown formatting" — often more effective than positive instructions for forbidden behaviour.

The Limits of Prompting Alone

LLMs have two fundamental limitations:

Their training data has a cutoff — they don't know about events after a certain date.
They have never seen your private data — your company's documents, your customer records, your internal wiki.

Hallucination — the model fabricating plausible-sounding but false answers — is most acute when asked about information outside its training data. The fix: RAG.

Retrieval-Augmented Generation (RAG)

RAG injects relevant context into the prompt at query time:

Index: Split your documents into chunks, compute embeddings (using an embedding model), store in a vector database (Pinecone, Weaviate, pgvector, Qdrant).
Retrieve: When a user asks a question, embed the query and find the top-k most similar chunks via cosine similarity.
Augment: Construct a prompt that includes those chunks as context.
Generate: Send the augmented prompt to the LLM, which now has the relevant facts to ground its answer.

Use the following context to answer the question. If the context
does not contain the answer, say "I don't know."

Context:
[chunk 1: company policy paragraph]
[chunk 2: another policy paragraph]
[chunk 3: relevant FAQ entry]

Question: What is the company's policy on remote work?

Answer:

RAG is the dominant pattern for production LLM applications: customer-support bots over a knowledge base, internal Q&A over company wikis, search over codebases. It avoids the cost of fine-tuning and lets you update knowledge instantly by re-indexing.

Frameworks for Prompting and RAG

LangChain and LlamaIndex: Python frameworks for chaining prompts, RAG, and tool use. The most widely used.
Vercel AI SDK: TypeScript-first framework for streaming chat UIs and structured generation.
Semantic Kernel: Microsoft's equivalent, integrates with .NET and Azure AI.
Native APIs: OpenAI, Anthropic, and Google now offer first-party tooling for structured outputs, function calling, and (increasingly) built-in RAG.