Snippet: Prompt Engineering¶

Domain Context¶

Designing, testing, and iterating on prompts for LLM-powered features. Prompts are code — they deserve the same versioning, testing, and review rigor.

Prompt Design Principles¶

Start with the simplest prompt that could work — add complexity only when eval shows it's needed
Structure: [System context] → [Task instruction] → [Format spec] → [Examples] → [Input]
Be explicit about what NOT to do — LLMs follow positive instructions better but need negative guardrails
Specify output format precisely: JSON schema, bullet list, or exact template
One task per prompt — if you need multiple steps, chain separate prompts

Prompt Patterns (use by name in code comments)¶

Zero-shot: direct instruction, no examples — baseline for every task
Few-shot: 3-5 diverse examples covering edge cases — most reliable improvement
Chain-of-Thought (CoT): "Think step by step" — effective for reasoning tasks
Self-Consistency: run CoT multiple times, majority vote — improves accuracy at higher cost
ReAct: interleave reasoning + tool actions — for tasks requiring external information
Structured Output: force JSON/XML schema — essential for downstream parsing

Version Control¶

Every prompt must have a version identifier: {task}_v{N}_{model}
Store prompts in dedicated files (YAML, Jinja2, or .txt) — never inline in application code
Track prompt changes in git with meaningful commit messages explaining the change reason
Maintain a prompt changelog: what changed, why, and eval results before/after

Testing & Evaluation¶

Every prompt change must be evaluated on a held-out test set (minimum 50 examples)
Define pass/fail criteria BEFORE writing the prompt — not after seeing results
Test edge cases explicitly: empty input, very long input, adversarial input, multilingual
A/B test prompts in production with traffic splitting — offline eval is necessary but insufficient
Log prompt performance over time — model updates can silently degrade prompt effectiveness

Optimization Techniques¶

Token efficiency: remove unnecessary words, use concise instructions — cost scales with tokens
Temperature tuning: 0 for deterministic tasks, 0.3-0.7 for creative tasks, document the choice
Few-shot example selection: choose examples similar to expected input distribution
Dynamic few-shot: retrieve relevant examples per query rather than fixed examples
Prompt compression: for long contexts, summarize or extract key information before feeding to model

System Prompt Best Practices¶

Define role, tone, constraints, and output format in the system prompt
Keep system prompts stable — change task prompts, not system prompts, for iteration
Include explicit safety constraints: what topics to refuse, what data not to expose
Test system prompt robustness against prompt injection attempts

Common Pitfalls¶

Prompt sensitivity: minor wording changes cause major output differences — always eval
Example bias: few-shot examples that are too similar cause the model to parrot patterns
Instruction following degrades with context length — put critical instructions at start AND end
Model-specific prompts: a prompt optimized for GPT-4 may perform poorly on Claude or Llama
Forgetting to update prompts when the model version changes