Skip to content

Snippet: Prompt Engineering

Domain Context

Designing, testing, and iterating on prompts for LLM-powered features. Prompts are code — they deserve the same versioning, testing, and review rigor.

Prompt Design Principles

  • Start with the simplest prompt that could work — add complexity only when eval shows it's needed
  • Structure: [System context] → [Task instruction] → [Format spec] → [Examples] → [Input]
  • Be explicit about what NOT to do — LLMs follow positive instructions better but need negative guardrails
  • Specify output format precisely: JSON schema, bullet list, or exact template
  • One task per prompt — if you need multiple steps, chain separate prompts

Prompt Patterns (use by name in code comments)

  • Zero-shot: direct instruction, no examples — baseline for every task
  • Few-shot: 3-5 diverse examples covering edge cases — most reliable improvement
  • Chain-of-Thought (CoT): "Think step by step" — effective for reasoning tasks
  • Self-Consistency: run CoT multiple times, majority vote — improves accuracy at higher cost
  • ReAct: interleave reasoning + tool actions — for tasks requiring external information
  • Structured Output: force JSON/XML schema — essential for downstream parsing

Version Control

  • Every prompt must have a version identifier: {task}_v{N}_{model}
  • Store prompts in dedicated files (YAML, Jinja2, or .txt) — never inline in application code
  • Track prompt changes in git with meaningful commit messages explaining the change reason
  • Maintain a prompt changelog: what changed, why, and eval results before/after

Testing & Evaluation

  • Every prompt change must be evaluated on a held-out test set (minimum 50 examples)
  • Define pass/fail criteria BEFORE writing the prompt — not after seeing results
  • Test edge cases explicitly: empty input, very long input, adversarial input, multilingual
  • A/B test prompts in production with traffic splitting — offline eval is necessary but insufficient
  • Log prompt performance over time — model updates can silently degrade prompt effectiveness

Optimization Techniques

  • Token efficiency: remove unnecessary words, use concise instructions — cost scales with tokens
  • Temperature tuning: 0 for deterministic tasks, 0.3-0.7 for creative tasks, document the choice
  • Few-shot example selection: choose examples similar to expected input distribution
  • Dynamic few-shot: retrieve relevant examples per query rather than fixed examples
  • Prompt compression: for long contexts, summarize or extract key information before feeding to model

System Prompt Best Practices

  • Define role, tone, constraints, and output format in the system prompt
  • Keep system prompts stable — change task prompts, not system prompts, for iteration
  • Include explicit safety constraints: what topics to refuse, what data not to expose
  • Test system prompt robustness against prompt injection attempts

Common Pitfalls

  • Prompt sensitivity: minor wording changes cause major output differences — always eval
  • Example bias: few-shot examples that are too similar cause the model to parrot patterns
  • Instruction following degrades with context length — put critical instructions at start AND end
  • Model-specific prompts: a prompt optimized for GPT-4 may perform poorly on Claude or Llama
  • Forgetting to update prompts when the model version changes