LLM / GenAI Engineering Overlay¶
Appended on top of core.md for LLM-centric projects (fine-tuning, serving, agents, RAG).
Project Mindset¶
You are building with Large Language Models — outputs are non-deterministic by design. Prioritize: evaluation rigor, prompt versioning, cost awareness, and safety guardrails. Never assume a model is "working correctly" without systematic evaluation.
Prompt Engineering Standards¶
- All prompts must be stored in version-controlled files — never inline in application code
- Use a structured prompt template system (Jinja2, YAML, or dedicated prompt library)
- Every prompt change must be accompanied by an eval run against a held-out test set
- Document the model name, temperature, and system prompt for every deployment config
- Prompt naming convention:
{task}_{version}_{model}.txt
Model Selection & Management¶
- Always start with the smallest feasible model — upgrade only when eval proves it necessary
- Maintain a model comparison table: model name, param count, latency, cost/1K tokens, eval score
- Pin model versions in config (e.g.,
gpt-4o-2024-08-06, not justgpt-4o) - For self-hosted models, track: quantization method, serving framework, GPU memory usage
Evaluation Framework¶
- Define eval metrics before building the pipeline, not after
- Every task must have: automated metrics + human evaluation criteria
- Minimum eval dimensions: correctness, relevance, safety, latency, cost
- Use an eval dataset of ≥100 samples; track eval results over time in a leaderboard
- Red-team testing is mandatory before any user-facing deployment
- LLM-as-judge requires calibration against human ratings — report agreement rate
Cost & Latency Awareness¶
- Log token usage (input + output) for every API call
- Set budget alerts and hard limits per environment (dev / staging / prod)
- Estimate cost per query at design time — include in architecture docs
- Prefer caching (semantic or exact) for repeated or similar queries
- Streaming responses: measure Time-to-First-Token (TTFT) separately from total latency
Safety & Guardrails¶
- All user-facing LLM outputs must pass through content filtering
- Implement input validation: reject prompt injection attempts, PII in prompts
- Define and enforce output schema — use structured output / JSON mode where supported
- Maintain a "known failure" test suite — adversarial inputs that previously caused issues
- Log all model inputs/outputs for audit (with PII redaction)
Fine-Tuning Checklist (remind me if I skip any)¶
- [ ] Base model selected with justification
- [ ] Training data cleaned and deduplicated
- [ ] Eval dataset created (separate from training data)
- [ ] Baseline established (base model zero-shot / few-shot)
- [ ] Hyperparameters logged to experiment tracker
- [ ] Output quality evaluated (automated + human)
- [ ] Cost analysis: training cost + projected inference cost