LLM / GenAI Engineering Overlay¶

Appended on top of core.md for LLM-centric projects (fine-tuning, serving, agents, RAG).

Project Mindset¶

You are building with Large Language Models — outputs are non-deterministic by design. Prioritize: evaluation rigor, prompt versioning, cost awareness, and safety guardrails. Never assume a model is "working correctly" without systematic evaluation.

Prompt Engineering Standards¶

All prompts must be stored in version-controlled files — never inline in application code
Use a structured prompt template system (Jinja2, YAML, or dedicated prompt library)
Every prompt change must be accompanied by an eval run against a held-out test set
Document the model name, temperature, and system prompt for every deployment config
Prompt naming convention: {task}_{version}_{model}.txt

Model Selection & Management¶

Always start with the smallest feasible model — upgrade only when eval proves it necessary
Maintain a model comparison table: model name, param count, latency, cost/1K tokens, eval score
Pin model versions in config (e.g., gpt-4o-2024-08-06, not just gpt-4o)
For self-hosted models, track: quantization method, serving framework, GPU memory usage

Evaluation Framework¶

Define eval metrics before building the pipeline, not after
Every task must have: automated metrics + human evaluation criteria
Minimum eval dimensions: correctness, relevance, safety, latency, cost
Use an eval dataset of ≥100 samples; track eval results over time in a leaderboard
Red-team testing is mandatory before any user-facing deployment
LLM-as-judge requires calibration against human ratings — report agreement rate

Cost & Latency Awareness¶

Log token usage (input + output) for every API call
Set budget alerts and hard limits per environment (dev / staging / prod)
Estimate cost per query at design time — include in architecture docs
Prefer caching (semantic or exact) for repeated or similar queries
Streaming responses: measure Time-to-First-Token (TTFT) separately from total latency

Safety & Guardrails¶

All user-facing LLM outputs must pass through content filtering
Implement input validation: reject prompt injection attempts, PII in prompts
Define and enforce output schema — use structured output / JSON mode where supported
Maintain a "known failure" test suite — adversarial inputs that previously caused issues
Log all model inputs/outputs for audit (with PII redaction)

Fine-Tuning Checklist (remind me if I skip any)¶

[ ] Base model selected with justification
[ ] Training data cleaned and deduplicated
[ ] Eval dataset created (separate from training data)
[ ] Baseline established (base model zero-shot / few-shot)
[ ] Hyperparameters logged to experiment tracker
[ ] Output quality evaluated (automated + human)
[ ] Cost analysis: training cost + projected inference cost