Snippet: Responsible AI & Safety¶

Domain Context¶

Ensuring AI systems are fair, safe, transparent, and accountable. These rules apply to any model that affects people — which is nearly all of them.

Define protected attributes upfront (age, gender, ethnicity, etc.) — document which ones are relevant and why
Measure fairness metrics across protected groups using AIF360 or Fairlearn:
Demographic parity difference (< 0.1 for low-risk, < 0.05 for high-risk applications)
Equalized odds difference (< 0.1)
Calibration: prediction means should be within 5% across groups
Disparate impact check: any metric gap >20% across groups requires investigation and documentation
Training data audit: check for representation imbalance and historical bias before training — log group counts
If debiasing is applied, document the method (resampling, reweighting, adversarial) and its impact on overall performance
Intersectional analysis: check bias across combinations of protected attributes (e.g., age × gender), not just individual axes
Run bias evaluation on a held-out slice analysis set — never reuse the general test set for fairness audits

All user-facing models must have output filtering: toxicity (Perspective API / Detoxify, threshold > 0.7 = block), PII (Presidio), harmful content
Input validation: detect and reject adversarial inputs, prompt injections (rebuff / LLM-Guard), out-of-distribution queries
Define a "refuse to answer" policy — model should abstain rather than produce harmful output; document the refusal categories
Maintain a safety test suite: ≥100 adversarial prompts covering toxicity, bias, PII leakage, jailbreaks — version-controlled
Regular red-team testing: quarterly at minimum for production systems, with documented findings and remediation timeline
Toxicity threshold: block outputs scoring >0.7 on Perspective API; flag for human review at >0.5
Rate limiting: cap per-user requests to prevent abuse (e.g., 60 req/min for general use, stricter for generation endpoints)
Content classifiers must be updated quarterly — new attack patterns emerge faster than model update cycles

Document model limitations explicitly — what the model CANNOT do is as important as what it can
Provide explanations appropriate to the audience: SHAP/LIME for data scientists, plain language for end users
Model cards: every production model must have one — include intended use, limitations, performance by demographic group, training data summary
Decision audit trail: for high-stakes decisions (credit, hiring, medical), log inputs, model version, outputs, and explanations — retain for ≥2 years
Confidence/uncertainty: expose prediction confidence to downstream consumers; define a minimum confidence threshold (e.g., 0.7) below which the model defers to human judgment
Watermarking: for generative models, consider output watermarking for traceability (e.g., text watermarking via green/red token lists)

PII detection and redaction in training data — automated scan (Presidio or custom regex) before any training run
Data retention policy: define how long model inputs/outputs are stored; default to 90 days unless compliance requires longer
Right to deletion: ensure ability to remove individual's data and retrain if requested; document the DSAR process
Synthetic data: consider for sensitive domains — document the generation method, privacy guarantees, and fidelity metrics
Federated learning: evaluate when data cannot leave its origin for regulatory reasons
Differential privacy: for sensitive training data, use DP-SGD with ε ≤ 10 and document the privacy budget

Risk assessment for every new model: who is affected, what can go wrong, mitigations, residual risk acceptance
Incident response plan: what happens when the model produces harmful output in production — define SLA (e.g., <4hr for severity-1)
Regular model reviews: scheduled re-evaluation of fairness metrics and performance drift — at minimum every 6 months
Regulatory compliance: identify applicable regulations (GDPR, EU AI Act, CCPA, sector-specific) at project start
Stakeholder communication: non-technical summary of model behavior, risks, and limitations — update with each release
AI system inventory: maintain a registry of all deployed models with ownership, risk level, and review schedule

A/B testing: include fairness metrics in experiment dashboards, not just engagement/revenue
Production monitoring: alert on fairness metric drift — set threshold at 1.5× the training-time gap
Shadow scoring: run bias assessments on live traffic samples weekly, not just at deploy time
Regression testing: every model update must re-run the safety test suite — automate in CI; gate deployment on pass
User feedback loop: provide a mechanism for users to report biased or harmful outputs; track resolution rate (target: 100% triaged within 48hr)
Hallucination monitoring (generative models): sample ≥100 production outputs/week for factual accuracy audit

Model outputs must be compatible with screen readers — avoid emoji-heavy or ASCII art responses
For multi-language models, test outputs across all supported languages — quality often varies 20–40%
Color-dependent visualizations (e.g., SHAP plots) must be colorblind-safe — use viridis or cividis palettes
Provide alternative text for any model-generated images or charts

"The model is unbiased because we removed protected attributes" — proxy variables (zip code → race) still exist
Fairness metrics can conflict with each other (demographic parity vs. equalized odds) — document which ones you prioritize and why
Safety testing only at launch, never revisited — models face new attack vectors over time; schedule recurring audits
Explainability as afterthought — design for interpretability from the start; retrofitting is 5× more expensive
Assuming open-source model = safe model — independent safety evaluation is still required
Over-reliance on automated toxicity classifiers — they miss context-dependent harm and cultural nuance; human review is mandatory for edge cases
Treating responsible AI as a one-time checklist instead of a continuous process — embed it in sprint workflows