Data Engineering Overlay¶

Appended on top of core.md for data-centric projects (ETL, feature engineering, data platforms).

Project Mindset¶

Data is the product. Data quality > pipeline speed > code elegance. Every pipeline must be idempotent, observable, and recoverable. Treat data contracts like API contracts — breaking changes require versioning and migration.

Data Pipeline Principles¶

Pipelines must be idempotent: running the same pipeline twice produces the same result
Every pipeline step must be independently re-runnable without side effects
Use a DAG orchestrator (Airflow, Dagster, Prefect) — never cron jobs for multi-step pipelines
Pipeline naming convention: {source}_{transform}_{destination}_{frequency}
Pipeline ownership: every pipeline must have an owner documented in config

Data Quality¶

Schema validation at every boundary: source → raw → staging → production
Data quality checks (mandatory per table): row count, null %, unique constraint, value range
Implement data contracts: agreed schema + freshness + quality SLA between producer and consumer
Anomaly detection on key metrics: alert on volume changes >20%, null rate spikes, distribution shifts
Quarantine bad data: route failing records to a dead-letter table, not silently drop them

Feature Store & Feature Engineering¶

Feature definitions must be centralized — one source of truth for each feature
Feature computation must be identical between training (offline) and serving (online)
Track feature lineage: which raw data sources contribute to each feature
Feature documentation: name, type, description, update frequency, owner, downstream consumers
Deprecation policy: mark features as deprecated before removing them; track usage

Storage & Format Standards¶

Columnar format preferred for analytics: Parquet or Delta Lake
Partition strategy: define by access pattern (date, region, etc.) — document the choice
File size target: 128MB-1GB per file for Spark; avoid small file problem
Schema evolution: use formats that support it (Parquet, Avro) — never break downstream readers
Data retention policy: define TTL per dataset; implement automated cleanup

Observability¶

Every pipeline run must log: start time, end time, records processed, records failed, duration
Data freshness monitoring: alert when data is older than SLA threshold
Cost tracking: monitor compute and storage costs per pipeline — optimize the expensive ones
Lineage tracking: maintain a graph of data dependencies (source → transform → destination)
Runbook: every pipeline must have a documented troubleshooting guide

Security & Governance¶

PII classification: tag columns containing personal data at ingestion time
Access control: implement column-level or row-level security for sensitive data
Audit logging: track who accessed what data, when, and why
Data masking: provide anonymized views for development and testing environments
Compliance: document which regulations apply to each dataset (GDPR, HIPAA, etc.)