Snippet: Graph ML & Knowledge Graphs¶
Domain Context¶
Graph Neural Networks (GNN), knowledge graph reasoning, network analysis, and graph-based recommendations. Graph structures add relational context that flat features miss — but also add significant complexity.
Data Representation¶
- Define node types, edge types, and their attributes clearly before any modeling
- For heterogeneous graphs: document the full schema (node types × edge types × attribute types)
- Graph statistics to log before training: node count, edge count, degree distribution, connected components
- Handle isolated nodes explicitly — they receive no message passing information
- Temporal graphs: maintain edge timestamps; use temporal-aware splitting (no future edges in training)
Model Selection¶
- Baseline: node/edge features + classical ML (without graph structure) establishes the GNN value-add
- GCN/GraphSAGE: start here for homogeneous node classification
- R-GCN/HGT: for heterogeneous graphs with multiple node/edge types
- GAT/GATv2: when attention-weighted neighborhood aggregation is expected to help
- Number of layers: 2-3 is usually optimal — more layers cause over-smoothing; document layer count choice
Training Practices¶
- Neighbor sampling (mini-batch): required for large graphs (>100K nodes) — full-batch won't fit in memory
- Sampling strategy: document fan-out per layer (e.g., [25, 10] for 2-layer)
- Inductive vs. transductive: be explicit about which setting applies to your problem
- Negative sampling for link prediction: random negatives as baseline, hard negatives for better training
- Feature normalization: normalize numerical node features; use embeddings for high-cardinality categoricals
Knowledge Graph Specific¶
- Triple format: (head, relation, tail) — validate for duplicate/contradictory triples
- Embedding models: TransE for simple relations, RotatE for complex patterns, ComplEx for symmetric
- Evaluation: Mean Reciprocal Rank (MRR), Hits@K (K=1,3,10) — report both filtered and raw
- Type constraints: enforce domain/range constraints on predicted triples
- Temporal KGs: use time-aware models; evaluate on future time slices only
Evaluation¶
- Node classification: accuracy, macro-F1; split must respect graph structure (not random node split)
- Link prediction: MRR, Hits@K with proper negative sampling (not trivially easy negatives)
- Avoid data leakage: for link prediction, remove test edges from the training graph
- Evaluate by node degree bucket (low/medium/high) — models often fail on low-degree nodes
- Over-smoothing test: compare performance across different numbers of GNN layers
Common Pitfalls¶
- Over-smoothing: too many GNN layers make all node representations converge — limits depth
- Information leakage in link prediction: test edges visible during training message passing
- Scalability: full-graph operations don't scale — always design for mini-batch from the start
- Ignoring graph structure changes: real-world graphs evolve — evaluate on updated graph snapshots
- Feature leakage: target label information encoded in neighbor features via message passing