Skip to content

Snippet: Graph ML & Knowledge Graphs

Domain Context

Graph Neural Networks (GNN), knowledge graph reasoning, network analysis, and graph-based recommendations. Graph structures add relational context that flat features miss — but also add significant complexity.

Data Representation

  • Define node types, edge types, and their attributes clearly before any modeling
  • For heterogeneous graphs: document the full schema (node types × edge types × attribute types)
  • Graph statistics to log before training: node count, edge count, degree distribution, connected components
  • Handle isolated nodes explicitly — they receive no message passing information
  • Temporal graphs: maintain edge timestamps; use temporal-aware splitting (no future edges in training)

Model Selection

  • Baseline: node/edge features + classical ML (without graph structure) establishes the GNN value-add
  • GCN/GraphSAGE: start here for homogeneous node classification
  • R-GCN/HGT: for heterogeneous graphs with multiple node/edge types
  • GAT/GATv2: when attention-weighted neighborhood aggregation is expected to help
  • Number of layers: 2-3 is usually optimal — more layers cause over-smoothing; document layer count choice

Training Practices

  • Neighbor sampling (mini-batch): required for large graphs (>100K nodes) — full-batch won't fit in memory
  • Sampling strategy: document fan-out per layer (e.g., [25, 10] for 2-layer)
  • Inductive vs. transductive: be explicit about which setting applies to your problem
  • Negative sampling for link prediction: random negatives as baseline, hard negatives for better training
  • Feature normalization: normalize numerical node features; use embeddings for high-cardinality categoricals

Knowledge Graph Specific

  • Triple format: (head, relation, tail) — validate for duplicate/contradictory triples
  • Embedding models: TransE for simple relations, RotatE for complex patterns, ComplEx for symmetric
  • Evaluation: Mean Reciprocal Rank (MRR), Hits@K (K=1,3,10) — report both filtered and raw
  • Type constraints: enforce domain/range constraints on predicted triples
  • Temporal KGs: use time-aware models; evaluate on future time slices only

Evaluation

  • Node classification: accuracy, macro-F1; split must respect graph structure (not random node split)
  • Link prediction: MRR, Hits@K with proper negative sampling (not trivially easy negatives)
  • Avoid data leakage: for link prediction, remove test edges from the training graph
  • Evaluate by node degree bucket (low/medium/high) — models often fail on low-degree nodes
  • Over-smoothing test: compare performance across different numbers of GNN layers

Common Pitfalls

  • Over-smoothing: too many GNN layers make all node representations converge — limits depth
  • Information leakage in link prediction: test edges visible during training message passing
  • Scalability: full-graph operations don't scale — always design for mini-batch from the start
  • Ignoring graph structure changes: real-world graphs evolve — evaluate on updated graph snapshots
  • Feature leakage: target label information encoded in neighbor features via message passing