Even many "generalization" benchmarks for AI miss the point. They often test parametric novelty, not structural novelty.
Parametric Novelty: "Here's a new level of the same game." (e.g., procedurally generated levels in Atari's Procgen). The rules, objects, and physics are identical. The AI doesn't need to induce a new world model.
Structural Novelty: Here's a new game with new rules you've never seen." This is what humans are great at. It requires building a new world model from scratch, or adapting an old one.
We are not testing for world model induction. We're just testing how well a static model generalizes to slight variations.