Competing Speeds of Memorisation and Generalisation Predict Grokking

Feb 4, 2026 • Yiding Song & Hanming Ye (Preprint)

Abstract

Deep networks trained with gradient descent have been observed to exhibit 'grokking', or delayed generalisation, on small algorithmic datasets. It has been conjectured that grokking is the result of different pattern learning speeds, where gradient descent first learns fast patterns that may overfit, and only later learns slower patterns that generalise better. In this work, we formalise this conjecture by establishing information-theoretic estimates of model capacity and dataset complexity. We demonstrate that the onset of grokking correlates strongly with the intersection of memorisation and generalisation speeds, where the time taken by a model to find an algorithmic solution equals that required to memorise a dataset of equivalent complexity. Surprisingly, we find that smaller models do not grok even if they have enough capacity to memorise the training set. We argue this is because smaller models have slower memorisation speeds, biasing gradient descent towards first discovering the faster, generalising solution. Our experiments provide evidence that memorisation and learning speeds are sufficient to quantitatively model grokking, and may be useful for understanding the generalisation behaviour of larger models on natural tasks.

Citation

Use the following BibTeX entry to cite this work:

@article{song2026competing,
  title={Competing speeds of memorisation and generalisation predict grokking},
  author={Song, Yiding and Ye, Hanming}
  year={2026}
}