They are not being trained only on 1 epoch. They are trained on multiple epochs for high quality data. Also Meta team with llama show that simply training more, more tokens, continues to reduce loss.
If you divide the number of sentences trained on by the total number of sentences in its corpora, the number for most of the top LLMs will be far closer to ~1 than any other integer.
> Also Meta team with llama show that simply training more, more tokens, continues to reduce loss.
Can you source the specific claim you are talking about? More tokens to me generally will mean new tokens unless you are specifying.
from the paper "We train for one epoch over the training data. In earlier experiments, we found that
training longer can lead to over-fitting"
I could be wrong, but I thought the llama 2 paper explicitly called out 1 epoch and that more than that caused over-fitting in their other experiments.