Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

They are not being trained only on 1 epoch. They are trained on multiple epochs for high quality data. Also Meta team with llama show that simply training more, more tokens, continues to reduce loss.


If you divide the number of sentences trained on by the total number of sentences in its corpora, the number for most of the top LLMs will be far closer to ~1 than any other integer.

> Also Meta team with llama show that simply training more, more tokens, continues to reduce loss.

Can you source the specific claim you are talking about? More tokens to me generally will mean new tokens unless you are specifying.

from the paper "We train for one epoch over the training data. In earlier experiments, we found that training longer can lead to over-fitting"


Yes. Surely "more tokens" doesn't mean "more epochs".


I could be wrong, but I thought the llama 2 paper explicitly called out 1 epoch and that more than that caused over-fitting in their other experiments.


You're not at all wrong :) I think a lot of people confuse the pre-training and fine-tuning runs because these are all novel concepts.


Llama: https://arxiv.org/pdf/2302.13971.pdf

You can clearly see in table on second page that higher quality data is trained on more than 1 epoch. Most Open LLM's do this.


Llama2 doesn’t and outclasses llama. I believe GPT was trained in same manner


> I believe GPT was trained in same manner

If you are talking about GPT-4, unless you are insider (doubt it) you'd have no way of proving it either way because that info is not public.

> Llama2 doesn’t and outclasses llama.

My point still stands, llama2 is just one llm, and we still don't know the distribution of their training set.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: