Curios to try it someday on a set of specialized documents. Though as I understand the cost of running this is whatever GPU you can rent with 80GB of VRAM. Which kind of leaves hobbyists and students out. Unless some cloud is donating gpu compute capacity.
A GPU with 80GB VRAM costs around $1-3 USD an hour on commodity clouds (i.e. the non-Big 3 bare metal providers e.g. https://getdeploying.com/reference/cloud-gpu/nvidia-h100). I think it's accessible to most middle class users in first world countries.
The 80 GB are for training with a batch size of 32 times 2048 tokens each. Since the model has only about 560M parameters, you could probably run it on CPU, if a bit slow.
It will work great with 40GB GPU, probably a bit less than twice slower. These are micro models of a few B param at most and fit easily during both training and inference.
Set nproc_per_node-1 instead of 8 (or run the training script directly instead of using torchrun) and set device_batch_size=4 instead of 32. You may be able to use 8 with a 5090, but it didn't work on my 4090. However it's way slower than expected, one H100 isn't 250x the 4090, so I'm not sure it's training correctly. I'll let it run overnight and see if the outputs make any sense, maybe the metrics are not accurate in this config.
Curios to try it someday on a set of specialized documents. Though as I understand the cost of running this is whatever GPU you can rent with 80GB of VRAM. Which kind of leaves hobbyists and students out. Unless some cloud is donating gpu compute capacity.