Should be "that you can train for $100" Curios to try it someday on a set of spe...

Onavo · 2025-10-13T17:56:36 1760378196

A GPU with 80GB VRAM costs around $1-3 USD an hour on commodity clouds (i.e. the non-Big 3 bare metal providers e.g. https://getdeploying.com/reference/cloud-gpu/nvidia-h100). I think it's accessible to most middle class users in first world countries.

antinomicus · 2025-10-13T18:49:15 1760381355

Isn’t the whole point to run your model locally?

theptip · 2025-10-13T18:56:47 1760381807

No, that’s clearly not a goal of this project.

This is a learning tool. If you want a local model you are almost certainly better using something trained on far more compute. (Deepseek, Qwen, etc)

yorwba · 2025-10-13T18:56:47 1760381807

The 80 GB are for training with a batch size of 32 times 2048 tokens each. Since the model has only about 560M parameters, you could probably run it on CPU, if a bit slow.

simonw · 2025-10-13T20:14:33 1760386473

You can run a model locally on much less expensive hardware. It's training that requires the really big GPUs.

jsight · 2025-10-13T19:39:13 1760384353

I'd guess that this will output faster than the average reader can read, even while using only CPU inferencing on a modern-ish CPU.

The param count is small enough that even cheap (<$500) GPUs would work too.

portaouflop · 2025-10-13T17:49:14 1760377754

If I have let’s say 40gb RAM does it not work at all or just take twice as long to train?

typpilol · 2025-10-13T18:02:27 1760378547

Won't work at all. Or if it does it'll be so slow since it'll have to go to the disk for every single calculation so it won't ever finish.

karpathy · 2025-10-13T19:51:17 1760385077

It will work great with 40GB GPU, probably a bit less than twice slower. These are micro models of a few B param at most and fit easily during both training and inference.

utopcell · 2025-10-14T02:34:38 1760409278

How low can this go? Can this run on a 5090 card (32GiB)?

JonathanFly · 2025-10-14T10:33:10 1760437990

Set nproc_per_node-1 instead of 8 (or run the training script directly instead of using torchrun) and set device_batch_size=4 instead of 32. You may be able to use 8 with a 5090, but it didn't work on my 4090. However it's way slower than expected, one H100 isn't 250x the 4090, so I'm not sure it's training correctly. I'll let it run overnight and see if the outputs make any sense, maybe the metrics are not accurate in this config.