- successful fp8 quantized training for SOTA model
- multi token prediction, mostly to improve training results, but also to enable speculative decoding
- very high sparsity per request (37B activated params per 671B total params)
- using reasoning data (from DeepSeek R1) to fine-tune and improve results on math & coding
- manual balancing of compute / communication in their infrastructure, up to SM level
- successful fp8 quantized training for SOTA model
- multi token prediction, mostly to improve training results, but also to enable speculative decoding
- very high sparsity per request (37B activated params per 671B total params)
- using reasoning data (from DeepSeek R1) to fine-tune and improve results on math & coding
- manual balancing of compute / communication in their infrastructure, up to SM level