I was referring to this paper a lot when it was hyped, when people cared about a...

I was referring to this paper a lot when it was hyped, when people cared about architectural decisions of neural networks. It was also the year I started studying neural networks.

I think the idea still holds. Although the interest has been shifted towards test-time scaling and thinking, researcher still care about architectures like nemotron 3, recently published.

Can anyone give more updates on this direction of research, more recent papers?