Wait, so are you telling me all of the hardware/ISA is actually fully accessible and functional, and it's just an artificial PTX -> SASS compiler limitation?
Because the official NVidia stance is definitely that TMEM, etc. is not supported and doesn't work.
...I don't suppose you have a link to a repo with code that can trigger any of this officially forbidden functionality?
Very interesting! Thanks! I'll definitely keep a close eye on that repo.
Anyhow, be that as it may, I was talking about the PTX mnemonics and such because I'd like to use this functionality from my own, custom kernels, and not necessarily only indirectly by triggering whatever lies at the bottom of NVidia's abstraction stack.
So what's your endgame with your proofs? You wrote "the breaking point was implementing an NVFP4 matmul" - so do you actually intend to implement an NVFP4 matmul? (: If so I'd be very much interested; personally I'm definitely still in the "cargo-cults from CUTLASS examples" camp, but would love something more principled.
Because the official NVidia stance is definitely that TMEM, etc. is not supported and doesn't work.
...I don't suppose you have a link to a repo with code that can trigger any of this officially forbidden functionality?