What is NGG and shader culling on AMD RDNA GPUs?

iforgotpassword · on July 15, 2022

> All of these HW stages (except GS, of course)

...Of course...

The intro was pretty rough, it expects a lot of knowledge about the topic. I almost stopped reading there. The second part that explains the possibilities and talks about real world performance impact was pretty interesting though, so, glad I kept reading.

tpxl · on July 15, 2022

It'd be nice if it explained the acronyms the first time they are used.

striking · on July 15, 2022

They link to a doc that has a glossary[1] but you're not wrong, it would be helpful.

1: https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/amd...

cylon13 · on July 15, 2022

Even knowing what they all are it’s annoying to read compared to having them spelled out. I’m constantly pausing to translate. If you’re talking to someone IRL you just say “hardware vertex shader,” not “HW VS”.

Jasper_ · on July 15, 2022

Yeah, I know all these terms, but I only learned them from working on the PS4 for a long time. Even those quite familiar with graphics might struggle to read this article.

GS = geometry shader, and it's main gimmick is allowing you to generate triangle topology on the GPU. So that's why it should be "obvious" that a GS lane can control more than one output vertex. GS was also a terrible idea, because it's slow.

jayd16 · on July 15, 2022

Hey, geometry shaders can be awesome but they do have limited use. I guess it might fall out of favor if we end up with Nanite taking over.

_eojb · on July 15, 2022

Them falling out of favor has nothing to do with nanite FYI, and you would not want to use GS for any of the benefits a tech like nanite gives you because it's far too slow. We've all moved to CS for that sort of things if you can't rely on AS/MS

dragontamer · on July 15, 2022

A few notes.

1. Rasterization is, conceptually at least, the painter's algorithm. You conceptually paint the "backmost" triangle, then paint the triangles "on top" afterwards. As long as you paint from furthest back, to front, you will get the right image. The textbook algorithm sorts all triangles from furthest back to front, but O(n*log(n)) sort is obviously too slow for 100-million triangles 60-frames-per-second that video-gamers want.

2. Culling modifies the #1 algorithm by removing triangles from consideration. Traditionally, this is done by ASIC parts of a GPU (I mean, traditionally, the whole GPU was ASIC and non-programmable. But even just a few years ago, hardware culling was not yet done in shaders). If Triangle#5 is completely covered by Triangle#200, then you can "optimize" by never drawing #5 to begin with, and instead just drawing Triangle#200. The GPU's hardware can detect cases like this and automatically skip the "Drawing of Triangle#5)

3. Primitive Shaders / NGG and other features from Vega onwards of AMD allow for data to be passed between the rendering pipeline in new ways. This seems to enable *software* culling.

4. Its not too hard to do software culling per se. What's hard is to do software culling that its worthwhile (aka: faster than the hardware ASIC culling). The claims here are suggesting that software culling is finally worthwhile thanks to these new shading units, new ways of passing data back-and-forth between stages of the GPU. With these new datapaths, it is possible to implement a software culler that matches (or exceeds) the speed of the hardware culler.

5. That's what a "shader culling" is. Software culling that's faster than the ASIC-paths of the GPU. Fully defined in software, so you can make them more flexible / tuned for your specific video game than the hardware.

6. Video games do often have CPU-side culling before sending the data to the GPU side. Its also culling, but in a different context. I believe that "shader culling" implies the low-level, fine-grained culling that the GPU-hardware was expected to do, rather than the coarse-grained "X character is on the wrong side of the camera so don't draw X or the 200,000 triangles associated with X" that CPUs have always done.

7. On #6's note, there's lots of kinds of culling done at many stages of any video game engine today, from software/CPU side all the way to the low level GPU stuff. Since this seems to be a low-level GPU driver post, you can assume that they're talking about low-level GPU hardware culling specifically.

8. I'm not actually a video game programmer. I just like researching / learning about this field.

faragon · on July 15, 2022

Great explanation, thank you.

phkahler · on July 15, 2022

When he says the main performance benefit is when games over tesselate, I'd say this is just an invitation for everyone to increase tesselation and bump mapping for more detail.

rasz · on July 15, 2022

>invitation for everyone to increase tesselation

that ship sailed with the introduction of Gameworks - an Nvidia program that paid game studios to deoptimize games on AMD hardware.

https://techreport.com/review/21404/crysis-2-tessellation-to...

https://wccftech.com/fight-nvidias-gameworks-continues-amd-c...

"Number one: Nvidia Gameworks typically damages the performance on Nvidia hardware as well, which is a bit tragic really. It certainly feels like it’s about reducing the performance, even on high-end graphics cards, so that people have to buy something new.

"That’s the consequence of it, whether it’s intended or not - and I guess I can’t read anyone’s minds so I can’t tell you what their intention is. But the consequence of it is it brings PCs to their knees when it’s unnecessary. And if you look at Crysis 2 in particular, you see that they’re tessellating water that’s not visible to millions of triangles every frame, and they’re tessellating blocks of concrete – essentially large rectangular objects – and generating millions of triangles per frame which are useless."

Jasper_ · on July 15, 2022

Well bump mapping happens in the pixel shader so that wouldn't affect tricounts. The traditional to way fight "too many triangles" is with LODs, but LODs aren't the perfect solution since triangle-screen density changes with camera settings too.

I wouldn't expect massive, massive gains from the new culling, as suggested. The instancing demo gets gains only because it didn't have any CPU-side frustum culling to begin with. But real games have frustum culling nowadays.

sylware · on July 15, 2022

Anybody wanna make a simple C-coded SPIR-V assembler in mesa, that to remove that horrible glsl radix sort?

Jasper_ · on July 15, 2022

SPIRV-Tools has an assembler...

sylware · on July 15, 2022

not simple C coded, as far as I know it is horrible c++.