Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

There's a bit more context rumbling under the surface.

Not too long ago, Qualcomm bought NUVIA, a designer of high performance arm64 cores that can theoretically compete with Apple cores on perf. Arm pretty much immediately sued saying that the specifics of the licenses that Qualcomm and NUVIA have mean that cores developed under NUVIA's license can't be transferred to Qualcomm's license.[0] Qualcomm obviously disagrees. Whatever happens those cores as they exist today are going to be stuck in litigation for longer than they're relevant.

Qualcomm's proposal smells strongly like they're doing the minimum to strap a RISC-V decoder to the front of these cores. For whatever reason the seem hell bent on only changing the part of the front end that's the 'pure function that converts bit patterns of ops to bit patterns of micro-ops'. Arm64 is only 32bit aligned instructions, so they don't want to support anything else.

At the end of the day, the C extension really isn't that bad to support in a high perf core if you go in wanting to support it. The canonical design (not just for RISC-V but high end designs like Intel and AMD too) is to have I$ lines fill into a shift register, have some hardware on whatever period your alignment boundary is that reports 'if an instruction started here, how long is it', and a second stage (logically, it doesn't have to be an actual clock stage) that looks at all of those reports generates the instruction boundaries and feeds them into the decoders. At this point everything is also marked for validity (ie. did an I$ line not come in because of a TLB permissions failure or something).

[0] - https://www.reuters.com/legal/chips-tech-firm-arm-sues-qualc...



> Qualcomm's proposal smells strongly like they're doing the minimum to strap a RISC-V decoder to the front of these cores.

Hmm.. At about the same time as the proposal to drop C from RVA, Qualcomm also proposed an instruction-set extension [1] that smells very much of ARM's ISA (at least to my nose). It also has several issues to criticise, IMHO.

[1] https://lists.riscv.org/g/tech-profiles/attachment/332/0/cod...


The 32-bit aligned instruction assumption is probably baked into their low-level caches, branch predictors etc. That might mean much more significant work for switching to 16-bit instructions than they are willing to do.


I don't think anyone bakes instruction alignment into their caches since the early 2000s, and adding an extra bit to the branch predictors isn't that big of a deal. It's got to be the first or second stage of their front end right before the decoders.


Why not bake instruction alignment into the cache? When you can assume instructions will always be 32bit aligned, then you can simplify the icache read port and simplify the data path from the read port to the instruction decoder. Seems like it would be an oversight to not optimise for that.

Though, I suspect that's easy problem to fix. The more pressing issue is what happens after the decoders. I understand this is a very wide design, decoding say 10 instructions per cycle.

There might be a single 16bit instruction in the middle of that block 40 bytes, changing the alignment halfway though. To keep the same throughput, Qualcomm now need 20 decoders, one attempting to decode on every 16bit boundary. The extra decoders waste power and die space.

Even worse, they somehow need to collect the first 10 valid instructions from those 20 decoders. I really doubt they have enough slack to do that inside the decode stage, or the next stage, so Qualcomm might find them selves adding an entire extra pipeline stage, (probably before decode, so they can have 20 simpler length decoders feeding into 10 full decoders on the next) just to deal with possible misaligned instructions.

I don't know how flexible their design is, it's quite possible adding an entire extra pipeline stage is a big deal. Much bigger than just rewriting the instruction decoders to 32bit RISC-V.


Because RISC-V was designed to be trivial to decode length for, you simply need to look at the top two bits of each 16bit word to tell if it's a 32bit or 16bit instruction. At that point, spending the extra I$ budget isn't worth it. Those 20 'simple decoders' are literally just each one 2nand gate. Adding complexity to the I$ hasn't even made sense for x86 in two decades, because of the extra area needed for the I$ versus the extra decode logic. And that's a place where this extra decode is legitimately an extra pipeline stage.

> I don't know how flexible their design is, it's quite possible adding an entire extra pipeline stage is a big deal. Much bigger than just rewriting the instruction decoders to 32bit RISC-V.

I'm sure it is legitimately simpler for them. I'm not sure we should bend over backwards and bring down the rest of the industry because they don't want to do it. Veyron, Tenstorrent were showing off high perf designs with RV-C.


It doesn't matter how optimised the length decoding is. Not doing it is still faster.

For an 8-wide or 10-wide design, the propagation delays are getting too long to do it in all in single cycle. So you need the extra pipeline stage. The longer pipeline translates to more cycles wasted on branch mispredits.

RISC-V code is only about 6-14% denser than Aarch64 [1], I'm really not sure the extra complexity is worth it. Especially since Aarch64 still ends up with a lower instruction count, so it will be faster whenever you are decode limited instead of icache limited.

> Adding complexity to the I$ hasn't even made sense for x86 in two decades

Hang on. Limiting the Icache to only 32bit aligned access actually simplifies it.

And since the NUVIA core was originally an aarch64 core, why wouldn't they optimise for hardcoded 32bit alignment and get a slightly smaller Icache?

[1] https://www.bitsnbites.eu/cisc-vs-risc-code-density/


> Hang on. Limiting the Icache to only 32bit aligned access actually simplifies it.

Even x86 only reads 16 or 32 byte aligned fields out of the I$, then shifts them. There's not extra I$ complexity. You still have to do that shift at some point, in case you don't jump 32 byte aligned address. You also ideally don't want to only hit peak decode bandwidth starting on aligned 32 byte program counters, so that whole shift register thing is pretty much a requirement. And that's where most of the propagation delays are.

> RISC-V code is only about 6-14% denser than Aarch64 [1], I'm really not sure the extra complexity is worth it. Especially since Aarch64 still ends up with a lower instruction count, so it will be faster whenever you are decode limited instead of icache limited.

There's heavy use of fusion, and fwiw, the M1 also heavily fuses into micro ops too (and I'm sure the AArch64 morph of NUVIA's cores do too).


Under a classic RISC architectures you can't jump to non-aligned addresses. That lets you specify jumps that are 4 times longer for the same number of bits in your jump instruction. Here's MIPS as an example:

https://en.wikibooks.org/wiki/MIPS_Assembly/Instruction_Form...


Classic RISC was targeting about 20k gates and isn't really applicable here.


AArch64 does the same thing.

https://valsamaras.medium.com/arm-64-assembly-series-branch-...

And it's not only a way of decreasing code size. It help with security too. If you can have an innocuous looking bit of binary starting at address X that turns into a piece of malware if you dump to instruction X+1 that's a serious problem.

https://mainisusuallyafunction.blogspot.com/2012/11/attackin...

RISC-V, I'm pretty sure, enforces 16 bit alignment and is self synchronizing so it doesn't suffer from this despite being variable length. But if it allowed the PC to be pointed at an instruction with a 1 byte offset then it might be.

As far as I'm aware every RISC ISA that's had any commercial succss does this. HP RISC, SPARC, POWER, MIPS, Arm, RISC-V, etc.


> And it's not only a way of decreasing code size. And RISC-V has better code density than AArch64.

> It help with security too. If you can have an innocuous looking bit of binary starting at address X that turns into a piece of malware if you dump to instruction X+1 that's a serious problem.

JIT spraying attacks work just fine on aligned architectures too, hence why Linux hardened the AArch64 BPF JIT as well: https://linux-kernel.vger.kernel.narkive.com/M0Qk08uz/patch-...

Additionally, MIPS these days has a compressed extension to their ISA too, heavily inspired by RV-C. https://mips.com/products/architectures/nanomips/


Not all JIT spraying relies on byte offsets to get past JIT filters, the attack I gave is just an example.

And NanoMips requires instructions to be word aligned just like everybody else, it's just that it requires 16 bit alignment rather than 32. Attempting to access an odd PC address will result in an access error according to this:

https://s3-eu-west-1.amazonaws.com/downloads-mips/I7200/I720...


> And NanoMips requires instructions to be word aligned just like everybody else, it's just that it requires 16 bit alignment rather than 32. Attempting to access an odd PC address will result in an access error according to this:

That's the same as RV-C.


Right, and I mentioned RISC-V as yet another sane RISC architecture that requires word alignment in instruction access. But the fact that it requires alignment means that the word size has implications for the instruction cache design and the complexity of the piping there.

I don't have a strong opinion on whether the C extension is a net good or bad for high performance designs, but I do strongly believe that it comes with costs as well as benefits.


Back in 2019, RISC-V was 15-20% smaller than x86 (up to 85% smaller in some cases) and was 20-30% smaller than ARM64 (up to 50% smaller in some cases).

https://project-archive.inf.ed.ac.uk/ug4/20191424/ug4_proj.p...

Since then, RISC-V has added a bunch more instructions that ARM/x86 already had which has made RISC-V even smaller relative to them.


No idea if this is true for Qualcomm, but people from Rivos have also been in that meeting arguing against the C extension and as far as I know Rivos have no in-house Arm cores they are trying to reuse.


Rivos was formed from a bunch of ex-Apple CPU engineers. I'm sure they would feel more comfortable with a closer to AArch64 derived design as well.


They might also know a bunch of techniques to give high performance that only work if you've got nice 32-bit only aligned instructions!


Haha that'd be a little counterintuitive given all 32-bit aligned is the trivial case for decoding variable length instructions, unless you're thinking about prefetching/branch prediction etc


> all 32-bit aligned is the trivial case for decoding variable length instructions

That's the point? You can go faster if everything is 32-bit aligned, i.e. you don't have variable length instructions.


The shift register design sounds quite expensive. You're essentially constructing <issue width> number of crossbars of 32 times <comparator widths> connected to a bunch of comparators to determine instruction boundary. In a wide design you also need to do this across multiple 32-bit lines


Well, half that because the instructions are 16 bit aligned. And approaching half of even that because not every decoder needs access to every offset. Decoder zero doesn't need any. Decoder one only needs two, etc.

But you need most of that anyway because you need to handle program counters that aren't 32 byte aligned, so you need to either do it before hitting the decoders, or afterwards when you're throwing the micro-ops into the issue queues (which are probably much wider and therefore more expensive).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: