Why would you need a 64 bit instruction; what kinds of things are going to be used for it?
What does 'rare' mean here, does it mean rare in execution, or rarely appears in code? (The difference being that something might only appear once in your code but be part of your hot loop so be executed any number of times)
If they are rare in execution, what is their value over composing them of 32-bit instructions, where the (rare) overhead of doing so would be typically a amortised away?
(The only thing I can think of that 64 bit instruction seem suited to is some kind of internal CPU management instructions, but context switches etc. are relatively rare & very expensive anyway so... I don't know)
From the RVI thread on 48 bit instructions, 64 bit ones would probably look similar:
> There are several 48-bit instruction possibilities.
> 1. PC-relative long jump
> 2. GP-relative addressing to support large small data area, effectively giving GP-relative access to entire data address space of most programs
> 3. Load upper 32-bits of 64-bit constants or addresses
> 4. Or lower 32-bits of 64-bit constants or addresses
> 5. And with 32-bit mask
> 6. More effective ins/ext of 64-bit bit fields
Another thing thats offten discussed is moving the vtype and setvl into each vector instructions, I'm not sure if that requries 48 or 64 bit instructions.
I was really asking about 64-bit instructions specifically, but going with what you've put, if you don't mind...
> 1. PC-relative long jump
My understanding is that these are rare
> 2. GP-relative addressing to support large small data area, effectively giving GP-relative access to entire data address space of most programs
What is 'GP' here? but "...access to entire data address space of most programs" In this case you are just going to be bouncing all over the address space, substantially missing any level of cache much of the time, surely?. Maybe you get a little extra code density but you aren't going to get any extra speed to speak of.
> 3. Load upper 32-bits of 64-bit constants or addresses
> 4. Or lower 32-bits of 64-bit constants or addresses
> 5. And with 32-bit mask
Well yeah, but how common is this? I understand the alpha architecture team looked at this and found it uncommon which is why they were okay with less-than-32-bit constants. If it really speeded things up you might build a specific cache to store constants (a kind of larger, stupider, register set). It would seem a simpler solution.
I'm not sure what you mean with 6, and I'm not familiar with vtype/setvl
On vtype/setvl: in the RISC-V V extension (aka RVV / Vector (≈SIMD)), due to the 32-bit instruction length, there's a separate instruction that does some configuration (operated-on element size, register group size, masked-off element behavior, target element count), which arith/etc operations afterwards will work by. So e.g. if you wanted to add vectors of int32_t-s, you'd need something like "vsetvli x0,x0,e32,m1,ta,ma; vadd.vv dst,src1,src2"
Often one vsetvl stays valid for multiple/most/all instructions, but sometimes there's a need to toggle it for a single instruction and then toggle it back. With 48-bit or 64-bit instructions, such temporary changes could be encoded in the operation instruction itself.
Additionally, masked instructions always mask by v0, which could be expanded to allow any register (and perhaps built-in negation) by more instruction bits too.
Depends on how many bits you had to start with. On Power ISA they aren't common either, but when they happen you need up to seven instructions (lis, ori, rldicl, oris, ori, then for branches mtctr/b(c)ctr) to specify the new address or larger value. Most other RISCs are similar when full 64-bit values must be specified. This is a significant savings.
Well you can embed longer immediates directly in the opcode.
You could have a lot more registers.
The first example, I'm not sure you'd want a full 64bit encoding space. You still aren't going to be able to load a 64bit immediate directly so I'd rather see an instruction that uses the next instruction as the immediate. But then 50% of the time you're still going to be padding this to 64bit alignment, so it's unclear to me that this is a benefit over 2 lots of the same but with 32bit immediates.
The second option is interesting. But if you've got 256 addressable registers say, what use are the 32 and 16 bit instructions that can only address a tiny proportion of those registers.
How do you even use all those registers? Serious question. I've toyed with a couple of 256-register ISAs, and the moment you hit function calls/parameter passing you realize that to utilize those efficiently, you really need some way to indirectly refer to registers, be it register windows, or MMIX's register slide, or Am29k's IPA/IPB/IPC registers; the only other option seems to be to perform global register allocation but that hardly works in scenarios with separate compilation/dynamic code loading.
Off the top of my head I don't really know. But then if you had asked me 20 years ago if we'd need multi core multi GHz multi GB computers to display a web page I'd probably have said no.
I suppose the os could reserve registers for itself to save swapping in and out quite so often.
Register windows for applications/functions/threads.
Or maybe something radically different, like get rid of the stack, and treat them conceptually like a list?
The sweet spot for scalar code is about 24 registers, but that leads to weird offset-bits (there's an ISA that does this, but I forget what it's called), so 32 registers is easier to implement and provides a mild improvement in the long tail of atypical functions.
On the flip side, the ability to have more registers is very good for SIMD/GPU applications.
Absolutely, I'm not saying a 64bit instruction length with 5/6/7/8 bits of registers would be bad per se. In fact I'd be interested to see where it leads.
But if you have a processor that also uses 16 bit instructions those extra registers become unusable. Thumb can't encode all registers in all instructions so you have the high registers that are significantly less useful than the low registers.
X86 is the same, never really done 64bit ASM so I don't know if they improved that.
So then you may aswell just divide up the registers so you've got 16 general purpose registers and 16 registers for simd or whatever.
Power10 added "prefixed" instructions, which are effectively 64-bit instructions in two 32-bit halves (the nominal instruction size). They are primarily used for larger immediates and branch displacements.
MIPS had load const to high or low half. More that 40 years ago Transputer had shift-and-load 8 bit constants. Lots of ancient precedents for rare big constants.
So does classic PowerPC, SPARC, and many other ISAs. It's the most common way to handle it on RISC. The Power10 prefixed instruction idea just expands on it.
Personally, I like the idea of doubling the instruction length every time -- 16, 32, 64, 128, etc. There's a big use case on the longer instruction end for VLIW/DSP/GPU applications.
What does 'rare' mean here, does it mean rare in execution, or rarely appears in code? (The difference being that something might only appear once in your code but be part of your hot loop so be executed any number of times)
If they are rare in execution, what is their value over composing them of 32-bit instructions, where the (rare) overhead of doing so would be typically a amortised away?
(The only thing I can think of that 64 bit instruction seem suited to is some kind of internal CPU management instructions, but context switches etc. are relatively rare & very expensive anyway so... I don't know)