This is mostly going to be the same. The main difference in tensorization is tha...

This is mostly going to be the same. The main difference in tensorization is that your vectors are not contiguous. So you lose some speedup. How you typically optimize this is by flattening your tensor. Then your data is contiguious and all the same rules apply.

This is overly simplified though. Things get different when we start talking about {S,M}I{S,M}D (page addresses SIMD) or GPUs. Parallel computing is a whole other game, and CUDA takes it to the next level (lots of people can write CUDA kernels, not a lot of people can write GOOD CUDA kernels (I'm definitely not a pro, but know some)). Parallelism adds another level of complexity due to being able to access different memory locations simultaneously. If you think concurrent optimization is magic, parallel optimization is wizardry (I still believe this is true)