well - yes ... that's the point of occam[1] ... if it can hang, it will hang deterministically
we have to zoom out from the 1980s when 4 CPUs were a lot ... but now you can build 40,000 (ie 200 x 200 array) of CPUs within the single reticle limit (ie same as a big NVIDIA) then a big MIMD must be coded with algorithmic patterns like map-reduce, pipelining, etc.
but the general CPU nature and HLL coding means that this is far easier than CUDA to get close to theoretical max performance
[1] or any CSP with both input and output descheduling - ie no queueing
we have to zoom out from the 1980s when 4 CPUs were a lot ... but now you can build 40,000 (ie 200 x 200 array) of CPUs within the single reticle limit (ie same as a big NVIDIA) then a big MIMD must be coded with algorithmic patterns like map-reduce, pipelining, etc.
but the general CPU nature and HLL coding means that this is far easier than CUDA to get close to theoretical max performance
[1] or any CSP with both input and output descheduling - ie no queueing