SD has a very primitive conceptual model. Basically “bag of words nudging pixels around for a while”. Words near each other influence each other. But, there’s nearly no understanding of grammar.
Midjourney is similar with text prompts. But, with image prompts it is able to understand content separately from style. You can give it a photo of two people and it can return many images of recognizable approximations of those people in different poses.
SD can only start from pixels, blur and deblur those pixels in place.
MJ image prompts probably works via image-to-tokens added on to your text-to-tokens-to-image.