Only so long as a) the gains are real, and not overfitting the test dataset, and b) you don't balloon in complexity, so that stacking approaches becomes impossible to manage.
Point (a) is extremely hard to discern, especially when people are chasing third-significant-digit gains on common benchmarks; it's essentially multiple-testing false discovery in action. I've seen whole families of methods fail to transfer to new domains...
Point (b) is also a real issue. As you increase the number of bells and whistles, each with their own hyperparameters with non-linear impacts on model quality, it becomes impossible to say what's working or not.
In practice, i think we see some cycles of baroque incremental improvements, followed by someone spending a year stripping away the bullshit and getting something simple that outperforms the pack, essentially because it's easier to do hyperparam search over simpler models once you figure out the bits that actually matter.