well for example in our systems all api calls only moves from a know state to an...

Negitivefrags · on Feb 27, 2019

It sounds like you are saying "The in-flight requests fail" to me.

I really don't like the idea of saying that it's simply okay to give random users a bad user experience like that when you are actually killing servers yourself all the time.

dbaggerman · on Feb 27, 2019

It's a different approach to managing risk -- minimizing impact of failure rather than minimizing the likelihood of failure.

It's nice to know that you can kill a process and the only impact is that in-flight requests fail, rather than having a more significant outage if a process crashes and the failover doesn't work, or the process doesn't automatically restart, etc.

If you accept that requests will fail you can build retries into the system. It's a lot harder to make a system more resilient if you avoid testing the failure scenarios.

lklig · on Feb 28, 2019

Exactly! Chaos engineering is all about thoughtfully planned out experiments, to observe what the user experience will be when something fails. Doing this on your own terms allows you to improve the experience so that your customers aren't affected.

You can decide what happens when an in-flight request is dropped, whether you hold onto the state somehow and retry or the client could fail gracefully with a relevant error message.

dkersten · on Feb 28, 2019

Another thing that's not often caught by "normal" testing but that chaos engineering can capture is when multiple things fail together in random ways. It can be surprising how otherwise robust services can fail badly when multiple things go wrong at once.

zeckalpha · on Feb 28, 2019

When you have nested service calls, a single downstream failure shouldn’t fail all the way back to the root request, in most cases.