A Short Note on Resilient Systems

I interpret resilient systems thus:

Resilient systems are difficult to build. Resilience is the capacity to withstand or to recover quickly from difficulties. The temptation to build systems without expectation of difficulty is strong. But once you build software that will be used by more than a few hundred people, difficulty becomes a natural part of the system. There will be surges in requests, services will go down, memory leaks will happen, etc. You want your systems to keep working despite all the worst case scenarios you don’t have in mind. And even when they do stop working, you want to know. You want them to handle the failure without making the problem worse. You want them to handle transactions that were in progress when they stopped working.

The list of things to keep in mind when building a resilient system is in-exhaustive, but the following have stood our for me:

Happy resilience.