A Short Note on Resilient Systems

I interpret resilient systems thus:

It handles large number of requests with a relatively low error rate. The system error rate, separate from human error, should ideally be below 5% when operating normally. Obviously, the lower the error rate, the better. And this is context dependent.
It handles requests with little latency, ideally in a few seconds, and preferably not more than a minute. This, like the error rate, is also context dependent.
It fails gracefully. When something goes wrong, the system communicates the failure and handles the failure in a way that prevents unexpected behaviour.
When the system resumes operations, it cleans up all pending transactions.

Resilient systems are difficult to build. Resilience is the capacity to withstand or to recover quickly from difficulties. The temptation to build systems without expectation of difficulty is strong. But once you build software that will be used by more than a few hundred people, difficulty becomes a natural part of the system. There will be surges in requests, services will go down, memory leaks will happen, etc. You want your systems to keep working despite all the worst case scenarios you don’t have in mind. And even when they do stop working, you want to know. You want them to handle the failure without making the problem worse. You want them to handle transactions that were in progress when they stopped working.

The list of things to keep in mind when building a resilient system is in-exhaustive, but the following have stood our for me:

Redundancy: Third party systems will eventually go down. Even your own system will eventually go down. Apple Pay was down for over an hour a couple of days ago. Instagram is down every once in a while. Twitter, Google, all the software companies with the best engineers. This is why you want to build redundancy into your systems. If a service is down, you want your service to try using another service to carry out that same action. Redundancy may be contextually different in different situations, but the most important part is that you want a backup. This back up should ideally kick in automatically, such as through a error handling mechanism or a circuit breaker, or through manual action.
Time outs: You want your system to time out. If the action will not be carried out, you want to communicate it. And sometimes, an action that takes too long is no different than a failed action. Your timeouts should ideally be set at a minute, although you could shorten or lengthen the time based on the specific circumstances. Some factors to consider in determining the appropriate time include customer experience, what action is taking to long and what that means, and alternative means your system can use to continue the action once it becomes aware that path has failed or is taking too long.
Retries: Your system should support retries. If your system crashes, and there are pending transactions, retries is likely how you will fulfil those transactions: either by moving them on to a success state, if you can, or moving them to a failed state. Even if your system has not failed, retries can be a great way to pick up one-off incidents of transactions that couldn’t be processed. You want your retries to be automatic, through a queue or a cron job. But in the absence of this, they can be manual, through someone familiar with the system monitoring the system closely and triggering the retry when they need to. Retries also mean your systems have to be idempotent. “Idempotency is a property of certain operations or API requests, which guarantees that performing the operation multiple times will yield the same result as if it was executed only once.” In essence, if your system tries to carry out the same operation 10 times, it should only be successful once. This ensures that even when the retry is wrongly triggered, your system simply does nothing.
Manual operations: Sometimes, you will need to carry out support operations manually. These include DML and DDL statements. Sometimes, you’ve not automated certain actions, or maybe you’ve discovered a new bug. Manual operations are how you correct it.
Monitoring: You need to monitor your systems if you have hopes of resilience. Monitoring is how you know when something is wrong and investigate those things that are wrongs. Monitoring includes logs, dashboards, and maybe notifications when something is wrong. Your logs should be searchable by various parameters. Your dashboards should be easy to understand at a glance. Your notifications should be actionable by the average engineer working on the system. Monitoring is different in various teams. You can read on observability at paystack to see what monitoring in action looks like.

Happy resilience.