Caching: Candy in a Drawer

Something I’ve spent an unhealthy amount of time thinking about recently is how the number of my Twitter followers fluctuate. More because I’m curious about Twitter’s software technology, since one of my favourite technical writers worked there, than because I care about follower count. When I speak of fluctuations, I don’t mean unfollows, because sometimes it rises without any new follows. Twitter says your follower count may fluctuate when they’ve asked users to confirm their passwords or phone numbers: those users will not be counted as part of your followers. They do this regularly to prevent spam. I highly doubt that is what I keep experiencing. Regardless, my curiosity led me down the caching rabbit hole. It also coincided with the period I decided to learn more about system design, moving from primarily frontend web development to the backend. Caching kept coming up (along with some other concepts).

The first analogy I thought of for caching is a jacket by the door. In fact, that was the first title for this, before I began writing. But as I read on the concept, I found a better analogy: candy in a drawer. Picture this. You often work at your drawer, and as you work, you like to have candy. To get candy, you need to buy it. So you leave your work and spend time and effort to reach the candy store. Maybe every time you go, you even risk an accident that will stop you from getting candy, maybe the candy store might be out of candy, or worse, something could stop you from returning home. Doing this would be inefficient, waste time, and risk failure every time you need to get a piece of candy. Instead, you buy a lot of candy. It could be a week’s worth of candy or a month’s worth of candy, depending on how much candy you take, how you keep the candy, and how quickly the candy spoils. This way, you can save your time, effort, and reduce your likelihood of failure. This is a rough conceptualisation of what caching is.

Many kinds of software use caching. Browsers, servers, mobile applications. It’s something software engineers eventually need to think about when they start to think about scaling. You can buy candy only when you need it if you take it very infrequently. However, when you want to start consuming a lot of it, it only makes sense to buy larger quantities and keep them to save yourself time. Understanding caching on computer systems needs you to understand how data is stored.

Most databases store data on hard drives. They’re comparatively cheaper, contain more space, and will persist data even after the system reboots. However, databases for caching (like Redis) often use RAM (random-access memory). The RAM is the part of your computer that fetches information fast, but also loses them easily. When you open many tabs on your browser (like the RAM-eating chrome), it uses your RAM. When you reboot your computer because it has been on for too long or has too many processes running and is now lagging, you are freeing up space on your RAM. So, while storing data on a RAM lets you get the data faster and avoid running expensive computations, the storage size is significantly limited, comparatively more expensive, and does not persist data (which makes the data easier to lose). These are only some of the reasons databases for caching can’t act as a substitute for larger databases that take longer to retrieve information.

Most caching databases do not support some database operations like filters by keys. For example, I could get users in a database with usernames that include a particular set of letters. Or by their ages. Or by gender. Or by any other key I wish to use. However, with a caching database, there is a unique key you use to store your value and it is only with that key you can get your result. The result would be a single user, instead of a list of matching users as with regular databases.

Hopefully, you now have a broad idea of what caching is: taking data and storing it somewhere it is easier to retrieve from. It is a great solution to scaling. However, this will bring us to another significant concept in computer science: cache invalidation. Some people have referred to computer science as the science of making trade offs. Whenever you get something, you are likely sacrificing something else. A deal with the devil kind of affair. And when what you’re getting is as significant as faster data retrieval, you know your sacrifice has to be major. An arm, a leg, and more kind of major, if you were truly dealing with the devil. Because scaling to the extent most big tech companies have, and processing that kind of data, would be significantly more difficult and expensive without caching. This problem is also why it came up as a reason for fluctuation of my Twitter followers count. Cache invalidation is significant enough that a fella named Phil Karlton said "There are only two hard things in Computer Science: cache invalidation and naming things.” That saying now has a million and one spin-offs.

Essentially, when you cache data, you’re increasing the availability of the data and reducing how much work your servers have to do. In exchange, you take on the risk of retrieving stale data. Just like how in the candy example, you take on the risk of getting spoilt candy, or ants-infested candy, or candy that would make your stomach ache. This problem is most similar to candy that would make your stomach ache while looking completely benign. Cache invalidation refers to finding all of your stale data every time there is an update and then putting in the recent data.

If the data does not exist in the cache or is obviously bad, most servers are configured to fetch the data from the primary database and store it in the cache. However, if the data has merely not been updated, there is no easy way to find out without defeating the point of caching.

Two solutions I have floating in my head are to update your cache every time the database is updated and to refresh from the primary database periodically. However, refreshing data regularly only means your stale data will eventually stop being stale and does not give you fresh data all the time. The first method, updating your cache every time your database is updated, can easily become very complex. If you cache aggressively and the data is used in many places, you would need to remember everywhere the data is used and then rebuild the data in those places. Wikipedia has a page on cache invalidation and there are several algorithms that have been developed for this. The best choice depends on your system and what is important to you. Also, some systems are fine with slight variance in data. Whether my Twitter followers count is 860 or 866 matters very little. However, if your system has to do with (sending and receiving) money, you might be more concerned about having the most recent data.

Artur Ejsmont, in Web Scalability for Startup Engineers, mentions partial caching (this is what I’m choosing to call what he described). You could cache data that will be expensive to fetch/unlikely to change frequently and then fetch sensitive data from your standard database every time you need it. In some instances, this could be a solution to cache invalidation problems. In our candy analogy, you would simply store only the kinds of candy that are likely to last, while running to the store for easily-spoilt candy when you need it.