What is a cache stampede and how we solved it by writing our own gem
Cache stampede is a fancy name for a cascading failure that will occur when caching mechanisms and underlying application layers come under high load both at the same time in a very short time.
Cache stampede is not something that will be considered a problem for most of the systems out there; it is an issue that one will encounter when reaching a certain scale and specific data access patterns.
To give you an example imagine you are doing an expensive SQL query that takes 3 seconds to complete and spikes CPU usage of your database server. You want to cache that query as running multiple of those in parallel can cause database performance problems from which your database will not recover bringing your entire app down.
Let’s say you have a traffic of 100 requests a second. With our 3-second query example, if you get 100 requests in one second and your cache is cold, you’ll end up with 300 processes all running the same uncached query.
That becomes a real problem with high load systems that will receive hundreds or even thousands requests a second to perform an initial population of the cache (or re-populating it) causing massive amounts of simultaneous writes to cache and a massive amount of traffic going to underlying systems. This has the potential to cause major problems for your infrastructure.
Avoiding Cache Stampede
In general, there are three approaches to avoid cache stampede:
- external recomputation,
- probabilistic early expiration,
This involves keeping an external process or a system alive that will re-compute your values and re-populate cache either on predefined intervals or when a cache miss happens.
The first issue is working with a cold cache. As I mentioned above, a flood of cache misses in a short period of time has the potential to run many expensive operations in parallel.
Continually re-populating cache also comes with the drawback that if there is no uniform access pattern to your data you will end up with cache bloated with useless information.
Probabilistic early expiration
When an expiration of cache nears, one of the processes (or actors) in your system will volunteer to re-calculate its value and to re-populate cache with a new value.
Rails has this implemented as
:race_condition_ttl setting on the cache itself.
For this approach to be effective you have to have at least one request per TTL; else your cache will go cold pushing you back to square one and facing a stampede.
If you are not setting some sort of lock to re-calculate cache while using probabilistic early expiration then you will end up with multiple processes hammering your cache and underlying systems computing and re-writing the same value.
This one is the simplest but also the most versatile out of the three. As soon as a process (or an actor) encounters a cache miss, it will set a global lock on re-computing the requested value. This will prevent other processes from attempting the same recompute. Once the first request computes and caches this value, it will write to the cache, and release the lock.
It will be up to you to decide what to do when another process encounters a cache miss and there is a global lock on a requested key. You can make it wait, make it return nil, raise an error, etc.
Locking has other benefits as well. You can keep your cache TTLs low and not have to worry about storing or serving stale data, in an effort to avoid a cache stampede.
You can also combine locking to avoid issues with cold cache while doing external recomputation of probabilistic expiration.
What were we trying to solve
Our problem was rather simple to define: we are facing high volume access to random data in random intervals with no pre-defined access patterns or no way to predict which data should be cached.
When stampede is happening it is also happening all at once: there is no ramp-up of volume or no signs indicating that a stampede will happen.
Once over, there is no value in keeping data in cache since we can not predict when data will be accessed next time - it can happen within next the 5 minutes or never.
To summarize: we don’t know when to pre-populate cache, we don’t know when to reliably say that we aren’t going to need cached data anymore and we don’t know which data are we going to need; we simply can not put our entire database into the cache.
To solve this we wrote our gem - stockpile_cache.
The first question that comes to mind is - why not reuse Rails cache and just use
:race_condition_ttl setting for probabilistic expiration? That solves cache stampede, right? Well, not quiet. The problem is that we need to populate (or re-populate) cache reliably without bringing our systems down.
:race_condition_ttl will not guarantee that when you are doing cold cache population or re-computation of your cached value there will be no race conditions or cache stampede as it is not setting a lock of any sort between expiration checks and writes to caching backend (Redis in our case) allowing multiple parallel executions of the same request to happen.
Another criterion is that we want this system to be able to work with multiple databases as we do want to separate our low traffic cache that is used to cache parts of views from high traffic cache that is a subject to cache stampede so they do not affect each other and can be scaled independently.
Finally, it would be nice if the caching solution would not be dependant on any specific framework or system and will just work in any Ruby environment.
How stockpile_cache solve that
Separation of caching backends is done naively now; we use stockpile_cache with its instance of Redis database for high volume caching and we use Rails’ cache for everything else for now.
For protection against cache stampede itself, we’ve gone with locking. When a request for a key returns a miss, stockpile sets a global lock for that key and computes its value. Once the value is computed it will be written to the cache and lock will be released. There is also a safeguard in place that will make sure that all locks will decay over time allowing other processes to try and re-populate cached value for a given key (this value is configurable).
If another request comes in and encounters a cache miss with a global lock for a key it will wait for 2 seconds (this value is configurable as well) trying to fetch a value from cache and then will bypass cache completely retrieving the value from underlying applications and will return a response.
So how are things looking?
We’ve been running stockpile in production with no issues. Does this make it production-ready? It does for us!
Currently, there are two items that we want to add to stockpile that we know of:
- first-class support for multiple Redis servers
- optional compression of cached value configurable per database
Once those changes are in place we will be able to move away from Rails’s cache completely and start using stockpile for caching exclusively.
Want to give it a try? Go ahead; we will be excited if you do! Or even if you want to help us (which would be awesome) just open a PR!