Lets say you have:
- a big site with 20 front ends running
- coursemodinfo is cached on a localized store like local disk
- Lets also say that rebuilding the coursemodinfo take 5 seconds.
None of those numbers are unreasonable numbers on a very large site under high load. Lets assume that if the cache is warm then loading it is instant.
First some thought experiments for comparison:
- If caching is off: then loading a course page should take 5 seconds in the context of this cache.
- If caching is on with a shared cache, then the first one is slow at 5 seconds, all other concurrent ones are blocked and also take 5 seconds, then all subsequent ones are near instant.
Ok so now for the failure scenario:
- we purge this case for some reason
- we have say 20 concurrent requests coming in, and the load balancer has split them perfectly across the 20 front ends
- they all try to grab a global lock and they will each wait 60 seconds for it:
- Only one of them will get it, it will then build the cache which takes 5 seconds, and then save it to local cache
- The second one at random will now get the lock, it's already waited 5 seconds but after waiting it's cache it still empty so it rebuilds it which takes another 5 seconds.
- The third has waited 10 seconds and adds another 5
- The fourth has waited 15 seconds and adds another 5
- Keep going, after the 12th request which has now taken 60 seconds they will start getting lock timeouts and all concurrent requests after that will fail.
- New requests coming in will sometimes work and sometimes fail depending on what frontend they hit, eventually all the caches will be warm and things will be good.
- But if autoscaling is on the load balancer can keep trying to add nodes as it can see high cpu load and things failing but adding more front ends just makes it worse until everything is warm and then we've ended up with more front ends that we need.
So the root cause fundamentally is that the 'scope' of the cache stores is local, but the scope of the locks are global. When we tested this with a shared cache store it worked fine. We also tested it with a local primary store and a shared final store and this worked fine too.
- The coursemodinfo locking should be moved from using the Lock API to the Cache Lock API and declare requirelockingwrite = true.
- We should also ship in core a Cache Lock type which can do local locking. The file lock is the obvious candidate and all it needs is to be configurable so you can point an instance at some local disk in the same way as a file store instance can be. Locking files on local disk is very fast and reliable.
- Bonus points: Because we will now ship at least one cache item which has requirelockingwrite as true, and because file locking on shared disk is horrible for most file systems at scale, I think core should also have a cache lock type which leverages the normal Lock API so have a correct lock implementation for localizable stores which are not configured to be local.
- Also if requirelockingwrite is false, which it currently is for all core cache items, I don't think you should be given the option to configure a lock instance as it makes no sense. Also if your cache item does implement cache_loader_with_locking but requirelockingwrite is false then that should throw a coding_exception, either at boot time or if you attempt to call aquire_lock()
I think the first in that list will be this tracker and I'll split the others into new trackers.