Affects Version/s: 3.5.3
Fix Version/s: None
Every now and then we have to purge caches on our production server to flush out something. We try not to purge caches during business hours but will when it's absolutely necessary, such as an emergency fix. The other day one of our admins purged the caches in an attempt to try and fix an audio issue for students writing tests in our Test Centre (not sure what the audio issue was, that's an issue for another day). When he purged the caches (using "Purge All Caches" in Site Administration) everyone got slammed with these mustache "Failed to write cache file" error's. I was notified and hit the page and saw the same error. I then purged our Memcached, Moodle Cache and OPCache. Hit refresh a few times, still saw the error and after about the 5th refresh it went away.
I can now replicate this issue.
We typically have an average load of 200 up to 500+ users. So I ran a load test on our test stack, which is identical to our prod stack, I was simulating 200 users that login via CAS, write 1 of 3 quizzes, post to a forum and view random files. This test was done with JMeter for 10 minutes and within that time I had hit "purge all caches" once with the UI button and then twice via CLI, 2-3 minutes apart. I had added various extra error_log statements so I could try to track the order of events. When I purged the Moodle caches, while running the test some interesting things were revealed.
Here's my setup:
- Moodle Version 3.5.3
- Load balanced to 2 Debian 9.6 VM's
- PHP 7.0.33-0+deb9u1 fpm-fcgi Linux with Opcache
- Both head ends point to one Dell Compellent NFS file system
- Memcached in place using Twemproxy pointing to a Couchbase server with multiple buckets.
The extra logging was added to the following functions:
What seems to happen here is as soon as I hit "Purge All Caches" the get_mustache() function immediately creates a new cache directory based on the $themerev number changing. My initial $themerev is 1549486915 and it then creates a new folder, 1549556523. In that same function immediately right after the new folder is created it then tries to perform a cleanup and delete all older folders, in this case the 1549486915 folder. With other users hitting templated pages the log's show in the FileSystemCache.php -> writeFile() function that files are now being created in both folders and it continues back and forth several times, creating and deleting. Eventually the 1549486915 get's purged and both app server's show the Error:
"Default exception handler: Exception - Failed to write cache file"
This is all before the purge_all_caches() function is finally triggered which then deletes everything......as it should.....followed by one more:
"Default exception handler: Exception - Failed to write cache file" error.
Then I see my log message: moodlelib.php -> purge_all_caches() -> localcachedir has been created, all mustache cache files are now being created in the new 1549556523 folder.
Right now I'm thinking is it critical to check and remove any old localcache/mustache/[$themerev] folder for every single mustache request or as soon as a new $themerev folder is created? During those few seconds users may be hitting the new or old localcache/mustache/[themerev] folder......so leave it there. Obviously in this case the purge_all_caches() function obliterated all folders, including the newly created 1549556523 folder (which was already in the process of populating the cached files). There are various ways to tackle this, maybe have a scheduled task to remove the old localcache/mustache/[themerev] folders every night or x days or whatever.