Uploaded image for project: 'Moodle'
  1. Moodle
  2. MDL-67020

The coursemodinfo cache item doesn't scale when localized due to global locking

    XMLWordPrintable

Details

    • MOODLE_36_STABLE, MOODLE_37_STABLE
    • MOODLE_401_STABLE
    • MDL-67020_master
    • Hide

      Either don't localize coursemodinfo, or have a stacked cache with a local primary store and a final store which is shared.

      Show
      Either don't localize coursemodinfo, or have a stacked cache with a local primary store and a final store which is shared.
    • Hide

      First, run test script 1 with the default cache settings. This will confirm that locking during modinfo cache rebuilds continues to work as intended by MDL-61305 when modinfocache is shared.

      Run test script 2. This uses a patch to simulate 10 nodes each with their own local cache, and an artificial 1000ms delay added to rebuilding the cache. The timings show that one node is not blocked by locks on another.

      Test script 1:

      1. As admin, go to any course page on your test system.
      2. Enrol a test user account in the course (or use an existing one).
      3. Open a second browser, go to the same course page, and log in using the test account.
      4. Make a code change (just for testing) to lib/modinfolib.php:
        1. Search for the comment: // Retrieve all information about activities and sections.
        2. Immediately above this line, add the following line: sleep(10); debugging('Actually building course cache');
      5. Using the admin browser, edit course settings. Without making a change, click Save and display
      6. Using the other (user) browser, load the course page again. (You need to do this step within ten seconds - ideally a couple of seconds after hitting save in the admin browser.)
        • EXPECTED: The admin browser takes 10 seconds then displays debugging info about rebuilding the course cache.
        • EXPECTED: The other browser finishes slightly faster - it should load immediately after the admin browser does.There should be no debugging info.
      7. In the admin browser, open a new tab on the Site administration page, and go to Plugins / Caching / Configuration.
      8. Find the coursemodinfo cache area, and click Purge next to it (if you can't see that link, try making your browser ginormously wide).
      9. Wait for this to finish.
      10. In the user browser, reload the course page.
      11. In the admin browser, go back to the course page tab and reload it.
        • EXPECTED: The user browser takes 10 seconds then displays the page, including debugging info about rebuilding the course cache.
        • EXPECTED: The admin browser finishes immediately after the user browser does, and there should be no debugging info.

      Test script 2:

      1. Remove the code change made in step 4 above.
      2. Apply the attached cachetest.patch with `patch -p1 < cachetest.patch` from dirroot.
      3. Visit /cache/admin.php, and create a new instance of the file cache store. Name it exactly "local", set the path to "/tmp/filecache" and check "Auto create directory". The name "local" is used by cachetest.patch to simulate multiple local caches.
      4. Map the coursemodinfo cache definition primary store to use the "local" cache store.
      5. Use /admin/tool/generator/maketestcourse.php to generate an "M" size course.
      6. Visit the course and make a note of the URL.
      7. Visit the "Participants" screen, select "Enrolment methods" from the top-left select list.
      8. Enable the "Guest Access" enrolment method.
      9. Under Site Administration > Users > Permissions > User policies, enable "Auto-login guests".
      10. Log out, and check that you can access the course URL you recorded.
      11. Install the ab benchmarking tool (part of apache2-utils package on Ubuntu).
        • sudo apt-get install apache2-utils
      12. Visit /tmp/filecache on your server, and delete all the files and folders within. This has the effect of resetting all the instances of the coursemodinfo cache.
      13. Run the following command, where COURSE-URL is the url of course from step 6:
        ab -n 30 -c 10 http://COURSE-URL
        This will make 30 requests, 10 at a time. This means each cache will need to be rebuilt, so we can test that each rebuild does not block the others.
        EXPECTED: You will see a list of percentiles within about 1000ms of one another, as all caches can rebuild simultaneously.
        BEFORE FIX: The percentiles range from about 3000ms at 50% to about 12000ms at 100%, as each concurrent run has to wait for previous ones to finish, adding about 1000ms for each previous run.
      14. If you wish to repeat step 13, also repeat step 12 to clear the caches first.
      Show
      First, run test script 1 with the default cache settings. This will confirm that locking during modinfo cache rebuilds continues to work as intended by MDL-61305 when modinfocache is shared. Run test script 2. This uses a patch to simulate 10 nodes each with their own local cache, and an artificial 1000ms delay added to rebuilding the cache. The timings show that one node is not blocked by locks on another. Test script 1: As admin, go to any course page on your test system. Enrol a test user account in the course (or use an existing one). Open a second browser, go to the same course page, and log in using the test account. Make a code change (just for testing) to lib/modinfolib.php: Search for the comment: // Retrieve all information about activities and sections. Immediately above this line, add the following line: sleep(10); debugging('Actually building course cache'); Using the admin browser, edit course settings. Without making a change, click Save and display Using the other (user) browser, load the course page again. (You need to do this step within ten seconds - ideally a couple of seconds after hitting save in the admin browser.) EXPECTED: The admin browser takes 10 seconds then displays debugging info about rebuilding the course cache. EXPECTED: The other browser finishes slightly faster - it should load immediately after the admin browser does.There should be no debugging info. In the admin browser, open a new tab on the Site administration page, and go to Plugins / Caching / Configuration . Find the coursemodinfo cache area, and click Purge next to it (if you can't see that link, try making your browser ginormously wide). Wait for this to finish. In the user browser, reload the course page. In the admin browser, go back to the course page tab and reload it. EXPECTED: The user browser takes 10 seconds then displays the page, including debugging info about rebuilding the course cache. EXPECTED: The admin browser finishes immediately after the user browser does, and there should be no debugging info. Test script 2: Remove the code change made in step 4 above. Apply the attached cachetest.patch with `patch -p1 < cachetest.patch` from dirroot. Visit /cache/admin.php, and create a new instance of the file cache store. Name it exactly "local", set the path to "/tmp/filecache" and check "Auto create directory". The name "local" is used by cachetest.patch to simulate multiple local caches. Map the coursemodinfo cache definition primary store to use the "local" cache store. Use /admin/tool/generator/maketestcourse.php to generate an "M" size course. Visit the course and make a note of the URL. Visit the "Participants" screen, select "Enrolment methods" from the top-left select list. Enable the "Guest Access" enrolment method. Under Site Administration > Users > Permissions > User policies, enable "Auto-login guests". Log out, and check that you can access the course URL you recorded. Install the ab benchmarking tool (part of apache2-utils package on Ubuntu). sudo apt-get install apache2-utils Visit /tmp/filecache on your server, and delete all the files and folders within. This has the effect of resetting all the instances of the coursemodinfo cache. Run the following command, where COURSE-URL is the url of course from step 6: ab -n 30 -c 10 http://COURSE-URL This will make 30 requests, 10 at a time. This means each cache will need to be rebuilt, so we can test that each rebuild does not block the others. EXPECTED: You will see a list of percentiles within about 1000ms of one another, as all caches can rebuild simultaneously. BEFORE FIX: The percentiles range from about 3000ms at 50% to about 12000ms at 100%, as each concurrent run has to wait for previous ones to finish, adding about 1000ms for each previous run. If you wish to repeat step 13, also repeat step 12 to clear the caches first.

    Description

      Lets say you have:

      • a big site with 20 front ends running
      • coursemodinfo is cached on a localized store like local disk
      • Lets also say that rebuilding the coursemodinfo take 5 seconds.

      None of those numbers are unreasonable numbers on a very large site under high load. Lets assume that if the cache is warm then loading it is instant.

      First some thought experiments for comparison:

      • If caching is off: then loading a course page should take 5 seconds in the context of this cache.
      • If caching is on with a shared cache, then the first one is slow at 5 seconds, all other concurrent ones are blocked and also take 5 seconds, then all subsequent ones are near instant.

      Ok so now for the failure scenario:

      1. we purge this case for some reason
      2. we have say 20 concurrent requests coming in, and the load balancer has split them perfectly across the 20 front ends
      3. they all try to grab a global lock and they will each wait 60 seconds for it:
        https://github.com/moodle/moodle/blob/master/lib/modinfolib.php#L456 
        https://github.com/moodle/moodle/blob/master/lib/modinfolib.php#L61
      4. Only one of them will get it, it will then build the cache which takes 5 seconds, and then save it to local cache
      5. The second one at random will now get the lock, it's already waited 5 seconds but after waiting it's cache it still empty so it rebuilds it which takes another 5 seconds.
      6. The third has waited 10 seconds and adds another 5
      7. The fourth has waited 15 seconds and adds another 5
      8. Keep going, after the 12th request which has now taken 60 seconds they will start getting lock timeouts and all concurrent requests after that will fail.
      9. New requests coming in will sometimes work and sometimes fail depending on what frontend they hit, eventually all the caches will be warm and things will be good.
      10. But if autoscaling is on the load balancer can keep trying to add nodes as it can see high cpu load and things failing but adding more front ends just makes it worse until everything is warm and then we've ended up with more front ends that we need.

      So the root cause fundamentally is that the 'scope' of the cache stores is local, but the scope of the locks are global. When we tested this with a shared cache store it worked fine. We also tested it with a local primary store and a shared final store and this worked fine too.

      Proposal:

      1. The coursemodinfo locking should be moved from using the Lock API to the Cache Lock API and declare requirelockingwrite = true.
      2. We should also ship in core a Cache Lock type which can do local locking. The file lock is the obvious candidate and all it needs is to be configurable so you can point an instance at some local disk in the same way as a file store instance can be. Locking files on local disk is very fast and reliable.
      3. Bonus points: Because we will now ship at least one cache item which has requirelockingwrite as true, and because file locking on shared disk is horrible for most file systems at scale, I think core should also have a cache lock type which leverages the normal Lock API so have a correct lock implementation for localizable stores which are not configured to be local.
      4. Also if requirelockingwrite is false, which it currently is for all core cache items, I don't think you should be given the option to configure a lock instance as it makes no sense. Also if your cache item does implement cache_loader_with_locking but requirelockingwrite is false then that should throw a coding_exception, either at boot time or if you attempt to call aquire_lock()

      I think the first in that list will be this tracker and I'll split the others into new trackers.

       

      Attachments

        1. cachetest.patch
          1 kB
        2. MDL-67020_result.png
          MDL-67020_result.png
          37 kB
        3. MDL-67020_script 1_1.webm
          829 kB
        4. MDL-67020_script 1_2.webm
          463 kB

        Issue Links

          Activity

            People

              marxjohnson Mark Johnson
              brendanheywood Brendan Heywood
              Brendan Heywood Brendan Heywood
              Victor Déniz Falcón Victor Déniz Falcón
              Angelia Dela Cruz Angelia Dela Cruz
              Votes:
              11 Vote for this issue
              Watchers:
              29 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0 minutes
                  0m
                  Logged:
                  Time Spent - 1 day, 32 minutes
                  1d 32m

                  Clockify

                    Error rendering 'clockify-timesheets-time-tracking-reports:timer-sidebar'. Please contact your Jira administrators.