Uploaded image for project: 'Moodle'
  1. Moodle
  2. MDL-67486

Minimize how long we hold the global cron lock for

XMLWordPrintable

    • MOODLE_37_STABLE, MOODLE_38_STABLE, MOODLE_39_STABLE
    • MOODLE_37_STABLE, MOODLE_38_STABLE
    • MDL-67486-swap-cron-lock
    • Hide

      This is a performance issue and its relative to box you are on. In particular the performance is extremely sensitive to the latency of the lock implementation you are using. So just repeat this before and after the patch to get relative numbers. The higher the lock latency, the earlier it 'chokes' and you hit the 10 second cron lock timeout and the processes die.

       

      1) Install testtasks to make testing easier:

      https://github.com/catalyst/moodle-tool_testtasks

      2) Install lockstats to make testing easier:

      https://github.com/catalyst/moodle-tool_lockstats

      Add this to config.php

      $CFG->lock_factory = '\tool_lockstats\proxy_lock_factory';

      3) In theory allow cron to scale right up

      $CFG->task_adhoc_concurrency_limit = 1000;
      $CFG->task_scheduled_concurrency_limit = 1000;
      $CFG->task_adhoc_max_runtime = 30;

      4) In terminal 1 watch for open locks, should be empty:

      watch -n 1 php admin/tool/lockstats/cli/list_locks.php

      5) In terminal 2 queue up a ton of adhoc tasks:

      php admin/tool/testtasks/cli/queue_adhoc_tasks.php -d=1 -n=2000 

      9) Spawn 10 runners, then let then run until the 30 second max runtime

      for i in {1..10}; do php admin/tool/task/cli/adhoc_task.php --execute & done

       
      10) and then kill all the background processes:

      php admin/cli/cron.php --stop

      11) Now see how many tasks it processed:

      $ select count(*) from mdl_task_adhoc;
       count 
      -------
        1838
      (1 row)
      

      12) This should really also be repeated using the main admin/cli/cron.php but the impact here is much lower and it's also harder to test.

      On my box I ran it for 30 seconds, and it with 10 runners it did 162 tasks. In theory it should max out at 30 x 10 = 300 tasks. After this patch it was left with 1740 tasks, so it processed 260 which is almost twice as fast and much closer to the theoretical maximum.

       

      Runners Second Before After Theoretical max
      10 30 162 260 300
      15 30 120 363 450
      20 30 127 523 600
      25 30 120 559 750
      30 30 112 (choked) 649 900
      40 40 118 (choked) 784 1200

       

       

       

      Show
      This is a performance issue and its relative to box you are on. In particular the performance is extremely sensitive to the latency of the lock implementation you are using. So just repeat this before and after the patch to get relative numbers. The higher the lock latency, the earlier it 'chokes' and you hit the 10 second cron lock timeout and the processes die.   1) Install testtasks to make testing easier: https://github.com/catalyst/moodle-tool_testtasks 2) Install lockstats to make testing easier: https://github.com/catalyst/moodle-tool_lockstats Add this to config.php $CFG->lock_factory = '\tool_lockstats\proxy_lock_factory'; 3) In theory allow cron to scale right up $CFG->task_adhoc_concurrency_limit = 1000; $CFG->task_scheduled_concurrency_limit = 1000; $CFG->task_adhoc_max_runtime = 30; 4) In terminal 1 watch for open locks, should be empty: watch -n 1 php admin/tool/lockstats/cli/list_locks.php 5) In terminal 2 queue up a ton of adhoc tasks: php admin/tool/testtasks/cli/queue_adhoc_tasks.php -d=1 -n=2000  9) Spawn 10 runners, then let then run until the 30 second max runtime for i in {1..10}; do php admin/tool/task/cli/adhoc_task.php --execute & done   10) and then kill all the background processes: php admin/cli/cron.php --stop 11) Now see how many tasks it processed: $ select count (*) from mdl_task_adhoc; count ------- 1838 (1 row) 12) This should really also be repeated using the main admin/cli/cron.php but the impact here is much lower and it's also harder to test. On my box I ran it for 30 seconds, and it with 10 runners it did 162 tasks. In theory it should max out at 30 x 10 = 300 tasks. After this patch it was left with 1740 tasks, so it processed 260 which is almost twice as fast and much closer to the theoretical maximum.   Runners Second Before After Theoretical max 10 30 162 260 300 15 30 120 363 450 20 30 127 523 600 25 30 120 559 750 30 30 112 (choked) 649 900 40 40 118 (choked) 784 1200      

      This tracker has evolved and the original was a little bit of a red herring. The symptom was an emergent property of a natural limit to the scaling ceiling of cron.

       

      (old description)

      The core_cron lock is held by the task manager to guarantee that only a single instance of a particular scheduled task, or a particular ad hoc task is allocated to any cron running process. In a very highly scaled environment, eg there might be 30 cron processes, each of these processes must wait for the global core_cron and this times out after 10 seconds and there can be a lot of contention for this lock, and once a process hits the 10 second timeout it exits which lowers the overall throughput. It can also cause an emergent behavior of cascading exits and you end up with less running processes than if a lower number of processes had started in the first place. A simple approach is just to increase the timeout from 10 to something larger which stops them shutting down but won't increase the max concurrency level.

      Out of 30 processes a typical balance is that 10 might be scheduled tasks and 20 are ad hoc tasks. The task manager only needs to guarantee atomic allocation of scheduled tasks as a group and ad hoc tasks as a group, they don't need to be grouped together. By splitting them we'd get roughly +200% concurrency for scheduled tasks and +33% more ad hoc concurrency in this simple example.

      These two resources can be split:

      https://github.com/moodle/moodle/blob/master/lib/classes/task/manager.php#L611

      https://github.com/moodle/moodle/blob/master/lib/classes/task/manager.php#L555

      Proposing to leave 'core_cron' for scheduled tasks and create a new lock resource key 'core_adhoc' for the adhoc task queue.

            brendanheywood Brendan Heywood
            mikhailgolenkov Misha Golenkov
            Matt Porritt Matt Porritt
            Andrew Lyons Andrew Lyons
            Anna Carissa Sadia Anna Carissa Sadia
            Votes:
            1 Vote for this issue
            Watchers:
            10 Start watching this issue

              Created:
              Updated:
              Resolved:

                Estimated:
                Original Estimate - 0 minutes
                0m
                Remaining:
                Remaining Estimate - 0 minutes
                0m
                Logged:
                Time Spent - 1 day, 2 hours, 30 minutes
                1d 2h 30m

                  Error rendering 'clockify-timesheets-time-tracking-reports:timer-sidebar'. Please contact your Jira administrators.