Uploaded image for project: 'Moodle'
  1. Moodle
  2. MDL-67483

Improve adhoc tasks quality of service at very high scale

    XMLWordPrintable

Details

    • MOODLE_39_STABLE
    • MOODLE_39_STABLE
    • MDL-67483-qos-perf
    • Hide

      0) Install these in admin tools to make the testing easier:

      https://github.com/catalyst/moodle-tool_testtasks

      https://github.com/catalyst/moodle-tool_lockstats

       

      Testing scenario 1. Test the case with a single task runner

      1) Queue up 1000 one seconds tasks:

      php admin/tool/testtasks/cli/queue_adhoc_tasks.php -d=1 -n=1000
      

      2) Behind this queue up a single 'another' type of adhoc task:

      php admin/tool/testtasks/cli/queue_adhoc_tasks.php -d=1 -n=1 --class='tool_testtasks\task\another_timed_adhoc_task'

      3) Peek into the task queue by class:

      $ select count(*),classname from mdl_task_adhoc group by classname;
       count | classname 
      -------+-----------------------------------------------
       1000 | \tool_testtasks\task\timed_adhoc_task
       1 | \tool_testtasks\task\another_timed_adhoc_task
      (2 rows)

      4) Now start processing the queue:

      php admin/tool/task/cli/adhoc_task.php --execute 
      

      5) You should see that the first 2 tasks it picked off cycled through each type and very quickly you are left with a number lower than 1000 (i.e. 997):

      $ select count(*),classname from mdl_task_adhoc group by classname;
       count | classname 
      -------+---------------------------------------
       997 | \tool_testtasks\task\timed_adhoc_task
      (1 row)

       

      Testing scenario 2. A bunch of runners in parallel

      1) First, allow this to work in config.php, also setup the lock stats tool which we'll use later:

      $CFG->task_adhoc_concurrency_limit = 1000;
      $CFG->lock_factory = '\tool_lockstats\proxy_lock_factory';
      $CFG->proxied_lock_factory = "auto";
      

      2) Queue up 22000 of type A and then another 22000 of type B behind it:

      $ php admin/tool/testtasks/cli/clear_adhoc_task_queue.php
      $ php admin/tool/testtasks/cli/queue_adhoc_tasks.php -d=1 -n=22000
      $ php admin/tool/testtasks/cli/queue_adhoc_tasks.php -d=1 -n=22000 --class='tool_testtasks\task\another_timed_adhoc_task' 
      

      3) Confirm the queues: 

      $ select count(*),classname from mdl_task_adhoc group by classname;
       count | classname 
      -------+-----------------------------------------------
       22000 | \tool_testtasks\task\timed_adhoc_task
       22000 | \tool_testtasks\task\another_timed_adhoc_task
      (2 rows)
      

      6) Now lets fire up several task runners, do this say 4 times (you may need to open several terminals in order to execute them):

      php admin/tool/task/cli/adhoc_task.php --execute &
      

      7) Recheck the queues and confirm they are being processed evenly (so the count column number is lower than in step #3):

      $ select count(*),classname from mdl_task_adhoc group by classname;
       count | classname 
      -------+-----------------------------------------------
       21944 | \tool_testtasks\task\timed_adhoc_task
       21944 | \tool_testtasks\task\another_timed_adhoc_task
      (2 rows) 
      

       

      Testing scenario 3.  NOT optional

      8) Lastly, and most importantly, keep throwing more and more runners at it and confirm that the overall system is still scaling up linearly the more processes you throw at it.

      ie fire up a number of processes and count how many processes you have running:

      php admin/tool/task/cli/adhoc_task.php --execute &

      Then using the lock stats tool you can see what is running right now in the gui:

      /admin/tool/lockstats/ 

      or from a sql shell:

      $ select count(*) from mdl_tool_lockstats_locks;
      count
      -------
      6
      (1 row) 
      

       Ideally you want to run this until it breaks to find the total maximum practical level of concurrency the system can handle. On my local box I saw something like:

      Cron processes Tasks being processed
      5 5
      10 10
      20 19
      30 28
      35 32
      40 18

      When it hits the max threshold each process may get a lock timeout and exit so you'll get a sharp drop off from linear back to something much smaller. This max concurrency is an issue with or without this patch but we need to make sure it doesn't go backwards under similar conditions.

       

       

       

      Show
      0) Install these in admin tools to make the testing easier: https://github.com/catalyst/moodle-tool_testtasks https://github.com/catalyst/moodle-tool_lockstats   Testing scenario 1. Test the case with a single task runner 1) Queue up 1000 one seconds tasks: php admin /tool/testtasks/cli/queue_adhoc_tasks .php -d=1 -n=1000 2) Behind this queue up a single 'another' type of adhoc task: php admin /tool/testtasks/cli/queue_adhoc_tasks .php -d=1 -n=1 --class= 'tool_testtasks\task\another_timed_adhoc_task' 3) Peek into the task queue by class: $ select count(*),classname from mdl_task_adhoc group by classname; count | classname -------+----------------------------------------------- 1000 | \tool_testtasks\task\timed_adhoc_task 1 | \tool_testtasks\task\another_timed_adhoc_task (2 rows) 4) Now start processing the queue: php admin /tool/task/cli/adhoc_task .php --execute  5) You should see that the first 2 tasks it picked off cycled through each type and very quickly you are left with a number lower than 1000 (i.e. 997): $ select count(*),classname from mdl_task_adhoc group by classname; count | classname -------+--------------------------------------- 997 | \tool_testtasks\task\timed_adhoc_task (1 row)   Testing scenario 2. A bunch of runners in parallel 1) First, allow this to work in config.php, also setup the lock stats tool which we'll use later: $CFG ->task_adhoc_concurrency_limit = 1000; $CFG ->lock_factory = '\tool_lockstats\proxy_lock_factory' ; $CFG ->proxied_lock_factory = "auto" ; 2) Queue up 22000 of type A and then another 22000 of type B behind it: $ php admin /tool/testtasks/cli/clear_adhoc_task_queue .php $ php admin /tool/testtasks/cli/queue_adhoc_tasks .php -d=1 -n=22000 $ php admin /tool/testtasks/cli/queue_adhoc_tasks .php -d=1 -n=22000 --class= 'tool_testtasks\task\another_timed_adhoc_task'   3) Confirm the queues:  $ select count(*),classname from mdl_task_adhoc group by classname; count | classname -------+----------------------------------------------- 22000 | \tool_testtasks\task\timed_adhoc_task 22000 | \tool_testtasks\task\another_timed_adhoc_task ( 2 rows) 6) Now lets fire up several task runners, do this say 4 times (you may need to open several terminals in order to execute them): php admin /tool/task/cli/adhoc_task .php --execute & 7) Recheck the queues and confirm they are being processed evenly (so the count column number is lower than in step #3): $ select count(*),classname from mdl_task_adhoc group by classname; count | classname -------+----------------------------------------------- 21944 | \tool_testtasks\task\timed_adhoc_task 21944 | \tool_testtasks\task\another_timed_adhoc_task ( 2 rows)    Testing scenario 3.  NOT optional 8) Lastly, and most importantly, keep throwing more and more runners at it and confirm that the overall system is still scaling up linearly the more processes you throw at it. ie fire up a number of processes and count how many processes you have running: php admin/tool/task/cli/adhoc_task.php --execute & Then using the lock stats tool you can see what is running right now in the gui: /admin/tool/lockstats/  or from a sql shell: $ select count(*) from mdl_tool_lockstats_locks; count ------- 6 (1 row)  Ideally you want to run this until it breaks to find the total maximum practical level of concurrency the system can handle. On my local box I saw something like: Cron processes Tasks being processed 5 5 10 10 20 19 30 28 35 32 40 18 When it hits the max threshold each process may get a lock timeout and exit so you'll get a sharp drop off from linear back to something much smaller. This max concurrency is an issue with or without this patch but we need to make sure it doesn't go backwards under similar conditions.      

    Description

      When I tested MDL-67363 I didn't scale it up high enough, the algorithm works but when you get really large it is scaling at O(n^2) and it starts to choke.

      This is a tweak to get the performance back to as close to linear as we can get.

      Attachments

        Issue Links

          Activity

            People

              brendanheywood Brendan Heywood
              brendanheywood Brendan Heywood
              Dmitrii Metelkin Dmitrii Metelkin
              Jake Dallimore Jake Dallimore
              Jake Dallimore Jake Dallimore
              David Woloszyn, Huong Nguyen, Jake Dallimore, Meirza, Michael Hawkins, Raquel Ortega, Safat Shahin, Stevani Andolo
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:
                15/Jun/20

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0 minutes
                  0m
                  Logged:
                  Time Spent - 4 hours
                  4h