Uploaded image for project: 'Moodle'
  1. Moodle
  2. MDL-67648

Cron task manager quality of service (version 3)

    XMLWordPrintable

Details

    • MOODLE_401_STABLE
    • MOODLE_401_STABLE
    • MDL-67648-master
    • Hide

      Prerequisites

      1. Install tool_testtasks:

      git clone git@github.com:catalyst/moodle-tool_testtasks.git admin/tool/testtasks 
      php admin/cli/upgrade.php
      

      Testing

      1. Install tool_testtasks as mentioned below
      2. Queue up 1000 tasks:

        php admin/tool/testtasks/cli/queue_multiple_adhoc_tasks.php -n=1000

      3. Run

        php admin/cli/adhoc_task.php --keep-alive=1000 --execute
        

        for 30 seconds or so and ensure it process some tasks

      4. In 3 separate terminals run

        php admin/cli/adhoc_task.php --keep-alive=1000 --execute
        

        in each of them. Ensure that a mix of tasks is being run

      5. Stop processing tasks (ctrl+c each terminal)
      6. In 4 separate terminals run

        php admin/cli/adhoc_task.php --keep-alive=1000 --execute
        

        in all 4 terminals, ensure that no more than 3 of them process tasks

      7. Stop processing tasks (ctrl+c each terminal)
      8. Navigate to "Site administration" > "Server" > "Task processing"
      9. Set "Ad hoc task concurrency limit" to 5
      10. In 5 separate terminals run

        php admin/cli/adhoc_task.php --keep-alive=1000 --execute
        

        and ensure a mix of each type of task is running.
        Observe the tasks being run in each terminal (just watch it for a while, no need to make sure they all complete), and verify:

        1. That there is never a situation where the 1000 seconds task is being run by every runner
        2. A good mix of tasks appear to be getting run (i.e., one specific task does not appear to get priority)
      11. Stop running tasks and clear the adhoc task queue:

        php admin/tool/testtasks/cli/clear_adhoc_task_queue.php
        

      12. Queue 5000x 100 second tasks:

        php admin/tool/testtasks/cli/queue_multiple_adhoc_tasks.php -n=5000 -c="tool_testtasks\task\one_thousand_second_task"
        

        followed by 100x 2 second tasks:

        php admin/tool/testtasks/cli/queue_multiple_adhoc_tasks.php -n=5000 -c="tool_testtasks\task\two_second_task"
        

      13. In 5 separate terminals run

        php admin/cli/adhoc_task.php --keep-alive=1000 --execute
        

        and ensure the 2 second tasks are getting processed

      14. Stop processing tasks and clear the queue:

        php admin/tool/testtasks/cli/clear_adhoc_task_queue.php
        

      15. Queue up 5000x 100 second tasks:

        php admin/tool/testtasks/cli/queue_multiple_adhoc_tasks.php -n=5000 -c="tool_testtasks\task\one_hundred_second_task"
        

      16. In 5 separate terminals run

        php admin/cli/adhoc_task.php --keep-alive=1000 --execute
        

        and ensure all the runners are running 100 second tasks

      17. While the runners are still processing tasks, in a separate terminal queue up some 2 second tasks:

        php admin/tool/testtasks/cli/queue_multiple_adhoc_tasks.php -n=1000 -c="tool_testtasks\task\two_second_task"
        

      18. Watch the other 5 terminals, when the 100 second tasks are finished, ensure some 2 second tasks are able to start running
      19. With the tasks still processing, in another terminal, queue up 5, 10, and 1000 second tasks:

        php admin/tool/testtasks/cli/queue_multiple_adhoc_tasks.php -n=1000 -c="tool_testtasks\task\five_second_task,tool_testtasks\task\ten_second_task,tool_testtasks\task\one_thousand_second_task"
        

      20. Watch the other 5 terminals and ensure all types of task are being run
      21. Stop processing tasks and clear the queue:

        php admin/tool/testtasks/cli/clear_adhoc_task_queue.php
        

      22. Navigate to "Site administration" > "Server" > "Task processing"
      23. Set "Ad hoc task concurrency limit" to 10
      24. Queue up 1000x all tasks:

        php admin/tool/testtasks/cli/queue_multiple_adhoc_tasks.php -n=1000
        

      25. In 10 terminals run:

        php admin/cli/adhoc_task.php --keep-alive=1000 --execute
        

        and ensure each task is being (roughly) run in 2 terminals (it can vary a bit depending on how things go when the tasks start, but overall it should balance to each task getting 2 runners)

      26. Stop processing tasks (don't clear the queue)
      27. Set per-task concurrency limits in config.php:

        $CFG->task_concurrency_limit['tool_testtasks\task\one_thousand_second_task'] = 1;
        

        (repeat for the other 4 task types as well)

      28. In 10 terminals run:

        php admin/cli/adhoc_task.php --keep-alive=1000 --execute
        

        and ensure each task type only one at a time (i.e., half the runners should be doing nothing)

      Show
      Prerequisites Install tool_testtasks: git clone git@github.com:catalyst/moodle-tool_testtasks.git admin/tool/testtasks php admin/cli/upgrade.php Testing Install tool_testtasks as mentioned below Queue up 1000 tasks: php admin/tool/testtasks/cli/queue_multiple_adhoc_tasks.php -n= 1000 Run php admin/cli/adhoc_task.php --keep-alive= 1000 --execute for 30 seconds or so and ensure it process some tasks In 3 separate terminals run php admin/cli/adhoc_task.php --keep-alive= 1000 --execute in each of them. Ensure that a mix of tasks is being run Stop processing tasks (ctrl+c each terminal) In 4 separate terminals run php admin/cli/adhoc_task.php --keep-alive= 1000 --execute in all 4 terminals, ensure that no more than 3 of them process tasks Stop processing tasks (ctrl+c each terminal) Navigate to "Site administration" > "Server" > "Task processing" Set "Ad hoc task concurrency limit" to 5 In 5 separate terminals run php admin/cli/adhoc_task.php --keep-alive= 1000 --execute and ensure a mix of each type of task is running. Observe the tasks being run in each terminal (just watch it for a while, no need to make sure they all complete), and verify : That there is never a situation where the 1000 seconds task is being run by every runner A good mix of tasks appear to be getting run (i.e., one specific task does not appear to get priority) Stop running tasks and clear the adhoc task queue: php admin/tool/testtasks/cli/clear_adhoc_task_queue.php Queue 5000x 100 second tasks: php admin/tool/testtasks/cli/queue_multiple_adhoc_tasks.php -n= 5000 -c= "tool_testtasks\task\one_thousand_second_task" followed by 100x 2 second tasks: php admin/tool/testtasks/cli/queue_multiple_adhoc_tasks.php -n= 5000 -c= "tool_testtasks\task\two_second_task" In 5 separate terminals run php admin/cli/adhoc_task.php --keep-alive= 1000 --execute and ensure the 2 second tasks are getting processed Stop processing tasks and clear the queue: php admin/tool/testtasks/cli/clear_adhoc_task_queue.php Queue up 5000x 100 second tasks: php admin/tool/testtasks/cli/queue_multiple_adhoc_tasks.php -n= 5000 -c= "tool_testtasks\task\one_hundred_second_task" In 5 separate terminals run php admin/cli/adhoc_task.php --keep-alive= 1000 --execute and ensure all the runners are running 100 second tasks While the runners are still processing tasks, in a separate terminal queue up some 2 second tasks: php admin/tool/testtasks/cli/queue_multiple_adhoc_tasks.php -n= 1000 -c= "tool_testtasks\task\two_second_task" Watch the other 5 terminals, when the 100 second tasks are finished, ensure some 2 second tasks are able to start running With the tasks still processing, in another terminal, queue up 5, 10, and 1000 second tasks: php admin/tool/testtasks/cli/queue_multiple_adhoc_tasks.php -n= 1000 -c= "tool_testtasks\task\five_second_task,tool_testtasks\task\ten_second_task,tool_testtasks\task\one_thousand_second_task" Watch the other 5 terminals and ensure all types of task are being run Stop processing tasks and clear the queue: php admin/tool/testtasks/cli/clear_adhoc_task_queue.php Navigate to "Site administration" > "Server" > "Task processing" Set "Ad hoc task concurrency limit" to 10 Queue up 1000x all tasks: php admin/tool/testtasks/cli/queue_multiple_adhoc_tasks.php -n= 1000 In 10 terminals run: php admin/cli/adhoc_task.php --keep-alive= 1000 --execute and ensure each task is being (roughly) run in 2 terminals (it can vary a bit depending on how things go when the tasks start, but overall it should balance to each task getting 2 runners) Stop processing tasks (don't clear the queue) Set per-task concurrency limits in config.php: $CFG->task_concurrency_limit[ 'tool_testtasks\task\one_thousand_second_task' ] = 1 ; (repeat for the other 4 task types as well) In 10 terminals run: php admin/cli/adhoc_task.php --keep-alive= 1000 --execute and ensure each task type only one at a time (i.e., half the runners should be doing nothing)

    Description

      This is more a placeholder to collect more ideas on a more holistic and performant approach following on from MDL-67486MDL-67211MDL-67483 and MDL-67363.

      Things are now much better but at high scale, but with very unequal sized adhoc tasks you can still end up with some tasks hogging cron abd blocking processing of other things. MDL-64610 will help a lot here, but in an ideal world the task manager would dynamically adjust the priorities of tasks based on as much info as it has and not need manual tuning by either the developer or the admin.

      Some example scenarios:

      1) A queue of very slow tasks, eg async backups that take 10 mins each, is followed by some small tasks like sending emails which generally we want to be done fairly fast. Even with QoS the slow tasks end up pegging all of the available cron runners, because QoS is only considering what should start next based on what is in the queue, not what is already running.

      2) You have say 2 or 3 types of heavy task and nothing else. We end up splitting the load 50/50 and cron is pegged on heavy tasks. There are no 'spare' runners to start on a random new type of task which comes along.

       

      The concept I'm thinking about is roughly:

      1) after MDL-67211 lands we have metadata on what is running and for how long total, grouped by type

      2) when we look at what should be picked up next we weight the priorities by the totals above, already running tasks get progressively lower and lower priorities

      3) We tune this so that if there is say 10 runners, then no one task can ever hog more than say 2/3 of the runners so we always have something spare to start on new tasks, but we don't have an explicit limit on any one type of task

      4) If new types of task appears in the queue then we want to balance the runners across all of them. So if there are 5 types of tasks and 10 runners then each should get roughly 2 processes each regardless of how long each specific task takes.

      5) The current QoS layer slows down at scale, try to rebuild it in sql as much as possible

       

       

      Attachments

        1. fixed-forever.mp4
          1.49 MB
        2. MDL-67648_1.webm
          1.01 MB
        3. MDL-67648_2.webm
          1.01 MB
        4. MDL-67648_3.webm
          1.11 MB
        5. MDL-67648_4.webm
          3.50 MB
        6. MDL-67648_5.webm
          1.64 MB
        7. MDL-67648_6.webm
          2.41 MB

        Issue Links

          Activity

            People

              cameron1729 cameron1729
              brendanheywood Brendan Heywood
              Brendan Heywood Brendan Heywood
              Ilya Tregubov Ilya Tregubov
              Angelia Dela Cruz Angelia Dela Cruz
              Matteo Scaramuccia, Andrew Lyons, Huong Nguyen, Jun Pataleta, Michael Hawkins, Shamim Rezaie, Simey Lameze, Stevani Andolo, Amaia Anabitarte, Carlos Escobedo, Ferran Recio, Ilya Tregubov, Laurent David, Raquel Ortega, Sara Arjona (@sarjona)
              Votes:
              4 Vote for this issue
              Watchers:
              15 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:
                14/Nov/22

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0 minutes
                  0m
                  Logged:
                  Time Spent - 2 hours, 10 minutes
                  2h 10m