Moodle
  1. Moodle
  2. MDL-37431

Cleaning up old question previews cron slow for large sites

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 2.3, 2.4.1, 2.5
    • Fix Version/s: 2.3.5, 2.4.2
    • Component/s: Quiz
    • Labels:
    • Testing Instructions:
      Hide

      1. Run cron.

      2. Verify that there are no errors relating to the
      Cleaning up old question previews...done.
      bit.

      3. If you really want to: Preview a question in the quetsion bank. Close the pop-up. Verify there is a new row with component = 'core_question_preview' in the

      {question_usages}

      table. Wait more than 24 hours. Run cron again. Make sure that row is automatically deleted.

      Show
      1. Run cron. 2. Verify that there are no errors relating to the Cleaning up old question previews...done. bit. 3. If you really want to: Preview a question in the quetsion bank. Close the pop-up. Verify there is a new row with component = 'core_question_preview' in the {question_usages} table. Wait more than 24 hours. Run cron again. Make sure that row is automatically deleted.
    • Affected Branches:
      MOODLE_23_STABLE, MOODLE_24_STABLE, MOODLE_25_STABLE
    • Fixed Branches:
      MOODLE_23_STABLE, MOODLE_24_STABLE
    • Pull from Repository:
    • Pull 2.4 Branch:
    • Pull Master Branch:
    • Rank:
      47064

      Description

      We have a large database on Postgresql, to give an idea of large, we have the following sizes;

      Table | rows
      mdl_question_attempt_steps | 27,892,867
      mdl_question_attempts | 9,519,112
      mdl_question_usages | 918,954

      As a result, the "Cleaning up old question previews..." part of cron can run for over 1 hour due to the record scanning. MDL-29847 implemented this part of the functionality and chose the NOT EXISTS method for searching for previews. This however causes problems with large tables because the NOT EXISTS section does not constrain the join of attempts and attempt_steps. This results in very large joins at these data volumes;

      We end up with one of two plans depending on whether PostgreSQL does a "Nested Loop Anti Join" or a "Merge Anti Join".

      Query Plans;

      Nested Loop Anti Join
      ---------------------
      HashAggregate (cost=2158531.26..2158531.27 rows=1 width=8) (actual time=696339.439..696339.439 rows=0 loops=1)
      -> Nested Loop (cost=2158522.77..2158531.26 rows=1 width=8) (actual time=696339.435..696339.435 rows=0 loops=1)
      -> HashAggregate (cost=2158522.77..2158522.78 rows=1 width=8) (actual time=696339.435..696339.435 rows=0 loops=1)
      -> Nested Loop Anti Join (cost=69.69..2158522.77 rows=1 width=8) (actual time=696339.433..696339.433 rows=0 loops=1)
      Join Filter: (qa.questionusageid = quba.id)
      -> Seq Scan on mdl_question_usages quba (cost=0.00..21615.64 rows=1 width=8) (actual time=93.213..139.845 rows=37 loops=1)
      Filter: ((component)::text = 'core_question_preview'::text)
      -> Merge Join (cost=69.69..2136839.09 rows=5443 width=8) (actual time=18732.730..18815.950 rows=2514 loops=37)
      Merge Cond: (qa.id = qas.questionattemptid)
      Join Filter: ((qa.timemodified > 1357563742) OR (qas.timecreated > 1357563742))
      -> Index Scan using mdl_quesatte_id2_pk on mdl_question_attempts qa (cost=0.00..673526.49 rows=9381639 width=24) (actual time=0.012..2541.247 rows=9518472 loops=37)
      -> Index Scan using mdl_quesattestep_que_ix on mdl_question_attempt_steps qas (cost=0.00..1005946.39 rows=27514781 width=16) (actual time=0.010..5562.191 rows=27891139 loops=37)
      -> Index Scan using mdl_quesatte_id_pk on mdl_question_usages (cost=0.00..8.47 rows=1 width=16) (never executed)
      Index Cond: (mdl_question_usages.id = quba.id)
      Total runtime: 696339.628 ms

      Merge Anti Join
      ---------------
      HashAggregate (cost=2219091.59..2219091.90 rows=31 width=8) (actual time=19052.529..19052.529 rows=0 loops=1)
      -> Nested Loop (cost=2218831.32..2219091.44 rows=31 width=8) (actual time=19052.527..19052.527 rows=0 loops=1)
      -> HashAggregate (cost=2218831.32..2218831.63 rows=31 width=8) (actual time=19052.527..19052.527 rows=0 loops=1)
      -> Merge Anti Join (cost=2218789.57..2218831.24 rows=31 width=8) (actual time=19052.525..19052.525 rows=0 loops=1)
      Merge Cond: (quba.id = qa.questionusageid)
      -> Sort (cost=20866.69..20866.77 rows=31 width=8) (actual time=110.707..110.709 rows=37 loops=1)
      Sort Key: quba.id
      Sort Method: quicksort Memory: 26kB
      -> Seq Scan on mdl_question_usages quba (cost=0.00..20865.93 rows=31 width=8) (actual time=70.771..110.688 rows=37 loops=1)
      Filter: ((component)::text = 'core_question_preview'::text)
      -> Sort (cost=2197922.88..2197943.63 rows=8301 width=8) (actual time=18941.455..18941.588 rows=2560 loops=1)
      Sort Key: qa.questionusageid
      Sort Method: quicksort Memory: 391kB
      -> Merge Join (cost=69.80..2197382.52 rows=8301 width=8) (actual time=18858.397..18940.720 rows=4241 loops=1)
      Merge Cond: (qa.id = qas.questionattemptid)
      Join Filter: ((qa.timemodified > 1357563742) OR (qas.timecreated > 1357563742))
      -> Index Scan using mdl_quesatte_id2_pk on mdl_question_attempts qa (cost=0.00..674159.33 rows=9519112 width=24) (actual time=0.016..2411.477 rows=9519112 width=24) (actual time=0.016..2411.477 rows=9519112 loops=1)
      -> Index Scan using mdl_quesattestep_que_ix on mdl_question_attempt_steps qas (cost=0.00..1011370.03 rows=27892868 width=16) (actual time=0.011..6115.337 rows=27892867 loops=1)
      -> Index Scan using mdl_quesatte_id_pk on mdl_question_usages (cost=0.00..8.37 rows=1 width=16) (never executed)
      Index Cond: (mdl_question_usages.id = quba.id)
      Total runtime: 19052.673 ms

      Either way, both are slow, but the "Nested Loop Anti Join" runs for over 10 minutes and multiple queries in the cron appear to use the NOT EXISTS structure used in this query.

      The real performance bottleneck somes from the following section plan the query and plan;

      -> Merge Join (cost=69.69..2136839.09 rows=5443 width=8) (actual time=18732.730..18815.950 rows=2514 loops=37)
      Merge Cond: (qa.id = qas.questionattemptid)
      Join Filter: ((qa.timemodified > 1357563742) OR (qas.timecreated > 1357563742))
      -> Index Scan using mdl_quesatte_id2_pk on mdl_question_attempts qa (cost=0.00..673526.49 rows=9381639 width=24) (actual time=0.012..2541.247 rows=9518472 loops=37)
      -> Index Scan using mdl_quesattestep_que_ix on mdl_question_attempt_steps qas (cost=0.00..1005946.39 rows=27514781 width=16) (actual time=0.010..5562.191 rows=27891139 loops=37)

      SELECT 1
      FROM mdl_question_attempts qa
      JOIN mdl_question_attempt_steps qas ON qas.questionattemptid = qa.id
      WHERE qa.questionusageid = quba.id
      AND (qa.timemodified > 1357563742
      OR qas.timecreated > 1357563742)

      This creates an unconstrained join of all the question attempt data to extract which ones are valid. It's a correlated subquery that lends itself to a nested loop join. But the insufficient constraint of the search results in full index scans for each loop taking 5 seconds each.

      In reality all we care about here is looking a question previews that are just that 'previews' we scan all the data above regardless of the attempt type. This produces the slow performance seen.

      There are two ways to address this.

      1. Revert to using the rejected method from MDL-29847 which runs in 108ms on our environment.
      2. Improve the join constraint in the NOT EXISTS query to only preview core_question_preview items which are significantly less a number than the entire question attempt set.

      Example;
      SELECT DISTINCT contextid, 1
      FROM mdl_question_usages
      WHERE id IN
      (SELECT quba.id FROM mdl_question_usages quba
      WHERE quba.component = 'core_question_preview'
      AND quba.id NOT IN (
      SELECT qa.questionusageid
      FROM mdl_question_attempts qa
      JOIN mdl_question_attempt_steps qas ON qas.questionattemptid = qa.id
      – Addition join to constrain to previews
      JOIN mdl_question_usages l_quba ON qa.questionusageid = l_quba.id
      WHERE l_quba.component = 'core_question_preview' AND (qa.timemodified > 1357563742 OR qas.timecreated > 1357563742))
      );

      This query runs in 300ms.

      Either of these options is an acceptable solution as they provide a fast index scan on the question_attempt and question_attempt_data tables. This means as the row numbers grow, performance will not degrade.

      I don't want to preempt the outcome, but the extra join constraint seems the safer option due to testing on various environments. I'm not able to test on other platforms and I've not looked at where this query is in the code. If a review of my comments can be made, I would look at getting a patch made if the proposed solutions are acceptable.

        Issue Links

          Activity

          Hide
          Russell Smith added a comment -

          This issue introduced cron code the cleanup question previews, however the query does not scale. This bug reports the non-scaling issue.

          Show
          Russell Smith added a comment - This issue introduced cron code the cleanup question previews, however the query does not scale. This bug reports the non-scaling issue.
          Hide
          Tim Hunt added a comment -

          Thank you very much for finding this problem and suggesting the solutions.

          Because this also has to work on MySQL, I think I will go for the simpler option of adding the redundant join to the subquery. Since this code only runs on cron, it does not have to be really highly performant, just good enough.

          Show
          Tim Hunt added a comment - Thank you very much for finding this problem and suggesting the solutions. Because this also has to work on MySQL, I think I will go for the simpler option of adding the redundant join to the subquery. Since this code only runs on cron, it does not have to be really highly performant, just good enough.
          Hide
          Tim Hunt added a comment -

          I think this fixes it. Submitting for integration review.

          Show
          Tim Hunt added a comment - I think this fixes it. Submitting for integration review.
          Hide
          Eloy Lafuente (stronk7) added a comment -

          The main moodle.git repository has just been updated with latest weekly modifications. You may wish to rebase your PULL branches to simplify history and avoid any possible merge conflicts. This would also make integrator's life easier next week.

          TIA and ciao

          Show
          Eloy Lafuente (stronk7) added a comment - The main moodle.git repository has just been updated with latest weekly modifications. You may wish to rebase your PULL branches to simplify history and avoid any possible merge conflicts. This would also make integrator's life easier next week. TIA and ciao
          Hide
          Eloy Lafuente (stronk7) added a comment -

          Integrated (23, 24 & master), thanks!

          Show
          Eloy Lafuente (stronk7) added a comment - Integrated (23, 24 & master), thanks!
          Hide
          Adrian Greeve added a comment -

          Tested on 2.3, 2.4 and master integration branches.
          I didn't get any errors relating to cleaning up old question previews.
          Test passed.

          Show
          Adrian Greeve added a comment - Tested on 2.3, 2.4 and master integration branches. I didn't get any errors relating to cleaning up old question previews. Test passed.
          Hide
          Dan Poltawski added a comment -

          Hurray! We did it! Thanks to all the reporters, testers, user and watchers for a bumper week of Moodling!

          Show
          Dan Poltawski added a comment - Hurray! We did it! Thanks to all the reporters, testers, user and watchers for a bumper week of Moodling!

            People

            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: