Moodle
  1. Moodle
  2. MDL-37577

Deleting users and very slow quiz_get_attempt_usertime_sql

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 2.3.4, 2.4, 2.5
    • Fix Version/s: 2.3.5, 2.4.2
    • Component/s: Quiz
    • Labels:
    • Testing Instructions:
      Hide

      -1. Run unit tests on all DBS [Dan is volunteering]

      0. You need a course with at least one quiz Q and one group G1.

      1. Add student S1 to group G1, then start a quiz attempt as S1. Also start a quiz attempt as student S2 who is not currently in group G1.

      2. Add an time-limit or close data override for Group G1 in Quiz Q. Verify that there are no errors.

      3. Add student S2 to G1, and remove student S1. Verify that there are no errors.

      4. Run cron, verify that the quiz overdue attempts processing cron runs, and that there are no errors.

      5. Edit the quiz settings and change the time-limit and/or close date. Verify that there are no errors when you save the form.

      6. Delete group G1. Verify that there are no errors.

      Show
      -1. Run unit tests on all DBS [Dan is volunteering] 0. You need a course with at least one quiz Q and one group G1. 1. Add student S1 to group G1, then start a quiz attempt as S1. Also start a quiz attempt as student S2 who is not currently in group G1. 2. Add an time-limit or close data override for Group G1 in Quiz Q. Verify that there are no errors. 3. Add student S2 to G1, and remove student S1. Verify that there are no errors. 4. Run cron, verify that the quiz overdue attempts processing cron runs, and that there are no errors. 5. Edit the quiz settings and change the time-limit and/or close date. Verify that there are no errors when you save the form. 6. Delete group G1. Verify that there are no errors.
    • Affected Branches:
      MOODLE_23_STABLE, MOODLE_24_STABLE, MOODLE_25_STABLE
    • Fixed Branches:
      MOODLE_23_STABLE, MOODLE_24_STABLE
    • Pull from Repository:
    • Pull 2.4 Branch:
    • Pull Master Branch:
    • Rank:
      47231

      Description

      We have a large database and are attempting to delete users from it. That coupled with a large number if quiz attempts results in being only able to delete about 8 users per hour.
      What we are seeing is multiple database calls involving the sql query in quiz_get_attempt_usertime_sql(). These take approximately 30 seconds each. See below for further details.
      It appears that open quizes are recalculated for each course when a user is deleted. This means can get a large number of 30 seconds blocks for deleting of a single user.

      Having read all the related materials, this is not going to be a fun fix and producing the current situation was complex enough. We are running PostgreSQL so all information
      here is developed in that context and not tested against other databases.

      There are a couple of options to improve this;
      1. Update the delete users to not continually call quiz_get_attempt_usertime_sql(). It's not known if this is possible and wasn't the focus of investigation
      2. Improve the SQL query to run faster in large environments.

      Attempts have been made to improve the SQL performance, but haven't been tested against other databases.

      Tables sizes

      mdl_groups_members: 294686
      mdl_quiz_overrides: 558
      mdl_quiz: 14336
      mdl_quiz_attempts: 836894

      Commit introducing this SQL: 8e771aed93eea08cc3e9410283f5354e02311281

      Related Tracker Items:
      https://tracker.moodle.org/browse/MDL-35717
      https://tracker.moodle.org/browse/MDL-36842

      See attached: Initial Plan.txt

      With this plan we end up producing about 9 million rows before completing an aggregate which is nearly 500MB of data to process. This is slow and completed on disk.

      In an attempt to limit the number of rows produced at any nodes, the results need to be summarized where possible to ensure the smallest set of rows bubble up.
      The groups_members joins to mdl_quiz_overrides creates a large number of output rows to build a set of group overrides for some quizes. By only outputting the one
      set of results for each quiz and user in the system with a group override, the result set at this point is much smaller. This is then joined back into the original
      main query and re-aggregates the user overrides and default quiz limits. The results are a much smaller set of rows bubbling up at the expense of a potentially greater subset
      of rows produced at the qogroup level.

      In our large environment the changes resulted in a 10x speed increase in the results. The executed per course for user deletes make them complete substantially faster.

      This is the adjusted query that operates on PostgreSQL

      SELECT iquiza.id,
      COALESCE(MAX(quo.timeclose), MAX(qogroup.timeclose), iquiz.timeclose) AS usertimeclose,
      COALESCE(MAX(quo.timelimit), MAX(qogroup.timelimit), iquiz.timelimit) AS usertimelimit
      FROM mdl_quiz_attempts iquiza
      JOIN mdl_quiz iquiz ON iquiz.id = iquiza.quiz

      – Process user quiz overrides
      LEFT JOIN mdl_quiz_overrides quo ON quo.quiz = iquiza.quiz AND quo.userid = iquiza.userid
      LEFT JOIN
      – Building the aggregate of overrides considerably constrains the number of rows processed at the top aggregate.
      – In our environment it's a factor of 11 small at the top aggregate processing
      (SELECT gm.userid, qgo1.quiz,
      COALESCE(MAX(qgo1.timeclose), MAX(qgo2.timeclose)) AS timeclose,
      COALESCE(MAX(qgo3.timelimit), MAX(qgo4.timelimit)) AS timelimit
      FROM mdl_groups_members gm
      LEFT JOIN mdl_quiz_overrides qgo1 ON qgo1.groupid = gm.groupid AND qgo1.timeclose = 0
      LEFT JOIN mdl_quiz_overrides qgo2 ON qgo2.groupid = gm.groupid AND qgo2.timeclose > 0
      LEFT JOIN mdl_quiz_overrides qgo3 ON qgo3.groupid = gm.groupid AND qgo3.timelimit = 0
      LEFT JOIN mdl_quiz_overrides qgo4 ON qgo4.groupid = gm.groupid AND qgo4.timelimit > 0
      GROUP BY gm.userid, qgo1.quiz
      ) qogroup
      ON qogroup.quiz = iquiza.quiz AND qogroup.userid = iquiza.userid
      GROUP BY iquiza.id, iquiz.id, iquiz.timeclose, iquiz.timelimit;

      For the resulting Query plan for our dataset see Updated Plan.txt

      Please review and confirm these queries are equivilent and whether my approach to this problem is appropriate.
      It will also need to be confirmed this format of subquery works on all databases. I expect it does as the query itself is a subquery itself and I've just created another nested level.

      1. Initial Plan.txt
        5 kB
        Russell Smith
      2. Updated Plan.txt
        5 kB
        Russell Smith

        Issue Links

          Activity

          Hide
          Russell Smith added a comment -

          The above issues were created and the result introduced the query we are talking about here. There was a long thread to get to the state where it is now. This issue extends the performance impacts discussed to real world situations.

          Show
          Russell Smith added a comment - The above issues were created and the result introduced the query we are talking about here. There was a long thread to get to the state where it is now. This issue extends the performance impacts discussed to real world situations.
          Hide
          Tim Hunt added a comment -

          Thanks for finding this problem, and suggesting a fix.

          If you could make a patch of you change, it would be easier for me to work with.

          If the query is that pathologically bad, I am disappointed we can only get a 10x speed-up, although that is not to be sniffed at. It is normally possible to do several orders of magnitude.

          It would be better to take approach 1. of the two you suggest. There is not need to do this calculation at all when deleting users. I think it is worth digging further into the code.

          Show
          Tim Hunt added a comment - Thanks for finding this problem, and suggesting a fix. If you could make a patch of you change, it would be easier for me to work with. If the query is that pathologically bad, I am disappointed we can only get a 10x speed-up, although that is not to be sniffed at. It is normally possible to do several orders of magnitude. It would be better to take approach 1. of the two you suggest. There is not need to do this calculation at all when deleting users. I think it is worth digging further into the code.
          Hide
          Russell Smith added a comment -

          Looking into the why does delete user call the quiz_get_attempt_usertime_sql, I've discovered that the quiz module registers and event for any group change in a course that results in quiz_update_open_atempts being called. This explains and one of our other slow jobs as well. We do daily group updates on changed enrolments and student enrolments. Any changes to ggroups call this slow function for each user in each course. If I'm understanding how the events systems works correctly. Deleting a user removes the user from all groups in all their courses and generates aan open quiz recalculation for each course they are being removed from. That can be 20+ courses in our case. I don't see a way for delete user to ignore the event even though it doesn't seem relevant. I have no experience at all with with the moodle events system to understand how I might interact with it in this way to improve the situation there. Any pointers or ideas on this front?

          Show
          Russell Smith added a comment - Looking into the why does delete user call the quiz_get_attempt_usertime_sql, I've discovered that the quiz module registers and event for any group change in a course that results in quiz_update_open_atempts being called. This explains and one of our other slow jobs as well. We do daily group updates on changed enrolments and student enrolments. Any changes to ggroups call this slow function for each user in each course. If I'm understanding how the events systems works correctly. Deleting a user removes the user from all groups in all their courses and generates aan open quiz recalculation for each course they are being removed from. That can be 20+ courses in our case. I don't see a way for delete user to ignore the event even though it doesn't seem relevant. I have no experience at all with with the moodle events system to understand how I might interact with it in this way to improve the situation there. Any pointers or ideas on this front?
          Hide
          Russell Smith added a comment -

          Revisiting options for SQL optimization;

          The slow query function is called from 2 places:

          mod/quiz/locallib.php: $quizausersql = quiz_get_attempt_usertime_sql();
          mod/quiz/cronlib.php: $quizausersql = quiz_get_attempt_usertime_sql();

          The cronlib call site has a legitimate case for wanting to see all attempts as it's looking for all overdue attempts putting the cron in a position where it needs all the data. I expect it needs the optimized version for looking at all attempts. In that case, 4 seconds is not bad for cron.

          Inspecting the locallib.php call site leads me to believe the function is too generic an implementation of what's required in a number of cases. I believe the query needs to be pushed up one function to allow more constraints to be put into the query. In the course and user case we are seeing, only a very small subset of that data is interesting and putting more constraints brings the query into the milliseconds range which is more the performance improvement you were hoping for. What is the best way to go about coding that? Pass the possible conditions into that function or removed the function completely?

          Show
          Russell Smith added a comment - Revisiting options for SQL optimization; The slow query function is called from 2 places: mod/quiz/locallib.php: $quizausersql = quiz_get_attempt_usertime_sql(); mod/quiz/cronlib.php: $quizausersql = quiz_get_attempt_usertime_sql(); The cronlib call site has a legitimate case for wanting to see all attempts as it's looking for all overdue attempts putting the cron in a position where it needs all the data. I expect it needs the optimized version for looking at all attempts. In that case, 4 seconds is not bad for cron. Inspecting the locallib.php call site leads me to believe the function is too generic an implementation of what's required in a number of cases. I believe the query needs to be pushed up one function to allow more constraints to be put into the query. In the course and user case we are seeing, only a very small subset of that data is interesting and putting more constraints brings the query into the milliseconds range which is more the performance improvement you were hoping for. What is the best way to go about coding that? Pass the possible conditions into that function or removed the function completely?
          Hide
          Tim Hunt added a comment -

          I just added some debugging, and then deleted a user, and quiz_get_attempt_usertime_sql was never called.

          So, please can you find out why it is getting called on your system.

          Show
          Tim Hunt added a comment - I just added some debugging, and then deleted a user, and quiz_get_attempt_usertime_sql was never called. So, please can you find out why it is getting called on your system.
          Hide
          Tim Hunt added a comment -

          Russell, Sorry, I did not see your comments before I made mine. (Grrr! Ajax UIs.)

          I just discussed this with a colleague, and we can't disagree with the fact that Moodle raises the groups_members_removed events while deleting a user, even though it is a pain in this case.

          quiz_get_attempt_usertime_sql is really horrible complex code, and it is really good that we don't duplicate that logic.

          OK, I have done a patch that adds redundant WHERE conditions in the subquery. That should solve the performance issue. Are you able to test it?

          Matt, would you be able to give this a quiz peer-review? Thank.

          Show
          Tim Hunt added a comment - Russell, Sorry, I did not see your comments before I made mine. (Grrr! Ajax UIs.) I just discussed this with a colleague, and we can't disagree with the fact that Moodle raises the groups_members_removed events while deleting a user, even though it is a pain in this case. quiz_get_attempt_usertime_sql is really horrible complex code, and it is really good that we don't duplicate that logic. OK, I have done a patch that adds redundant WHERE conditions in the subquery. That should solve the performance issue. Are you able to test it? Matt, would you be able to give this a quiz peer-review? Thank.
          Hide
          Andrew Nicols added a comment -

          Stealing the Peer Review off Matt after discussion with Tim.

          Show
          Andrew Nicols added a comment - Stealing the Peer Review off Matt after discussion with Tim.
          Hide
          Andrew Nicols added a comment -

          Thanks Tim,

          I've tested this patch and all seems to be working as expected. Our sync time has gone from 8+ hours back to it's normal 2 minutes.

          On the code review front I just have a couple of minor points.

          [N] Syntax
          [Y] Output
          [Y] Whitespace
          [-] Language
          [Y] Databases
          [N] Testing
          [Y] Security
          [-] Documentation
          [Y] Git
          [Y] Sanity check

          Syntax

          • Typo in the the new @param string for quiz_get_attempt_usertime_sql (attmepts -> attempts)
          • Missing trailing . on the @return string
          • Imploding with "\n " for the where clauses. Is this intentionally left in, or was it from debugging? There isn't a precedent for this in Moodle and, although it makes debugging the statement easier, I'm not sure we should be doing this. I'll leave it to you to decide and/or integrators to consider.

          Testing
          Just missing the testing instructions.

          Otherwise everything looks good. Feel free to submit for integration when ready.

          Andrew

          Show
          Andrew Nicols added a comment - Thanks Tim, I've tested this patch and all seems to be working as expected. Our sync time has gone from 8+ hours back to it's normal 2 minutes. On the code review front I just have a couple of minor points. [N] Syntax [Y] Output [Y] Whitespace [-] Language [Y] Databases [N] Testing [Y] Security [-] Documentation [Y] Git [Y] Sanity check Syntax Typo in the the new @param string for quiz_get_attempt_usertime_sql (attmepts -> attempts) Missing trailing . on the @return string Imploding with "\n " for the where clauses. Is this intentionally left in, or was it from debugging? There isn't a precedent for this in Moodle and, although it makes debugging the statement easier, I'm not sure we should be doing this. I'll leave it to you to decide and/or integrators to consider. Testing Just missing the testing instructions. Otherwise everything looks good. Feel free to submit for integration when ready. Andrew
          Hide
          Tim Hunt added a comment -

          Thanks for your review Andrew. I have fixed the comment. I think we should leave the nice SQL formatting. I did indeed add it while debugging, and it make things easier. So, while I would not bother to do that in other situations, since I did do it here, we should leave it in.

          Submitting for integration now.

          Show
          Tim Hunt added a comment - Thanks for your review Andrew. I have fixed the comment. I think we should leave the nice SQL formatting. I did indeed add it while debugging, and it make things easier. So, while I would not bother to do that in other situations, since I did do it here, we should leave it in. Submitting for integration now.
          Hide
          Michael Aherne added a comment -

          In case it's of interest, we applied this patch on our live server after a 2.3.4 upgrade and it has sped up group add / remove from over 30s per user to about 3s. We have 124977 quiz attempts but only 16 quiz overrides.

          Show
          Michael Aherne added a comment - In case it's of interest, we applied this patch on our live server after a 2.3.4 upgrade and it has sped up group add / remove from over 30s per user to about 3s. We have 124977 quiz attempts but only 16 quiz overrides.
          Hide
          Matt Petro added a comment -

          Thanks for fixing and peer reviewing this. I didn't get any emails from the ticket until this morning, not sure why.

          The changes seem good, but I'm wondering if we can't simplify this even more:

          In quiz_update_open_attempts(), just set quiza.timecheckstate = 0 for all affected attempts, which eliminates the monster query all together. Cron will later fix it up.

          Show
          Matt Petro added a comment - Thanks for fixing and peer reviewing this. I didn't get any emails from the ticket until this morning, not sure why. The changes seem good, but I'm wondering if we can't simplify this even more: In quiz_update_open_attempts(), just set quiza.timecheckstate = 0 for all affected attempts, which eliminates the monster query all together. Cron will later fix it up.
          Hide
          Tim Hunt added a comment -

          @Michael, thank you for the additional testing.

          @Matt, thanks for looking. You are be right, but I think with this patch, that has been testing, it is working fine. The new query seems to be very fast. So, let us leave it until someone can find a situation where the new query takes more than a few hundredths of a second.

          Show
          Tim Hunt added a comment - @Michael, thank you for the additional testing. @Matt, thanks for looking. You are be right, but I think with this patch, that has been testing, it is working fine. The new query seems to be very fast. So, let us leave it until someone can find a situation where the new query takes more than a few hundredths of a second.
          Hide
          Matt Petro added a comment -

          Excellent! Russell and Tim, thanks for fixing up my code! When I wrote those routines I didn't know about the need to include redundant WHERE clauses in subqueries for situations like this, as I figured that the SQL engine would figure that out on its own. We just started seeing problems on our own site due to this problem, and I'll push out the fix now. Cheers!

          Show
          Matt Petro added a comment - Excellent! Russell and Tim, thanks for fixing up my code! When I wrote those routines I didn't know about the need to include redundant WHERE clauses in subqueries for situations like this, as I figured that the SQL engine would figure that out on its own. We just started seeing problems on our own site due to this problem, and I'll push out the fix now. Cheers!
          Hide
          Eloy Lafuente (stronk7) added a comment -

          Integrated (23, 24 & master), thanks!

          PS: Each time I look to dynamic-sql-generation like that (passing as function params, INs everywhere, imploding for formatting...) I think someone is really crazy or, alternatively, the world sucks and it's tremendously complex.

          PS2: But if it works, it works, thanks for the export everybody!

          PS3: I'd suggest to try it under all DBs, I cannot imagine any SQL problem but...

          Show
          Eloy Lafuente (stronk7) added a comment - Integrated (23, 24 & master), thanks! PS: Each time I look to dynamic-sql-generation like that (passing as function params, INs everywhere, imploding for formatting...) I think someone is really crazy or, alternatively, the world sucks and it's tremendously complex. PS2: But if it works, it works, thanks for the export everybody! PS3: I'd suggest to try it under all DBs, I cannot imagine any SQL problem but...
          Hide
          Tim Hunt added a comment -

          Eloy, there are unit tests which should include testing this key query. Hopefully cibot will run those under all DBs.

          Show
          Tim Hunt added a comment - Eloy, there are unit tests which should include testing this key query. Hopefully cibot will run those under all DBs.
          Hide
          Russell Smith added a comment -

          Just returning to confirm we have seen the performance improved we expected from these changes. The 30seconds of queries are now running in < 10ms most of the time. The worst we have seen for delete user is about 11ms. Thanks for the fixes and reviews everybody.

          Sorry for the delay in responding this time, we had so other things to attend to and it took some time to get back to following this up.

          Show
          Russell Smith added a comment - Just returning to confirm we have seen the performance improved we expected from these changes. The 30seconds of queries are now running in < 10ms most of the time. The worst we have seen for delete user is about 11ms. Thanks for the fixes and reviews everybody. Sorry for the delay in responding this time, we had so other things to attend to and it took some time to get back to following this up.
          Hide
          Dan Poltawski added a comment -

          Unfortunately the CI servers do not yet run the tests on all dbs, so I added a note to the testing instructions.

          Show
          Dan Poltawski added a comment - Unfortunately the CI servers do not yet run the tests on all dbs, so I added a note to the testing instructions.
          Hide
          Tim Hunt added a comment -

          While you are at it, you should get the CI servers to run the test on Windows too. I have been noticing failures on Windows recently, and I don't think it should always be my job to fix them

          Show
          Tim Hunt added a comment - While you are at it, you should get the CI servers to run the test on Windows too. I have been noticing failures on Windows recently, and I don't think it should always be my job to fix them
          Hide
          Tim Hunt added a comment -

          @Russell, Thanks, that is useful feedback, and not too late to be useful.

          Show
          Tim Hunt added a comment - @Russell, Thanks, that is useful feedback, and not too late to be useful.
          Hide
          Andrew Davis added a comment -

          I'm going through the manual component of the testing. I am however unable to run the unit tests on the various DBs.

          Show
          Andrew Davis added a comment - I'm going through the manual component of the testing. I am however unable to run the unit tests on the various DBs.
          Hide
          Andrew Davis added a comment -

          The manual testing all went off without a hitch. Just waiting on the unit tests now.

          Show
          Andrew Davis added a comment - The manual testing all went off without a hitch. Just waiting on the unit tests now.
          Hide
          Dan Poltawski added a comment -

          Looks good on oracle/pg/mysql.

          Show
          Dan Poltawski added a comment - Looks good on oracle/pg/mysql.
          Hide
          Dan Poltawski added a comment -

          Damyon tested on mssql and looks good there too.

          Show
          Dan Poltawski added a comment - Damyon tested on mssql and looks good there too.
          Hide
          Eloy Lafuente (stronk7) added a comment -

          Surely you will be happy to know that your code is now part of Moodle upstream. Thanks, thanks!

          Closing as fixed, ciao

          Show
          Eloy Lafuente (stronk7) added a comment - Surely you will be happy to know that your code is now part of Moodle upstream. Thanks, thanks! Closing as fixed, ciao

            People

            • Votes:
              1 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: