Moodle
  1. Moodle
  2. MDL-32462

Cron performance problems after upgrading to Moodle 2.2.x

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 2.2.2
    • Fix Version/s: 2.2.3
    • Component/s: Roles / Access
    • Labels:
      None
    • Testing Instructions:
      Hide

      1. Make a test script like

      <?php
      require_once('config.php');
      context_helper::build_all_paths(true);
      echo 'done';
      

      run it, and verify that there are no errors.

      2. Change the 'true' to 'false', and run the script again.

      3. Go in to the context table, find two rows with contextlevel = 30 (user). In one set depth to 0, and in the other set path to NULL. Run the 'false' version of the test script again. Make sure it sets the data back to the correct values.

      4. Again, edit rows in the user table, but set depth to something silly, like 100, and path to a random string. Run the 'true' version of the test script again. Make sure it sets the data back to the correct values.

      Show
      1. Make a test script like <?php require_once('config.php'); context_helper::build_all_paths( true ); echo 'done'; run it, and verify that there are no errors. 2. Change the 'true' to 'false', and run the script again. 3. Go in to the context table, find two rows with contextlevel = 30 (user). In one set depth to 0, and in the other set path to NULL. Run the 'false' version of the test script again. Make sure it sets the data back to the correct values. 4. Again, edit rows in the user table, but set depth to something silly, like 100, and path to a random string. Run the 'true' version of the test script again. Make sure it sets the data back to the correct values.
    • Affected Branches:
      MOODLE_22_STABLE
    • Fixed Branches:
      MOODLE_22_STABLE
    • Pull from Repository:
    • Pull Master Branch:
    • Rank:
      39346

      Description

      Cron and accesslib.php were both changed substantially in Moodle 2.2.

      One bit in particular has been causing us problems since we upgraded from 2.1.x this morning. In cron-lib, there are now calls to

      context_helper::cleanup_instances(); // Line 149 of lib/cronlib.php
      context_helper::build_all_paths(false); // Line 151 of lib/cronlib.php
      context_helper::create_instances(); // Line 183 of lib/cronlib.php

      At least one of thase calls was bringing down our server, but we are not 100% sure which. We commented out all three lines of code, and now running cron does not bring down the server, and we don't really want to experiment enabling them one at a time, and see which crashes our live systemwink

      See http://moodle.org/mod/forum/discuss.php?d=199830 for more info.

        Activity

        Hide
        Petr Škoda added a comment -

        How can I reproduce this on my test server? Could you tell us at least how many contexts of each level you have in database and what database are you using?

        Show
        Petr Škoda added a comment - How can I reproduce this on my test server? Could you tell us at least how many contexts of each level you have in database and what database are you using?
        Hide
        Tim Hunt added a comment -

        Sorry Petr, I am not expecting you to work on this. I only created this bug so I could submit a patch for peer review.

        Show
        Tim Hunt added a comment - Sorry Petr, I am not expecting you to work on this. I only created this bug so I could submit a patch for peer review.
        Hide
        Tim Hunt added a comment -

        We think the problem only affects Postgres, and is to do with how Postgres handles concurrency. Basically, in pg, when you alter a row, it actually writes a new row, then when the transaction is committed, marks the old row as deleted, and the new row as live. Later, the deleted rows get cleaned up by a VACUUM.

        At the moment, context_user::build_paths is doing an update that will alter every row (even though the data in almost all the rows will not change). We think that this, or rather the auto-vacuum that it triggers soon after wards, is killing our Moodle by locking the the context table for some time.

        Therefore, this patch (by Derek Woolhead) changes the query add a where clause so that we only rewrite rows where the data will actually change.

        Show
        Tim Hunt added a comment - We think the problem only affects Postgres, and is to do with how Postgres handles concurrency. Basically, in pg, when you alter a row, it actually writes a new row, then when the transaction is committed, marks the old row as deleted, and the new row as live. Later, the deleted rows get cleaned up by a VACUUM. At the moment, context_user::build_paths is doing an update that will alter every row (even though the data in almost all the rows will not change). We think that this, or rather the auto-vacuum that it triggers soon after wards, is killing our Moodle by locking the the context table for some time. Therefore, this patch (by Derek Woolhead) changes the query add a where clause so that we only rewrite rows where the data will actually change.
        Hide
        Petr Škoda added a comment -

        I think it could be build paths for user context level, I can try adding $force condition similar to other context and test it with a few tens of thousands of users.

        Show
        Petr Škoda added a comment - I think it could be build paths for user context level, I can try adding $force condition similar to other context and test it with a few tens of thousands of users.
        Hide
        Tim Hunt added a comment -

        OK, here is our proposed change. Since the issue is Postgres only, this is not actually going to help Dan P with moodle.org.

        Anyway, comments please.

        Show
        Tim Hunt added a comment - OK, here is our proposed change. Since the issue is Postgres only, this is not actually going to help Dan P with moodle.org. Anyway, comments please.
        Hide
        Tim Hunt added a comment -

        If we like this change, obviously it needs to be backported to 2.2 stable.

        Show
        Tim Hunt added a comment - If we like this change, obviously it needs to be backported to 2.2 stable.
        Hide
        Petr Škoda added a comment -

        I guess it might be even faster to do only "depth = 0 OR path IS NULL" if !$force and the query in patch if $force.

        Show
        Petr Škoda added a comment - I guess it might be even faster to do only "depth = 0 OR path IS NULL" if !$force and the query in patch if $force.
        Hide
        Dan Poltawski added a comment -

        Do we only need to care about the user? Should we consider the same thing happeneing in other contextlevels?

        Show
        Dan Poltawski added a comment - Do we only need to care about the user? Should we consider the same thing happeneing in other contextlevels?
        Hide
        Tim Hunt added a comment -

        I think that the code for most other context levels already has this sort of check built in, or is just not significant. At the OU we have half a million users, a few thousand courses, and about a hundred categories.

        Show
        Tim Hunt added a comment - I think that the code for most other context levels already has this sort of check built in, or is just not significant. At the OU we have half a million users, a few thousand courses, and about a hundred categories.
        Hide
        Dan Poltawski added a comment - - edited

        +1 from me on this, but I reffer you to Petrs' comments since hes the expert in this area.

        Show
        Dan Poltawski added a comment - - edited +1 from me on this, but I reffer you to Petrs' comments since hes the expert in this area.
        Hide
        Tim Hunt added a comment -

        So, shall I do a version of the patch the respects the $force parameter?

        Show
        Tim Hunt added a comment - So, shall I do a version of the patch the respects the $force parameter?
        Hide
        Petr Škoda added a comment -

        yes please, the IS NULL should be a lot faster

        Show
        Petr Škoda added a comment - yes please, the IS NULL should be a lot faster
        Hide
        Tim Hunt added a comment -

        OK. Patch revised. Please can you look at https://github.com/timhunt/moodle/compare/master...MDL-32462 again.

        I note that the corresponding code for courses and modules is much more complex for some reason.

        Show
        Tim Hunt added a comment - OK. Patch revised. Please can you look at https://github.com/timhunt/moodle/compare/master...MDL-32462 again. I note that the corresponding code for courses and modules is much more complex for some reason.
        Hide
        Petr Škoda added a comment -

        looks ok, but I did not tests the code, I guess we could improve the accesslib unittests a bit to test both $force values
        anyway +1 for integration

        Show
        Petr Škoda added a comment - looks ok, but I did not tests the code, I guess we could improve the accesslib unittests a bit to test both $force values anyway +1 for integration
        Hide
        Tim Hunt added a comment -

        I tested it to the extent of making a test script that called context_helper::build_all_paths(false); and context_helper::build_all_paths(true); and ensuring there were no fatal errors.

        Show
        Tim Hunt added a comment - I tested it to the extent of making a test script that called context_helper::build_all_paths(false); and context_helper::build_all_paths(true); and ensuring there were no fatal errors.
        Hide
        Tim Hunt added a comment -

        Testing instructions written.

        I will rebase and cherry-pick once the weeklies are out.

        Show
        Tim Hunt added a comment - Testing instructions written. I will rebase and cherry-pick once the weeklies are out.
        Hide
        Tim Hunt added a comment -

        Submitting for integration now. Thanks to everyone who looked at this.

        Note that we are not yet running this fix on our live servers. We don't normally change core code locally if we can help it. Instead, we wait until the commit is integrated into the stable branch and then cherry-pick it. However, in this case we are considering patching it to the live servers next week (which means next Tuesday morning UK time) which means that we will know if it solves our cron performance problems, or not, before the weeklies have to be rolled.

        Show
        Tim Hunt added a comment - Submitting for integration now. Thanks to everyone who looked at this. Note that we are not yet running this fix on our live servers. We don't normally change core code locally if we can help it. Instead, we wait until the commit is integrated into the stable branch and then cherry-pick it. However, in this case we are considering patching it to the live servers next week (which means next Tuesday morning UK time) which means that we will know if it solves our cron performance problems, or not, before the weeklies have to be rolled.
        Hide
        Petr Škoda added a comment -

        it would be great to know if we should look for more problems, thanks a lot!

        Show
        Petr Škoda added a comment - it would be great to know if we should look for more problems, thanks a lot!
        Hide
        Dan Poltawski added a comment -

        Thanks, i've integrated this now.

        Tim: it'll be really great if you can report on the results of your run in production. thanks.

        Show
        Dan Poltawski added a comment - Thanks, i've integrated this now. Tim: it'll be really great if you can report on the results of your run in production. thanks.
        Hide
        Jason Fowler added a comment -

        Works fine

        Show
        Jason Fowler added a comment - Works fine
        Hide
        Eloy Lafuente (stronk7) added a comment -

        This has been near becoming rejected, because it's not the best code you are able to produce.

        But, luckily, at the end, it has landed and has been spread to all repos out there.

        Many thanks and, don't forget it, keep improving your skills, you can!

        Closing, ciao

        Show
        Eloy Lafuente (stronk7) added a comment - This has been near becoming rejected, because it's not the best code you are able to produce. But, luckily, at the end, it has landed and has been spread to all repos out there. Many thanks and, don't forget it, keep improving your skills, you can! Closing, ciao
        Hide
        Derek Woolhead added a comment -

        Eloy, please explain your last comment because as far as I am concerned no thought whatsoever has been given to code quality/performance/scalability with regard to the context changes for Moodle 2.2

        Show
        Derek Woolhead added a comment - Eloy, please explain your last comment because as far as I am concerned no thought whatsoever has been given to code quality/performance/scalability with regard to the context changes for Moodle 2.2
        Hide
        Petr Škoda added a comment -

        Hi Derek, I think Eloy meant it as a joke. You are right that the performance/scalability was not my major concern when coding the accesslib changes, the changes in 2.2 actually fixed bugs, cleaned up the internal implementation and part of it was also introduction of unit tests + inline documentation. This should allow us to work on new features (such as cohort contexts, multitenant if we ever agree to implement it, etc.) and implement new performance improvements (such as memcahe caches, deduplication of caches in cron, etc.). The biggest problem is that majority of core developers do not have access to any meaningful large test datasets, you can not expect us to actually test scalability when we do not have access to test servers with configuration and data similar to real production sites, right? So for now we can only make sure that the code returns correct data, carefully review code and wait till somebody tests it with data from/on production servers and if any problems are reported we try to fix them as soon as possible. But then again we have to rely on somebody else to actually prove the improvements work as expected for large installs.

        Show
        Petr Škoda added a comment - Hi Derek, I think Eloy meant it as a joke. You are right that the performance/scalability was not my major concern when coding the accesslib changes, the changes in 2.2 actually fixed bugs, cleaned up the internal implementation and part of it was also introduction of unit tests + inline documentation. This should allow us to work on new features (such as cohort contexts, multitenant if we ever agree to implement it, etc.) and implement new performance improvements (such as memcahe caches, deduplication of caches in cron, etc.). The biggest problem is that majority of core developers do not have access to any meaningful large test datasets, you can not expect us to actually test scalability when we do not have access to test servers with configuration and data similar to real production sites, right? So for now we can only make sure that the code returns correct data, carefully review code and wait till somebody tests it with data from/on production servers and if any problems are reported we try to fix them as soon as possible. But then again we have to rely on somebody else to actually prove the improvements work as expected for large installs.

          People

          • Votes:
            1 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: