Moodle
  1. Moodle
  2. MDL-19702

lib/accesslib.php causes php to crash while executing: cache_context() - [patch]

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.9.5, 1.9.6, 1.9.7, 1.9.8, 1.9.9, 2.0
    • Fix Version/s: 1.9.11
    • Component/s: Libraries, Performance
    • Environment:
      moodle 1.9.5+ (Version: May 27th 2009), PHP Version 5.2.9, Apache Version 2.2.11, eAccelerator Version 0.9.5.3
    • Database:
      PostgreSQL
    • Difficulty:
      Moderate
    • Affected Branches:
      MOODLE_19_STABLE, MOODLE_20_STABLE
    • Fixed Branches:
      MOODLE_19_STABLE
    • Rank:
      406

      Description

      Since updating from m194+ to m1.9.5+ (Version: May 27th 2009), under heavy traffic, we observed (initially) php crashes while executing: "cache_context($context);" (see logs below) leading to continuously dying apache instances right after birth. As a result pages aren't fully served any more, logged in useres can't perform any course/context related actions any more (well, at least nothing)....and moodle is completely messed-up. A graceful apache restart is sufficient to come back to a defined operation level.

      To Reproduce:
      On our development server i tried hard to reproduce such a state.... and therefore this seems to be a little bit "unrealistic" usecase:
      0. in php.ini set "max_execution_time" to some useless short priod of time, e.g. 10 sec.
      1. Set the myMoodle redirect after login as default ( admin/settings.php?section=mymoodle ) // mymoodleredirect
      2. Set the maximum number of courses for the myMoodle site to.... something huge... e.g. 300 // mycoursesperpage
      3. Check out an user having many, many courses.... e.g. 200 (or create one: simply by assigning an user e.g. as teacher within a category with 200+ courses)
      4. Login in as this particular user... "start using moodle", and my be two or three more...
      5. watch your apache und php logs......

      Obervation:
      lib/accesslib.php will be terminated due to the exceed of the maximum execution time.... and voila, here's the mess....
      ... in best cases logged in useres accessing their myMoodle Site will see "No course information to show"

      Workaround:
      1. Limit the maximum number of courses for the myMoodle site to: 21 (default) - 50 ? // mycoursesperpage
      2. raise the "max_execution_time" in php.ini to 60+++ sec.
      ------------------------------------------------------------------------------------------------------------------------------------------------
      Logs:
      php_log
      [25-Jun-2009 11:27:11] PHP Fatal error: Maximum execution time of 60 seconds exceeded in <docroot>/moodle/lib/accesslib.php on line 2538
      [25-Jun-2009 11:29:34] PHP Fatal error: Maximum execution time of 60 seconds exceeded in<docroot>/moodle/lib/accesslib.php on line 182
      [25-Jun-2009 11:31:01] PHP Fatal error: Maximum execution time of 60 seconds exceeded in <docroot>/moodle/lib/accesslib.php on line 2538

      error_log
      [8814] EACCELERATOR: PHP crashed on opline 17 of round() at <docroot>/moodle/mod/forum/lib.php:2264
      [Thu Jun 25 11:31:31 2009] [notice] child pid 8814 exit signal Floating point exception (8)

      access_log
      "nothing uncommon"...... normal usage......
      ------------------------------------------------------------------------------------------------------------------------------------------------
      This issue may be related to:
      http://tracker.moodle.org/browse/MDL-18768
      http://tracker.moodle.org/browse/MDL-19288

      1. accesslib.cache.2.patch
        14 kB
        Sam Marshall
      2. accesslib.php.patch
        5 kB
        Jens Eremie
      3. lib.php.patch
        3 kB
        Jens Eremie
      4. MDL-19702-demo.php
        2 kB
        David Mudrak
      1. myCourses.png
        9 kB

        Issue Links

          Activity

          Hide
          Petr Škoda added a comment -

          lowering the severity, unfortunately our current roles related code uses a lot of memory - but in any case you EACCELERATOR should not be crashing like this - maybe there is a problem in it

          Show
          Petr Škoda added a comment - lowering the severity, unfortunately our current roles related code uses a lot of memory - but in any case you EACCELERATOR should not be crashing like this - maybe there is a problem in it
          Hide
          Jens Eremie added a comment -

          Part I:
          --------
          In lib/accesslib.php the "caching" of the context table (cache_context()) is done by simply using an array with the limit of 5000 tuples ('MAX_CONTEXT_CACHE_SIZE').
          When this limit is reached and there is need to cache more tuples the size of this array is calculated and one single tuple is deleted:

          while (count($context_cache_id) >= MAX_CONTEXT_CACHE_SIZE)

          { $first = reset($context_cache_id); unset($context_cache_id[$first->id]); unset($context_cache[$first->contextlevel][$first->instanceid]); }

          The observation here is:
          when 5000 isn't enough by now, its very, very likely that there is much more to cache 6000, 6500,....,7000
          and in this case the calculation of the array size and deleting one single tuple for 1000, 2000, .... 3000 times is far, far too expensive.

          Solution:
          1. get rid of "count($context_cache_id)" by simply counting
          2. remove bigger chunks (e.g. 500 tuples at once) of the array which is much faster since we even don't know which tuples will be read first/next.....

          Implication:
          1. never allow direct write access to the cache array, other ways the cache count will be incorrect
          2. when tuples are removed from the cache array, also remove them from $preloadedcourses
          3. when the cache array is unset, also unset $preloadedcourses

          Observation:
          MAX_CONTEXT_CACHE_SIZE could be reduced to about (+/.- 500) 2500. Because above 2500 there is no performance plus and the cache is flushed serveral time anyway....

          Part II:
          ---------
          In course/lib.php the courses modules "activity" is cached ( get_fast_modinfo() ) in the same way.... and the array is limited via MAX_MODINFO_CACHE_SIZE=10.
          get_fast_modinfo() is called when e.g. an admin assignes some particular user as teacher within a category -> as often as many courses there are within this category...
          or on every visit of the myMoodle site for every course she/he is subscribed in....

          Oberservation:
          Since the tuples come in order to the cache and are read in the same order and if the first tuple is the one which is removed from cache when:
          MAX_MODINFO_CACHE_SIZE1=10.
          Its obvious that here's a cache miss.... and the cache holds for 10+ tuples never the desired one.... and every tuple will be read at least 4 times, if present.....

          Soution:
          1. get rid of "count($context_cache_id)" by simply counting
          2. raise MAX_MODINFO_CACHE_SIZE to at least 20+++ // i choose 250 which leads the apache to approx. 100MB which should be ok for the most.....
          3. remove randomly tuples which have no cache hit by now....

          ====
          The performance plus of this two little changes is drastic!
          E.g. myMoodle Pages with more than 200 courses which took 70+ secs to be served are now served within 2 1/2 secs.....

          Please test & check !

          Show
          Jens Eremie added a comment - Part I: -------- In lib/accesslib.php the "caching" of the context table (cache_context()) is done by simply using an array with the limit of 5000 tuples ('MAX_CONTEXT_CACHE_SIZE'). When this limit is reached and there is need to cache more tuples the size of this array is calculated and one single tuple is deleted: while (count($context_cache_id) >= MAX_CONTEXT_CACHE_SIZE) { $first = reset($context_cache_id); unset($context_cache_id[$first->id]); unset($context_cache[$first->contextlevel][$first->instanceid]); } The observation here is: when 5000 isn't enough by now, its very, very likely that there is much more to cache 6000, 6500,....,7000 and in this case the calculation of the array size and deleting one single tuple for 1000, 2000, .... 3000 times is far, far too expensive. Solution: 1. get rid of "count($context_cache_id)" by simply counting 2. remove bigger chunks (e.g. 500 tuples at once) of the array which is much faster since we even don't know which tuples will be read first/next..... Implication: 1. never allow direct write access to the cache array, other ways the cache count will be incorrect 2. when tuples are removed from the cache array, also remove them from $preloadedcourses 3. when the cache array is unset, also unset $preloadedcourses Observation: MAX_CONTEXT_CACHE_SIZE could be reduced to about (+/.- 500) 2500. Because above 2500 there is no performance plus and the cache is flushed serveral time anyway.... Part II: --------- In course/lib.php the courses modules "activity" is cached ( get_fast_modinfo() ) in the same way.... and the array is limited via MAX_MODINFO_CACHE_SIZE=10. get_fast_modinfo() is called when e.g. an admin assignes some particular user as teacher within a category -> as often as many courses there are within this category... or on every visit of the myMoodle site for every course she/he is subscribed in.... Oberservation: Since the tuples come in order to the cache and are read in the same order and if the first tuple is the one which is removed from cache when: MAX_MODINFO_CACHE_SIZE1=10. Its obvious that here's a cache miss.... and the cache holds for 10+ tuples never the desired one.... and every tuple will be read at least 4 times, if present..... Soution: 1. get rid of "count($context_cache_id)" by simply counting 2. raise MAX_MODINFO_CACHE_SIZE to at least 20+++ // i choose 250 which leads the apache to approx. 100MB which should be ok for the most..... 3. remove randomly tuples which have no cache hit by now.... ==== The performance plus of this two little changes is drastic! E.g. myMoodle Pages with more than 200 courses which took 70+ secs to be served are now served within 2 1/2 secs..... Please test & check !
          Hide
          Jens Eremie added a comment -

          The Limit of MAX_MODINFO_CACHE_SIZE=10may also cause: MDL-20061

          Show
          Jens Eremie added a comment - The Limit of MAX_MODINFO_CACHE_SIZE=10may also cause: MDL-20061
          Hide
          Wen Hao Chuang added a comment -

          Hi correct me if I'm wrong. I got the impression that this only happen when you upgrade from 1.9.4 to 1.9.5. If you start with a "fresh" installation (of 1.9.5 or 1.9.6), then you won't have this issue. Is this correct? Thanks!

          Show
          Wen Hao Chuang added a comment - Hi correct me if I'm wrong. I got the impression that this only happen when you upgrade from 1.9.4 to 1.9.5. If you start with a "fresh" installation (of 1.9.5 or 1.9.6), then you won't have this issue. Is this correct? Thanks!
          Hide
          Jens Eremie added a comment -

          I'm sure this problem is independent from from upgrade or vanilla install. In fact you'll need enough data, courses, useres, producing a sufficient amount of contexts which is necessary to "fill" the context cache and this is probably the only the case when updating an existing system.....

          Show
          Jens Eremie added a comment - I'm sure this problem is independent from from upgrade or vanilla install. In fact you'll need enough data, courses, useres, producing a sufficient amount of contexts which is necessary to "fill" the context cache and this is probably the only the case when updating an existing system.....
          Hide
          Jens Eremie added a comment -

          Problem also/still exists in m196+....

          Show
          Jens Eremie added a comment - Problem also/still exists in m196+....
          Hide
          David Mudrak added a comment -

          Jens, you are probably right. I am able to reproduce this PHP behaviour. I looks that count() is actually much slower than one might expect. See the attached demostration PHP file. There are two functions - cache_context() with our current implementation and the second one cache_context2() which just uses an external global counter to achieve the same functionality.

          Anybody interested, please compare the script running time on your machines. At my PHP 5.2.12-pl0-gentoo it is like 7 secs versus 0.05 sec

          Show
          David Mudrak added a comment - Jens, you are probably right. I am able to reproduce this PHP behaviour. I looks that count() is actually much slower than one might expect. See the attached demostration PHP file. There are two functions - cache_context() with our current implementation and the second one cache_context2() which just uses an external global counter to achieve the same functionality. Anybody interested, please compare the script running time on your machines. At my PHP 5.2.12-pl0-gentoo it is like 7 secs versus 0.05 sec
          Hide
          Sam Marshall added a comment -

          Confirmed here (assuming the test script is correct, I just ran it didn't actually look at it):

          4.8820159435272
          0.051156044006348

          not sure why we didn't reproduce this before. this is a duplicate of MDL-18768 probably?

          Show
          Sam Marshall added a comment - Confirmed here (assuming the test script is correct, I just ran it didn't actually look at it): 4.8820159435272 0.051156044006348 not sure why we didn't reproduce this before. this is a duplicate of MDL-18768 probably?
          Hide
          Penny Leach added a comment -

          on my tiny netbook:

          76.920532941818
          vs
          0.21796298027039

          WOW!!

          Show
          Penny Leach added a comment - on my tiny netbook: 76.920532941818 vs 0.21796298027039 WOW!!
          Hide
          Penny Leach added a comment -

          Does php version make any difference? I have 5.3.1-5.

          I am amazed that php doesn't handle this for us. Amazed.

          Show
          Penny Leach added a comment - Does php version make any difference? I have 5.3.1-5. I am amazed that php doesn't handle this for us. Amazed.
          Hide
          Mike Churchward added a comment -

          9.71382904053 secs. vs. 0.0749158859253 secs. on my laptop running Ubuntu.

          Show
          Mike Churchward added a comment - 9.71382904053 secs. vs. 0.0749158859253 secs. on my laptop running Ubuntu.
          Hide
          Dan Poltawski added a comment -

          8.9504210948944 vs 0.097811937332153
          php 5.3.1-5

          Show
          Dan Poltawski added a comment - 8.9504210948944 vs 0.097811937332153 php 5.3.1-5
          Hide
          Eloy Lafuente (stronk7) added a comment -

          Well,

          I'm not sure the test is a correct (real) one. As I commented in MDL-18768, before any change in this area we need to revise all uses of the cache, because some times we are accessing/modifying it directly, and that would break any static counter in the function (getting out of sync easily).

          So, also as commented in that bug, all we can do is to move the count() out from the loop. That will save time when we are deleting multiples, but won't help too much when deleting in 1-1 basis.

          In my tests:

          1) current: 6.02
          2) 100% variable alternative (global/static): 0.15
          3) count moved out from the loop: 3.8

          I think we can only achieve 3, as commented.

          Ciao

          Show
          Eloy Lafuente (stronk7) added a comment - Well, I'm not sure the test is a correct (real) one. As I commented in MDL-18768 , before any change in this area we need to revise all uses of the cache, because some times we are accessing/modifying it directly, and that would break any static counter in the function (getting out of sync easily). So, also as commented in that bug, all we can do is to move the count() out from the loop. That will save time when we are deleting multiples, but won't help too much when deleting in 1-1 basis. In my tests: 1) current: 6.02 2) 100% variable alternative (global/static): 0.15 3) count moved out from the loop: 3.8 I think we can only achieve 3, as commented. Ciao
          Hide
          Mike Churchward added a comment - - edited

          This really isn't a fair test of count() though... It is being called each time through the loop on an array that has been decreased by one element.

          nm... Didn't read it close enough...

          Show
          Mike Churchward added a comment - - edited This really isn't a fair test of count() though... It is being called each time through the loop on an array that has been decreased by one element. nm... Didn't read it close enough...
          Hide
          David Mudrak added a comment -

          Well, IMHO the test _is_pretty fair as it demonstrates how we really use the function at the moment - see

              while($context = rs_fetch_next_record($rs)) {
                  cache_context($context);
              }
          

          in lib/accesslib.php and the function definition there as well - I basically copied it into the demo script.

          What I do not think is that we can generalize this problem too much. I just wrote some simple loops having count() involved in the comparison and the figures are not that bad. I think we are just facing a malicious combination of several factors in this particular case because there are two arrays linking to the same objects, resetting the array too much often and unsetting the arrays' first elements too often...

          What if we just wrap the cache into a class with an internal static counter and all the cache manipulation would be done via public methods of this class only. There would be no direct access to the array so the counter can't get out of sync and we will not call count never. If my greps are correct, we do not use the global array elsewhere but in lib/accesslib.php so it should be quite easy refactoring fun...

          Show
          David Mudrak added a comment - Well, IMHO the test _is_pretty fair as it demonstrates how we really use the function at the moment - see while ($context = rs_fetch_next_record($rs)) { cache_context($context); } in lib/accesslib.php and the function definition there as well - I basically copied it into the demo script. What I do not think is that we can generalize this problem too much. I just wrote some simple loops having count() involved in the comparison and the figures are not that bad. I think we are just facing a malicious combination of several factors in this particular case because there are two arrays linking to the same objects, resetting the array too much often and unsetting the arrays' first elements too often... What if we just wrap the cache into a class with an internal static counter and all the cache manipulation would be done via public methods of this class only. There would be no direct access to the array so the counter can't get out of sync and we will not call count never. If my greps are correct, we do not use the global array elsewhere but in lib/accesslib.php so it should be quite easy refactoring fun...
          Hide
          Eloy Lafuente (stronk7) added a comment -

          Yeah, David, with better words that's 100% what I said (need to encapsulate cache and counter into bullet-proof class).

          But first we need to review all current uses of the cache to see if there is someplace where it's used in an strange way (like add XXX contexts, or array merge the cache with another... or whatever).

          Ciao

          Show
          Eloy Lafuente (stronk7) added a comment - Yeah, David, with better words that's 100% what I said (need to encapsulate cache and counter into bullet-proof class). But first we need to review all current uses of the cache to see if there is someplace where it's used in an strange way (like add XXX contexts, or array merge the cache with another... or whatever). Ciao
          Hide
          Martin Dougiamas added a comment -

          ....and so what happened? Is there still some simple changes we can make to improve performance here?

          Show
          Martin Dougiamas added a comment - ....and so what happened? Is there still some simple changes we can make to improve performance here?
          Hide
          Jens Eremie added a comment -

          Sorry Martin,

          due to the unchanged usage of "count($context_cache_id)" inside of "while" in the nearly unchanged functions:
          cache_context(), get_fast_modinfo() (compared m19 to 20) , I'm pretty sure the problem still present.... i may be wrong
          because i didn't do any further (real data) mesasurement on this.

          Compared:
          MOODLE_19_WEEKLY: $version = 2007101590.00;
          HEAD (m20): $version = 2010090501;

          lib/accesslib.php: cache_context()
          MOODLE_19_WEEKLY: accesslib.php,v 1.421.2.109
          HEAD (m20): -

          course/lib.php, lib/moodlelib.php : get_fast_modinfo() // function moved
          MOODLE_19_WEEKLY: lib.php,v 1.538.2.80
          HEAD (m20) - moodlelib.php: -

          Best regards, Jens

          Show
          Jens Eremie added a comment - Sorry Martin, due to the unchanged usage of "count($context_cache_id)" inside of "while" in the nearly unchanged functions: cache_context(), get_fast_modinfo() (compared m19 to 20) , I'm pretty sure the problem still present.... i may be wrong because i didn't do any further (real data) mesasurement on this. Compared: MOODLE_19_WEEKLY: $version = 2007101590.00; HEAD (m20): $version = 2010090501; lib/accesslib.php: cache_context() MOODLE_19_WEEKLY: accesslib.php,v 1.421.2.109 HEAD (m20): - course/lib.php, lib/moodlelib.php : get_fast_modinfo() // function moved MOODLE_19_WEEKLY: lib.php,v 1.538.2.80 HEAD (m20) - moodlelib.php: - Best regards, Jens
          Hide
          Sam Marshall added a comment -

          For info: Just during today, I'm looking at this bug as part of the OU's Moodle Monday exercise. If I figure it out by the end of today, I'll attach a patch here. (If it turns out to be too difficult, I'll just leave it - sorry.)

          note: I have some experience with this area of code, so hopefully can come up with working solution, presuming one is still needed. As agreed by others on this bug, I plan to check all current uses and, if possible, encapsulate the cache(s) into a class.

          Show
          Sam Marshall added a comment - For info: Just during today, I'm looking at this bug as part of the OU's Moodle Monday exercise. If I figure it out by the end of today, I'll attach a patch here. (If it turns out to be too difficult, I'll just leave it - sorry.) note: I have some experience with this area of code, so hopefully can come up with working solution, presuming one is still needed. As agreed by others on this bug, I plan to check all current uses and, if possible, encapsulate the cache(s) into a class.
          Hide
          Sam Marshall added a comment - - edited

          Ok, I've addressed this issue. Here's what I did:

          0. Neither $ACCESSLIB_PRIVATE nor the function cache_context( are referenced anywhere outside accesslib.

          1. Created new class context_cache to replace function cache_context.

          2. Reduce MAX_CONTEXT_CACHE_SIZE to 2500 as per original request.

          3. Added REDUCE_SIZE constant (1000) so that it does the reductions in large batch as per original request.

          4. Made it count approx cache size and use this rather than counting array.

          5. Added unit test for new class which tests 2,500 contexts in cache (passes).

          6. Verified existing unit tests (still pass except one unrelated one which was failing already).

          7. Verified basic usage by clicking around the website (still works).

          The patch is attached - accesslib.cache.patch (14 kB)

          I did NOT make any changes regarding modinfo. This is because:

          i. I don't think the 100MB assumption will always stand up; modinfo can get much bigger depending on what features you use and how large courses are.

          ii. 100MB might still cause it to run out of RAM.

          iii. The count() thing is surely not a performance issue here because it doesn't take long to count 10 elements. Similarly, obtaining modinfo is a quite heavyweight task anyway so messing with a cache is probably going to be a small proportion of this. (Compared to context where getting context is easy.)

          I have got Tim to review this patch. I think it's fairly important that it is committed and it is quality unit-tested code, so could I get a +1 to commit it please (or somebody else commit it).

          We're fairly sure it is the right approach so if nobody has time to review it, I'll commit in a few days.

          Show
          Sam Marshall added a comment - - edited Ok, I've addressed this issue. Here's what I did: 0. Neither $ACCESSLIB_PRIVATE nor the function cache_context( are referenced anywhere outside accesslib. 1. Created new class context_cache to replace function cache_context. 2. Reduce MAX_CONTEXT_CACHE_SIZE to 2500 as per original request. 3. Added REDUCE_SIZE constant (1000) so that it does the reductions in large batch as per original request. 4. Made it count approx cache size and use this rather than counting array. 5. Added unit test for new class which tests 2,500 contexts in cache (passes). 6. Verified existing unit tests (still pass except one unrelated one which was failing already). 7. Verified basic usage by clicking around the website (still works). The patch is attached - accesslib.cache.patch (14 kB) I did NOT make any changes regarding modinfo. This is because: i. I don't think the 100MB assumption will always stand up; modinfo can get much bigger depending on what features you use and how large courses are. ii. 100MB might still cause it to run out of RAM. iii. The count() thing is surely not a performance issue here because it doesn't take long to count 10 elements. Similarly, obtaining modinfo is a quite heavyweight task anyway so messing with a cache is probably going to be a small proportion of this. (Compared to context where getting context is easy.) I have got Tim to review this patch. I think it's fairly important that it is committed and it is quality unit-tested code, so could I get a +1 to commit it please (or somebody else commit it). We're fairly sure it is the right approach so if nobody has time to review it, I'll commit in a few days.
          Hide
          Sam Marshall added a comment -

          (fix minor bug in previous patch)

          Show
          Sam Marshall added a comment - (fix minor bug in previous patch)
          Hide
          Petr Škoda added a comment -

          I have committed sam's patch + renamed $ACCESSLIB_PRIVATE->contexcache and added some tweaks for potential drift problems

          keeping this open for now because some other parts could use similar caching, thanks sam!

          Show
          Petr Škoda added a comment - I have committed sam's patch + renamed $ACCESSLIB_PRIVATE->contexcache and added some tweaks for potential drift problems keeping this open for now because some other parts could use similar caching, thanks sam!
          Hide
          Eloy Lafuente (stronk7) added a comment -

          Yay, love it! Thanks Petr and Sam!

          Show
          Eloy Lafuente (stronk7) added a comment - Yay, love it! Thanks Petr and Sam!
          Hide
          David Mudrak added a comment -

          Very nice. Thanks Sam for the patch and Jens for the report and suggestions. Reassigning back to Petr.

          Show
          David Mudrak added a comment - Very nice. Thanks Sam for the patch and Jens for the report and suggestions. Reassigning back to Petr.
          Hide
          Petr Škoda added a comment -

          Hello, this can be probably closed, I did not find the exact version so I am going to use some older 1.9.x, thanks everybody!

          Show
          Petr Škoda added a comment - Hello, this can be probably closed, I did not find the exact version so I am going to use some older 1.9.x, thanks everybody!

            People

            • Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: