Moodle
  1. Moodle
  2. MDL-40947

Segmentation fault caused by get_fast_modinfo()

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 2.5.1
    • Fix Version/s: 2.4.6, 2.5.2
    • Component/s: Course
    • Labels:
    • Testing Instructions:
      Hide

      1. Go to /my page, having logged in as the user who enrolled in more than 80 courses.
      2. Make sure there are no errors on the page.
      3. Reverse patch for MDL-34785 and do the same (those two issues address different problems but both affect /my/ page)

      Show
      1. Go to /my page, having logged in as the user who enrolled in more than 80 courses. 2. Make sure there are no errors on the page. 3. Reverse patch for MDL-34785 and do the same (those two issues address different problems but both affect /my/ page)
    • Affected Branches:
      MOODLE_25_STABLE
    • Fixed Branches:
      MOODLE_24_STABLE, MOODLE_25_STABLE
    • Pull Master Branch:
      MDL-40947-master
    • Rank:
      51839

      Description

      Some users reported 502 error on attempt to open /my page in Moodle. Apache error log only indicated:

      [Tue Jul 30 10:33:43 2013] [notice] child pid 8183 exit signal Segmentation fault (11)
      

      which was not useful enough. Seems like apache process just dies without reporting anything else in the log. The closer look indicated that problem appear with users who is enrolled to too many courses (above 80 in our case).

      The next logical step was to make Apache produce a core dump for analysis of the problem. As it was impossible to replicated issue on our test system, for this purpose we removed of the frontends from the live pool used it for testing. After spending some time playing with gdb and learning how it works, I moved further in analysis and got backtrace of events from the memory dump:

      [0x7f215e3e84b8] get_fast_modinfo() /usr/share/moodle-lu/lib/modinfolib.php:1399 
      [0x7f215e3e7c90] get_fast_modinfo(object[0x7f2160df0680], NULL) /usr/share/moodle-lu/lib/datalib.php:1424 
      [0x7f215e3e7478] get_all_instances_in_courses("chat", array(132)[0x7f216147c040]) /usr/share/moodle-lu/mod/chat/lib.php:1093 
      [0x7f215e3e7290] chat_print_overview(array(132)[0x7f216147c040], array(77)[0x7f2161478a10]) /usr/share/moodle-lu/blocks/course_overview/locallib.php:35 
      [0x7f215e3e67d0] block_course_overview_get_overviews(array(132)[0x7f216147c040]) /usr/share/moodle-lu/blocks/course_overview/block_course_overview.php:68 
      [0x7f215e3e6608] block_course_overview->get_content() /usr/share/moodle-lu/blocks/moodleblock.class.php:292 
      [0x7f215e3e5c48] block_base->formatted_contents(object[0x7f2160d86be8]) /usr/share/moodle-lu/blocks/moodleblock.class.php:238 
      [0x7f215e3e5258] block_base->get_content_for_output(object[0x7f2160d86be8]) /usr/share/moodle-lu/lib/blocklib.php:951 
      [0x7f215e3e4de0] block_manager->create_block_contents(array(1)[0x7f2160ad92d8], object[0x7f2160d86be8], "content") /usr/share/moodle-lu/lib/blocklib.php:1003 
      [0x7f215e3e4c58] block_manager->ensure_content_created("content", object[0x7f2160d86be8]) /usr/share/moodle-lu/lib/blocklib.php:315 
      [0x7f215e3e4560] block_manager->get_content_for_region("content", object[0x7f2160d86be8]) /usr/share/moodle-lu/lib/outputrenderers.php:1232 
      [0x7f215e3e30a8] core_renderer->blocks_for_region("content") /usr/share/moodle-lu/my/index.php:155
      

      The top line gave a clue what causes the problem:

       if (count($cache) > MAX_MODINFO_CACHE_SIZE) {
              reset($cache);
              $key = key($cache);
              unset($cache[$key]->instances);  // line 1399
              unset($cache[$key]->cms);
              unset($cache[$key]);
          }
      

      This has been introduced in MDL-25191 and had a good reason for being there, as we know that unsetting static variables in PHP is not trivial. They sort of remain set in the memory, just a reference being removed.

      The first thought, what Andrew suggested was to introduce gc_collect_cycles(); before doing unsets. This solved the problem resulting in script execution for particular user lasts 53 seconds:

      [Tue Jul 30 12:26:50 2013] [error] [client 148.88.22.10] PERF: /my/ time: 53.323946s memory_total: 28194552B (26.9MB) memory_growth: 27846176B (26.6MB) memory_peak: 31064032B (29.6MB) includecount: 448 contextswithfilters: 0 filterscreated: 0 textsfiltered: 0 stringsfiltered: 0 langcountgetstring: 908 includedyuimodules: 0 includedjsmodules: 3 db reads/writes: 20103/0 ticks: 5333 user: 2997 sys: 78 cuser: 0 csys: 0 serverload: 0.37 Session: 62.6KB Caches used (hits/misses/sets): core/databasemeta {** static persist **(18907/15/0) cachestore_memcached(6/9/9) } core/string {** static persist **(1028/30/0) cachestore_memcached(4/26/26) } core/plugintypes {** static persist **(463/3/0) cachestore_memcached(3/0/0) } core/pluginlist {** static persist **(2/4/0) cachestore_memcached(4/0/0) } core/yuimodules {cachestore_memcached(1/0/0) } core/htmlpurifier {cachestore_memcached(3/30/30) } core/groupdata {** static persist **(6/1/0) cachestore_memcached(0/1/1) }
      

      The second approach I tried was to remove what has been added in MDL-25191, leaving just single unset (unset($cache[$key]). Surprisingly, this worked as well taking less time:

      [Tue Jul 30 12:23:24 2013] [error] [client 148.88.22.10] PERF: /my/ time: 36.408919s memory_total: 29366872B (28MB) memory_growth: 29018496B (27.7MB) memory_peak: 37594232B (35.9MB) includecount: 415 contextswithfilters: 0 filterscreated: 0 textsfiltered: 0 stringsfiltered: 0 langcountgetstring: 908 includedyuimodules: 0 includedjsmodules: 3 db reads/writes: 20093/0 ticks: 3641 user: 1380 sys: 86 cuser: 0 csys: 0 serverload: 0.30 Session: 62.6KB Caches used (hits/misses/sets): core/databasemeta {** static persist **(18903/15/0) cachestore_memcached(12/3/3) } core/string {** static persist **(1028/30/0) cachestore_memcached(28/2/2) } core/plugintypes {** static persist **(435/3/0) cachestore_memcached(3/0/0) } core/pluginlist {** static persist **(2/4/0) cachestore_memcached(4/0/0) } core/yuimodules {cachestore_memcached(1/0/0) } core/htmlpurifier {cachestore_memcached(3/30/30) } core/groupdata {** static persist **(6/1/0) cachestore_memcached(0/1/1) }
      

      This seems like a problem related to particular system and Moodle configuration, which is difficult to replicate. We realised that if we reduce MAX_MODINFO_CACHE_SIZE (we had it set to 62) the problem also disappears. Having looked further into unsetting static variable, I found a solution that does both - killing the variable in memory and not causing problems in our installation:

           if (count($cache) > MAX_MODINFO_CACHE_SIZE) {
               reset($cache);
               $key = key($cache);
      -        unset($cache[$key]->instances);
      -        unset($cache[$key]->cms);
      +        $cache[$key] = '';
               unset($cache[$key]);
           }
      

      The performance remain the same as in the previous case where MDL-25191 changes are reverted. I suggest to implement it, though, given that problem is weird and not applicable to everyone, I am not 100% sure we need to change existing code.

      Comments are welcome.

        Issue Links

          Activity

          Hide
          Ruslan Kabalin added a comment -

          Added Sam and Andrew to watchers list. Sam, your comments are very welcome

          Show
          Ruslan Kabalin added a comment - Added Sam and Andrew to watchers list. Sam, your comments are very welcome
          Hide
          Andrew Nicols added a comment -

          If we're seeing segfaults, others could be too.

          Show
          Andrew Nicols added a comment - If we're seeing segfaults, others could be too.
          Hide
          Paul Vaughan added a comment -

          We've been experiencing just this problem, so far with just a handful of users but due to it being the Summer holidays here, Moodle's load is currently nothing compared to it's term-time load. I would have to assume this affects many more, but we're not seeing that yet because they're all on annual leave.

          A user would log in, would be automatically redirected to /my due to course settings, would wait approximately one minute then would receive a 'No data received' error (Chrome). Apache's logs would show at least one "[notice] child pid 3263 exit signal Segmentation fault (11)".

          Using Andrew's patch, above, I was able to stop Apache segfaulting. While I'm glad Moodle works, the /my page takes between 1m50s and 2m15s to load for that user, compared to <2s for myself and other users. This user is enrolled onto 48 courses.

          Pertinent details of our server set up:

          • Debian 7.1 'wheezy'
          • Apache 2.2.22
          • PHP 5.4.4-14+deb7u3

          I hope this is useful.

          Show
          Paul Vaughan added a comment - We've been experiencing just this problem, so far with just a handful of users but due to it being the Summer holidays here, Moodle's load is currently nothing compared to it's term-time load. I would have to assume this affects many more, but we're not seeing that yet because they're all on annual leave. A user would log in, would be automatically redirected to /my due to course settings, would wait approximately one minute then would receive a 'No data received' error (Chrome). Apache's logs would show at least one " [notice] child pid 3263 exit signal Segmentation fault (11)" . Using Andrew's patch, above, I was able to stop Apache segfaulting. While I'm glad Moodle works, the /my page takes between 1m50s and 2m15s to load for that user, compared to <2s for myself and other users. This user is enrolled onto 48 courses. Pertinent details of our server set up: Debian 7.1 'wheezy' Apache 2.2.22 PHP 5.4.4-14+deb7u3 I hope this is useful.
          Hide
          Andrew Nicols added a comment -

          FTR, we're also running Debian Wheezy. We're only seeing this for a small number of users who are enrolled on lots of courses.

          Show
          Andrew Nicols added a comment - FTR, we're also running Debian Wheezy. We're only seeing this for a small number of users who are enrolled on lots of courses.
          Hide
          Sam Marshall added a comment -

          I have asked our system team. We did see a few strange instances where Apache crashed for unexplained reasons, but I can't confirm whether they're the same as this one - they didn't manage to get stack traces.

          The change suggested here looks like voodoo but I would be OK with it happening, I think it does needs a comment to explain why it's working that way.

          Basically I'd be in favour of this change since it seems to fix this problem and I don't see why it should make anything worse.

          Show
          Sam Marshall added a comment - I have asked our system team. We did see a few strange instances where Apache crashed for unexplained reasons, but I can't confirm whether they're the same as this one - they didn't manage to get stack traces. The change suggested here looks like voodoo but I would be OK with it happening, I think it does needs a comment to explain why it's working that way. Basically I'd be in favour of this change since it seems to fix this problem and I don't see why it should make anything worse.
          Hide
          Sam Marshall added a comment -

          Note: We don't use My Moodle, but there are a few other cases where Moodle might load a bunch of courses, so it might have been that for us in a different context (admin stuff).

          Show
          Sam Marshall added a comment - Note: We don't use My Moodle, but there are a few other cases where Moodle might load a bunch of courses, so it might have been that for us in a different context (admin stuff).
          Hide
          Sam Marshall added a comment -

          Update: "We currently get a very high number of apache segfaults on our systems."

          The 'few' relates to actual crashes.

          Show
          Sam Marshall added a comment - Update: "We currently get a very high number of apache segfaults on our systems." The 'few' relates to actual crashes.
          Hide
          Sam Marshall added a comment -

          More update: they've traced it and it looks like our faults are caused by some error in the shutdown handler (may be OU-specific), so they're not necessarily related to this issue, unfortunately.

          Show
          Sam Marshall added a comment - More update: they've traced it and it looks like our faults are caused by some error in the shutdown handler (may be OU-specific), so they're not necessarily related to this issue, unfortunately.
          Hide
          Ruslan Kabalin added a comment -

          Paul, by Andrew's patch you mean Andrew's suggestion to use gc_collect_cycles()? Notice that the final solution is 31% faster than gc_collect_cycles approach, you may reduce run time even further.

          Show
          Ruslan Kabalin added a comment - Paul, by Andrew's patch you mean Andrew's suggestion to use gc_collect_cycles()? Notice that the final solution is 31% faster than gc_collect_cycles approach, you may reduce run time even further.
          Hide
          Paul Vaughan added a comment -

          Sorry, Ruslan, I meant your patch but mentioned Andrew as I'd been exchanging tweets at the same time and has his name in my head when writing the above comment. I did exactly what your diff suggests.

          Show
          Paul Vaughan added a comment - Sorry, Ruslan, I meant your patch but mentioned Andrew as I'd been exchanging tweets at the same time and has his name in my head when writing the above comment. I did exactly what your diff suggests.
          Hide
          Ruslan Kabalin added a comment -

          Thanks everyone for comments. I guess now is the time for peer review.

          Show
          Ruslan Kabalin added a comment - Thanks everyone for comments. I guess now is the time for peer review.
          Hide
          Paul Vaughan added a comment -

          For info, and following on from a previous comment: the user who had a /my page loading time of approximately two minutes when enrolled onto 48 courses, has unenrolled themselves from some unnecessary courses and has seen the page load time drop to around 15 seconds with only 22 course enrolments.

          Show
          Paul Vaughan added a comment - For info, and following on from a previous comment: the user who had a /my page loading time of approximately two minutes when enrolled onto 48 courses, has unenrolled themselves from some unnecessary courses and has seen the page load time drop to around 15 seconds with only 22 course enrolments.
          Hide
          Marina Glancy added a comment -

          awesome guys.
          I did a search on net and found this:
          http://stackoverflow.com/questions/584960/whats-better-at-freeing-memory-with-php-unset-or-var-null

          What we have in course_modinfo is the object with circular reference - course_modinfo has array of cm_info and they link back to parent. There is a reported bug in php https://bugs.php.net/bug.php?id=33595
          even though it seems to be fixed in PHP 5.3 it's still better to use assignment instead of unset()

          Show
          Marina Glancy added a comment - awesome guys. I did a search on net and found this: http://stackoverflow.com/questions/584960/whats-better-at-freeing-memory-with-php-unset-or-var-null What we have in course_modinfo is the object with circular reference - course_modinfo has array of cm_info and they link back to parent. There is a reported bug in php https://bugs.php.net/bug.php?id=33595 even though it seems to be fixed in PHP 5.3 it's still better to use assignment instead of unset()
          Hide
          Marina Glancy added a comment -

          TO INTEGRATORS: Please cherry-pick in 2.4 and 2.5 as well

          Show
          Marina Glancy added a comment - TO INTEGRATORS: Please cherry-pick in 2.4 and 2.5 as well
          Hide
          Marina Glancy added a comment -

          Paul re your page loading time, it may be resolved with MDL-34785 (also submitted for integration now)

          Show
          Marina Glancy added a comment - Paul re your page loading time, it may be resolved with MDL-34785 (also submitted for integration now)
          Hide
          Ruslan Kabalin added a comment -

          Thanks Marina for revision.

          Show
          Ruslan Kabalin added a comment - Thanks Marina for revision.
          Hide
          Sam Hemelryk added a comment -

          Thanks guys - this has been integrated now.

          Show
          Sam Hemelryk added a comment - Thanks guys - this has been integrated now.
          Hide
          Petr Škoda added a comment -

          works fine or me, thanks

          Show
          Petr Škoda added a comment - works fine or me, thanks
          Hide
          Switch Systems added a comment -

          I thought i'd just add, we're seeing memory corruption errors in deb7u3 that were fine in previous versions deb7u2 and below. Stumbled upon this post when googling for others who are experiencing this issue.

          Show
          Switch Systems added a comment - I thought i'd just add, we're seeing memory corruption errors in deb7u3 that were fine in previous versions deb7u2 and below. Stumbled upon this post when googling for others who are experiencing this issue.
          Hide
          Sam Hemelryk added a comment -

          Huzzah, your code made it into Moodle. Perhaps now things are ever so slightly better!

          "The ship can't take this much pressure. Sometimes it falls apart just sitting in the hangar."
          ~ Professor Farnsworth

          Show
          Sam Hemelryk added a comment - Huzzah, your code made it into Moodle. Perhaps now things are ever so slightly better! "The ship can't take this much pressure. Sometimes it falls apart just sitting in the hangar." ~ Professor Farnsworth
          Hide
          Dan Poltawski added a comment -

          Nice work Ruslan!

          Show
          Dan Poltawski added a comment - Nice work Ruslan!

            People

            • Votes:
              2 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: