Moodle
  1. Moodle
  2. MDL-30026

Prevent session lock waits from stalling server processes forever

    Details

    • Testing Instructions:
      Hide

      For developers:
      1/ run the functional DB tests under sqlsrv. Verify that no error related with "test_session_locks" happens.

      2/ no need to perform the tests under other DBs, nothing has changed there (last week we introduced and tested that).

      Show
      For developers: 1/ run the functional DB tests under sqlsrv. Verify that no error related with "test_session_locks" happens. 2/ no need to perform the tests under other DBs, nothing has changed there (last week we introduced and tested that).
    • Affected Branches:
      MOODLE_20_STABLE, MOODLE_21_STABLE, MOODLE_22_STABLE
    • Fixed Branches:
      MOODLE_22_STABLE
    • Pull from Repository:
    • Pull Master Branch:
      MDL-30026_sqlsrv
    • Rank:
      19568

      Description

      On many of our hosted sites we sometimes get periods of unavailability or out of memory conditions which trace back to a single user.

      Most often in these cases, the user will be performing some long-running process (e.g backup, users upload) and attempts to continue opening things from another browser tab. The new requests get backed up behind the long request waiting to grab a session lock. This can also be triggered by bug, especially infinite loops in code sections with increased time limit.

      This results in stuck server processes which may not be freed for hours, and can cause out-of-memory issues if more processes are started to handle other requests.

      Over time, it should be a goal to migrate these long processes out of web requests and into background processing but it would be nice to have some kind of catchall to prevent these and bugs from causing site unavailability.

      The suggestion is to, where possible, place a time limit on how long Moodle will wait to acquire a session lock before giving up. While this does cause an error to the user, they do not get hung browser windows forever, and the site remains available.

      Please see the linked patch for an attempt at doing this for Postgres.

        Issue Links

          Activity

          Show
          Tony Levi added a comment - https://github.com/tlevi/moodle/tree/mdl30026
          Hide
          Tony Levi added a comment -

          Obviously to do this in core will require additional work for the other databases, if they even support such a mechanism...

          Show
          Tony Levi added a comment - Obviously to do this in core will require additional work for the other databases, if they even support such a mechanism...
          Hide
          Petr Škoda added a comment - - edited

          Thanks a lot for the report and patch, I have cleaned up the session handler a bit more and fixed other drivers.

          To integrators: make sure it works in mssql, sqlsrv and oracle and fix it if not, I do not have a test server for these (my MSDN license expired, sorry). I suppose it is too big for backporting.

          ciao

          Show
          Petr Škoda added a comment - - edited Thanks a lot for the report and patch, I have cleaned up the session handler a bit more and fixed other drivers. To integrators: make sure it works in mssql, sqlsrv and oracle and fix it if not, I do not have a test server for these (my MSDN license expired, sorry). I suppose it is too big for backporting. ciao
          Hide
          Petr Škoda added a comment -

          oh, I had some problems with Chrome on OSX when testing this, it seemed to do some extra page access logging - weird. It worked 100% for me in Safari for both MySQL and PostgreSQL...

          Show
          Petr Škoda added a comment - oh, I had some problems with Chrome on OSX when testing this, it seemed to do some extra page access logging - weird. It worked 100% for me in Safari for both MySQL and PostgreSQL...
          Hide
          Tony Levi added a comment -

          Was that with your additional changes or just my patch?

          Show
          Tony Levi added a comment - Was that with your additional changes or just my patch?
          Hide
          Petr Škoda added a comment -

          I did not test your patch much because I discovered major issues there while reviewing it - such as different timeout handling in each driver, double errors caused by exceptions in session driver inits, incorrect error reporting, saving of invalid session content, etc. So I fixed all these first and then tested the new code. I always need some time away to see my own bugs, sorry for the trouble...

          Show
          Petr Škoda added a comment - I did not test your patch much because I discovered major issues there while reviewing it - such as different timeout handling in each driver, double errors caused by exceptions in session driver inits, incorrect error reporting, saving of invalid session content, etc. So I fixed all these first and then tested the new code. I always need some time away to see my own bugs, sorry for the trouble...
          Hide
          Tony Levi added a comment - - edited

          Hang on a minute, I only touched one driver; that doesn't even make any sense?

          edit:
          oh, I think you mean the rest of the existing code, OK
          does this mean this issue has to wait for 2.3 now?

          Show
          Tony Levi added a comment - - edited Hang on a minute, I only touched one driver; that doesn't even make any sense? edit: oh, I think you mean the rest of the existing code, OK does this mean this issue has to wait for 2.3 now?
          Hide
          Petr Škoda added a comment -

          You touched only one driver which was not enough to make it work properly, we must support all drivers and there were other bugs which I had to fix. The Google Chrome trouble is a separate issue, I just wanted to warn the testers that it may behave in a different way than other browsers.

          Making changes in session code is always risky, this may be backported to stable branches later, I think it is better to keep it in dev branch until more people test it on real sites with real browsers.

          In any case the integrators have to decide if it gets backported and when, luckily it is not my job to do that...

          Show
          Petr Škoda added a comment - You touched only one driver which was not enough to make it work properly, we must support all drivers and there were other bugs which I had to fix. The Google Chrome trouble is a separate issue, I just wanted to warn the testers that it may behave in a different way than other browsers. Making changes in session code is always risky, this may be backported to stable branches later, I think it is better to keep it in dev branch until more people test it on real sites with real browsers. In any case the integrators have to decide if it gets backported and when, luckily it is not my job to do that...
          Hide
          Eloy Lafuente (stronk7) added a comment -

          Some comments:

          Ciao

          Show
          Eloy Lafuente (stronk7) added a comment - Some comments: tiny: https://github.com/skodak/moodle/compare/master...w44_MDL-30026_m22_sessionconnect#L8R27 (say 3, implement 2) I'm trying some (1/2 seconds) unittests to check lowlevel get_session_lock() and release_session_lock() work as expected across all DBs, will add it once tested here. The Oracle impl seems a bit rough, I think I implemented proper error throwing... in the moodle_locks.get_lock() method, will recheck it. Ciao
          Hide
          Eloy Lafuente (stronk7) added a comment -

          Hi, I've added 3 commits on top of your branch, doing:

          • Added unit tests for get/release_session_lock() + the 2/3mins typo.
          • Fixed behavior for mssql, sadly there is one bug somewhere in freetds/php causing scalar results from stored procedures not being returned ok. The implemented workaround solves that.
          • Fixed behavior for oracle, the execute call returns false if there is any problem in the execution of get_lock(), so now it's checking for that instead of the timing way.

          Source: https://github.com/stronk7/moodle/compare/master...MDL-30026

          Right now it is working under mysql/pgsql/mssql/oci. Only sqlsrv requires testing and feedback here. So I'm going to send this to integration, asking specially for sqlsrv testing... thanks and ciao

          Show
          Eloy Lafuente (stronk7) added a comment - Hi, I've added 3 commits on top of your branch, doing: Added unit tests for get/release_session_lock() + the 2/3mins typo. Fixed behavior for mssql, sadly there is one bug somewhere in freetds/php causing scalar results from stored procedures not being returned ok. The implemented workaround solves that. Fixed behavior for oracle, the execute call returns false if there is any problem in the execution of get_lock(), so now it's checking for that instead of the timing way. Source: https://github.com/stronk7/moodle/compare/master...MDL-30026 Right now it is working under mysql/pgsql/mssql/oci. Only sqlsrv requires testing and feedback here. So I'm going to send this to integration, asking specially for sqlsrv testing... thanks and ciao
          Hide
          Eloy Lafuente (stronk7) added a comment -

          Integrated, thanks!

          Updated test instructions to run unit tests and, also, I've created MDL-30242 about to backport this to 21_STABLE if everything goes ok along the next weeks.

          Show
          Eloy Lafuente (stronk7) added a comment - Integrated, thanks! Updated test instructions to run unit tests and, also, I've created MDL-30242 about to backport this to 21_STABLE if everything goes ok along the next weeks.
          Hide
          Tony Levi added a comment -

          Thanks for the effort on this one.

          Show
          Tony Levi added a comment - Thanks for the effort on this one.
          Hide
          Eloy Lafuente (stronk7) added a comment -

          ty, Tony, you ignited the whole thing, thanks!

          Show
          Eloy Lafuente (stronk7) added a comment - ty, Tony, you ignited the whole thing, thanks!
          Hide
          Sam Hemelryk added a comment -

          Hi guys,
          I've just tested it now using sqlsrv and have had the following error occur consistently. (Using Server 2008)

          Fail: lib/dml/simpletest/testdml.php / ► dml_test / ► test_session_locks
          at [C:\www\integration\lib\dml\simpletest\testdml.php line 4164]

          line ... of ...
          line 4164 of \lib\dml\simpletest\testdml.php: call to UnitTestCase->assertTrue()
          line ... of ...
          line 53 of \admin\tool\unittest\ex_simple_test.php: call to TestSuite->run()
          line ... of ...
          line 113 of \admin\tool\unittest\dbtest.php: call to autogroup_test_coverage->run_with_external_coverage()

          Let me know if there is anything more I can help with otherwise I'll wait for this to come up for testing again.

          Cheers
          Sam

          Show
          Sam Hemelryk added a comment - Hi guys, I've just tested it now using sqlsrv and have had the following error occur consistently. (Using Server 2008) Fail: lib/dml/simpletest/testdml.php / ► dml_test / ► test_session_locks at [C:\www\integration\lib\dml\simpletest\testdml.php line 4164] line ... of ... line 4164 of \lib\dml\simpletest\testdml.php: call to UnitTestCase->assertTrue() line ... of ... line 53 of \admin\tool\unittest\ex_simple_test.php: call to TestSuite->run() line ... of ... line 113 of \admin\tool\unittest\dbtest.php: call to autogroup_test_coverage->run_with_external_coverage() Let me know if there is anything more I can help with otherwise I'll wait for this to come up for testing again. Cheers Sam
          Hide
          Eloy Lafuente (stronk7) added a comment -

          I'm getting this to me. Has been agreed by Sam and me to keep this integrated (working under 4DBs) and immediately after running upstream, we'll work in a fix for sqlsrv.

          Ciao

          Show
          Eloy Lafuente (stronk7) added a comment - I'm getting this to me. Has been agreed by Sam and me to keep this integrated (working under 4DBs) and immediately after running upstream, we'll work in a fix for sqlsrv. Ciao
          Hide
          Eloy Lafuente (stronk7) added a comment -

          As commented, passing this knowing there is some work to do in the sqlsrv side. I'll reopen this in some hours.

          Show
          Eloy Lafuente (stronk7) added a comment - As commented, passing this knowing there is some work to do in the sqlsrv side. I'll reopen this in some hours.
          Hide
          Eloy Lafuente (stronk7) added a comment -

          Yes, you got this finally upstream, just in time for Moodle 2.2beta. Congrats and thanks!

          Ciao

          Show
          Eloy Lafuente (stronk7) added a comment - Yes, you got this finally upstream, just in time for Moodle 2.2beta. Congrats and thanks! Ciao
          Hide
          Eloy Lafuente (stronk7) added a comment -

          Reopening this, as commented, to get the sqlsrv implementation fixed, now that the rest have landed upstream.

          Show
          Eloy Lafuente (stronk7) added a comment - Reopening this, as commented, to get the sqlsrv implementation fixed, now that the rest have landed upstream.
          Hide
          Eloy Lafuente (stronk7) added a comment -

          Sending to integration

          Show
          Eloy Lafuente (stronk7) added a comment - Sending to integration
          Hide
          Aparup Banerjee added a comment -

          This has been integrated into master only. Open to more testing!

          my tests:
          For some reason i couldn't get past test_concurent_transactions() so i skipped that for my setup.

          The test_session_locks() is working fine for me on sqlsrv on MSSQL server 2008.

          Show
          Aparup Banerjee added a comment - This has been integrated into master only. Open to more testing! my tests: For some reason i couldn't get past test_concurent_transactions() so i skipped that for my setup. The test_session_locks() is working fine for me on sqlsrv on MSSQL server 2008.
          Hide
          Aparup Banerjee added a comment -

          passing this as far as no session/locks issues seen from Eloy's test.

          note:
          (11:12:58) stronk7: 100% pass mysql and postgresql
          (11:13:13) stronk7: 7 problems in mssql (none related with sessions/locks)
          (11:13:46) stronk7: 1 problem in oracle (the famous concat one)

          (Sam, i saw no problems with sessions/locks too but your (3rd pov) verification would be useful here when you have time.)

          Show
          Aparup Banerjee added a comment - passing this as far as no session/locks issues seen from Eloy's test. note: (11:12:58) stronk7: 100% pass mysql and postgresql (11:13:13) stronk7: 7 problems in mssql (none related with sessions/locks) (11:13:46) stronk7: 1 problem in oracle (the famous concat one) (Sam, i saw no problems with sessions/locks too but your (3rd pov) verification would be useful here when you have time.)
          Hide
          Eloy Lafuente (stronk7) added a comment -

          And this has landed upstream, just on time for the upcoming new releases week. Thanks for it!

          Ciao

          Show
          Eloy Lafuente (stronk7) added a comment - And this has landed upstream, just on time for the upcoming new releases week. Thanks for it! Ciao
          Hide
          Martin Dougiamas added a comment - - edited

          Is session locking on all the time? Since 2.2 I'm seeing "Timed out while waiting for session lock." on all kinds of operations, after everything hangs for some minutes. And this is on a site with one user.

          Are these regressions? MDL-31870

          I'm going to try turning off database sessions (which are on by default).

          Show
          Martin Dougiamas added a comment - - edited Is session locking on all the time? Since 2.2 I'm seeing "Timed out while waiting for session lock." on all kinds of operations, after everything hangs for some minutes. And this is on a site with one user. Are these regressions? MDL-31870 I'm going to try turning off database sessions (which are on by default).
          Hide
          Petr Škoda added a comment - - edited

          That is the new expected behaviour, the session locks have timeout defined in SESSION_ACQUIRE_LOCK_TIMEOUT (by default 2 minutes). If you want longer timeout please define it in your config.php. Alternatively fix the script that executes for long time by closing session manually before starting the slow operations.

          Show
          Petr Škoda added a comment - - edited That is the new expected behaviour, the session locks have timeout defined in SESSION_ACQUIRE_LOCK_TIMEOUT (by default 2 minutes). If you want longer timeout please define it in your config.php. Alternatively fix the script that executes for long time by closing session manually before starting the slow operations.
          Hide
          Tony Levi added a comment -

          Yep, this is the rather unfortunate but intended behavior, otherwise one user brings the system down by holding refresh (fills all server processes).

          The only other way I think is to eventually lose the session locks - certainly not safe until everything uses transactions properly and so on. More like a "Moodle 3" feature.

          Show
          Tony Levi added a comment - Yep, this is the rather unfortunate but intended behavior, otherwise one user brings the system down by holding refresh (fills all server processes). The only other way I think is to eventually lose the session locks - certainly not safe until everything uses transactions properly and so on. More like a "Moodle 3" feature.
          Hide
          Petr Škoda added a comment -

          Transaction would not help at all here, the problem is we cache critical information in session, there is no way to remove the locks imo, sorry.

          Show
          Petr Škoda added a comment - Transaction would not help at all here, the problem is we cache critical information in session, there is no way to remove the locks imo, sorry.
          Hide
          Tony Levi added a comment -

          well yeah, there is that too; it should live in a cache, not session of course.

          Show
          Tony Levi added a comment - well yeah, there is that too; it should live in a cache, not session of course.
          Hide
          Martin Dougiamas added a comment - - edited

          From a user perspective (remember them?) there is something SERIOUSLY wrong if one little test site using MySQL is hanging completely just because one user is switching quickly between pages that use AJAX (such as the course page) on Moodle 2.2.

          This was never such a problem before. Cure worse than the disease?

          Perhaps database sessions should be default to being off.

          Show
          Martin Dougiamas added a comment - - edited From a user perspective (remember them?) there is something SERIOUSLY wrong if one little test site using MySQL is hanging completely just because one user is switching quickly between pages that use AJAX (such as the course page) on Moodle 2.2. This was never such a problem before. Cure worse than the disease? Perhaps database sessions should be default to being off.
          Hide
          Petr Škoda added a comment -

          When users starts clicking like crazy the http requests should be interrupted and the session locks released immediately. It would be great if you described the exact steps to reproduce this. Why do you think this is related to database sessions? Did you try file bases sessions? (it uses locking too) Did you try Postgresql to rule out mysql drivers problems?

          Show
          Petr Škoda added a comment - When users starts clicking like crazy the http requests should be interrupted and the session locks released immediately. It would be great if you described the exact steps to reproduce this. Why do you think this is related to database sessions? Did you try file bases sessions? (it uses locking too) Did you try Postgresql to rule out mysql drivers problems?
          Hide
          Martin Dougiamas added a comment -

          It was pretty random, but one I remember was that I was in the course page, and it was partially loaded and I clicked a link to go to an activity page. Bang. It does seem better now on file-based sessions, problem has not happened again. I don't have an option to use Postgres for this site (and nor would many users).

          Show
          Martin Dougiamas added a comment - It was pretty random, but one I remember was that I was in the course page, and it was partially loaded and I clicked a link to go to an activity page. Bang. It does seem better now on file-based sessions, problem has not happened again. I don't have an option to use Postgres for this site (and nor would many users).
          Hide
          Petr Škoda added a comment -

          Is your moodle using persistent connections? Is PHP cgi or apache module?

          Show
          Petr Škoda added a comment - Is your moodle using persistent connections? Is PHP cgi or apache module?
          Hide
          Petr Škoda added a comment -

          I will try to work around mysql problems by trying to create new full featured file-based session driver and memcached session driver (so far it looks that MUC abstraction would be unnecessarily complicated by the session locking logic, so it might be better to be separate).

          Show
          Petr Škoda added a comment - I will try to work around mysql problems by trying to create new full featured file-based session driver and memcached session driver (so far it looks that MUC abstraction would be unnecessarily complicated by the session locking logic, so it might be better to be separate).
          Hide
          Alberto Lorenzo Pulido added a comment -

          Hi all guys, using moodle 2.3.2+ and PostgreSQL here.

          If im logged in and put one of my fingers over the F5 for a few seconds, while im at "moodlesite/my" path. Then database is filled with conections and the request may be interrupted every "F5" pulse, but the database still making the queries necesary for close the session and then unlocking it. This causes a lot of apache threads waiting for the answer of the database and ends on filling memory on front end servers and connection slots on database server.

          Even with the acquire_lock_timeout at 5 seconds, connections and request are filled faster speed that released...

          Now scale this to a 300 concurrent and 21.000 potential users... and i think it becames on a big: God Luck my friend!

          ¿Its really necesary to lock the session just for browsing? I mean, i understand to do this on quiz attempts or whatever but just opening "my moodle" page or the course list...

          Show
          Alberto Lorenzo Pulido added a comment - Hi all guys, using moodle 2.3.2+ and PostgreSQL here. If im logged in and put one of my fingers over the F5 for a few seconds, while im at "moodlesite/my" path. Then database is filled with conections and the request may be interrupted every "F5" pulse, but the database still making the queries necesary for close the session and then unlocking it. This causes a lot of apache threads waiting for the answer of the database and ends on filling memory on front end servers and connection slots on database server. Even with the acquire_lock_timeout at 5 seconds, connections and request are filled faster speed that released... Now scale this to a 300 concurrent and 21.000 potential users... and i think it becames on a big: God Luck my friend! ¿Its really necesary to lock the session just for browsing? I mean, i understand to do this on quiz attempts or whatever but just opening "my moodle" page or the course list...
          Hide
          Tony Levi added a comment -

          That's exactly right. It also isn't simple to fix at all...

          The basic problem is that pretty much all write activity in Moodle does not use proper transactional methods and so is prone to terrible race conditions. This is reduced only by the session lock though some cases can still cause hideous effects visible to other users for example when moving categories it is possible for other users to temporarily see broken context information...

          There also is absolutely zero way to identify if a page will be performing this race-prone activity, especially at the time the session is locked so adding special mitigation to certain common pages is error prone.

          Moodle really needs to grow up with proper transactional semantics where appropriate to ultimately remove the session lock. Unfortunately an apparent requirement to support 120% broken mysql configurations will likely block this forever.

          Show
          Tony Levi added a comment - That's exactly right. It also isn't simple to fix at all... The basic problem is that pretty much all write activity in Moodle does not use proper transactional methods and so is prone to terrible race conditions. This is reduced only by the session lock though some cases can still cause hideous effects visible to other users for example when moving categories it is possible for other users to temporarily see broken context information... There also is absolutely zero way to identify if a page will be performing this race-prone activity, especially at the time the session is locked so adding special mitigation to certain common pages is error prone. Moodle really needs to grow up with proper transactional semantics where appropriate to ultimately remove the session lock. Unfortunately an apparent requirement to support 120% broken mysql configurations will likely block this forever.
          Hide
          James McLean added a comment - - edited

          I have managed to replicate the issue Martin mentioned above reliably.

          Moodle 2.6.1+
          Chrome Version 33.0.1750.154 m
          Firefox 28.0
          IE (9, old version intentionally installed)

          Replicating:
          1) Access course with large amount/number of files in the course, and preferably a huge number of files in the Moodle instance in other courses.
          2) Edit a section (type of content is not a problem)
          3) Select 'Insert Image' -> 'Find and Upload an Image'
          4) Select 'Server Files'. Display files as Tree (IMPORTANT)
          5) Before 'server files' loads, select another item such as 'Recent Files' and then 'Upload a File' - clicking either.
          6) Depending on the session time out, this will throw the "Session Timeout" error when the timeout is reached.
          7) The window is then "frozen", with no ability to select anything on the page - such as close the upload window.
          8) Reloading the window then also returns the "Session Timeout" error and the system is not usable until the window is closed and session cookie is lost (as happens when the window closes).

          I have video of the issue being replicated if it will help the resolution process.

          Show
          James McLean added a comment - - edited I have managed to replicate the issue Martin mentioned above reliably. Moodle 2.6.1+ Chrome Version 33.0.1750.154 m Firefox 28.0 IE (9, old version intentionally installed) Replicating: 1) Access course with large amount/number of files in the course, and preferably a huge number of files in the Moodle instance in other courses. 2) Edit a section (type of content is not a problem) 3) Select 'Insert Image' -> 'Find and Upload an Image' 4) Select 'Server Files'. Display files as Tree (IMPORTANT) 5) Before 'server files' loads, select another item such as 'Recent Files' and then 'Upload a File' - clicking either. 6) Depending on the session time out, this will throw the "Session Timeout" error when the timeout is reached. 7) The window is then "frozen", with no ability to select anything on the page - such as close the upload window. 8) Reloading the window then also returns the "Session Timeout" error and the system is not usable until the window is closed and session cookie is lost (as happens when the window closes). I have video of the issue being replicated if it will help the resolution process.
          Hide
          Aparup Banerjee added a comment -

          Hi James,
          imo, you may have stumbled upon one of many ways to replicate the issue. Can i suggest the creation of a new issue and within that you copy paste the replication steps. It may turn out to be one of a number of fixes needed to grow moodle towards proper transactions, this way we could create a meta for a better overview plan.

          Show
          Aparup Banerjee added a comment - Hi James, imo, you may have stumbled upon one of many ways to replicate the issue. Can i suggest the creation of a new issue and within that you copy paste the replication steps. It may turn out to be one of a number of fixes needed to grow moodle towards proper transactions, this way we could create a meta for a better overview plan.

            People

            • Votes:
              3 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: