Moodle

Large number of role_capabilities (overrides) affect accesslib performance

Details

  • Type: Bug Bug
  • Status: Closed Closed
  • Priority: Minor Minor
  • Resolution: Fixed
  • Affects Version/s: 1.9.2
  • Fix Version/s: 1.9.5
  • Component/s: Roles / Access
  • Labels:
    None

Description

Site with 7 standard roles, normal assignments and 45000 role_capabilities. Running MySQL:

Running this query from get_user_access_sitewide():

SELECT sctx.path,
ra.roleid,
ctx.path AS parentpath,
rco.capability,
rco.permission
FROM mdl_role_assignments ra
JOIN mdl_context ctx ON ra.contextid=ctx.id
JOIN mdl_context sctx ON (sctx.path LIKE CONCAT(ctx.path,'/%') )
JOIN mdl_role_capabilities rco ON (rco.roleid=ra.roleid AND rco.contextid=sctx.id)
WHERE ra.userid = 18797
AND sctx.contextlevel <= 50
ORDER BY sctx.depth, sctx.path, ra.roleid;

is 8.2 secs in my blazing fast dev server

Changing it to:

SELECT inn.path,
inn.roleid,
inn.parentpath,
rco.capability,
rco.permission
FROM mdl_role_capabilities rco
JOIN (SELECT sctx.id,
sctx.path,
sctx.depth,
ra.roleid,
ctx.path AS parentpath
FROM mdl_role_assignments ra
JOIN mdl_context ctx ON ra.contextid=ctx.id
JOIN mdl_context sctx ON (sctx.path LIKE CONCAT(ctx.path,'/%') )
WHERE ra.userid = 18797
AND sctx.contextlevel <= 50
) inn
ON (rco.roleid=inn.roleid AND rco.contextid=inn.id)
ORDER BY inn.depth, inn.path, inn.roleid;

only spends 0.9 secs

The cause to make that explicit subquery is that it seems that MySQL "delays" queries from tables being searched with LIKE/CONCAT to the end of the execution plan (to save some process). While that's good in general terms... that strategy kills completely our query. We need to get the target contexts first and then, their role_capabilities. And right now MySQL is doing exactly the opposite (with an high cost for sites with thousands of role_capabilities).

With the subquery... our preferred order is forced. Hence the better times.

Ciao

Issue Links

Activity

Hide
Eloy Lafuente (stronk7) added a comment -

Also, I've compared both queries under one site with less overrides (moodle.org), and differences are none. I got 0.1-0.3 execution times with both alternatives. So it shouldn't affect small (in term of overrides) sites.

Ciao

Show
Eloy Lafuente (stronk7) added a comment - Also, I've compared both queries under one site with less overrides (moodle.org), and differences are none. I got 0.1-0.3 execution times with both alternatives. So it shouldn't affect small (in term of overrides) sites. Ciao
Hide
Samuli Karevaara added a comment -

Is this going in as a patch?

Show
Samuli Karevaara added a comment - Is this going in as a patch?
Hide
Samuli Karevaara added a comment -

Further: I got slow times for both queries...

Show
Samuli Karevaara added a comment - Further: I got slow times for both queries...
Hide
Samuli Karevaara added a comment -

Running EXPLAIN for both queries: the first one hits "using temporary; using filesort;" but reports go through 19 x 338 lines on my test query. The same query, as version 2, gave "Impossible WHERE noticed after reading const tables" and says to go through 19 x 17343 rows.

Show
Samuli Karevaara added a comment - Running EXPLAIN for both queries: the first one hits "using temporary; using filesort;" but reports go through 19 x 338 lines on my test query. The same query, as version 2, gave "Impossible WHERE noticed after reading const tables" and says to go through 19 x 17343 rows.
Hide
Eloy Lafuente (stronk7) added a comment -

The "Impossible WHERE noticed after reading const tables" should happen if the "inn" view doesn't return any record, so no final join is performed, so cost is minimum.

In my explains, I get the alternative query using unique indexes /primary keys for the ra, ctx and sctx tables and they being executed before the final join (so less records have to be retrieved in order to perform the LIKE operation. The difference are more noticeable as role_capabilities grows and grows (i.e. overrides mainly).

Show
Eloy Lafuente (stronk7) added a comment - The "Impossible WHERE noticed after reading const tables" should happen if the "inn" view doesn't return any record, so no final join is performed, so cost is minimum. In my explains, I get the alternative query using unique indexes /primary keys for the ra, ctx and sctx tables and they being executed before the final join (so less records have to be retrieved in order to perform the LIKE operation. The difference are more noticeable as role_capabilities grows and grows (i.e. overrides mainly).
Hide
Samuli Karevaara added a comment -

Eloy, is it so that courses are currently never sub-contexts of courses? At least in 1.9.3+ meta courses are stored in another table, but not as sub-contexts. If so, the query can be made much much faster in MySQL by changing it to:

SELECT sctx.path, ra.roleid, rco.capability, rco.permission FROM mdl5_role_assignments ra
JOIN mdl5_context ctx ON ra.contextid=ctx.id
JOIN mdl5_context sctx ON (sctx.path LIKE CONCAT(ctx.path,'/%') )
JOIN mdl5_role_capabilities rco ON (rco.roleid=ra.roleid AND rco.contextid=sctx.id)
WHERE ra.userid = {a_variable} AND ctx.contextlevel <= 40 AND sctx.contextlevel <= 50;

I added the "AND ctx.contextlevel <= 40" there. Minor improvements can be made by ruling out other contexts also, but this was the main one. For example, if CONTEXT_USERID can not be a parent context, it can be ruled out from the query also.

(I posted this also in the forums, added here also because this is a more appropriate place for this suggestion.)

Show
Samuli Karevaara added a comment - Eloy, is it so that courses are currently never sub-contexts of courses? At least in 1.9.3+ meta courses are stored in another table, but not as sub-contexts. If so, the query can be made much much faster in MySQL by changing it to: SELECT sctx.path, ra.roleid, rco.capability, rco.permission FROM mdl5_role_assignments ra JOIN mdl5_context ctx ON ra.contextid=ctx.id JOIN mdl5_context sctx ON (sctx.path LIKE CONCAT(ctx.path,'/%') ) JOIN mdl5_role_capabilities rco ON (rco.roleid=ra.roleid AND rco.contextid=sctx.id) WHERE ra.userid = {a_variable} AND ctx.contextlevel <= 40 AND sctx.contextlevel <= 50; I added the "AND ctx.contextlevel <= 40" there. Minor improvements can be made by ruling out other contexts also, but this was the main one. For example, if CONTEXT_USERID can not be a parent context, it can be ruled out from the query also. (I posted this also in the forums, added here also because this is a more appropriate place for this suggestion.)
Hide
Samuli Karevaara added a comment -

In MySQL 5.0.67 on my Ubuntu dev box I get 4.4s on the first query and 5.2s for the second query on our test data (150K contexts, 77K role assignments, 34K role capabilities). Could the query cache cause the speedup? I'm running my dev box without the query cache.

Show
Samuli Karevaara added a comment - In MySQL 5.0.67 on my Ubuntu dev box I get 4.4s on the first query and 5.2s for the second query on our test data (150K contexts, 77K role assignments, 34K role capabilities). Could the query cache cause the speedup? I'm running my dev box without the query cache.
Hide
Samuli Karevaara added a comment -

My previous comment was referring to Eloys original two queries. The query with my additional WHERE clause still runs in less than 0.3s.

Show
Samuli Karevaara added a comment - My previous comment was referring to Eloys original two queries. The query with my additional WHERE clause still runs in less than 0.3s.
Hide
Eloy Lafuente (stronk7) added a comment -

Hi,

I've created one test, to be able to measure the get_user_access_sitewide() performance. It performs up to XX (configurable) calls to the function with different users, timing the sum of all the calls and calculating one hash over the whole results returned (to see if there is any difference).

Here there are some data of execution of the test in moodle.org, with the host being noticeable busy while the tests were ran:

First execution:

With added condition: Real users processed: 8263 Time spent in get_user_access_sitewide(): 11.52703332901 HASH of the whole operation: f3d57aad6fdb5da34d868cfb85aafca0

Real users processed: 8263 Time spent in get_user_access_sitewide(): 13.991519451141 HASH of the whole operation: f3d57aad6fdb5da34d868cfb85aafca0

Second execution:

With added condition: Real users processed: 8263 Time spent in get_user_access_sitewide(): 11.575879335403 HASH of the whole operation: f3d57aad6fdb5da34d868cfb85aafca0

Real users processed: 8263 Time spent in get_user_access_sitewide(): 12.882212638855 HASH of the whole operation: f3d57aad6fdb5da34d868cfb85aafca0

Third execution:

With added condition: Real users processed: 8263 Time spent in get_user_access_sitewide(): 12.480083227158 HASH of the whole operation: f3d57aad6fdb5da34d868cfb85aafca0

Real users processed: 8263 Time spent in get_user_access_sitewide(): 14.296385526657 HASH of the whole operation: f3d57aad6fdb5da34d868cfb85aafca0

Some conclusions:

1) The test was executed for 8263 users (from a configured maximum of 10000)
2) The hash is consistent in all the execution (hence, there aren't changes in the returned sets).
3) The execution using the extra condition proposed by Samuli are slightly faster than the current one.
4) moodle.org hasn't thousands of courses nor millions of activities, so it's possible that the number of overrides doesn't affect too much.

In any case, it would be great to execute the tests in another server, having bigger numbers of courses/enrolments/overrides. Ideally that test should be executed by:

  • optimize the database
  • shutdown the database
  • startup the database
  • run the test

In any case, I think Samuli suggestion has perfect sense, as far as the optimizer can try different techniques depending of the number of records. This cause moodle.org to not being affected (access by PK because of small number of contexts) but other servers suffering a lot (full scan over context table). So adding one extra condition can help to reduce that, more if it seems that it never hurts any info.

After all, the parent of one course only can be one category, so it haven't sense to look for parent contexts "below" the category level.

Ciao

Show
Eloy Lafuente (stronk7) added a comment - Hi, I've created one test, to be able to measure the get_user_access_sitewide() performance. It performs up to XX (configurable) calls to the function with different users, timing the sum of all the calls and calculating one hash over the whole results returned (to see if there is any difference). Here there are some data of execution of the test in moodle.org, with the host being noticeable busy while the tests were ran: First execution: With added condition: Real users processed: 8263 Time spent in get_user_access_sitewide(): 11.52703332901 HASH of the whole operation: f3d57aad6fdb5da34d868cfb85aafca0 Real users processed: 8263 Time spent in get_user_access_sitewide(): 13.991519451141 HASH of the whole operation: f3d57aad6fdb5da34d868cfb85aafca0 Second execution: With added condition: Real users processed: 8263 Time spent in get_user_access_sitewide(): 11.575879335403 HASH of the whole operation: f3d57aad6fdb5da34d868cfb85aafca0 Real users processed: 8263 Time spent in get_user_access_sitewide(): 12.882212638855 HASH of the whole operation: f3d57aad6fdb5da34d868cfb85aafca0 Third execution: With added condition: Real users processed: 8263 Time spent in get_user_access_sitewide(): 12.480083227158 HASH of the whole operation: f3d57aad6fdb5da34d868cfb85aafca0 Real users processed: 8263 Time spent in get_user_access_sitewide(): 14.296385526657 HASH of the whole operation: f3d57aad6fdb5da34d868cfb85aafca0 Some conclusions: 1) The test was executed for 8263 users (from a configured maximum of 10000) 2) The hash is consistent in all the execution (hence, there aren't changes in the returned sets). 3) The execution using the extra condition proposed by Samuli are slightly faster than the current one. 4) moodle.org hasn't thousands of courses nor millions of activities, so it's possible that the number of overrides doesn't affect too much. In any case, it would be great to execute the tests in another server, having bigger numbers of courses/enrolments/overrides. Ideally that test should be executed by:
  • optimize the database
  • shutdown the database
  • startup the database
  • run the test
In any case, I think Samuli suggestion has perfect sense, as far as the optimizer can try different techniques depending of the number of records. This cause moodle.org to not being affected (access by PK because of small number of contexts) but other servers suffering a lot (full scan over context table). So adding one extra condition can help to reduce that, more if it seems that it never hurts any info. After all, the parent of one course only can be one category, so it haven't sense to look for parent contexts "below" the category level. Ciao
Hide
Eloy Lafuente (stronk7) added a comment -

ping!

Show
Eloy Lafuente (stronk7) added a comment - ping!
Hide
Jonathan Cockrell added a comment -

I was also having this issue on 1.9.4+ (Build: 20090311) and have fixed it with the above code. In dev we upgraded from 1.5.4 to 1.9.4+, and the role_capabilities table was populated with 45214 rows (42884 of which are "mod/forum:%" entries). The role_assignments table was populated with 98872 rows.

The original query was taking up to 48 seconds to run on a virtual machine delegated a single 2GHz proc and 2GB of memory. Since this machine is not designed as a production server, I think it exacerbates the problem but also brings attention to it. In production this would probably still take 20-30 seconds.
Some of the log entries:

  1. Query_time: 8 Lock_time: 0 Rows_sent: 0 Rows_examined: 637580
  2. Query_time: 29 Lock_time: 0 Rows_sent: 0 Rows_examined: 2359046

The 8 second time occured on a new student (id in the 10000 range). The 29 second time occured on an older student (id in the 2000 range). Old classes are not purged from the system and are left for auditing purposed.

The following test results were gathered under no load and "ideal" (as specified by Eloy) conditions

Results of Eloy's Script
------------------------
//Old $sql, $users=10000, $step=10
Old code timed out around id 700.

//New $sql, $users=10000, $step=10
Real users processed: 1186
Time spent in get_user_access_sitewide(): 96.489293336868
Time spent in get_user_access_sitewide(): 100.22150158882
Time spent in get_user_access_sitewide(): 96.221356868744
HASH of the whole operation: f61fd5dcd1d82d6413e929d64daa200b
------------------------
//Old $sql, $users=1000, $step=100
Real users processed: 118
Time spent in get_user_access_sitewide(): 822.34694337845
HASH of the whole operation: 15dd8a6e63446885c7ed881bce5e0391

//New $sql, $users=1000, $step=100
Real users processed: 118
Time spent in get_user_access_sitewide(): 10.581704854965
HASH of the whole operation: 15dd8a6e63446885c7ed881bce5e0391
------------------------
//Old $sql, $users=100, $step=10
Real users processed: 98
Time spent in get_user_access_sitewide(): 68.873546361923
HASH of the whole operation: 590337dee470437f0b5bf22ee2ed3902

//New $sql, $users=100, $step=10
Real users processed: 98
Time spent in get_user_access_sitewide(): 5.2130753993988
HASH of the whole operation: 590337dee470437f0b5bf22ee2ed3902

Other Info
----------
mdl_context: 27047 rows
Ubuntu 8.04
MySQL 5.0.51

Show
Jonathan Cockrell added a comment - I was also having this issue on 1.9.4+ (Build: 20090311) and have fixed it with the above code. In dev we upgraded from 1.5.4 to 1.9.4+, and the role_capabilities table was populated with 45214 rows (42884 of which are "mod/forum:%" entries). The role_assignments table was populated with 98872 rows. The original query was taking up to 48 seconds to run on a virtual machine delegated a single 2GHz proc and 2GB of memory. Since this machine is not designed as a production server, I think it exacerbates the problem but also brings attention to it. In production this would probably still take 20-30 seconds. Some of the log entries:
  1. Query_time: 8 Lock_time: 0 Rows_sent: 0 Rows_examined: 637580
  2. Query_time: 29 Lock_time: 0 Rows_sent: 0 Rows_examined: 2359046
The 8 second time occured on a new student (id in the 10000 range). The 29 second time occured on an older student (id in the 2000 range). Old classes are not purged from the system and are left for auditing purposed. The following test results were gathered under no load and "ideal" (as specified by Eloy) conditions Results of Eloy's Script ------------------------ //Old $sql, $users=10000, $step=10 Old code timed out around id 700. //New $sql, $users=10000, $step=10 Real users processed: 1186 Time spent in get_user_access_sitewide(): 96.489293336868 Time spent in get_user_access_sitewide(): 100.22150158882 Time spent in get_user_access_sitewide(): 96.221356868744 HASH of the whole operation: f61fd5dcd1d82d6413e929d64daa200b ------------------------ //Old $sql, $users=1000, $step=100 Real users processed: 118 Time spent in get_user_access_sitewide(): 822.34694337845 HASH of the whole operation: 15dd8a6e63446885c7ed881bce5e0391 //New $sql, $users=1000, $step=100 Real users processed: 118 Time spent in get_user_access_sitewide(): 10.581704854965 HASH of the whole operation: 15dd8a6e63446885c7ed881bce5e0391 ------------------------ //Old $sql, $users=100, $step=10 Real users processed: 98 Time spent in get_user_access_sitewide(): 68.873546361923 HASH of the whole operation: 590337dee470437f0b5bf22ee2ed3902 //New $sql, $users=100, $step=10 Real users processed: 98 Time spent in get_user_access_sitewide(): 5.2130753993988 HASH of the whole operation: 590337dee470437f0b5bf22ee2ed3902 Other Info ---------- mdl_context: 27047 rows Ubuntu 8.04 MySQL 5.0.51
Hide
Eloy Lafuente (stronk7) added a comment -

re-ping!

Jonathan's (thanks!) result show the DIFFERENCE (yes, uppercase, ) provided by reducing the number of contexts with that simple change (one extra condition limiting the search of contexts). Please, vote +1 ASAP !

Ciao

Show
Eloy Lafuente (stronk7) added a comment - re-ping! Jonathan's (thanks!) result show the DIFFERENCE (yes, uppercase, ) provided by reducing the number of contexts with that simple change (one extra condition limiting the search of contexts). Please, vote +1 ASAP ! Ciao
Hide
Martin Dougiamas added a comment -

Holy crap why isn't this in stable? +1

Show
Martin Dougiamas added a comment - Holy crap why isn't this in stable? +1
Hide
Tim Hunt added a comment -

I agree that extra condition will not change the result of the query, and can only help performance, so it should be added, so +1 to commit.

However, I do not understand what the query id trying to do. Normally it is irrelevant whether the context of an override is above of below the context where the role is assigned.

Ah!, but this may be to do with optimising how much data is loaded from the database. I think I see.

Show
Tim Hunt added a comment - I agree that extra condition will not change the result of the query, and can only help performance, so it should be added, so +1 to commit. However, I do not understand what the query id trying to do. Normally it is irrelevant whether the context of an override is above of below the context where the role is assigned. Ah!, but this may be to do with optimising how much data is loaded from the database. I think I see.
Hide
Eloy Lafuente (stronk7) added a comment -

Thanks for feedback... going to apply it NOW!

Show
Eloy Lafuente (stronk7) added a comment - Thanks for feedback... going to apply it NOW!
Hide
Eloy Lafuente (stronk7) added a comment -

Change has been committed both to 19_STABLE and HEAD. Now we only search for child contexts in context being category level or above. That reduces the number of canditate contexts in a huge way.

Thanks Samuli and everybody else. Resolving as fixed...ciao

Show
Eloy Lafuente (stronk7) added a comment - Change has been committed both to 19_STABLE and HEAD. Now we only search for child contexts in context being category level or above. That reduces the number of canditate contexts in a huge way. Thanks Samuli and everybody else. Resolving as fixed...ciao
Hide
Petr Škoda (skodak) added a comment -

nice, thanks!

Show
Petr Škoda (skodak) added a comment - nice, thanks!
Hide
Eloy Lafuente (stronk7) added a comment -

Linking with MDL-19465, where a (different) attempt to improve speed is being researched

Show
Eloy Lafuente (stronk7) added a comment - Linking with MDL-19465, where a (different) attempt to improve speed is being researched

Dates

  • Created:
    Updated:
    Resolved: