This is a natural follow on from
To set the scene this is an the symptoms we are seeing in a large site running on postgres and tracing it back to the root cause
1) get_role_definitions_uncached takes around 500ms as a best case. When things are bad we've seen it spike up to double digit seconds. There are 33000 records for some role definitions.
2) We are seeing the _uncached function regularly being run by different seemingly unrelated and read-only requests for random requests. Often it is various ajax calls but this is purely because they are is called very often and are just the victims here, eg:
3) When this happens, we see a cluster of requests together racing to rebuild the roledef cache and then it comes good again. All these requests which would normally be ~100ms are now ~600ms
4) The roledef cache item is being purged which in this particular case was traced back to an academic who was editing their course. Over the course of a couple hours they made many small changes. Lots of ad hoc tasks created 'course_delete_modules' which eventually calls accesslib_reset_role_cache(). But there will be a bunch of code paths to that cache reset.
5) In debugging this I also found that each of the roledefs cache items were roughly 1mb when uncompressed, so every request looks at 1 but often 2 roles and so it is loading a crazy amount of io from muc / redis.
So I think just about every step in the chain could be improved here:
a) get_role_definitions_uncached the sql could possibly be further tuned / indexed
b) In various places instead of simply resetting the roledef cache we should rebuild it once. Especially in the places like the ad hoc tasks. This solves 3)
c) But the real root cause here seems to be the design the of this cache key which holds information on capabilities and access across the entire system. So when any change at all happens anywhere then the whole thing is throw away. I'm thinking a better approach might be to split this into smaller cache items, eg 1 for the main trunk of the tree starting at the site context and then smaller cache items for each course. When we need it we would load both and graft just the parts of the context tree that we need. When we invalidate something we can just purge the cache for a single course. This solves 4) an 5)
d) To support c) get_role_definitions_uncached can have a second optional $course or $context argument so we can query just the part of the sub tree we need instead of the whole site. Maybe passing the context path is the best way to filter.
e) Possibly making the role cache localizable. Maybe this is redundant if c) & d) are implemented.