-
Bug
-
Resolution: Fixed
-
Minor
-
4.1.6, 4.2.3, 4.3, 4.4
Summary
The way the task manager (specifically its ad-hoc task queue) is set up in combination with the way PostgreSQL Advisory Locks usage is implemented is guaranteed to cause the entire Cron script to fail and never complete any task past that point (without manual intervention). This failure will occur at predetermined "times" in the future due to a currently unavoidable partial hash collision.
Explanation
- The task manager's manager::get_next_adhoc_task method loops over every ad-hoc task in the queue and always acquires two locks in every iteration: First a lock for the current task itself, where the resource name is set to "adhoc_<id>" (with <id> being the primary key of the record in the task_adhoc table); then the so called global cron lock (resource name "core_cron") towards the end of the iteration while the first one is still held by the task.
- The postgres_lock_factory::get_lock method called by the task manager attempts to acquire an advisory lock via the Postgres function pg_try_advisory_lock and passes two 32-bit keys, wherein the first appears to be merely used to distinguish different Moodle instances/table prefixes and/or to distinguish from session locks and the second is the actual key identifying the resource the lock is for.
- That key is generated by the postgres_lock_factory::get_index_from_key method, which gets the resource name (prefixed with "cron_"), hashes that string with SHA1, then takes only the first 7 hex-digits of the SHA1 digest and turns those into a decimal integer.
- Since the global cron lock is always acquired while exactly one ad-hoc task lock is still held and since the hash of its name (and thus the key) is always the same, we can easily calculate to probability of a random resource name resulting in a hash that starts with the same 7 hex-digits. 7 hexadecimals is equivalent to 28 bits, so there are exactly 2^28, i.e. approximately 268 Million 28-bit numbers.
- The SHA1 digest of the string "cron_core_cron" happens to start with the digits 3aefa76 and so does the digest of the string "cron_adhoc_82795762", which means as soon as about 83 Million ad-hoc tasks have been created, a partial collision will occur preventing the global cron lock to be acquired, causing a locktimeout exception that interrupts the entire cron script (which means no progress with the following ad-hoc tasks).
- Since the cron script will just pick up where it left off last time and the "offending" ad-hoc task is still enqueued, the same error will occur immediately once it calls cron_run_adhoc_tasks. Therefore no ad-hoc tasks after that will ever be executed by the cron.
Implications
This is obviously a sub-optimal situation once it occurs. And 83 Million ad-hoc tasks may seem like a lot, but with a sufficiently large Moodle instance this can happen in a matter of years, and has in fact happened to us recently. Until we - bonczek (@bonczek), mgauk (@mgauk) and I (@daniil-berg) - found that bug and forced our way out of it, about 150.000 unfinished ad-hoc tasks had accumulated.
Assuming a generous 100.000 new ad-hoc tasks on an average day, it would take a newly installed (admittedly very large) instance a little over 27 months to reach the first collision. With a more modest 50.000 ad-hoc tasks per day, it would take about 4.5 years to the first collision. You get the idea. It is important to stress that I am not talking about probabilities here. The collision is certain to happen at specific, predetermined IDs; it is just a matter of when it will happen.
Our (very inelegant) quick-fix was to simply delete the ad-hoc task with the ID 82795762 (happened to be a user notification about a forum post) from the DB to un-block the cron and allow it to continue processing ad-hoc tasks.
What's nice is that the deterministic nature of the problem allows us to easily calculate the next occurrences of such partial hash collisions. Tapping into my inner Bitcoin miner, I calculated that at ID 89080789 this will happen again (for us probably in about 5 weeks ) and then again at ID 162524415 (likely in a few years). I am honestly surprised this has not come up before or we are the lucky first.
To reproduce
Not that it is really necessary, since the logic of the algorithms involved (as laid out above) is pretty clear, but if you want to:
- Install
Moodle 4.1 (have not tested with other versions yet)any Moodle 4 version with a PostgreSQL database. - Force the ad-hoc task table to begin primary keys with the ID that will cause the collision by setting the next sequence value. The following should work: ALTER SEQUENCE mdl_task_adhoc_id_seq RESTART WITH 82795762; (Obviously adjust schema/prefix as needed.)
- Enqueue any ad-hoc task (e.g. post in a forum with a subscribed user).
- Run admin/cli/cron.php.
Possible solutions
The first thing that comes to mind is using a single 64-bit key with pg_try_advisory_lock instead of two 32-bit keys. Instead of wasting the first 32-bits on a "namespace" that simply distinguishes between those locks and session locks or between multiple instances of Moodle, concatenate the entire set of strings (DB name, prefix, lock type and resource), hash that, and use the first 16 hex-digits of the digest. That gives us on the order of 10^19 or over 10 Billion Billion possible keys. I would venture to claim that the probability of a collision between a random session lock and a task lock would be negligible. Also, the first ad-hoc task ID causing a collision in the first 16 hex-digits of the SHA1 digest of "cron_core_cron" ("3aefa76ca66760eb") is greater than 18.5 Billion. (I stopped trying to brute-force a collision after that number, so I don't know the actual ID.) Considering that this is a couple orders of magnitude greater than the current first collision, this seems unlikely to become an issue within a couple hundred years, even with the most ambitious Moodle installation.
Maybe a totally different approach is possible by modifying the logic by which tasks acquire locks that somehow decouples the lock for the task itself from the global cron lock. I have not looked into the code enough, but as far as I can see, while the current master branch has some different logic and some refactoring was done in get_next_adhoc_task (relative to our 4.1 version), it still seems to be susceptible to the same problem, just via the manager::set_locks method.
- has a non-specific relationship to
-
MDL-67667 Deprecate is_blocking() for adhoc and scheduled tasks in Task API
- Closed