Uploaded image for project: 'Moodle'
  1. Moodle
  2. MDL-24355

Computation of tag correlations is incorrect and very inefficient

    Details

    • Database:
      Any
    • Testing Instructions:
      Hide

      check that /admin/cron.php runs without error. Note that the tag correlation calculation doesn't run every time. To ensure it has been run you may wish to edit /lib/cronlib.php at line 197 to be:

      if (true) {

      1) Add 3 blog entries. List 2 tags with each entry. The same 2 for each entry.
      2) run the modified version of cron (see above).
      3) Go to the manage tags page. Site administration -> appearance -> manage tags. Check that your two tags are listed.
      4) Click on one of the tags you added. On the tags page your other tag should be listed in the "related tags" area.
      5) Click "Manage tags" to go back. This link can be hard to find. Its in the top left of the centre page area.
      6) Tick the box next to one of your tags and select "delete" from the "with selected tags" drop down. Click ok.
      7) Click your remaining tag and check that the related tags area is gone.
      8) Click through to one of your blog posts and check that the deleted tag is gone from the tags list for that blog post.

      This testing does need to be repeated for 1.9, 2 stable and 2.1 :|

      Show
      check that /admin/cron.php runs without error. Note that the tag correlation calculation doesn't run every time. To ensure it has been run you may wish to edit /lib/cronlib.php at line 197 to be: if (true) { 1) Add 3 blog entries. List 2 tags with each entry. The same 2 for each entry. 2) run the modified version of cron (see above). 3) Go to the manage tags page. Site administration -> appearance -> manage tags. Check that your two tags are listed. 4) Click on one of the tags you added. On the tags page your other tag should be listed in the "related tags" area. 5) Click "Manage tags" to go back. This link can be hard to find. Its in the top left of the centre page area. 6) Tick the box next to one of your tags and select "delete" from the "with selected tags" drop down. Click ok. 7) Click your remaining tag and check that the related tags area is gone. 8) Click through to one of your blog posts and check that the deleted tag is gone from the tags list for that blog post. This testing does need to be repeated for 1.9, 2 stable and 2.1 :|
    • Difficulty:
      Moderate
    • Affected Branches:
      MOODLE_19_STABLE, MOODLE_20_STABLE, MOODLE_21_STABLE
    • Fixed Branches:
      MOODLE_19_STABLE, MOODLE_20_STABLE
    • Pull Master Branch:
      wip-MDL-24355-master

      Description

      The tag_compute_correlations() function in "tag/lib.php" is not computing tags correctly. This is the query it's currently using to compute the correlated tags for a specified tag:

      $query = "SELECT tb.tagid ".
          "FROM {$CFG->prefix}tag_instance ta INNER JOIN {$CFG->prefix}tag_instance tb ON ta.itemid = tb.itemid ".
          "WHERE ta.tagid = {$tag->id} AND tb.tagid != {$tag->id} ".
          "GROUP BY tb.tagid ".
          "HAVING COUNT(*) > $min_correlation ".
          "ORDER BY COUNT(*) DESC";

      However, in the 'tag_instance' table, different items can have the same 'itemid' but different 'itemtype'. The query above does not account for that. To fix only this query, you could change it to this:

      $query = "SELECT tb.tagid ".
          "FROM {$CFG->prefix}tag_instance ta INNER JOIN {$CFG->prefix}tag_instance tb ON (ta.itemtype = tb.itemtype AND ta.itemid = tb.itemid) ".
          "WHERE ta.tagid = {$tag->id} AND tb.tagid != {$tag->id} ".
          "GROUP BY tb.tagid ".
          "HAVING COUNT(*) > $min_correlation ".
          "ORDER BY COUNT(*) DESC";

      The only difference is the join condition (the part after "ON" in the 2nd line).

      However, there's another problem with the tag_compute_correlations() function. It runs a minimum of 2*N+1 queries where N is the number of rows in the 'tag' table. 1 query to get a list of all the tags, 1 query per tag to get a list of correlated tags, and 1 query per tag to get the cached correlated tags from the 'tag_correlation' table.

      I work for Remote-Learner and many of our clients have at least hundreds of rows in that table, and some have thousands. The particular client that caused me to discover this currently has 9649 rows in their 'tag' table. So every time that function gets called for their site, it runs a minimum of 19299 queries. I know that it's only supposed to run 20% of the times that 'admin/cron.php' runs, but still, that's a lot of queries! Especially when the equivalent can be done with 1 single query. I've attached a rewrite of the tag_compute_correlations() function that accomplishes this. I've thoroughly tested it on a few different sites. Let me know if you have any questions about it.

      Sorry I didn't upload a patch file (if that's what you prefer). I thought it wouldn't really make sense as a patch since I just rewrote the whole function.

        Gliffy Diagrams

          Attachments

            Issue Links

              Activity

                People

                • Votes:
                  4 Vote for this issue
                  Watchers:
                  7 Start watching this issue

                  Dates

                  • Created:
                    Updated:
                    Resolved:
                    Fix Release Date:
                    1/Aug/11