Issue Details (XML | Word | Printable)

Key: MDL-13424
Type: Sub-task Sub-task
Status: Resolved Resolved
Resolution: Fixed
Priority: Minor Minor
Assignee: Mathieu Petit-Clair
Reporter: Matt Gibson
Votes: 0
Watchers: 2
Operations

Add/Edit UI Mockup to this issue
If you were logged in you would be able to see more operations.
Moodle
MDL-13404

Make a one-time upgrade path to remove old (non-official) tags that are not in use anymore

Created: 12/Feb/08 02:55 AM   Updated: 05/Jul/08 09:31 PM
Return to search
Component/s: Tags
Affects Version/s: 1.9
Fix Version/s: 1.9.2

File Attachments: 1. Text File tag_cleanup_faster_untested.patch (4 kB)

Issue Links:
Dependency
 

Participants: Eloy Lafuente (stronk7), Mathieu Petit-Clair, Matt Gibson and Petr Skoda
Security Level: None
QA Assignee: Eloy Lafuente (stronk7)
Resolved date: 05/Jul/08
Affected Branches: MOODLE_19_STABLE
Fixed Branches: MOODLE_19_STABLE


 Description  « Hide
My tag list is cluttered up with poorly spelled mistakes and experiments created by students whilst playing about with the interests section of their profile. Many have deleted said tags from their profile now, leaving those tags with no associated data or reason to exist, but they still hang about in the manage tags list.

A cron job to clean these out would be a good solution.

 All   Comments   Change History   Version Control      Sort Order: Ascending order - Click to sort in descending order
Mathieu Petit-Clair added a comment - 12/Feb/08 12:28 PM
This cron job is actually in the code, but there was a problem in the query, along with the fact that blog posts have an itemtype of "blog" while they are saved in table "post".

Mathieu Petit-Clair added a comment - 22/Feb/08 07:38 PM
The code now remove old tags as they become "unused", which will prevent this problem from happening.

I'm leaving this ticket open though, because the current code doesn't clean up what was left before.


Mathieu Petit-Clair added a comment - 30/Apr/08 03:50 PM
I just committed a new tag_cleanup() function that cleans up the tag database. This should fix both this issue, MDL-1544 and probably MDL-14569 too.

I didn't make it automatic now: this is potentially destructive. I tested this extensively, but I would like other people to review this patch and post comments, before calling this function from the cron.

To test:
1. have a look at the code : http://cvs.moodle.org/moodle/tag/lib.php?r1=1.43.2.33&r2=1.43.2.34 or in your local copy of a recent checkout of MOODLE_19_STABLE
2. make a backup of your database, add a line "tag_cleanup();" to function tag_cron() (around line 804 in Moodle 1.9), call the cron until you see "Executed tag cron".
3. have a look at your tags, see if everything is cleaner than before


Mathieu Petit-Clair added a comment - 30/Apr/08 04:55 PM
Eloy, can you have a look at the code and tell me if there's a horrible error in there? Otherwise, it's going in (and in the cron) before 1.9.1, hopefully.

Eloy Lafuente (stronk7) added a comment - 01/May/08 03:50 AM
Hi Mathieu,

the tag_cleanup() looks nearly perfect IMO, some comments:

1) The first $tags = get_records('tag'): I will delete (can become really BIG) it and perform record_exists() while iterating.
2) The TWO foreach loops. Both are based in some, potentially big too, get_records() call. I will change them to their get_recordset() alternatives, iterating with rs_fetch_next_record() and closing the record set at the end. In general, I think always something can grown over, say, 1000 (perhaps less) records... should be handled by recordsets instead of big arrays.
3) Agree seems to be the type of task not needed to be executed foreach cron invocation. I think you can put that call within the 20% section in cron.php (if ($random100 < 20) )
4) The second foreach loop. I think it can be easily transformed into one simple OUTER join returning all the tags not having instances, something like:

SELECT tg.id
FROM {$CFG->prefix}tag tg
LEFT JOIN{$CFG->prefix}tag_instance ti ON ti.tagid = tg.id
WHERE tg.tagtype = 'default'
AND ti.tagid IS NULL

or, alternatively:

SELECT tg.id
FROM {$CFG->prefix}tag tg
WHERE tg.tagtype = 'default'
AND NOT EXISTS (
SELECT 'x'
FROM {$CFG->prefix}tag_instance ti
WHERE ti.tagid = tg.id)

Not tested! But any of them will avoid the iteration and executing one query for each tag.

Hope this helps, ciao


Mathieu Petit-Clair added a comment - 01/May/08 02:19 PM
Muchas gracias Eloy! New diff: http://cvs.moodle.org/moodle/tag/lib.php?r1=1.43.2.34&r2=1.43.2.35 ... I think everything's there, except the cron activation. The function tag_cron() is already in the "20%" section of cron. Maybe I should add a comment there, to make this more obvious. Thanks for the query (good stuff!), worked perfectly (in MySQL at least).

Eloy Lafuente (stronk7) added a comment - 02/May/08 07:36 AM
Looks perfect now! Great job. Ciao

Petr Skoda added a comment - 06/May/08 05:48 PM
What is the status, can we resolve this as fixed?

Petr Skoda added a comment - 06/May/08 05:52 PM
oh, the performance of tag_cleanup() still seems less than optimal, looking at it...

Petr Skoda added a comment - 06/May/08 06:32 PM
Sending untested patch, the idea is that if no cleanup done the number of queries should be constant (does not depend on number of tags or instances).
Also we should always try to use LEFT JOIN instead of "AND NOT EXISTS (SELECT..)" - the time difference can be in orders of magnitudes when dealing with large tables.

I think that with this type of changes the tag cleanup could be done in each cron invocation.

I did not test the patch at all because my tag* tables are empty and I am no tag expert


Mathieu Petit-Clair added a comment - 06/May/08 06:42 PM
Thanks Petr, I'll do tests and commit this.

Eloy Lafuente (stronk7) added a comment - 06/May/08 07:25 PM
Uhm... when was reviewing this bug... initially I thought about to change it in a similar way than the proposed by Petr's patch, but finally decided to keep it that way for better readability. But agree that, in the long term, could be better to have it working as suggested. So +1 for that.

IMO, to keep it working in the 20% section is ok, because with current (solid) API is difficult to have those "orphan" instances anymore (in fact the original bug was about to make the clean as part of the upgrade process, only once, and not in cron).

And then this:

  • Also we should always try to use LEFT JOIN instead of "AND NOT EXISTS (SELECT..)" - the time difference can be in orders of magnitudes when dealing with large tables.

Don't agree at all!

I think they are 99.9% the same under all DBs, in fact NOT EXISTS will work better under some circumstances (not indexes in join column, for example). And it's far more readable.

Real differences happen with "AND NOT IN (SELECT)" because the IN clause is executed per-each record!

BUT NOT EXISTS is, as commented, perfect, IMO.

Ciao


Petr Skoda added a comment - 06/May/08 07:40 PM - edited
Thanks ELoy, I guess I hit the 0.1% accidentally with some weird subquery in stats and also mixed it up with "NOT IN" which was causing more trouble (or more likely I did some wrong measuring).

Do we have this "NOT IN" and "NOT EXISTS" in coding guide somewhere? Maybe we should add a new wiki page with tips for writing of efficient sql queries


Mathieu Petit-Clair added a comment - 07/May/08 10:31 AM
Moodle-round-the-clock did it again. Thanks Petr and Eloy, I think we can close this issue now. I don't think this cleanup should be called very often (if it was up to me, it would be called montly), as the tags are actually cleaned on deletion. Problem is, some modules might forget to do the right thing, hence this procedure.

It would be amazingly great to have a guide to writing efficent sql in Moodle, especially with all the JOINs that are in the code. I guess you two should be the ones writing it..


Eloy Lafuente (stronk7) added a comment - 08/May/08 07:41 AM
Closing... I'm really empty of ideas about what to write there. Perhaps we should start annotating them as they happen, hehe.

Petr Skoda added a comment - 05/Jul/08 09:31 PM
this can be closed, right?