Moodle

Add support for TEXT indexes within XMLDB

Details

  • Type: New Feature New Feature
  • Status: Open Open
  • Priority: Critical Critical
  • Resolution: Unresolved
  • Affects Version/s: DEV backlog
  • Fix Version/s: DEV backlog
  • Component/s: Database SQL/XMLDB
  • Labels:
  • Database:
    Any

Description

There are some places where relational searches aren't good at all, specially where looking against large text sets (forums, glossaries...).

To fix this we must use documental (fulltext) indexes. They allow to perform "instant" searches against text.

So, For 1.9 something like this must be done:

1- Analyze and document each RDBMS documental index support.
2- Implement creation and drop of documental indexes from XMLDB generators (XMLDB schema, XML files and, finally, generators)
3- Extend XMLDB tests to check this functionality (creation, searching and drop).
4- Make text_index_exists() to work when looking for this type of indexes.
5- Add support for doc. indexes within the XMLDB Editor (parser, editor...)
6- Modify searchlib (search_generate_text_SQL() function) to be able to generate the correct SQL statements under each RDBMS.
7- Create a bunch of settings ($CFG->forum_usetextsearches....) to tell Moodle where to perform documental searches, code should be able to determinate when the underlying index exists.
8- Modify cron by calling one database_cron() function to be able to perform indexes maintenance (sync, optimisation...).

That's all, for now.

For Moodle 1.8 a basic documental indexes functionality has been implemented to work exclusively under MySQL and forum searches. It requires manual creation of the index before enabling it. Look to forum_search_posts() for some more basic info.

Activity

Hide
Martin Dougiamas added a comment -

Well ... remember we have the global search subproject which uses it's own index.

The Lucene engine is pretty powerful, allows consistent searching across all parts of Moodle, and most of the work is done (in moodle/search), however it's going to be tricky bringing roles into the picture (but possible I think).

Shouldn't we focus on that instead? Or should we change direction and use the native database for searching?

Show
Martin Dougiamas added a comment - Well ... remember we have the global search subproject which uses it's own index. The Lucene engine is pretty powerful, allows consistent searching across all parts of Moodle, and most of the work is done (in moodle/search), however it's going to be tricky bringing roles into the picture (but possible I think). Shouldn't we focus on that instead? Or should we change direction and use the native database for searching?
Hide
Eloy Lafuente (stronk7) added a comment -

+1 for lucene if possible to adapt if for roles (I hope it'll be cross-db compatible, at least it's in other languages where I've played a bit with).

I thought that (don't ask why) that such project was semi-dead and we could start supporting our own implementation, specially if we want to index uploaded files (perhaps this is the KEY question) and if lucene works well with non-spaced languages (like japanese).

Those (upload files + non-spaced languages) are the two major drawbacks of implementing our own indexes. Does Lucene uses its own index format (file based) or does it use DB specific indexes... uhm.

Too much questions! Let's halt this until 1.8 release and then examine a bit more about how lucene is working.... oki?

Show
Eloy Lafuente (stronk7) added a comment - +1 for lucene if possible to adapt if for roles (I hope it'll be cross-db compatible, at least it's in other languages where I've played a bit with). I thought that (don't ask why) that such project was semi-dead and we could start supporting our own implementation, specially if we want to index uploaded files (perhaps this is the KEY question) and if lucene works well with non-spaced languages (like japanese). Those (upload files + non-spaced languages) are the two major drawbacks of implementing our own indexes. Does Lucene uses its own index format (file based) or does it use DB specific indexes... uhm. Too much questions! Let's halt this until 1.8 release and then examine a bit more about how lucene is working.... oki?
Hide
Martin Dougiamas added a comment -

I've turned this on for moodle.org and WOW! That is heaps faster now! :-D

Times for some searches I tried have dropped from 11 seconds to about 1 second !

+1000 for better handling of this in 1.9.

Show
Martin Dougiamas added a comment - I've turned this on for moodle.org and WOW! That is heaps faster now! :-D Times for some searches I tried have dropped from 11 seconds to about 1 second ! +1000 for better handling of this in 1.9.
Hide
Martin Dougiamas added a comment -

More info about Lucene vs MySQL for the global search: http://jayant7k.blogspot.com/2006/05/mysql-fulltext-search-versus-lucene.html

Show
Martin Dougiamas added a comment - More info about Lucene vs MySQL for the global search: http://jayant7k.blogspot.com/2006/05/mysql-fulltext-search-versus-lucene.html
Hide
Eloy Lafuente (stronk7) added a comment -

Please,

note that current experimental implementation is far from perfect because, to mimic current behaviour and semantics of relational searching we are executing all the searches by appending the "*" wildchar to all the terms. It reduces real benefit a lot (compared with the execution without wildchars).

That's because I proposed an alternative search form + help file to be showed describing the new semantics used when the documental search is enabled. While the search with wildchars can be a good idea in sites with fewer results (more records are found), sites like moodle.org (+250.000 posts) will return enough information just searching by "exact" words, that is, in fact, the default search mode worldwide!

Anyway, I'll start documenting about the lucene-php-mysql thing after release... that seems to be the preferred way (if we don't want to apply such changes in semantics for now).

Show
Eloy Lafuente (stronk7) added a comment - Please, note that current experimental implementation is far from perfect because, to mimic current behaviour and semantics of relational searching we are executing all the searches by appending the "*" wildchar to all the terms. It reduces real benefit a lot (compared with the execution without wildchars). That's because I proposed an alternative search form + help file to be showed describing the new semantics used when the documental search is enabled. While the search with wildchars can be a good idea in sites with fewer results (more records are found), sites like moodle.org (+250.000 posts) will return enough information just searching by "exact" words, that is, in fact, the default search mode worldwide! Anyway, I'll start documenting about the lucene-php-mysql thing after release... that seems to be the preferred way (if we don't want to apply such changes in semantics for now).
Hide
Eloy Lafuente (stronk7) added a comment -

Quick note:

last week I've been playing with 300.000 posts and Zend Lucene. Some numbers:

  • got 1000+ indexed documents/minute.
  • most searches resolved in less than 1 sec, with 1-3 more secs needed to fetch and render one page of results
  • absolutely easy to customize for our needs (UTF-8, keeping numbers, versions... in the index) all this working with < 100 lines code.
  • easy to keep synchronised with one intemediate "lucene_documents" relational table (where all the modules will write the actions to be performed).
  • relatively easy to support groups/users/allowed forums in the documental searches.

but I've found some problems with current Zend Ludene implementation:

  • relies on old lucene 1.4 code.
  • index format has been dramatically improved for Lucene 2.0-2.1
  • tons of memory needed (> 120MB) to fully optimize the index (and this should be executed, at least, weekly, to avoid fragmentation and reduce number of segments).
  • after looking bugs and activity... not sure if it's

Uhm... just to point to another alternative... could we "reinvent the wheel" ? Drupal seems to have one interesting "hand-made" documental index for all its contents... perhaps it would be enough for our "token based searches"....

Ciao

Show
Eloy Lafuente (stronk7) added a comment - Quick note: last week I've been playing with 300.000 posts and Zend Lucene. Some numbers:
  • got 1000+ indexed documents/minute.
  • most searches resolved in less than 1 sec, with 1-3 more secs needed to fetch and render one page of results
  • absolutely easy to customize for our needs (UTF-8, keeping numbers, versions... in the index) all this working with < 100 lines code.
  • easy to keep synchronised with one intemediate "lucene_documents" relational table (where all the modules will write the actions to be performed).
  • relatively easy to support groups/users/allowed forums in the documental searches.
but I've found some problems with current Zend Ludene implementation:
  • relies on old lucene 1.4 code.
  • index format has been dramatically improved for Lucene 2.0-2.1
  • tons of memory needed (> 120MB) to fully optimize the index (and this should be executed, at least, weekly, to avoid fragmentation and reduce number of segments).
  • after looking bugs and activity... not sure if it's
Uhm... just to point to another alternative... could we "reinvent the wheel" ? Drupal seems to have one interesting "hand-made" documental index for all its contents... perhaps it would be enough for our "token based searches".... Ciao
Hide
Eloy Lafuente (stronk7) added a comment - - edited

Moving this to 2.0.

Will Lucene be the final solution...uhm... ciao

Show
Eloy Lafuente (stronk7) added a comment - - edited Moving this to 2.0. Will Lucene be the final solution...uhm... ciao
Hide
Eloy Lafuente (stronk7) added a comment -

Just updating this to comment that there are new downloads of the Zend Framework. And it also contains a very-much improved version of the Lucene engine (using the long awaited 2.1 index formats + big memory improvements).

Link: http://framework.zend.com/download

Ciao

Show
Eloy Lafuente (stronk7) added a comment - Just updating this to comment that there are new downloads of the Zend Framework. And it also contains a very-much improved version of the Lucene engine (using the long awaited 2.1 index formats + big memory improvements). Link: http://framework.zend.com/download Ciao
Hide
Valery Fremaux added a comment -

Thanks Eloy, I will try it soon in my dev instance.

Show
Valery Fremaux added a comment - Thanks Eloy, I will try it soon in my dev instance.
Hide
Eloy Lafuente (stronk7) added a comment -

Moving this to DEV backlog... someday we'll need to support some indexer properly providing the API for different parts of Moodle (not only global search), but that is something for the future...

Ciao

Show
Eloy Lafuente (stronk7) added a comment - Moving this to DEV backlog... someday we'll need to support some indexer properly providing the API for different parts of Moodle (not only global search), but that is something for the future... Ciao

People

Dates

  • Created:
    Updated: