Uploaded image for project: 'Moodle'
  1. Moodle
  2. MDL-68690

Search: Allow Solr to add documents in batches

    XMLWordPrintable

Details

    • MOODLE_310_STABLE
    • MOODLE_310_STABLE
    • MDL-68690-master
    • Hide
      • In order to test this change, you will need a Moodle site that is configured using the Solr search engine.
      1. Go to a forum and create 3 new forum posts. In each post, type whatever text you like, but include the special word MARIEZWOOP.
      2. Run the search indexing task (php admin/cli/scheduled_task.php --execute='\core\task\search_index_task') and check the result of the execution @ Server > Tasks > Scheduled tasks using the "view logs" option. Specifically look in logs where it indexes the 'Forum - posts' area.
        • EXPECTED: You should see something like the text below, containing the note '(1 batch)'.
      3. Using the global search icon in the Moodle header, search for 'MARIEZWOOP'.
        • EXPECTED: You should get all 3 results, proving they were indexed correctly.

      Processing area: Forum - posts
        Processed 3 records containing 3 documents (1 batch), in 0.1 seconds.
      

      Show
      In order to test this change, you will need a Moodle site that is configured using the Solr search engine. Go to a forum and create 3 new forum posts. In each post, type whatever text you like, but include the special word MARIEZWOOP. Run the search indexing task ( php admin/cli/scheduled_task.php --execute='\core\task\search_index_task' ) and check the result of the execution @ Server > Tasks > Scheduled tasks using the "view logs" option. Specifically look in logs where it indexes the 'Forum - posts' area. EXPECTED: You should see something like the text below, containing the note '(1 batch)'. Using the global search icon in the Moodle header, search for 'MARIEZWOOP'. EXPECTED: You should get all 3 results, proving they were indexed correctly. Processing area: Forum - posts Processed 3 records containing 3 documents (1 batch), in 0.1 seconds.

    Description

      Search reindexing with Solr is slow when there are a large number of documents. The time taken can be in the order of weeks, which is annoying if you want to (for example) upgrade to a new Solr version.

      The current engine code adds documents one at a time. It is possible to add multiple documents in one request, which would at least save on network round trips.

      In my testing, this change improves indexing performance:

      • By 80% when using a remote (cloud hosted) server running Solr 6.6.2, indexing small text entries.
      • By 30% when using a local server running Solr 8.5.1, for the same condition.
        (See attached performance.png for full test results.)

      It would be expected that the performance increase is better for a remote rather than locally hosted Solr instance, and is better when indexing mainly small text entries such as forum posts. (This change doesn't affect how files are indexed, and if you have a large number of files then those may be the most significant part of indexing.)

      This is a significant improvement. I don't have any real-life test results but it's possible that the real-life improvement on a cloud hosted server with a mixture of small text entries and files could be 50%, which is significant when the time to reindex an entire site (e.g. for search engine update) can sometimes be measured in weeks.

      One concern could be the potential size of batch updates. I have implemented a limit of 100 documents per batch, and each document can be only up to 1MB of text in its content, otherwise it will be sent individually. There is a unit test to make sure it works with the worst allowed case (100 x 1MB).

      Attachments

        Issue Links

          Activity

            People

              quen Sam Marshall
              quen Sam Marshall
              Mark Johnson Mark Johnson
              Eloy Lafuente (stronk7) Eloy Lafuente (stronk7)
              Gladys Basiana Gladys Basiana
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:
                9/Nov/20

                Time Tracking

                  Estimated:
                  Original Estimate - 0 minutes
                  0m
                  Remaining:
                  Remaining Estimate - 0 minutes
                  0m
                  Logged:
                  Time Spent - 7 hours, 10 minutes
                  7h 10m