Uploaded image for project: 'Moodle'
  1. Moodle
  2. MDL-70446

Solr: File indexing fails on certain files due to multipart upload

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 3.9.6, 3.10, 3.11, 4.0
    • Fix Version/s: 3.9.7, 3.10.4
    • Component/s: Global search
    • Labels:
    • Testing Instructions:
      Hide

      Please test this in all branches, specifically with 39_STABLE because it has differet code changes.

      You need a server configured to use Solr for Moodle global search. The test process is as follows:

      1. Log is as admin
      2. Ensure that you've the "searchallavailablecourses" admin setting set to "Search within ALL courses the used can access" (to get search results from any course in the site).
      3. Go to a forum and click 'Add a discussion', then 'Advanced'.
      4. Type in some text for the subject and message.
      5. Drag a file into the 'Attachments' area (see "Option 1" and "Option 2" below). Wait for files to fully upload.
      6. Click 'Post to forum'.
      7. Run the 'global search indexing' scheduled task; if this is scheduled to run automatically in your system then you could wait for it, otherwise you can trigger it in the main site administration under Server / Tasks / Scheduled Tasks; find the task and click 'Run now' option next to it.
      8. Click the global search feature (magnifying glass in header).
      9. Type in a word that is contained in your attachment (see the suggested terms in options 1 & 2 below)
      10. If there are too many results across your system, either use a more obscure word or phrase, or use the 'Filter' section to restrict results to the specific test course.
        • EXPECTED: The search results should show a result in the attachment that you posted - it should have the title and some of the text of the forum post, and 'Matched from file xxxx' (where xxxxis the name of the attachment file).

      Option 1: I suggest using the 'problem' attachment which you can find in the Solr issue https://issues.apache.org/jira/browse/SOLR-15039 (it may help to rename it to something.pptx before uploading). Search for "drones" for example.

      Option 2: attached there are 3 simple documents (PDF, RTF, Word) that you can use to search too - for terms "Triceratops", "Stegosaurus" and "Velociraptor", each one should show the post with one of the files as source for the match. Also, yo can search from "ñam, ñam" and it should return the post with the 3 files as matching ones.

      Show
      Please test this in all branches, specifically with 39_STABLE because it has differet code changes. You need a server configured to use Solr for Moodle global search. The test process is as follows: Log is as admin Ensure that you've the "searchallavailablecourses" admin setting set to "Search within ALL courses the used can access" (to get search results from any course in the site). Go to a forum and click 'Add a discussion', then 'Advanced'. Type in some text for the subject and message. Drag a file into the 'Attachments' area (see "Option 1" and "Option 2" below). Wait for files to fully upload. Click 'Post to forum'. Run the 'global search indexing' scheduled task; if this is scheduled to run automatically in your system then you could wait for it, otherwise you can trigger it in the main site administration under Server / Tasks / Scheduled Tasks; find the task and click 'Run now' option next to it. Click the global search feature (magnifying glass in header). Type in a word that is contained in your attachment (see the suggested terms in options 1 & 2 below) If there are too many results across your system, either use a more obscure word or phrase, or use the 'Filter' section to restrict results to the specific test course. EXPECTED: The search results should show a result in the attachment that you posted - it should have the title and some of the text of the forum post, and 'Matched from file xxxx' (where xxxxis the name of the attachment file). Option 1: I suggest using the 'problem' attachment which you can find in the Solr issue https://issues.apache.org/jira/browse/SOLR-15039 (it may help to rename it to something.pptx before uploading). Search for "drones" for example. Option 2: attached there are 3 simple documents (PDF, RTF, Word) that you can use to search too - for terms "Triceratops", "Stegosaurus" and "Velociraptor", each one should show the post with one of the files as source for the match. Also, yo can search from "ñam, ñam" and it should return the post with the 3 files as matching ones.
    • Affected Branches:
      MOODLE_310_STABLE, MOODLE_311_STABLE, MOODLE_39_STABLE, MOODLE_400_STABLE
    • Fixed Branches:
      MOODLE_310_STABLE, MOODLE_39_STABLE
    • Pull 3.9 Branch:
      MDL-70446-m39
    • Pull 3.10 Branch:
      MDL-70446-m310
    • Pull 3.11 Branch:
      MDL-70446-m311
    • Pull Master Branch:
      MDL-70446-master

      Description

      In Solr you can upload a file for indexing in two ways: as direct binary data in the POST, or in MIME multipart format. Moodle uses the multipart approach.

      I have reported a Solr bug (tested in many versions e.g. 6.6, latest 8.7) where certain files cause serious failures when uploaded using multipart.

      https://issues.apache.org/jira/browse/SOLR-15039

      In my testing with one specific file (attached to the Solr issue), Solr running in Windows tended to just give an error and not index the file, but Solr in Unix would output a very long string of Chinese characters which occupies several megabytes in the index. I suspect this is rare, but I also suspect that there may be many files which are not indexed correctly due to this problem, meaning that even if you don't run out of disk space in your search index, you might have things not being indexed that should be.

      Even if Solr fix this problem, Moodle supports current/older Solr versions so it would be good to fix it at our end too, which we can do fairly easily by not using multipart upload.

      The only disadvantage is that when uploading in this way, we have to load the file into memory (Curl only supports the 'curl_file' objects that we use to upload files without loading into memory when you put them in the multipart array). To avoid this causing problems, I've set an additional limit on file size (of the memory limit less 100MB, which on most 64-bit systems will be 284MB) otherwise it won't index files. Hopefully, most sane people have the 'biggest file to index' setting smaller than this already.

        Attachments

        1. MDL-70446.jpg
          MDL-70446.jpg
          37 kB
        2. 70446.rtf
          0.7 kB
        3. 70446.pdf
          14 kB
        4. 70446.docx
          8 kB

          Activity

            People

            Assignee:
            quen Sam Marshall
            Reporter:
            quen Sam Marshall
            Peer reviewer:
            Tim Hunt Tim Hunt
            Integrator:
            Eloy Lafuente (stronk7) Eloy Lafuente (stronk7)
            Tester:
            Anna Carissa Sadia Anna Carissa Sadia
            Participants:
            Component watchers:
            Amaia Anabitarte, Carlos Escobedo, Ferran Recio, Ilya Tregubov, Sara Arjona (@sarjona)
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Dates

              Created:
              Updated:
              Resolved:
              Fix Release Date:
              10/May/21

                Time Tracking

                Estimated:
                Original Estimate - 0 minutes
                0m
                Remaining:
                Remaining Estimate - 0 minutes
                0m
                Logged:
                Time Spent - 5 hours, 30 minutes
                5h 30m