Uploaded image for project: 'Moodle'
  1. Moodle
  2. MDL-62042

core_search: Unicode non-characters cause indexing problems

    XMLWordPrintable

    Details

    • Testing Instructions:
      Hide
      1. As this problem occurred when using Solr, you need a Moodle setup with global search using Solr.
      2. Go to the scheduled tasks page and click 'Run now' against the 'Global search indexing' scheduled task. Repeat if necessary until all existing content is indexed.
      3. Open the attached badcharacter.txt in a web browser or a text editor that does not itself strip out the U+FFEF character. (You should see a weird character in the middle of the word 'cutoff', i.e. between the t and the o.) I used Firefox for this. Copy the full text of the file to the clipboard.
      4. Go to a Moodle course and create a new Label, copying and pasting in the contents of badcharacter.txt.
      5. Go to the scheduled tasks page and click 'Run now' against the 'Global search indexing' scheduled task. When it finishes, look in the results under 'Processing area: Label'.
        • EXPECTED: You should see 'Processed 1 records containing 1 documents'
        • BEFORE FIX: You got a PHP warning and 'No new documents to index'.
      6. Using the global search facility (e.g. the magnifying glass icon in the header), search for the special word klooblesnee.
        • EXPECTED: It should find the label 'The character in the middle of the following word...'
        • BEFORE FIX: The label was not found because it was not indexed
      Show
      As this problem occurred when using Solr, you need a Moodle setup with global search using Solr. Go to the scheduled tasks page and click 'Run now' against the 'Global search indexing' scheduled task. Repeat if necessary until all existing content is indexed. Open the attached badcharacter.txt in a web browser or a text editor that does not itself strip out the U+FFEF character. (You should see a weird character in the middle of the word 'cutoff', i.e. between the t and the o.) I used Firefox for this. Copy the full text of the file to the clipboard. Go to a Moodle course and create a new Label, copying and pasting in the contents of badcharacter.txt. Go to the scheduled tasks page and click 'Run now' against the 'Global search indexing' scheduled task. When it finishes, look in the results under 'Processing area: Label'. EXPECTED: You should see 'Processed 1 records containing 1 documents' BEFORE FIX: You got a PHP warning and 'No new documents to index'. Using the global search facility (e.g. the magnifying glass icon in the header), search for the special word klooblesnee . EXPECTED: It should find the label 'The character in the middle of the following word...' BEFORE FIX: The label was not found because it was not indexed
    • Affected Branches:
      MOODLE_33_STABLE
    • Fixed Branches:
      MOODLE_33_STABLE, MOODLE_34_STABLE
    • Pull Master Branch:
      MDL-62042-master

      Description

      We have seen serious indexing errors in our live system, which can be reproduced as debugging warnings in test systems, when text to be indexed contains Unicode non-characters. Solr objects to some of these characters.

      These non-characters can be entered into Moodle, for example in forum posts and so on. While it arguably might be a nice idea to stop people entering them, and that could also be considered, I think we can address the problem in a single location by making the search system clean away these characters.

      Only two of the characters U+FFFE and U+FFFF cause repeatable problems on my test PHP 7 system, so it may be sufficient to remove those. Our PHP 5.x system also had problems with the other noncharacters.

      To confirm the list of noncharacters, here is a quote allegedly from the Unicode standard:

      'The Unicode Standard sets aside 66 noncharacter code points. The last two code points of each plane are noncharacters: U+FFFE and U+FFFF on the BMP, U+1FFFE and U+1FFFF on Plane 1, and so on, up to U+10FFFE and U+10FFFF on Plane 16, for a total of 34 code points. In addition, there is a contiguous range of another 32 noncharacter code points in the BMP: U+FDD0..U+FDEF. For historical reasons, the range U+FDD0..U+FDEF is contained within the Arabic Presentation Forms-A block, but those noncharacters are not “Arabic noncharacters” or “right-to-left noncharacters,” and are not distinguished in any other way from the other noncharacters, except in their code point values.'

        Attachments

          Activity

            People

            • Assignee:
              quen Sam Marshall
              Reporter:
              quen Sam Marshall
              Peer reviewer:
              Tim Hunt
              Integrator:
              David Monllaó
              Tester:
              David Monllaó
              Participants:
              Component watchers:
              Amaia Anabitarte, Carlos Escobedo, Ferran Recio, Sara Arjona (@sarjona), Víctor Déniz Falcón
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:
                Fix Release Date:
                17/May/18