-
Bug
-
Resolution: Fixed
-
Minor
-
3.3.5
-
MOODLE_33_STABLE
-
MOODLE_33_STABLE, MOODLE_34_STABLE
-
MDL-62042-master -
We have seen serious indexing errors in our live system, which can be reproduced as debugging warnings in test systems, when text to be indexed contains Unicode non-characters. Solr objects to some of these characters.
These non-characters can be entered into Moodle, for example in forum posts and so on. While it arguably might be a nice idea to stop people entering them, and that could also be considered, I think we can address the problem in a single location by making the search system clean away these characters.
Only two of the characters U+FFFE and U+FFFF cause repeatable problems on my test PHP 7 system, so it may be sufficient to remove those. Our PHP 5.x system also had problems with the other noncharacters.
To confirm the list of noncharacters, here is a quote allegedly from the Unicode standard:
'The Unicode Standard sets aside 66 noncharacter code points. The last two code points of each plane are noncharacters: U+FFFE and U+FFFF on the BMP, U+1FFFE and U+1FFFF on Plane 1, and so on, up to U+10FFFE and U+10FFFF on Plane 16, for a total of 34 code points. In addition, there is a contiguous range of another 32 noncharacter code points in the BMP: U+FDD0..U+FDEF. For historical reasons, the range U+FDD0..U+FDEF is contained within the Arabic Presentation Forms-A block, but those noncharacters are not “Arabic noncharacters” or “right-to-left noncharacters,” and are not distinguished in any other way from the other noncharacters, except in their code point values.'