Moodle
  1. Moodle
  2. MDL-24565

Cleaning does not prevent invalid XML unicode characters

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 1.9.9, 2.0
    • Fix Version/s: 1.9.10, 2.0
    • Component/s: Libraries
    • Labels:
      None
    • Affected Branches:
      MOODLE_19_STABLE, MOODLE_20_STABLE
    • Fixed Branches:
      MOODLE_19_STABLE, MOODLE_20_STABLE
    • Rank:
      27387

      Description

      It is possible to enter, e.g. into a forum post, Unicode control characters such as U+0001.

      Within XML output, only the control characters 9, 10, and 13 are permitted. Presumably for this reason, the RSS feed output for the forum does not work if somebody enters those characters.

      A suitable fix would be to make the Moodle clean_param function capable of stripping out these characters (any control character other than 9, 10, 13).

        Activity

        Hide
        Jenny Gray added a comment -

        I'm beginning to wish I'd left this one to Sam!! I decided to see if there was still an issue in Moodle 2.0, and at first I thought there wasn't, so I distracted myself for several hours tracking down exactly what was different between the two.

        It turns out that in both 2.0 and 1.9 the dodgy character is stored in the database, and cleaned in the output - both the on-screen print_post and the rss.

        In 2.0 the output cleaning was working to take out the dodgy character, but on 1.9 it wasn't.

        Ultimately the difference is that a default install of 1.9 has $CFG->enablehtmlpurifier as No and 2.0 has it as Yes. htmlpurifier handles these control characters perfectly, but the other stripping functions we have, and specifically the regular expression below does not.

        $text = preg_replace('/&#0*([0-9]+);?/', '&#$1;', $string);
        $text = preg_replace('/&#x0*([0-9a-fA-F]+);?/', '&#x$1;', $text);

        The regular expression can't work where the control character has been entered fully because its not been encoded like this. I'm sure there ought to be a function to swap the control character to &# format, but I can't seem to find it.

        Question: to enable htmlpurifier, or decode the character?

        Show
        Jenny Gray added a comment - I'm beginning to wish I'd left this one to Sam!! I decided to see if there was still an issue in Moodle 2.0, and at first I thought there wasn't, so I distracted myself for several hours tracking down exactly what was different between the two. It turns out that in both 2.0 and 1.9 the dodgy character is stored in the database, and cleaned in the output - both the on-screen print_post and the rss. In 2.0 the output cleaning was working to take out the dodgy character, but on 1.9 it wasn't. Ultimately the difference is that a default install of 1.9 has $CFG->enablehtmlpurifier as No and 2.0 has it as Yes. htmlpurifier handles these control characters perfectly, but the other stripping functions we have, and specifically the regular expression below does not. $text = preg_replace('/&#0*( [0-9] +);?/', '&#$1;', $string); $text = preg_replace('/&#x0*( [0-9a-fA-F] +);?/', '&#x$1;', $text); The regular expression can't work where the control character has been entered fully because its not been encoded like this. I'm sure there ought to be a function to swap the control character to &# format, but I can't seem to find it. Question: to enable htmlpurifier, or decode the character?
        Hide
        Jenny Gray added a comment - - edited

        I've identified a regular expression \p

        {C}

        which will strip the unicode control characters out, but leave the majority of unicode in.

        Show
        Jenny Gray added a comment - - edited I've identified a regular expression \p {C} which will strip the unicode control characters out, but leave the majority of unicode in.
        Hide
        Petr Škoda added a comment -

        Fix committed into fix_non_standard_entities() in 2.0, HTML purifier is already fixing this.

        Show
        Petr Škoda added a comment - Fix committed into fix_non_standard_entities() in 2.0, HTML purifier is already fixing this.
        Hide
        Petr Škoda added a comment -

        The problem with 1.9.x is we do not know if it works in all supported versions and configurations of PHP...

        Show
        Petr Škoda added a comment - The problem with 1.9.x is we do not know if it works in all supported versions and configurations of PHP...
        Hide
        Jenny Gray added a comment -

        We established that the \p option is available from php 5.1.0 http://php.net/manual/en/regexp.reference.unicode.php but that PCRE can be compiled without unicode support.

        I think the following environment test will do the trick for moodle 1.9. Its using the moodle version_compare function for the PHP function. preg_replace returns null if the \p option isn't recognised, so I'm using a default string and testing whether \p

        {C} is recognised. @ does not appear to be required as no error is thrown.

        if ((version_compare(phpversion(),'5.2.0') > 0) &&
        !is_null(preg_replace( '/\p{C}

        /u', '', 'test prce compile setting' ))) {

        Note - I don't have PHP4 installed anywhere to test on, so I've tested this by using an unrecognised \p option.

        Show
        Jenny Gray added a comment - We established that the \p option is available from php 5.1.0 http://php.net/manual/en/regexp.reference.unicode.php but that PCRE can be compiled without unicode support. I think the following environment test will do the trick for moodle 1.9. Its using the moodle version_compare function for the PHP function. preg_replace returns null if the \p option isn't recognised, so I'm using a default string and testing whether \p {C} is recognised. @ does not appear to be required as no error is thrown. if ((version_compare(phpversion(),'5.2.0') > 0) && !is_null(preg_replace( '/\p{C} /u', '', 'test prce compile setting' ))) { Note - I don't have PHP4 installed anywhere to test on, so I've tested this by using an unrecognised \p option.
        Hide
        Jenny Gray added a comment -

        I did some searching in tracker and found another issue which may be related to this MDL-24193. I'm not sure whether I should link it or not?

        Show
        Jenny Gray added a comment - I did some searching in tracker and found another issue which may be related to this MDL-24193 . I'm not sure whether I should link it or not?
        Hide
        Jordan Tomkinson added a comment - - edited

        PCRE packages from redhat / centos / others were not compiled with unicode support, this patch has now broken moodle.org and others (all moodle.org and moodle.com subdomains)
        every page generates the below entry in the apache error_log:

        PHP Warning:  preg_replace() [<a href='function.preg-replace'>function.preg-replace</a>]: Compilation failed: support for \\P, \\p, and \\X has not been compiled at offset 1 in /html/lib/weblib.php on line 1301, referer: http://moodle.org/login/index.php

        I imagine this affects quite a lot of installations.

        Show
        Jordan Tomkinson added a comment - - edited PCRE packages from redhat / centos / others were not compiled with unicode support, this patch has now broken moodle.org and others (all moodle.org and moodle.com subdomains) every page generates the below entry in the apache error_log: PHP Warning: preg_replace() [<a href='function.preg-replace'>function.preg-replace</a>]: Compilation failed: support for \\P, \\p, and \\X has not been compiled at offset 1 in /html/lib/weblib.php on line 1301, referer: http: //moodle.org/login/index.php I imagine this affects quite a lot of installations.
        Hide
        Jenny Gray added a comment -

        The issue Jordan describes is exactly what Petr was worried about, but we thought Moodle 2.0 would be OK. Obviously not.

        Can you try the test I suggested above (8/10/10 6:24pm) to see if it resolves the issue?

        Show
        Jenny Gray added a comment - The issue Jordan describes is exactly what Petr was worried about, but we thought Moodle 2.0 would be OK. Obviously not. Can you try the test I suggested above (8/10/10 6:24pm) to see if it resolves the issue?
        Hide
        Sam Marshall added a comment - - edited

        Just a thought (and may be a bit late) but am I not right in thinking that this code doesn't need to use Unicode, because it is about hiding control characters which are all* in the <=127 ASCII range.

        This code might work (untested):

        $text = preg_replace('~[x00-\x08\x0b-\x0c\x0e-\x1f]~', '', $text);
        

        i.e. replace byte values 0-8, 11-12, and 14-31 with ''

        • There might be some unicode control characters outside that range, I dunno - but it's the ones 0-31 (except 9, 10, 13) which are not permitted in xml/rss which is causing the problem.
        Show
        Sam Marshall added a comment - - edited Just a thought (and may be a bit late) but am I not right in thinking that this code doesn't need to use Unicode, because it is about hiding control characters which are all* in the <=127 ASCII range. This code might work (untested): $text = preg_replace('~[x00-\x08\x0b-\x0c\x0e-\x1f]~', '', $text); i.e. replace byte values 0-8, 11-12, and 14-31 with '' There might be some unicode control characters outside that range, I dunno - but it's the ones 0-31 (except 9, 10, 13) which are not permitted in xml/rss which is causing the problem.
        Hide
        Jenny Gray added a comment -

        Character set stuff makes my brain ache. Sam is right. I've tested with

        $text = preg_replace('[\x00-\x08\x0b-\x0c\x0e-\x1f]', '', $text);

        Added to the appropriate point in weblib and it works on our setup.

        Jordan, can you try this on moodle.org in fix_non_standard_entities($string) in weblib.php instead of current line 1301 which we added last week?

        Show
        Jenny Gray added a comment - Character set stuff makes my brain ache. Sam is right. I've tested with $text = preg_replace(' [\x00-\x08\x0b-\x0c\x0e-\x1f] ', '', $text); Added to the appropriate point in weblib and it works on our setup. Jordan, can you try this on moodle.org in fix_non_standard_entities($string) in weblib.php instead of current line 1301 which we added last week?

          People

          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: