Uploaded image for project: 'Moodle'
  1. Moodle
  2. MDL-39378

Fix s() hex entity handling, and improve performance

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: 2.3.6, 2.4.3
    • Fix Version/s: 2.3.7, 2.4.3
    • Component/s: Libraries
    • Labels:

      Description

      The regex is currently "/&#(\d+|x[0-7a-fA-F]+);/i", "&#$1;"

      • we have /i, but a-fA-F
      • why do we have 0-7, not 0-9?

        Gliffy Diagrams

          Issue Links

            Activity

            Hide
            timhunt Tim Hunt added a comment -

            Right, here is my proposed fix.

            Show
            timhunt Tim Hunt added a comment - Right, here is my proposed fix.
            Hide
            timhunt Tim Hunt added a comment -

            This test script times the varous options I considered. The only things faster than what I went for fail some unit tests.

            Show
            timhunt Tim Hunt added a comment - This test script times the varous options I considered. The only things faster than what I went for fail some unit tests.
            Hide
            timhunt Tim Hunt added a comment -

            I will also add, we were seeing s() showing up in our load-testing. It is about 3-4% of page-load time. My changes make it about 10% faster.

            (For comparison, get string is about 16% of page-load time.)

            Show
            timhunt Tim Hunt added a comment - I will also add, we were seeing s() showing up in our load-testing. It is about 3-4% of page-load time. My changes make it about 10% faster. (For comparison, get string is about 16% of page-load time.)
            Hide
            skodak Petr Skoda added a comment -

            pardon me, but it looks very crazy to remove the encoding in htmlspecialchars()...

            Show
            skodak Petr Skoda added a comment - pardon me, but it looks very crazy to remove the encoding in htmlspecialchars()...
            Hide
            dobedobedoh Andrew Nicols added a comment -

            From http://php.net/manual/en/function.htmlspecialchars.php:

            If omitted, the default value for this argument is ISO-8859-1 in versions of PHP prior to 5.4.0, and UTF-8 from PHP 5.4.0 onwards.

            Show
            dobedobedoh Andrew Nicols added a comment - From http://php.net/manual/en/function.htmlspecialchars.php: If omitted, the default value for this argument is ISO-8859-1 in versions of PHP prior to 5.4.0, and UTF-8 from PHP 5.4.0 onwards.
            Hide
            skodak Petr Skoda added a comment -

            I would be also great if you started separating actual bug fixes from your cleanup commits.

            Show
            skodak Petr Skoda added a comment - I would be also great if you started separating actual bug fixes from your cleanup commits.
            Hide
            timhunt Tim Hunt added a comment -

            Sorry, Petr, which part of the commit was clean-up, and not part of the actual improvement?

            Show
            timhunt Tim Hunt added a comment - Sorry, Petr, which part of the commit was clean-up, and not part of the actual improvement?
            Hide
            skodak Petr Skoda added a comment -

            The minimal bug fix was 1 character if I read it correctly, you modified 8 lines and you managed to introduce a regressions, that can not be ok.

            Show
            skodak Petr Skoda added a comment - The minimal bug fix was 1 character if I read it correctly, you modified 8 lines and you managed to introduce a regressions, that can not be ok.
            Hide
            timhunt Tim Hunt added a comment -

            Removing the encoding is not crazy. Specifying the string UTF-8 there makes things noticably slower (and I am testing this on a server that using PHP 5.4, so that is the option being used).

            I am running the unit tests on PHP 5.3, and have added some unit tests using UTF-8 characters, and they still pass.

            Think about what htmlspecialchars does. It finds all occurrences of 5 speific bytes, and replaces them with several bytes. The bytes it needs to search for are the same in UTF-8 and ISO-whatever. The bytes it searces for can never be part of a multi-byte character in UTF-8.

            Show
            timhunt Tim Hunt added a comment - Removing the encoding is not crazy. Specifying the string UTF-8 there makes things noticably slower (and I am testing this on a server that using PHP 5.4, so that is the option being used). I am running the unit tests on PHP 5.3, and have added some unit tests using UTF-8 characters, and they still pass. Think about what htmlspecialchars does. It finds all occurrences of 5 speific bytes, and replaces them with several bytes. The bytes it needs to search for are the same in UTF-8 and ISO-whatever. The bytes it searces for can never be part of a multi-byte character in UTF-8.
            Hide
            timhunt Tim Hunt added a comment -

            I just pushed an amended patch with the UTF-8 unit tests.

            Show
            timhunt Tim Hunt added a comment - I just pushed an amended patch with the UTF-8 unit tests.
            Hide
            skodak Petr Skoda added a comment - - edited

            Could you please make a separate commit with just the one character bugfix, please? Then commit unit tests and only after that your cleanup.

            I believe that upstream should be dealing with performance, intentionally using invalid encoding parameters is wrong. I believe this might be a regresion and even a security issue because:

            If the input string contains an invalid code unit sequence within the given encoding an empty string will be returned, unless either the ENT_IGNORE or ENT_SUBSTITUTE flags are set.

            My -2 because
            1/ invalid UTF-8 chars might get through which might theoretically result in XSS
            2/ cleanup should not be mixed with bug fixes in one commit

            Show
            skodak Petr Skoda added a comment - - edited Could you please make a separate commit with just the one character bugfix, please? Then commit unit tests and only after that your cleanup. I believe that upstream should be dealing with performance, intentionally using invalid encoding parameters is wrong. I believe this might be a regresion and even a security issue because: If the input string contains an invalid code unit sequence within the given encoding an empty string will be returned, unless either the ENT_IGNORE or ENT_SUBSTITUTE flags are set. My -2 because 1/ invalid UTF-8 chars might get through which might theoretically result in XSS 2/ cleanup should not be mixed with bug fixes in one commit
            Hide
            skodak Petr Skoda added a comment -
            Show
            skodak Petr Skoda added a comment - http://blog.astrumfutura.com/2012/03/a-hitchhikers-guide-to-cross-site-scripting-xss-in-php-part-1-how-not-to-use-htmlspecialchars-for-output-escaping/ Tim: look for "Great ASCII Delusion (GAD)" in that article or even better read it all!
            Hide
            timhunt Tim Hunt added a comment -

            Thanks. Interesting read.

            Branch amended into two commits. One that is just the bug fix (+ unit test to verify it). The other adds the performance improvements + the extra unit tests.

            Performance is a shared responsibility. The PHP / PCRE folks should optimise preg_replace as much as possible. But we should also do our part. For example, do we really need to call preg_replace 5000 times when viewing a course page?

            Show
            timhunt Tim Hunt added a comment - Thanks. Interesting read. Branch amended into two commits. One that is just the bug fix (+ unit test to verify it). The other adds the performance improvements + the extra unit tests. Performance is a shared responsibility. The PHP / PCRE folks should optimise preg_replace as much as possible. But we should also do our part. For example, do we really need to call preg_replace 5000 times when viewing a course page?
            Hide
            timhunt Tim Hunt added a comment -

            Sorry, one more thing: Is it worth making a meta-bug for "Things to do when we move to PHP 5.4 as a minium supported version? Think like updating the htmlspecialchars to use the new ENT_SUBSTITUTE option?

            Show
            timhunt Tim Hunt added a comment - Sorry, one more thing: Is it worth making a meta-bug for "Things to do when we move to PHP 5.4 as a minium supported version? Think like updating the htmlspecialchars to use the new ENT_SUBSTITUTE option?
            Hide
            skodak Petr Skoda added a comment -

            Some performance ideas:
            1/ if there is no "&" in original string preg_replace() is not necessary, just htmlspecialchars()
            2/ numbers do not need to be escaped at all

            Show
            skodak Petr Skoda added a comment - Some performance ideas: 1/ if there is no "&" in original string preg_replace() is not necessary, just htmlspecialchars() 2/ numbers do not need to be escaped at all
            Hide
            timhunt Tim Hunt added a comment -

            From the measurements I did today, I am pretty sure that checking those conditions will be slower than just doing the preg-repace.

            From the numbers output by the test script (basically time in seconds to call s() 40000 times) I think the scores were very roughly:

            1. Any PHP function call: 0.4
            2. htmlspecialchars: 0.6
            3. preg_replace: 1.0
            4. simplifying the if: 0.1
            5. removing the default arguments to htmlspecialchars: 0.1 each.

            There was actually a lot of noise on top of that, but I ran the test script many times.

            I wish we had documented why the preg_replace was necessary at all. I would actually be inclined to make things slower, by adding debugging logic so that any time the preg_replace did something, it displayed a warning. I can't think of a good reason why we want s('&#10') to be '&#10' and not '&#10'. Do you remember.

            Anyway, I have spent all the time I can afford on this. I think I have made things better, so I should submit this for integration. If you want to try and squeeze more performance out here, then we can create a new issue.

            (But, if you want to optimise something, then the load testing results I saw today were, for a course view page:
            get_string: 16%
            s: 3-4%
            mostly it was DB and MUC, of course. I hope the people doing our load-testing will summarise what they learned in a forum post.)

            Show
            timhunt Tim Hunt added a comment - From the measurements I did today, I am pretty sure that checking those conditions will be slower than just doing the preg-repace. From the numbers output by the test script (basically time in seconds to call s() 40000 times) I think the scores were very roughly: 1. Any PHP function call: 0.4 2. htmlspecialchars: 0.6 3. preg_replace: 1.0 4. simplifying the if: 0.1 5. removing the default arguments to htmlspecialchars: 0.1 each. There was actually a lot of noise on top of that, but I ran the test script many times. I wish we had documented why the preg_replace was necessary at all. I would actually be inclined to make things slower, by adding debugging logic so that any time the preg_replace did something, it displayed a warning. I can't think of a good reason why we want s('&#10') to be '&#10' and not '&#10'. Do you remember. Anyway, I have spent all the time I can afford on this. I think I have made things better, so I should submit this for integration. If you want to try and squeeze more performance out here, then we can create a new issue. (But, if you want to optimise something, then the load testing results I saw today were, for a course view page: get_string: 16% s: 3-4% mostly it was DB and MUC, of course. I hope the people doing our load-testing will summarise what they learned in a forum post.)
            Hide
            skodak Petr Skoda added a comment -

            It is there because in the old days some browsers were converting all non-ascii chars to html entities.

            Show
            skodak Petr Skoda added a comment - It is there because in the old days some browsers were converting all non-ascii chars to html entities.
            Hide
            skodak Petr Skoda added a comment -

            I suppose large number of s() is with numbers, this might make it faster:

            if ($var === false) {
                return '0';
            }
            $var = (string)$var;
            if ($var === (string)((int)$var)) {
               return $var;
            }

            no?

            Show
            skodak Petr Skoda added a comment - I suppose large number of s() is with numbers, this might make it faster: if ($var === false) { return '0'; } $var = (string)$var; if ($var === (string)((int)$var)) { return $var; } no?
            Hide
            timhunt Tim Hunt added a comment -

            As I say, if you want to experiment, please can we create a separate issue.

            Or, you can take over this issue.

            You raise a good point: if a large number of s() is with numbers ...

            It would be interesting to hack S() to write all the inputs to a file. Then browser round some typical Moodle pages. That way, we would know what sort of input we really have to deal with.

            Show
            timhunt Tim Hunt added a comment - As I say, if you want to experiment, please can we create a separate issue. Or, you can take over this issue. You raise a good point: if a large number of s() is with numbers ... It would be interesting to hack S() to write all the inputs to a file. Then browser round some typical Moodle pages. That way, we would know what sort of input we really have to deal with.
            Hide
            skodak Petr Skoda added a comment -

            +1

            Show
            skodak Petr Skoda added a comment - +1
            Hide
            skodak Petr Skoda added a comment -

            to integrators: this should be safe for backport, feel free to cherry pick...

            Show
            skodak Petr Skoda added a comment - to integrators: this should be safe for backport, feel free to cherry pick...
            Hide
            timhunt Tim Hunt added a comment -

            Thanks Petr. I will make stable branches tomorrow, providing that the integrators have not already cherry-picked this by then.

            Show
            timhunt Tim Hunt added a comment - Thanks Petr. I will make stable branches tomorrow, providing that the integrators have not already cherry-picked this by then.
            Hide
            timhunt Tim Hunt added a comment -

            Right, stable branches made.

            Show
            timhunt Tim Hunt added a comment - Right, stable branches made.
            Hide
            poltawski Dan Poltawski added a comment -

            I wasn't keen on seeing the removal of the argument on the stable branches, but Tim points out that it actually has no effect, not even a E_STRICT warning. The argument was removed a long time ago (before my time in Moodle, perhaps, I don't recall seeing code using it).

            And of course, Tim says they've found a measured performance improvement by removing this, so the argument is compelling. Finally, getting this into the stables helps the OU, who are kind enough to contribute this change to us, so - thanks OU! This will land in the stables and i'll blame you for any problems

            Show
            poltawski Dan Poltawski added a comment - I wasn't keen on seeing the removal of the argument on the stable branches, but Tim points out that it actually has no effect, not even a E_STRICT warning. The argument was removed a long time ago (before my time in Moodle, perhaps, I don't recall seeing code using it). And of course, Tim says they've found a measured performance improvement by removing this, so the argument is compelling. Finally, getting this into the stables helps the OU, who are kind enough to contribute this change to us, so - thanks OU! This will land in the stables and i'll blame you for any problems
            Hide
            poltawski Dan Poltawski added a comment -

            Integrated to master, 24 and 23 - thanks Tim.

            Show
            poltawski Dan Poltawski added a comment - Integrated to master, 24 and 23 - thanks Tim.
            Hide
            poltawski Dan Poltawski added a comment -

            Looks good on all branches, thanks Tim.

            Show
            poltawski Dan Poltawski added a comment - Looks good on all branches, thanks Tim.
            Hide
            poltawski Dan Poltawski added a comment -

            Thanks! You're changes are now spread to the world through this git and our source control repositories.

            No time to rest though, we've got days to make 2.5 the best yet!

            ciao

            Show
            poltawski Dan Poltawski added a comment - Thanks! You're changes are now spread to the world through this git and our source control repositories. No time to rest though, we've got days to make 2.5 the best yet! ciao

              People

              • Votes:
                0 Vote for this issue
                Watchers:
                5 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:
                  Fix Release Date:
                  18/Mar/13