Moodle
  1. Moodle
  2. MDL-39378

Fix s() hex entity handling, and improve performance

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 2.3.6, 2.4.3
    • Fix Version/s: 2.3.7, 2.4.3
    • Component/s: Libraries
    • Labels:
    • Rank:
      50014

      Description

      The regex is currently "/&#(\d+|x[0-7a-fA-F]+);/i", "&#$1;"

      • we have /i, but a-fA-F
      • why do we have 0-7, not 0-9?

        Issue Links

          Activity

          Hide
          Tim Hunt added a comment -

          Right, here is my proposed fix.

          Show
          Tim Hunt added a comment - Right, here is my proposed fix.
          Hide
          Tim Hunt added a comment -

          This test script times the varous options I considered. The only things faster than what I went for fail some unit tests.

          Show
          Tim Hunt added a comment - This test script times the varous options I considered. The only things faster than what I went for fail some unit tests.
          Hide
          Tim Hunt added a comment -

          I will also add, we were seeing s() showing up in our load-testing. It is about 3-4% of page-load time. My changes make it about 10% faster.

          (For comparison, get string is about 16% of page-load time.)

          Show
          Tim Hunt added a comment - I will also add, we were seeing s() showing up in our load-testing. It is about 3-4% of page-load time. My changes make it about 10% faster. (For comparison, get string is about 16% of page-load time.)
          Hide
          Petr Škoda added a comment -

          pardon me, but it looks very crazy to remove the encoding in htmlspecialchars()...

          Show
          Petr Škoda added a comment - pardon me, but it looks very crazy to remove the encoding in htmlspecialchars()...
          Hide
          Andrew Nicols added a comment -

          From http://php.net/manual/en/function.htmlspecialchars.php:

          If omitted, the default value for this argument is ISO-8859-1 in versions of PHP prior to 5.4.0, and UTF-8 from PHP 5.4.0 onwards.

          Show
          Andrew Nicols added a comment - From http://php.net/manual/en/function.htmlspecialchars.php: If omitted, the default value for this argument is ISO-8859-1 in versions of PHP prior to 5.4.0, and UTF-8 from PHP 5.4.0 onwards.
          Hide
          Petr Škoda added a comment -

          I would be also great if you started separating actual bug fixes from your cleanup commits.

          Show
          Petr Škoda added a comment - I would be also great if you started separating actual bug fixes from your cleanup commits.
          Hide
          Tim Hunt added a comment -

          Sorry, Petr, which part of the commit was clean-up, and not part of the actual improvement?

          Show
          Tim Hunt added a comment - Sorry, Petr, which part of the commit was clean-up, and not part of the actual improvement?
          Hide
          Petr Škoda added a comment -

          The minimal bug fix was 1 character if I read it correctly, you modified 8 lines and you managed to introduce a regressions, that can not be ok.

          Show
          Petr Škoda added a comment - The minimal bug fix was 1 character if I read it correctly, you modified 8 lines and you managed to introduce a regressions, that can not be ok.
          Hide
          Tim Hunt added a comment -

          Removing the encoding is not crazy. Specifying the string UTF-8 there makes things noticably slower (and I am testing this on a server that using PHP 5.4, so that is the option being used).

          I am running the unit tests on PHP 5.3, and have added some unit tests using UTF-8 characters, and they still pass.

          Think about what htmlspecialchars does. It finds all occurrences of 5 speific bytes, and replaces them with several bytes. The bytes it needs to search for are the same in UTF-8 and ISO-whatever. The bytes it searces for can never be part of a multi-byte character in UTF-8.

          Show
          Tim Hunt added a comment - Removing the encoding is not crazy. Specifying the string UTF-8 there makes things noticably slower (and I am testing this on a server that using PHP 5.4, so that is the option being used). I am running the unit tests on PHP 5.3, and have added some unit tests using UTF-8 characters, and they still pass. Think about what htmlspecialchars does. It finds all occurrences of 5 speific bytes, and replaces them with several bytes. The bytes it needs to search for are the same in UTF-8 and ISO-whatever. The bytes it searces for can never be part of a multi-byte character in UTF-8.
          Hide
          Tim Hunt added a comment -

          I just pushed an amended patch with the UTF-8 unit tests.

          Show
          Tim Hunt added a comment - I just pushed an amended patch with the UTF-8 unit tests.
          Hide
          Petr Škoda added a comment - - edited

          Could you please make a separate commit with just the one character bugfix, please? Then commit unit tests and only after that your cleanup.

          I believe that upstream should be dealing with performance, intentionally using invalid encoding parameters is wrong. I believe this might be a regresion and even a security issue because:

          If the input string contains an invalid code unit sequence within the given encoding an empty string will be returned, unless either the ENT_IGNORE or ENT_SUBSTITUTE flags are set.

          My -2 because
          1/ invalid UTF-8 chars might get through which might theoretically result in XSS
          2/ cleanup should not be mixed with bug fixes in one commit

          Show
          Petr Škoda added a comment - - edited Could you please make a separate commit with just the one character bugfix, please? Then commit unit tests and only after that your cleanup. I believe that upstream should be dealing with performance, intentionally using invalid encoding parameters is wrong. I believe this might be a regresion and even a security issue because: If the input string contains an invalid code unit sequence within the given encoding an empty string will be returned, unless either the ENT_IGNORE or ENT_SUBSTITUTE flags are set. My -2 because 1/ invalid UTF-8 chars might get through which might theoretically result in XSS 2/ cleanup should not be mixed with bug fixes in one commit
          Hide
          Petr Škoda added a comment -
          Show
          Petr Škoda added a comment - http://blog.astrumfutura.com/2012/03/a-hitchhikers-guide-to-cross-site-scripting-xss-in-php-part-1-how-not-to-use-htmlspecialchars-for-output-escaping/ Tim: look for "Great ASCII Delusion (GAD)" in that article or even better read it all!
          Hide
          Tim Hunt added a comment -

          Thanks. Interesting read.

          Branch amended into two commits. One that is just the bug fix (+ unit test to verify it). The other adds the performance improvements + the extra unit tests.

          Performance is a shared responsibility. The PHP / PCRE folks should optimise preg_replace as much as possible. But we should also do our part. For example, do we really need to call preg_replace 5000 times when viewing a course page?

          Show
          Tim Hunt added a comment - Thanks. Interesting read. Branch amended into two commits. One that is just the bug fix (+ unit test to verify it). The other adds the performance improvements + the extra unit tests. Performance is a shared responsibility. The PHP / PCRE folks should optimise preg_replace as much as possible. But we should also do our part. For example, do we really need to call preg_replace 5000 times when viewing a course page?
          Hide
          Tim Hunt added a comment -

          Sorry, one more thing: Is it worth making a meta-bug for "Things to do when we move to PHP 5.4 as a minium supported version? Think like updating the htmlspecialchars to use the new ENT_SUBSTITUTE option?

          Show
          Tim Hunt added a comment - Sorry, one more thing: Is it worth making a meta-bug for "Things to do when we move to PHP 5.4 as a minium supported version? Think like updating the htmlspecialchars to use the new ENT_SUBSTITUTE option?
          Hide
          Petr Škoda added a comment -

          Some performance ideas:
          1/ if there is no "&" in original string preg_replace() is not necessary, just htmlspecialchars()
          2/ numbers do not need to be escaped at all

          Show
          Petr Škoda added a comment - Some performance ideas: 1/ if there is no "&" in original string preg_replace() is not necessary, just htmlspecialchars() 2/ numbers do not need to be escaped at all
          Hide
          Tim Hunt added a comment -

          From the measurements I did today, I am pretty sure that checking those conditions will be slower than just doing the preg-repace.

          From the numbers output by the test script (basically time in seconds to call s() 40000 times) I think the scores were very roughly:

          1. Any PHP function call: 0.4
          2. htmlspecialchars: 0.6
          3. preg_replace: 1.0
          4. simplifying the if: 0.1
          5. removing the default arguments to htmlspecialchars: 0.1 each.

          There was actually a lot of noise on top of that, but I ran the test script many times.

          I wish we had documented why the preg_replace was necessary at all. I would actually be inclined to make things slower, by adding debugging logic so that any time the preg_replace did something, it displayed a warning. I can't think of a good reason why we want s('&#10') to be '&#10' and not '&#10'. Do you remember.

          Anyway, I have spent all the time I can afford on this. I think I have made things better, so I should submit this for integration. If you want to try and squeeze more performance out here, then we can create a new issue.

          (But, if you want to optimise something, then the load testing results I saw today were, for a course view page:
          get_string: 16%
          s: 3-4%
          mostly it was DB and MUC, of course. I hope the people doing our load-testing will summarise what they learned in a forum post.)

          Show
          Tim Hunt added a comment - From the measurements I did today, I am pretty sure that checking those conditions will be slower than just doing the preg-repace. From the numbers output by the test script (basically time in seconds to call s() 40000 times) I think the scores were very roughly: 1. Any PHP function call: 0.4 2. htmlspecialchars: 0.6 3. preg_replace: 1.0 4. simplifying the if: 0.1 5. removing the default arguments to htmlspecialchars: 0.1 each. There was actually a lot of noise on top of that, but I ran the test script many times. I wish we had documented why the preg_replace was necessary at all. I would actually be inclined to make things slower, by adding debugging logic so that any time the preg_replace did something, it displayed a warning. I can't think of a good reason why we want s('&#10') to be '&#10' and not '&#10'. Do you remember. Anyway, I have spent all the time I can afford on this. I think I have made things better, so I should submit this for integration. If you want to try and squeeze more performance out here, then we can create a new issue. (But, if you want to optimise something, then the load testing results I saw today were, for a course view page: get_string: 16% s: 3-4% mostly it was DB and MUC, of course. I hope the people doing our load-testing will summarise what they learned in a forum post.)
          Hide
          Petr Škoda added a comment -

          It is there because in the old days some browsers were converting all non-ascii chars to html entities.

          Show
          Petr Škoda added a comment - It is there because in the old days some browsers were converting all non-ascii chars to html entities.
          Hide
          Petr Škoda added a comment -

          I suppose large number of s() is with numbers, this might make it faster:

          if ($var === false) {
              return '0';
          }
          $var = (string)$var;
          if ($var === (string)((int)$var)) {
             return $var;
          }
          

          no?

          Show
          Petr Škoda added a comment - I suppose large number of s() is with numbers, this might make it faster: if ($ var === false ) { return '0'; } $ var = (string)$ var ; if ($ var === (string)(( int )$ var )) { return $ var ; } no?
          Hide
          Tim Hunt added a comment -

          As I say, if you want to experiment, please can we create a separate issue.

          Or, you can take over this issue.

          You raise a good point: if a large number of s() is with numbers ...

          It would be interesting to hack S() to write all the inputs to a file. Then browser round some typical Moodle pages. That way, we would know what sort of input we really have to deal with.

          Show
          Tim Hunt added a comment - As I say, if you want to experiment, please can we create a separate issue. Or, you can take over this issue. You raise a good point: if a large number of s() is with numbers ... It would be interesting to hack S() to write all the inputs to a file. Then browser round some typical Moodle pages. That way, we would know what sort of input we really have to deal with.
          Hide
          Petr Škoda added a comment -

          +1

          Show
          Petr Škoda added a comment - +1
          Hide
          Petr Škoda added a comment -

          to integrators: this should be safe for backport, feel free to cherry pick...

          Show
          Petr Škoda added a comment - to integrators: this should be safe for backport, feel free to cherry pick...
          Hide
          Tim Hunt added a comment -

          Thanks Petr. I will make stable branches tomorrow, providing that the integrators have not already cherry-picked this by then.

          Show
          Tim Hunt added a comment - Thanks Petr. I will make stable branches tomorrow, providing that the integrators have not already cherry-picked this by then.
          Hide
          Tim Hunt added a comment -

          Right, stable branches made.

          Show
          Tim Hunt added a comment - Right, stable branches made.
          Hide
          Dan Poltawski added a comment -

          I wasn't keen on seeing the removal of the argument on the stable branches, but Tim points out that it actually has no effect, not even a E_STRICT warning. The argument was removed a long time ago (before my time in Moodle, perhaps, I don't recall seeing code using it).

          And of course, Tim says they've found a measured performance improvement by removing this, so the argument is compelling. Finally, getting this into the stables helps the OU, who are kind enough to contribute this change to us, so - thanks OU! This will land in the stables and i'll blame you for any problems

          Show
          Dan Poltawski added a comment - I wasn't keen on seeing the removal of the argument on the stable branches, but Tim points out that it actually has no effect, not even a E_STRICT warning. The argument was removed a long time ago (before my time in Moodle, perhaps, I don't recall seeing code using it). And of course, Tim says they've found a measured performance improvement by removing this, so the argument is compelling. Finally, getting this into the stables helps the OU, who are kind enough to contribute this change to us, so - thanks OU! This will land in the stables and i'll blame you for any problems
          Hide
          Dan Poltawski added a comment -

          Integrated to master, 24 and 23 - thanks Tim.

          Show
          Dan Poltawski added a comment - Integrated to master, 24 and 23 - thanks Tim.
          Hide
          Dan Poltawski added a comment -

          Looks good on all branches, thanks Tim.

          Show
          Dan Poltawski added a comment - Looks good on all branches, thanks Tim.
          Hide
          Dan Poltawski added a comment -

          Thanks! You're changes are now spread to the world through this git and our source control repositories.

          No time to rest though, we've got days to make 2.5 the best yet!

          ciao

          Show
          Dan Poltawski added a comment - Thanks! You're changes are now spread to the world through this git and our source control repositories. No time to rest though, we've got days to make 2.5 the best yet! ciao

            People

            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: