Moodle
  1. Moodle
  2. MDL-31142

slow textlib::substr() and invalid result if length not specified

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 2.2.1, 2.3
    • Fix Version/s: 2.2.2
    • Component/s: Libraries
    • Labels:
    • Environment:
      64bit Linux - Debian Squeeze
      Note, Debian uses glibc for iconv_substr; MacOS uses libiconv which is apparently far more efficient
    • Rank:
      37574

      Description

      We noticed this when looking at an issue with timeouts in the rss_client block.
      The failure we were seeing was caused by the rss_client block calling lib/web.php->break_up_long_words($description, 30) where $description was a string of around 4,600 characters.
      break_up_long_words() instantiates a new textlib class and calls $textlib->substr on each character in turn; which in turn performs an iconv_substr on the supplied text.

      Effectively, the code is calling:

      $strlen = $textlib->strlen($text);
      for ($i = 0; $i < $strlen; $i++) {
        $foo += $textlib->substr($text, $i, 1);
      }
      

      I'll attach some code which demonstrates the performance compared with mb_string()

        Issue Links

          Activity

          Hide
          Eloy Lafuente (stronk7) added a comment - - edited

          Pasted from HQ:

          [17:10:08] <Cindereloy> I still don't get why we changed the fallback order in last iteration (MDL-29027) of textlib.
          [17:10:33] <Cindereloy> anyway, it's a matter of changing it again, just that.
          [17:11:51] <Cindereloy> by default the Typo3 libraries that we use as backend do always encode/mb/iconv/custom code fallback. (correct). We just introduced one iconv shortcut in our wrapper, nor sure why.

          Show
          Eloy Lafuente (stronk7) added a comment - - edited Pasted from HQ: [17:10:08] <Cindereloy> I still don't get why we changed the fallback order in last iteration ( MDL-29027 ) of textlib. [17:10:33] <Cindereloy> anyway, it's a matter of changing it again, just that. [17:11:51] <Cindereloy> by default the Typo3 libraries that we use as backend do always encode/mb/iconv/custom code fallback. (correct). We just introduced one iconv shortcut in our wrapper, nor sure why.
          Hide
          Andrew Nicols added a comment -

          Some stats from my box.

          Quad-Core i7 with HT (8 thread capable):

          Checking mb_substr with a length of 2500

          Took 0.01346492767334 seconds to process 2500 characters. That's 5.3859710693359E-6 seconds per character

          Checking iconv_substr with a length of 2500

          Took 11.092577934265 seconds to process 2500 characters. That's 0.0044370311737061 seconds per character

          Difference of 11.079113006592 seconds - 0.0044316452026367 seconds per character

          Show
          Andrew Nicols added a comment - Some stats from my box. Quad-Core i7 with HT (8 thread capable): Checking mb_substr with a length of 2500 Took 0.01346492767334 seconds to process 2500 characters. That's 5.3859710693359E-6 seconds per character Checking iconv_substr with a length of 2500 Took 11.092577934265 seconds to process 2500 characters. That's 0.0044370311737061 seconds per character Difference of 11.079113006592 seconds - 0.0044316452026367 seconds per character
          Hide
          Eloy Lafuente (stronk7) added a comment -

          Pasted from HQ:

          [17:34:43] <sam marshall> btw I believe nowadays all browsers support the CSS option to wrap long words (instead of breaking the layout)...
          [17:35:08] <Cindereloy> really? wow, didn't know.
          [17:35:26] <Cindereloy> +1 for that if all uses are web-based
          [17:40:01] <Cindereloy> http://w3schools.com/cssref/css3_pr_word-wrap.asp
          [17:40:27] <Cindereloy> gtg, ciaoooo
          [17:48:19] <sam marshall> We added this to our body tag in theme:
          [17:48:21] <sam marshall> word-wrap: break-word;
          -ms-word-wrap: break-word;

          [17:48:45] <sam marshall> I think I did this some weeks back, so ask me later if we have had any problems, if considering adding it to standard moodle themes So far haven't seen any.

          Ciao

          Show
          Eloy Lafuente (stronk7) added a comment - Pasted from HQ: [17:34:43] <sam marshall> btw I believe nowadays all browsers support the CSS option to wrap long words (instead of breaking the layout)... [17:35:08] <Cindereloy> really? wow, didn't know. [17:35:26] <Cindereloy> +1 for that if all uses are web-based [17:40:01] <Cindereloy> http://w3schools.com/cssref/css3_pr_word-wrap.asp [17:40:27] <Cindereloy> gtg, ciaoooo [17:48:19] <sam marshall> We added this to our body tag in theme: [17:48:21] <sam marshall> word-wrap: break-word; -ms-word-wrap: break-word; [17:48:45] <sam marshall> I think I did this some weeks back, so ask me later if we have had any problems, if considering adding it to standard moodle themes So far haven't seen any. Ciao
          Hide
          Andrew Nicols added a comment -

          Are we now saying that we should ditch the break_up_long_words() function (or at least in the long-run) and apply CSS instead?

          The CSS3 word-wrap tag seems to have had support in browsers for some time:

          I'm not sure what the official line is for support of Opera and Firefox, but both now support Auto-Updating so I'd expect most users to be on later versions by now.

          Show
          Andrew Nicols added a comment - Are we now saying that we should ditch the break_up_long_words() function (or at least in the long-run) and apply CSS instead? The CSS3 word-wrap tag seems to have had support in browsers for some time: Firefox - Gecko 1.9.1/Firefox3.5+ - June 2009 ( https://developer.mozilla.org/en/CSS/word-wrap https://developer.mozilla.org/En/Firefox_3.5_for_developers ) IE - IE5+ - March 1999 ( http://msdn.microsoft.com/en-us/library/cc351024(v=vs.85).aspx ) Chrome/Safari - WebKit since birth (used in Chrome and Safari) (can't find a link to verify but multiple sources suggest this) Opera - Presto 2.5 Partial - March 2010; Presto 2.6+ Full Support - July 2010 ( http://www.opera.com/docs/specs/presto29/css/text/ http://www.opera.com/docs/history/#o95 ) I'm not sure what the official line is for support of Opera and Firefox, but both now support Auto-Updating so I'd expect most users to be on later versions by now.
          Hide
          Petr Škoda added a comment -

          While testing the patch I have discovered that our textlib::substr() did not work if length was not specified of was null and mbstring was not available.

          I have changed all methods to try mb_string first.

          Thanks a lot for the report.

          Show
          Petr Škoda added a comment - While testing the patch I have discovered that our textlib::substr() did not work if length was not specified of was null and mbstring was not available. I have changed all methods to try mb_string first. Thanks a lot for the report.
          Hide
          Eloy Lafuente (stronk7) added a comment -

          Sorry Petr, but I think that mb_substr() behaves somehow like iconv and, if you have to pass the 4th param ('UTF-8') then you cannot pass null length because it means 0 (and not all).

          In fact your tests are failing here with:

          Fail: lib/simpletest/testtextlib.php / ▶ textlib_test / ▶ test_substr
          Identical expectation [String: ] fails with [String: Žluťoučký koníček] at character 0 with [] and [Žluťoučký koníček] at [lib/simpletest/testtextlib.php line 86]
          
          Fail: lib/simpletest/testtextlib.php / ▶ textlib_test / ▶ test_substr
          Identical expectation [String: ] fails with [String: luťoučký koníček] at character 0 with [] and [luťoučký koníček] at [lib/simpletest/testtextlib.php line 87]
          

          Possible workarounds (alternative) are:

          1) Check for null lengths, and do the same than in the iconv case (calculate len).
          2) Get current mb encoding, change it to UTF-8, execute mb_substr() with only 2 params and reset mb encoding back to original.

          Reopening, sorry, ciao

          Show
          Eloy Lafuente (stronk7) added a comment - Sorry Petr, but I think that mb_substr() behaves somehow like iconv and, if you have to pass the 4th param ('UTF-8') then you cannot pass null length because it means 0 (and not all). In fact your tests are failing here with: Fail: lib/simpletest/testtextlib.php / ▶ textlib_test / ▶ test_substr Identical expectation [ String : ] fails with [ String : Žluťoučký koníček] at character 0 with [] and [Žluťoučký koníček] at [lib/simpletest/testtextlib.php line 86] Fail: lib/simpletest/testtextlib.php / ▶ textlib_test / ▶ test_substr Identical expectation [ String : ] fails with [ String : luťoučký koníček] at character 0 with [] and [luťoučký koníček] at [lib/simpletest/testtextlib.php line 87] Possible workarounds (alternative) are: 1) Check for null lengths, and do the same than in the iconv case (calculate len). 2) Get current mb encoding, change it to UTF-8, execute mb_substr() with only 2 params and reset mb encoding back to original. Reopening, sorry, ciao
          Hide
          Petr Škoda added a comment -

          arrggh, at least the tests detected it, I was a bit surprised it worked here, fixing

          thanks a lot

          Show
          Petr Škoda added a comment - arrggh, at least the tests detected it, I was a bit surprised it worked here, fixing thanks a lot
          Hide
          Petr Škoda added a comment -

          fixed, it was failing or me now - that means I forgot to restart apache after messing with PHP extensions, sorry

          I have used the trick with default encoding because it looked faster, I did not bother in case of iconv though...

          Show
          Petr Škoda added a comment - fixed, it was failing or me now - that means I forgot to restart apache after messing with PHP extensions, sorry I have used the trick with default encoding because it looked faster, I did not bother in case of iconv though...
          Hide
          Eloy Lafuente (stronk7) added a comment -

          Integrated (22 and master), thanks!

          Show
          Eloy Lafuente (stronk7) added a comment - Integrated (22 and master), thanks!
          Hide
          Rajesh Taneja added a comment -

          Works Great
          Thanks for fixing this, Petr.

          One question:

          1. Why do we force user to use charset to be small case utf-8 ?
          2. If we hard code this to small case utf-8 then why description in dockblock is change to reflect capital case UTF-8
          3. While passing charset to mb_substr and iconv_substr, it's hard coded. There is no harm in this, but it might be nice to use passed $charset after checking it in if statement as utf-8
          Show
          Rajesh Taneja added a comment - Works Great Thanks for fixing this, Petr. One question: Why do we force user to use charset to be small case utf-8 ? If we hard code this to small case utf-8 then why description in dockblock is change to reflect capital case UTF-8 While passing charset to mb_substr and iconv_substr, it's hard coded. There is no harm in this, but it might be nice to use passed $charset after checking it in if statement as utf-8
          Hide
          Petr Škoda added a comment -

          typo3 is using lowercase, the rest of the world usually uppercase, our API allows both

          Show
          Petr Škoda added a comment - typo3 is using lowercase, the rest of the world usually uppercase, our API allows both
          Hide
          Rajesh Taneja added a comment -

          Thanks for clarifying this Petr,
          Should we use strtolower or strtoupper to deal with this? or put more description in docblock to make it more clear?

          Show
          Rajesh Taneja added a comment - Thanks for clarifying this Petr, Should we use strtolower or strtoupper to deal with this? or put more description in docblock to make it more clear?
          Hide
          Petr Škoda added a comment -

          there is already a normalisation call in each textlib function that needs encodings, devs can use anything there

          Show
          Petr Škoda added a comment - there is already a normalisation call in each textlib function that needs encodings, devs can use anything there
          Hide
          Eloy Lafuente (stronk7) added a comment -

          This is now available in the git and cvs repositories.

          Consider the responsibility of your fingerprints engraved there for future generations!

          Thanks for the work, closing, ciao

          Show
          Eloy Lafuente (stronk7) added a comment - This is now available in the git and cvs repositories. Consider the responsibility of your fingerprints engraved there for future generations! Thanks for the work, closing, ciao

            People

            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: