Uploaded image for project: 'Moodle'
  1. Moodle
  2. MDL-31142

slow textlib::substr() and invalid result if length not specified

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.2.1, 2.3
    • Fix Version/s: 2.2.2
    • Component/s: Libraries
    • Labels:
    • Environment:
      64bit Linux - Debian Squeeze
      Note, Debian uses glibc for iconv_substr; MacOS uses libiconv which is apparently far more efficient

      Description

      We noticed this when looking at an issue with timeouts in the rss_client block.
      The failure we were seeing was caused by the rss_client block calling lib/web.php->break_up_long_words($description, 30) where $description was a string of around 4,600 characters.
      break_up_long_words() instantiates a new textlib class and calls $textlib->substr on each character in turn; which in turn performs an iconv_substr on the supplied text.

      Effectively, the code is calling:

      $strlen = $textlib->strlen($text);
      for ($i = 0; $i < $strlen; $i++) {
        $foo += $textlib->substr($text, $i, 1);
      }

      I'll attach some code which demonstrates the performance compared with mb_string()

        Gliffy Diagrams

          Attachments

            Issue Links

              Activity

              Hide
              stronk7 Eloy Lafuente (stronk7) added a comment - - edited

              Pasted from HQ:

              [17:10:08] <Cindereloy> I still don't get why we changed the fallback order in last iteration (MDL-29027) of textlib.
              [17:10:33] <Cindereloy> anyway, it's a matter of changing it again, just that.
              [17:11:51] <Cindereloy> by default the Typo3 libraries that we use as backend do always encode/mb/iconv/custom code fallback. (correct). We just introduced one iconv shortcut in our wrapper, nor sure why.

              Show
              stronk7 Eloy Lafuente (stronk7) added a comment - - edited Pasted from HQ: [17:10:08] <Cindereloy> I still don't get why we changed the fallback order in last iteration ( MDL-29027 ) of textlib. [17:10:33] <Cindereloy> anyway, it's a matter of changing it again, just that. [17:11:51] <Cindereloy> by default the Typo3 libraries that we use as backend do always encode/mb/iconv/custom code fallback. (correct). We just introduced one iconv shortcut in our wrapper, nor sure why.
              Hide
              dobedobedoh Andrew Nicols added a comment -

              Some stats from my box.

              Quad-Core i7 with HT (8 thread capable):

              Checking mb_substr with a length of 2500

              Took 0.01346492767334 seconds to process 2500 characters. That's 5.3859710693359E-6 seconds per character

              Checking iconv_substr with a length of 2500

              Took 11.092577934265 seconds to process 2500 characters. That's 0.0044370311737061 seconds per character

              Difference of 11.079113006592 seconds - 0.0044316452026367 seconds per character

              Show
              dobedobedoh Andrew Nicols added a comment - Some stats from my box. Quad-Core i7 with HT (8 thread capable): Checking mb_substr with a length of 2500 Took 0.01346492767334 seconds to process 2500 characters. That's 5.3859710693359E-6 seconds per character Checking iconv_substr with a length of 2500 Took 11.092577934265 seconds to process 2500 characters. That's 0.0044370311737061 seconds per character Difference of 11.079113006592 seconds - 0.0044316452026367 seconds per character
              Hide
              stronk7 Eloy Lafuente (stronk7) added a comment -

              Pasted from HQ:

              [17:34:43] <sam marshall> btw I believe nowadays all browsers support the CSS option to wrap long words (instead of breaking the layout)...
              [17:35:08] <Cindereloy> really? wow, didn't know.
              [17:35:26] <Cindereloy> +1 for that if all uses are web-based
              [17:40:01] <Cindereloy> http://w3schools.com/cssref/css3_pr_word-wrap.asp
              [17:40:27] <Cindereloy> gtg, ciaoooo
              [17:48:19] <sam marshall> We added this to our body tag in theme:
              [17:48:21] <sam marshall> word-wrap: break-word;
              -ms-word-wrap: break-word;

              [17:48:45] <sam marshall> I think I did this some weeks back, so ask me later if we have had any problems, if considering adding it to standard moodle themes So far haven't seen any.

              Ciao

              Show
              stronk7 Eloy Lafuente (stronk7) added a comment - Pasted from HQ: [17:34:43] <sam marshall> btw I believe nowadays all browsers support the CSS option to wrap long words (instead of breaking the layout)... [17:35:08] <Cindereloy> really? wow, didn't know. [17:35:26] <Cindereloy> +1 for that if all uses are web-based [17:40:01] <Cindereloy> http://w3schools.com/cssref/css3_pr_word-wrap.asp [17:40:27] <Cindereloy> gtg, ciaoooo [17:48:19] <sam marshall> We added this to our body tag in theme: [17:48:21] <sam marshall> word-wrap: break-word; -ms-word-wrap: break-word; [17:48:45] <sam marshall> I think I did this some weeks back, so ask me later if we have had any problems, if considering adding it to standard moodle themes So far haven't seen any. Ciao
              Hide
              dobedobedoh Andrew Nicols added a comment -

              Are we now saying that we should ditch the break_up_long_words() function (or at least in the long-run) and apply CSS instead?

              The CSS3 word-wrap tag seems to have had support in browsers for some time:

              I'm not sure what the official line is for support of Opera and Firefox, but both now support Auto-Updating so I'd expect most users to be on later versions by now.

              Show
              dobedobedoh Andrew Nicols added a comment - Are we now saying that we should ditch the break_up_long_words() function (or at least in the long-run) and apply CSS instead? The CSS3 word-wrap tag seems to have had support in browsers for some time: Firefox - Gecko 1.9.1/Firefox3.5+ - June 2009 ( https://developer.mozilla.org/en/CSS/word-wrap https://developer.mozilla.org/En/Firefox_3.5_for_developers ) IE - IE5+ - March 1999 ( http://msdn.microsoft.com/en-us/library/cc351024(v=vs.85).aspx ) Chrome/Safari - WebKit since birth (used in Chrome and Safari) (can't find a link to verify but multiple sources suggest this) Opera - Presto 2.5 Partial - March 2010; Presto 2.6+ Full Support - July 2010 ( http://www.opera.com/docs/specs/presto29/css/text/ http://www.opera.com/docs/history/#o95 ) I'm not sure what the official line is for support of Opera and Firefox, but both now support Auto-Updating so I'd expect most users to be on later versions by now.
              Hide
              skodak Petr Skoda added a comment -

              While testing the patch I have discovered that our textlib::substr() did not work if length was not specified of was null and mbstring was not available.

              I have changed all methods to try mb_string first.

              Thanks a lot for the report.

              Show
              skodak Petr Skoda added a comment - While testing the patch I have discovered that our textlib::substr() did not work if length was not specified of was null and mbstring was not available. I have changed all methods to try mb_string first. Thanks a lot for the report.
              Hide
              stronk7 Eloy Lafuente (stronk7) added a comment -

              Sorry Petr, but I think that mb_substr() behaves somehow like iconv and, if you have to pass the 4th param ('UTF-8') then you cannot pass null length because it means 0 (and not all).

              In fact your tests are failing here with:

              Fail: lib/simpletest/testtextlib.php / ▶ textlib_test / ▶ test_substr
              Identical expectation [String: ] fails with [String: Žluťoučký koníček] at character 0 with [] and [Žluťoučký koníček] at [lib/simpletest/testtextlib.php line 86]
               
              Fail: lib/simpletest/testtextlib.php / ▶ textlib_test / ▶ test_substr
              Identical expectation [String: ] fails with [String: luťoučký koníček] at character 0 with [] and [luťoučký koníček] at [lib/simpletest/testtextlib.php line 87]

              Possible workarounds (alternative) are:

              1) Check for null lengths, and do the same than in the iconv case (calculate len).
              2) Get current mb encoding, change it to UTF-8, execute mb_substr() with only 2 params and reset mb encoding back to original.

              Reopening, sorry, ciao

              Show
              stronk7 Eloy Lafuente (stronk7) added a comment - Sorry Petr, but I think that mb_substr() behaves somehow like iconv and, if you have to pass the 4th param ('UTF-8') then you cannot pass null length because it means 0 (and not all). In fact your tests are failing here with: Fail: lib/simpletest/testtextlib.php / ▶ textlib_test / ▶ test_substr Identical expectation [String: ] fails with [String: Žluťoučký koníček] at character 0 with [] and [Žluťoučký koníček] at [lib/simpletest/testtextlib.php line 86]   Fail: lib/simpletest/testtextlib.php / ▶ textlib_test / ▶ test_substr Identical expectation [String: ] fails with [String: luťoučký koníček] at character 0 with [] and [luťoučký koníček] at [lib/simpletest/testtextlib.php line 87] Possible workarounds (alternative) are: 1) Check for null lengths, and do the same than in the iconv case (calculate len). 2) Get current mb encoding, change it to UTF-8, execute mb_substr() with only 2 params and reset mb encoding back to original. Reopening, sorry, ciao
              Hide
              skodak Petr Skoda added a comment -

              arrggh, at least the tests detected it, I was a bit surprised it worked here, fixing

              thanks a lot

              Show
              skodak Petr Skoda added a comment - arrggh, at least the tests detected it, I was a bit surprised it worked here, fixing thanks a lot
              Hide
              skodak Petr Skoda added a comment -

              fixed, it was failing or me now - that means I forgot to restart apache after messing with PHP extensions, sorry

              I have used the trick with default encoding because it looked faster, I did not bother in case of iconv though...

              Show
              skodak Petr Skoda added a comment - fixed, it was failing or me now - that means I forgot to restart apache after messing with PHP extensions, sorry I have used the trick with default encoding because it looked faster, I did not bother in case of iconv though...
              Hide
              stronk7 Eloy Lafuente (stronk7) added a comment -

              Integrated (22 and master), thanks!

              Show
              stronk7 Eloy Lafuente (stronk7) added a comment - Integrated (22 and master), thanks!
              Hide
              rajeshtaneja Rajesh Taneja added a comment -

              Works Great
              Thanks for fixing this, Petr.

              One question:

              1. Why do we force user to use charset to be small case utf-8 ?
              2. If we hard code this to small case utf-8 then why description in dockblock is change to reflect capital case UTF-8
              3. While passing charset to mb_substr and iconv_substr, it's hard coded. There is no harm in this, but it might be nice to use passed $charset after checking it in if statement as utf-8
              Show
              rajeshtaneja Rajesh Taneja added a comment - Works Great Thanks for fixing this, Petr. One question: Why do we force user to use charset to be small case utf-8 ? If we hard code this to small case utf-8 then why description in dockblock is change to reflect capital case UTF-8 While passing charset to mb_substr and iconv_substr, it's hard coded. There is no harm in this, but it might be nice to use passed $charset after checking it in if statement as utf-8
              Hide
              skodak Petr Skoda added a comment -

              typo3 is using lowercase, the rest of the world usually uppercase, our API allows both

              Show
              skodak Petr Skoda added a comment - typo3 is using lowercase, the rest of the world usually uppercase, our API allows both
              Hide
              rajeshtaneja Rajesh Taneja added a comment -

              Thanks for clarifying this Petr,
              Should we use strtolower or strtoupper to deal with this? or put more description in docblock to make it more clear?

              Show
              rajeshtaneja Rajesh Taneja added a comment - Thanks for clarifying this Petr, Should we use strtolower or strtoupper to deal with this? or put more description in docblock to make it more clear?
              Hide
              skodak Petr Skoda added a comment -

              there is already a normalisation call in each textlib function that needs encodings, devs can use anything there

              Show
              skodak Petr Skoda added a comment - there is already a normalisation call in each textlib function that needs encodings, devs can use anything there
              Hide
              stronk7 Eloy Lafuente (stronk7) added a comment -

              This is now available in the git and cvs repositories.

              Consider the responsibility of your fingerprints engraved there for future generations!

              Thanks for the work, closing, ciao

              Show
              stronk7 Eloy Lafuente (stronk7) added a comment - This is now available in the git and cvs repositories. Consider the responsibility of your fingerprints engraved there for future generations! Thanks for the work, closing, ciao

                People

                • Votes:
                  0 Vote for this issue
                  Watchers:
                  1 Start watching this issue

                  Dates

                  • Created:
                    Updated:
                    Resolved:
                    Fix Release Date:
                    12/Mar/12