Moodle
  1. Moodle
  2. MDL-22896

bad regular expression in html2text library causes text to go missing from forum emails

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 1.9.9, 2.1.4, 2.2.1, 2.3
    • Fix Version/s: 2.1.5, 2.2.2
    • Component/s: Libraries
    • Labels:
    • Testing Instructions:
      Hide

      Note: To test this, you should have email working for forum.

      1. set "Email format" as plain text in your profile.
      2. Add a forum post with following text with "Mail now" checked
        Gin & Tonic
        - 2oz gin;
        - 5oz tonic water;
        - 5 cubes of ice;
        - 1 lime wedge.
        
      3. Run cron /admin/cron.php after 1 min.
      4. Make sure no text is lost.
      Show
      Note: To test this, you should have email working for forum. set "Email format" as plain text in your profile. Add a forum post with following text with "Mail now" checked Gin & Tonic - 2oz gin; - 5oz tonic water; - 5 cubes of ice; - 1 lime wedge. Run cron /admin/cron.php after 1 min. Make sure no text is lost.
    • Difficulty:
      Easy
    • Affected Branches:
      MOODLE_19_STABLE, MOODLE_21_STABLE, MOODLE_22_STABLE, MOODLE_23_STABLE
    • Fixed Branches:
      MOODLE_21_STABLE, MOODLE_22_STABLE
    • Pull Master Branch:
      wip-mdl-22896
    • Rank:
      6092

      Description

      Greetings.. I believe I've found and fixed a bug in the html2text library.

      In /lib/html2text.php...
      ---------------------------
      478 // Remove unknown/unhandled entities (this cannot be done in search-and-replace block)
      479 $text = preg_replace('/&[^&;]+;/i', '', $text);
      ---------------------------

      That regular expression is too greedy... it matches any sequence of characters that starts with an ampersand and ends with a semicolon.

      We've had numerous reports from users that huge chunks of forum posts are missing from the plain-text emails they receive by subscription.

      The problem occurs when someone happens to include an ampersand in their text, and also a semicolon somewhere. Anything between those two characters is filtered out.

      Here's an example...

      Gin & Tonic

      • 2oz gin;
      • 5oz tonic water;
      • 5 cubes of ice;
      • 1 lime wedge.

      if you ran that through html2text, it would output this..

      Gin

      • 5oz tonic water;
      • 5 cubes of ice;
      • 1 lime wedge.

      The simple fix I am testing now is this:
      479 $text = preg_replace('/&[^&;\s]+;/i', '', $text);

      The additional \s makes sure the match stops on whitespace.

      Best regards,
      -Garret

      1. Help Sessions and general admin.20100827140840.txt.withoutfix
        3 kB
        Troy Williams
      2. Help Sessions and general admin.20100827140938.txt.withfix
        3 kB
        Troy Williams
      3. Help Sessions and general admin.html
        4 kB
        Troy Williams

        Issue Links

          Activity

          Hide
          Garret Gengler added a comment -

          I also reported this to the Roundcube Webmail developers (they are the upstream maintainer of html2text.php)...

          They very quickly patched the bug this way:

          479 $text = preg_replace('/&#?[a-z0-9]

          {2,7}

          ;/i', '', $text);

          http://trac.roundcube.net/changeset/3777

          -Garret

          Show
          Garret Gengler added a comment - I also reported this to the Roundcube Webmail developers (they are the upstream maintainer of html2text.php)... They very quickly patched the bug this way: 479 $text = preg_replace('/&#? [a-z0-9] {2,7} ;/i', '', $text); http://trac.roundcube.net/changeset/3777 -Garret
          Hide
          Teresa Gibbison added a comment -

          Hi Garret
          Would this also affect the display of a hyperlink in the email?

          We have a forum post like this
          "As of this week (Friday 30 July 2010), the Friday tutorials will be held in Computing Laboratory 6 (R G.18)."

          here is the html as the words 'Computing Lab 6' and 'RG.18' are hyperlinks to maps on the website

          "As of this week (Friday 30 July 2010), the Friday tutorials will be held in <strong><a href="http://www.scms.waikato.ac.nz/genquery.php?linklevel=2&linklist=SCMS_Rightbar&linkname=Computer_Laboratories&linktype=report&listby=Lab_Title&lwhere=unique_record_id=10&children=">Computing Laboratory 6</a></strong> (<a href="http://www.waikato.ac.nz/contacts/map/?term=R">R G.18</a>)."

          A user is reporting that his plain text email output is:
          "As of this week (Friday 30 July 2010), the Friday tutorials will be held in ()."

          Show
          Teresa Gibbison added a comment - Hi Garret Would this also affect the display of a hyperlink in the email? We have a forum post like this "As of this week (Friday 30 July 2010), the Friday tutorials will be held in Computing Laboratory 6 (R G.18)." here is the html as the words 'Computing Lab 6' and 'RG.18' are hyperlinks to maps on the website "As of this week (Friday 30 July 2010), the Friday tutorials will be held in <strong><a href="http://www.scms.waikato.ac.nz/genquery.php?linklevel=2&linklist=SCMS_Rightbar&linkname=Computer_Laboratories&linktype=report&listby=Lab_Title&lwhere=unique_record_id=10&children=">Computing Laboratory 6</a></strong> (<a href="http://www.waikato.ac.nz/contacts/map/?term=R">R G.18</a>)." A user is reporting that his plain text email output is: "As of this week (Friday 30 July 2010), the Friday tutorials will be held in ()."
          Hide
          Troy Williams added a comment -

          Hi,

          Have attached three files:

          Help Sessions and general admin.html
          -> html we were having problems with.

          Help Sessions and general admin.20100827140840.txt.withoutfix
          -> using $text = preg_replace('/&[^&;]+;/i', '', $text);

          Help Sessions and general admin.20100827140938.txt.withfix
          -> Help Sessions and general admin.20100827140938.txt.withfix

          Works for this example.

          Cheers,

          Troy

          Show
          Troy Williams added a comment - Hi, Have attached three files: Help Sessions and general admin.html -> html we were having problems with. Help Sessions and general admin.20100827140840.txt.withoutfix -> using $text = preg_replace('/& [^&;] +;/i', '', $text); Help Sessions and general admin.20100827140938.txt.withfix -> Help Sessions and general admin.20100827140938.txt.withfix Works for this example. Cheers, Troy
          Hide
          Troy Williams added a comment -

          Script used to check: html2texttest.php

          just put in admin/utils or change path to config.php

          Troy

          Show
          Troy Williams added a comment - Script used to check: html2texttest.php just put in admin/utils or change path to config.php Troy
          Hide
          Steve Bond added a comment -

          The greedy regex is still there in html2text.php in Moodle 2.0

          Show
          Steve Bond added a comment - The greedy regex is still there in html2text.php in Moodle 2.0
          Hide
          Huy Hoang added a comment -

          the bug is still there in Moodle 2.0 (and latest 2.2dev as of today). A quick search shows about 10 references of the function html_to_text():

          admin/cli/upgrade.php
          grade/lib.php
          lib/environmentlib.php
          lib/weblib.php
          mod/lesson/format.php
          question/format.php
          question/format/multianswer/format.php
          question/type/questiontype.php
          question/type/essay/questiontype.php

          Show
          Huy Hoang added a comment - the bug is still there in Moodle 2.0 (and latest 2.2dev as of today). A quick search shows about 10 references of the function html_to_text(): admin/cli/upgrade.php grade/lib.php lib/environmentlib.php lib/weblib.php mod/lesson/format.php question/format.php question/format/multianswer/format.php question/type/questiontype.php question/type/essay/questiontype.php
          Hide
          Garret Gengler added a comment -

          Edited to make it clear that this affects 2.x as well.

          It's an easy fix.. was reported almost 2 years ago.

          The RoundCubeMail developers have since updated their code again... you can see the latest here:
          http://trac.roundcube.net/browser/trunk/roundcubemail/program/lib/html2text.php

          Show
          Garret Gengler added a comment - Edited to make it clear that this affects 2.x as well. It's an easy fix.. was reported almost 2 years ago. The RoundCubeMail developers have since updated their code again... you can see the latest here: http://trac.roundcube.net/browser/trunk/roundcubemail/program/lib/html2text.php
          Hide
          Tim Hunt added a comment -

          Michael, this one slipped through the cracks. Probably worth adding to a sprint soon.

          Show
          Tim Hunt added a comment - Michael, this one slipped through the cracks. Probably worth adding to a sprint soon.
          Hide
          Michael de Raadt added a comment -

          Bumping this issue.

          Show
          Michael de Raadt added a comment - Bumping this issue.
          Hide
          Gerard Caulfield added a comment -

          Looks perfect so long as their are no entities with more than 6 characters and never will be?

          Show
          Gerard Caulfield added a comment - Looks perfect so long as their are no entities with more than 6 characters and never will be?
          Hide
          Aparup Banerjee added a comment -

          Hi Raj,
          i've found the relevant change that was made upstream here :

          http://trac.roundcube.net/changeset?reponame=&new=4223%40trunk%2Froundcubemail%2Fprogram%2Flib%2Fhtml2text.php&old=4013%40trunk%2Froundcubemail%2Fprogram%2Flib%2Fhtml2text.php

          The change was made with a couple of other changes too so this fix may be out of sync/context.

          Perhaps we should consider updating the entire 'library' file here to the latest maintained copy rather than fixing the one line ?

          cheers,
          Aparup

          Show
          Aparup Banerjee added a comment - Hi Raj, i've found the relevant change that was made upstream here : http://trac.roundcube.net/changeset?reponame=&new=4223%40trunk%2Froundcubemail%2Fprogram%2Flib%2Fhtml2text.php&old=4013%40trunk%2Froundcubemail%2Fprogram%2Flib%2Fhtml2text.php The change was made with a couple of other changes too so this fix may be out of sync/context. Perhaps we should consider updating the entire 'library' file here to the latest maintained copy rather than fixing the one line ? cheers, Aparup
          Hide
          Rajesh Taneja added a comment -

          Thanks Aparup,

          I saw this change, but was not sure if whole file should be updated. If it's fine to do it as part of stable team then I would love to do that.

          My only concern, in doing this is:

          1. Is it ok to update library as part of stable work (As asked above)
          2. Can this be backported to 22 and 21, I feel if not then please have this fix integrated and I will open another bug for master only.
          Show
          Rajesh Taneja added a comment - Thanks Aparup, I saw this change, but was not sure if whole file should be updated. If it's fine to do it as part of stable team then I would love to do that. My only concern, in doing this is: Is it ok to update library as part of stable work (As asked above) Can this be backported to 22 and 21, I feel if not then please have this fix integrated and I will open another bug for master only.
          Hide
          Aparup Banerjee added a comment -

          Thanks for the report and work here guys, this has been integrated and is up for testing.

          Raj: went with (2) (don't see any regressions but the changes are out of context, still - i've created MDL-31805)

          Please ensure that there are no regressions, the change here does seem to fix the reported issue but the fix seems to be out of context. Note the parentheses introduced here are there because the regex was pulled from the latest version of this file. Although harmless, this simply means there are other changes to the library relevant to this latest regex.

          This library is not up to date, i've created MDL-31805 to look at updating it.

          Show
          Aparup Banerjee added a comment - Thanks for the report and work here guys, this has been integrated and is up for testing. Raj: went with (2) (don't see any regressions but the changes are out of context, still - i've created MDL-31805 ) Please ensure that there are no regressions, the change here does seem to fix the reported issue but the fix seems to be out of context. Note the parentheses introduced here are there because the regex was pulled from the latest version of this file. Although harmless, this simply means there are other changes to the library relevant to this latest regex. This library is not up to date, i've created MDL-31805 to look at updating it.
          Hide
          Rajesh Taneja added a comment -

          Thanks Aparup

          Show
          Rajesh Taneja added a comment - Thanks Aparup
          Hide
          Ankit Agarwal added a comment -

          All good.
          Passing!
          Thanks

          Show
          Ankit Agarwal added a comment - All good. Passing! Thanks
          Hide
          Eloy Lafuente (stronk7) added a comment -

          Well,

          I wish I said it every time
          you do the things you do.
          You always lend a helping hand,
          and I'm filled with gratitude.

          You are strong and generous
          for each and everyone one of us.
          I am eternally grateful,
          I cannot say thanks enough.

          Sorry for the (un)cool bit above, lol. Closing this as fixed. Ciao

          Show
          Eloy Lafuente (stronk7) added a comment - Well, I wish I said it every time you do the things you do. You always lend a helping hand, and I'm filled with gratitude. You are strong and generous for each and everyone one of us. I am eternally grateful, I cannot say thanks enough. Sorry for the (un)cool bit above, lol. Closing this as fixed. Ciao

            People

            • Votes:
              9 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: