Uploaded image for project: 'Moodle'
  1. Moodle
  2. MDL-22896

bad regular expression in html2text library causes text to go missing from forum emails

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: 1.9.9, 2.1.4, 2.2.1, 2.3
    • Fix Version/s: 2.1.5, 2.2.2
    • Component/s: Libraries
    • Labels:
    • Testing Instructions:
      Hide

      Note: To test this, you should have email working for forum.

      1. set "Email format" as plain text in your profile.
      2. Add a forum post with following text with "Mail now" checked

        Gin & Tonic
        - 2oz gin;
        - 5oz tonic water;
        - 5 cubes of ice;
        - 1 lime wedge.

      3. Run cron /admin/cron.php after 1 min.
      4. Make sure no text is lost.
      Show
      Note: To test this, you should have email working for forum. set "Email format" as plain text in your profile. Add a forum post with following text with "Mail now" checked Gin & Tonic - 2oz gin; - 5oz tonic water; - 5 cubes of ice; - 1 lime wedge. Run cron /admin/cron.php after 1 min. Make sure no text is lost.
    • Difficulty:
      Easy
    • Affected Branches:
      MOODLE_19_STABLE, MOODLE_21_STABLE, MOODLE_22_STABLE, MOODLE_23_STABLE
    • Fixed Branches:
      MOODLE_21_STABLE, MOODLE_22_STABLE
    • Pull Master Branch:
      wip-mdl-22896

      Description

      Greetings.. I believe I've found and fixed a bug in the html2text library.

      In /lib/html2text.php...
      ---------------------------
      478 // Remove unknown/unhandled entities (this cannot be done in search-and-replace block)
      479 $text = preg_replace('/&[^&;]+;/i', '', $text);
      ---------------------------

      That regular expression is too greedy... it matches any sequence of characters that starts with an ampersand and ends with a semicolon.

      We've had numerous reports from users that huge chunks of forum posts are missing from the plain-text emails they receive by subscription.

      The problem occurs when someone happens to include an ampersand in their text, and also a semicolon somewhere. Anything between those two characters is filtered out.

      Here's an example...

      Gin & Tonic

      • 2oz gin;
      • 5oz tonic water;
      • 5 cubes of ice;
      • 1 lime wedge.

      if you ran that through html2text, it would output this..

      Gin

      • 5oz tonic water;
      • 5 cubes of ice;
      • 1 lime wedge.

      The simple fix I am testing now is this:
      479 $text = preg_replace('/&[^&;\s]+;/i', '', $text);

      The additional \s makes sure the match stops on whitespace.

      Best regards,
      -Garret

        Gliffy Diagrams

        1. Help Sessions and general admin.20100827140840.txt.withoutfix
          3 kB
          Troy Williams
        2. Help Sessions and general admin.20100827140938.txt.withfix
          3 kB
          Troy Williams
        3. Help Sessions and general admin.html
          4 kB
          Troy Williams

          Issue Links

            Activity

            Hide
            garretg Garret Gengler added a comment -

            I also reported this to the Roundcube Webmail developers (they are the upstream maintainer of html2text.php)...

            They very quickly patched the bug this way:

            479 $text = preg_replace('/&#?[a-z0-9]

            {2,7}

            ;/i', '', $text);

            http://trac.roundcube.net/changeset/3777

            -Garret

            Show
            garretg Garret Gengler added a comment - I also reported this to the Roundcube Webmail developers (they are the upstream maintainer of html2text.php)... They very quickly patched the bug this way: 479 $text = preg_replace('/&#? [a-z0-9] {2,7} ;/i', '', $text); http://trac.roundcube.net/changeset/3777 -Garret
            Hide
            cttxg Teresa Gibbison added a comment -

            Hi Garret
            Would this also affect the display of a hyperlink in the email?

            We have a forum post like this
            "As of this week (Friday 30 July 2010), the Friday tutorials will be held in Computing Laboratory 6 (R G.18)."

            here is the html as the words 'Computing Lab 6' and 'RG.18' are hyperlinks to maps on the website

            "As of this week (Friday 30 July 2010), the Friday tutorials will be held in <strong><a href="http://www.scms.waikato.ac.nz/genquery.php?linklevel=2&linklist=SCMS_Rightbar&linkname=Computer_Laboratories&linktype=report&listby=Lab_Title&lwhere=unique_record_id=10&children=">Computing Laboratory 6</a></strong> (<a href="http://www.waikato.ac.nz/contacts/map/?term=R">R G.18</a>)."

            A user is reporting that his plain text email output is:
            "As of this week (Friday 30 July 2010), the Friday tutorials will be held in ()."

            Show
            cttxg Teresa Gibbison added a comment - Hi Garret Would this also affect the display of a hyperlink in the email? We have a forum post like this "As of this week (Friday 30 July 2010), the Friday tutorials will be held in Computing Laboratory 6 (R G.18)." here is the html as the words 'Computing Lab 6' and 'RG.18' are hyperlinks to maps on the website "As of this week (Friday 30 July 2010), the Friday tutorials will be held in <strong><a href="http://www.scms.waikato.ac.nz/genquery.php?linklevel=2&linklist=SCMS_Rightbar&linkname=Computer_Laboratories&linktype=report&listby=Lab_Title&lwhere=unique_record_id=10&children=">Computing Laboratory 6</a></strong> (<a href="http://www.waikato.ac.nz/contacts/map/?term=R">R G.18</a>)." A user is reporting that his plain text email output is: "As of this week (Friday 30 July 2010), the Friday tutorials will be held in ()."
            Hide
            twilliams Troy Williams added a comment -

            Hi,

            Have attached three files:

            Help Sessions and general admin.html
            -> html we were having problems with.

            Help Sessions and general admin.20100827140840.txt.withoutfix
            -> using $text = preg_replace('/&[^&;]+;/i', '', $text);

            Help Sessions and general admin.20100827140938.txt.withfix
            -> Help Sessions and general admin.20100827140938.txt.withfix

            Works for this example.

            Cheers,

            Troy

            Show
            twilliams Troy Williams added a comment - Hi, Have attached three files: Help Sessions and general admin.html -> html we were having problems with. Help Sessions and general admin.20100827140840.txt.withoutfix -> using $text = preg_replace('/& [^&;] +;/i', '', $text); Help Sessions and general admin.20100827140938.txt.withfix -> Help Sessions and general admin.20100827140938.txt.withfix Works for this example. Cheers, Troy
            Hide
            twilliams Troy Williams added a comment -

            Script used to check: html2texttest.php

            just put in admin/utils or change path to config.php

            Troy

            Show
            twilliams Troy Williams added a comment - Script used to check: html2texttest.php just put in admin/utils or change path to config.php Troy
            Hide
            daveyboond Steve Bond added a comment -

            The greedy regex is still there in html2text.php in Moodle 2.0

            Show
            daveyboond Steve Bond added a comment - The greedy regex is still there in html2text.php in Moodle 2.0
            Hide
            sillyxone Huy Hoang added a comment -

            the bug is still there in Moodle 2.0 (and latest 2.2dev as of today). A quick search shows about 10 references of the function html_to_text():

            admin/cli/upgrade.php
            grade/lib.php
            lib/environmentlib.php
            lib/weblib.php
            mod/lesson/format.php
            question/format.php
            question/format/multianswer/format.php
            question/type/questiontype.php
            question/type/essay/questiontype.php

            Show
            sillyxone Huy Hoang added a comment - the bug is still there in Moodle 2.0 (and latest 2.2dev as of today). A quick search shows about 10 references of the function html_to_text(): admin/cli/upgrade.php grade/lib.php lib/environmentlib.php lib/weblib.php mod/lesson/format.php question/format.php question/format/multianswer/format.php question/type/questiontype.php question/type/essay/questiontype.php
            Hide
            garretg Garret Gengler added a comment -

            Edited to make it clear that this affects 2.x as well.

            It's an easy fix.. was reported almost 2 years ago.

            The RoundCubeMail developers have since updated their code again... you can see the latest here:
            http://trac.roundcube.net/browser/trunk/roundcubemail/program/lib/html2text.php

            Show
            garretg Garret Gengler added a comment - Edited to make it clear that this affects 2.x as well. It's an easy fix.. was reported almost 2 years ago. The RoundCubeMail developers have since updated their code again... you can see the latest here: http://trac.roundcube.net/browser/trunk/roundcubemail/program/lib/html2text.php
            Hide
            timhunt Tim Hunt added a comment -

            Michael, this one slipped through the cracks. Probably worth adding to a sprint soon.

            Show
            timhunt Tim Hunt added a comment - Michael, this one slipped through the cracks. Probably worth adding to a sprint soon.
            Hide
            salvetore Michael de Raadt added a comment -

            Bumping this issue.

            Show
            salvetore Michael de Raadt added a comment - Bumping this issue.
            Hide
            gerry Gerard Caulfield added a comment -

            Looks perfect so long as their are no entities with more than 6 characters and never will be?

            Show
            gerry Gerard Caulfield added a comment - Looks perfect so long as their are no entities with more than 6 characters and never will be?
            Hide
            nebgor Aparup Banerjee added a comment -

            Hi Raj,
            i've found the relevant change that was made upstream here :

            http://trac.roundcube.net/changeset?reponame=&new=4223%40trunk%2Froundcubemail%2Fprogram%2Flib%2Fhtml2text.php&old=4013%40trunk%2Froundcubemail%2Fprogram%2Flib%2Fhtml2text.php

            The change was made with a couple of other changes too so this fix may be out of sync/context.

            Perhaps we should consider updating the entire 'library' file here to the latest maintained copy rather than fixing the one line ?

            cheers,
            Aparup

            Show
            nebgor Aparup Banerjee added a comment - Hi Raj, i've found the relevant change that was made upstream here : http://trac.roundcube.net/changeset?reponame=&new=4223%40trunk%2Froundcubemail%2Fprogram%2Flib%2Fhtml2text.php&old=4013%40trunk%2Froundcubemail%2Fprogram%2Flib%2Fhtml2text.php The change was made with a couple of other changes too so this fix may be out of sync/context. Perhaps we should consider updating the entire 'library' file here to the latest maintained copy rather than fixing the one line ? cheers, Aparup
            Hide
            rajeshtaneja Rajesh Taneja added a comment -

            Thanks Aparup,

            I saw this change, but was not sure if whole file should be updated. If it's fine to do it as part of stable team then I would love to do that.

            My only concern, in doing this is:

            1. Is it ok to update library as part of stable work (As asked above)
            2. Can this be backported to 22 and 21, I feel if not then please have this fix integrated and I will open another bug for master only.
            Show
            rajeshtaneja Rajesh Taneja added a comment - Thanks Aparup, I saw this change, but was not sure if whole file should be updated. If it's fine to do it as part of stable team then I would love to do that. My only concern, in doing this is: Is it ok to update library as part of stable work (As asked above) Can this be backported to 22 and 21, I feel if not then please have this fix integrated and I will open another bug for master only.
            Hide
            nebgor Aparup Banerjee added a comment -

            Thanks for the report and work here guys, this has been integrated and is up for testing.

            Raj: went with (2) (don't see any regressions but the changes are out of context, still - i've created MDL-31805)

            Please ensure that there are no regressions, the change here does seem to fix the reported issue but the fix seems to be out of context. Note the parentheses introduced here are there because the regex was pulled from the latest version of this file. Although harmless, this simply means there are other changes to the library relevant to this latest regex.

            This library is not up to date, i've created MDL-31805 to look at updating it.

            Show
            nebgor Aparup Banerjee added a comment - Thanks for the report and work here guys, this has been integrated and is up for testing. Raj: went with (2) (don't see any regressions but the changes are out of context, still - i've created MDL-31805 ) Please ensure that there are no regressions, the change here does seem to fix the reported issue but the fix seems to be out of context. Note the parentheses introduced here are there because the regex was pulled from the latest version of this file. Although harmless, this simply means there are other changes to the library relevant to this latest regex. This library is not up to date, i've created MDL-31805 to look at updating it.
            Hide
            rajeshtaneja Rajesh Taneja added a comment -

            Thanks Aparup

            Show
            rajeshtaneja Rajesh Taneja added a comment - Thanks Aparup
            Hide
            ankit_frenz Ankit Agarwal added a comment -

            All good.
            Passing!
            Thanks

            Show
            ankit_frenz Ankit Agarwal added a comment - All good. Passing! Thanks
            Hide
            stronk7 Eloy Lafuente (stronk7) added a comment -

            Well,

            I wish I said it every time
            you do the things you do.
            You always lend a helping hand,
            and I'm filled with gratitude.

            You are strong and generous
            for each and everyone one of us.
            I am eternally grateful,
            I cannot say thanks enough.

            Sorry for the (un)cool bit above, lol. Closing this as fixed. Ciao

            Show
            stronk7 Eloy Lafuente (stronk7) added a comment - Well, I wish I said it every time you do the things you do. You always lend a helping hand, and I'm filled with gratitude. You are strong and generous for each and everyone one of us. I am eternally grateful, I cannot say thanks enough. Sorry for the (un)cool bit above, lol. Closing this as fixed. Ciao

              People

              • Votes:
                9 Vote for this issue
                Watchers:
                9 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:
                  Fix Release Date:
                  12/Mar/12