Moodle
  1. Moodle
  2. MDL-22896

bad regular expression in html2text library causes text to go missing from forum emails

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 1.9.9, 2.1.4, 2.2.1, 2.3
    • Fix Version/s: 2.1.5, 2.2.2
    • Component/s: Libraries
    • Labels:
    • Testing Instructions:
      Hide

      Note: To test this, you should have email working for forum.

      1. set "Email format" as plain text in your profile.
      2. Add a forum post with following text with "Mail now" checked

        Gin & Tonic
        - 2oz gin;
        - 5oz tonic water;
        - 5 cubes of ice;
        - 1 lime wedge.

      3. Run cron /admin/cron.php after 1 min.
      4. Make sure no text is lost.
      Show
      Note: To test this, you should have email working for forum. set "Email format" as plain text in your profile. Add a forum post with following text with "Mail now" checked Gin & Tonic - 2oz gin; - 5oz tonic water; - 5 cubes of ice; - 1 lime wedge. Run cron /admin/cron.php after 1 min. Make sure no text is lost.
    • Difficulty:
      Easy
    • Affected Branches:
      MOODLE_19_STABLE, MOODLE_21_STABLE, MOODLE_22_STABLE, MOODLE_23_STABLE
    • Fixed Branches:
      MOODLE_21_STABLE, MOODLE_22_STABLE
    • Pull Master Branch:
      wip-mdl-22896

      Description

      Greetings.. I believe I've found and fixed a bug in the html2text library.

      In /lib/html2text.php...
      ---------------------------
      478 // Remove unknown/unhandled entities (this cannot be done in search-and-replace block)
      479 $text = preg_replace('/&[^&;]+;/i', '', $text);
      ---------------------------

      That regular expression is too greedy... it matches any sequence of characters that starts with an ampersand and ends with a semicolon.

      We've had numerous reports from users that huge chunks of forum posts are missing from the plain-text emails they receive by subscription.

      The problem occurs when someone happens to include an ampersand in their text, and also a semicolon somewhere. Anything between those two characters is filtered out.

      Here's an example...

      Gin & Tonic

      • 2oz gin;
      • 5oz tonic water;
      • 5 cubes of ice;
      • 1 lime wedge.

      if you ran that through html2text, it would output this..

      Gin

      • 5oz tonic water;
      • 5 cubes of ice;
      • 1 lime wedge.

      The simple fix I am testing now is this:
      479 $text = preg_replace('/&[^&;\s]+;/i', '', $text);

      The additional \s makes sure the match stops on whitespace.

      Best regards,
      -Garret

        Gliffy Diagrams

        1. Help Sessions and general admin.20100827140840.txt.withoutfix
          3 kB
          Troy Williams
        2. Help Sessions and general admin.20100827140938.txt.withfix
          3 kB
          Troy Williams
        3. Help Sessions and general admin.html
          4 kB
          Troy Williams

          Issue Links

            Activity

              People

              • Votes:
                9 Vote for this issue
                Watchers:
                9 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: