Moodle

html2text not compatible with utf-8

Details

  • Type: Bug Bug
  • Status: Closed Closed
  • Priority: Critical Critical
  • Resolution: Fixed
  • Affects Version/s: 1.5, 1.8.9, 1.9.5
  • Fix Version/s: 1.8.10, 1.9.6
  • Component/s: General
  • Labels:
    None
  • Environment:
    Both PHP4 and PHP5
  • Affected Branches:
    MOODLE_15_STABLE, MOODLE_18_STABLE, MOODLE_19_STABLE
  • Fixed Branches:
    MOODLE_18_STABLE, MOODLE_19_STABLE

Description

html2text function called indirectly from format_text_email function is not compatible with utf-8 charset encoding. html2text replaces all chr(160) bytes with ' ' at end of the function, while chr(160) in utf-8 encoding does not mean a white space. This causes some characters in utf-8 encoding such as 'da' (U+3060) characters in ja_utf8 garbled in text formatted email.

— ../20050325/moodle/lib/html2text.php Sun Jan 23 11:18:50 2005

+++ html2text.php Sat Mar 26 16:56:06 2005

@@ -157,12 +157,12 @@

$goodStr = wordwrap( $goodStr, 78 );

//make sure there are no more than 3 linebreaks in a row and trim whitespace

  • $goodStr = str_replace(chr(160), ' ', $goodStr );

+// $goodStr = str_replace(chr(160), ' ', $goodStr );

$goodStr = preg_replace(/\r\n?/\f/, \n, $goodStr);

$goodStr = preg_replace(/\n(\s*\n){2}/, \n\n\n, $goodStr);

$goodStr = preg_replace(/[ \t]+(\n/$)/, $1, $goodStr);

$goodStr = preg_replace(/^\n*/\n*$/, '', $goodStr);

  • $goodStr = str_replace(chr(160), ' ', $goodStr );

+// $goodStr = str_replace(chr(160), ' ', $goodStr );

return $goodStr;

  1. html_entity_decode_utf8.patch
    10/Jun/09 11:08 AM
    2 kB
    Francois Marier
  2. html2text_utf8_fixes.patch
    15/Jun/09 1:35 PM
    2 kB
    Francois Marier
  3. html2text.20090608.patch
    08/Jun/09 6:51 PM
    0.8 kB
    Juan Segarra Montesinos
  4. html2text.20090610.patch
    10/Jun/09 3:34 PM
    0.8 kB
    Juan Segarra Montesinos

Issue Links

Activity

Hide
Martin Dougiamas added a comment -

From Martin Dougiamas (martin at moodle.com) Sunday, 27 March 2005, 01:50 PM:

Thanks! Fixed in 1.5 CVS

Show
Martin Dougiamas added a comment - From Martin Dougiamas (martin at moodle.com) Sunday, 27 March 2005, 01:50 PM: Thanks! Fixed in 1.5 CVS
Hide
Michael Blake added a comment -

assign to a valid user

Show
Michael Blake added a comment - assign to a valid user
Hide
Joseph Rézeau added a comment -

I have just one question: what is the use of the html_to_text($html) function in moodle weblib ? That function uses the html2text.php library, which mangles utf8 text.

I found this by accident, because html_to_text() IS used in the Questionnaire plugin (where it causes a problem in non-ASCII languages), but nowhere else.

In MDL-17542, François Marier reported "I have committed an updated version of this file to CVS, along with a readme describing its origin." but again, what's the point of it all if that library is not being used anywhere in Moodle core files?

I remain puzzled,

Joseph

Show
Joseph Rézeau added a comment - I have just one question: what is the use of the html_to_text($html) function in moodle weblib ? That function uses the html2text.php library, which mangles utf8 text. I found this by accident, because html_to_text() IS used in the Questionnaire plugin (where it causes a problem in non-ASCII languages), but nowhere else. In MDL-17542, François Marier reported "I have committed an updated version of this file to CVS, along with a readme describing its origin." but again, what's the point of it all if that library is not being used anywhere in Moodle core files? I remain puzzled, Joseph
Hide
Francois Marier added a comment -

It's also used for example when emailing out forum posts.

Show
Francois Marier added a comment - It's also used for example when emailing out forum posts.
Hide
Joseph Rézeau added a comment -

François, sorry but in moodle 1.9 I cannot find a reference to the html_to_text($html) function being used anywhere. In which script is it actually used?

Show
Joseph Rézeau added a comment - François, sorry but in moodle 1.9 I cannot find a reference to the html_to_text($html) function being used anywhere. In which script is it actually used?
Hide
Juan Segarra Montesinos added a comment -

Hi

html2text is not working correctly in 1.9.5. To reproduce the problem:

1. Write a forum email with non ASCII charaters
2. Send the email

Look at the text/plain part. Part of the body is incorrectly encoded.

Problem is in method _convert() in html2text.php. html_entity_decode() works in latin1 by default, so first text should be converted to latin1 or specify what's the input encoding.

The patch attached seems to solve the problem.

Thanks in advance

Show
Juan Segarra Montesinos added a comment - Hi html2text is not working correctly in 1.9.5. To reproduce the problem: 1. Write a forum email with non ASCII charaters 2. Send the email Look at the text/plain part. Part of the body is incorrectly encoded. Problem is in method _convert() in html2text.php. html_entity_decode() works in latin1 by default, so first text should be converted to latin1 or specify what's the input encoding. The patch attached seems to solve the problem. Thanks in advance
Hide
Francois Marier added a comment -

Joseph, if you look in lib/weblib.php, the html_to_text() function is used once within the format_text_email() function:

function format_text_email($text, $format) {

switch ($format) { ... case FORMAT_HTML: return html_to_text($text); break; ... }
}

see: http://git.moodle.org/gw?p=moodle.git;a=blob;f=lib/weblib.php;h=0a6dec42bcb2ea5288aacad9822b292a6e9a3460;hb=MOODLE_19_STABLE#l1775

Show
Francois Marier added a comment - Joseph, if you look in lib/weblib.php, the html_to_text() function is used once within the format_text_email() function: function format_text_email($text, $format) { switch ($format) { ... case FORMAT_HTML: return html_to_text($text); break; ... } } see: http://git.moodle.org/gw?p=moodle.git;a=blob;f=lib/weblib.php;h=0a6dec42bcb2ea5288aacad9822b292a6e9a3460;hb=MOODLE_19_STABLE#l1775
Hide
Francois Marier added a comment -

Alright, I've got a patch which seems to work both on PHP5 and PHP4.

Can people please test it and confirm whether or not it fixes their issues? I'm particularly interested to hear whether it works on non-latin locales (for example, Japanese).

Cheers,
Francois

Show
Francois Marier added a comment - Alright, I've got a patch which seems to work both on PHP5 and PHP4. Can people please test it and confirm whether or not it fixes their issues? I'm particularly interested to hear whether it works on non-latin locales (for example, Japanese). Cheers, Francois
Hide
Juan Segarra Montesinos added a comment -

Sorry for the noise guys, but I submitted a wrong patch the other day... bad day

This solves our issues with plain text email...

I'll try to provide feedback for the other patch too.

bye

Show
Juan Segarra Montesinos added a comment - Sorry for the noise guys, but I submitted a wrong patch the other day... bad day This solves our issues with plain text email... I'll try to provide feedback for the other patch too. bye
Hide
Francois Marier added a comment -

Hi Juan,

I'm not sure about the conversion to latin1. What happens if there are characters (e.g. Japanese characters) which fall outside of that range?

This is why I'd like someone using a non-latin1 locale to confirm that the html_entity_decode_utf8 patch works.

Cheers,
Francois

Show
Francois Marier added a comment - Hi Juan, I'm not sure about the conversion to latin1. What happens if there are characters (e.g. Japanese characters) which fall outside of that range? This is why I'd like someone using a non-latin1 locale to confirm that the html_entity_decode_utf8 patch works. Cheers, Francois
Hide
Juan Segarra Montesinos added a comment -

Well done Francois

Show
Juan Segarra Montesinos added a comment - Well done Francois
Hide
Jens Eremie added a comment -

Thanks for the Fix Francois and Juan!

Works on m1.9.5+ (php 5.2.9)

Cheers,
Jens

Show
Jens Eremie added a comment - Thanks for the Fix Francois and Juan! Works on m1.9.5+ (php 5.2.9) Cheers, Jens
Hide
Eloy Lafuente (stronk7) added a comment -

Hi,

as commented in MDL-19499, +1 to use current textlib->entities_to_utf8(), plus tests in lib/simpletest/testweblib.php to see if it works ok. (-1 to add new function into html2text).

Ciao

Show
Eloy Lafuente (stronk7) added a comment - Hi, as commented in MDL-19499, +1 to use current textlib->entities_to_utf8(), plus tests in lib/simpletest/testweblib.php to see if it works ok. (-1 to add new function into html2text). Ciao
Hide
Francois Marier added a comment -

Thanks for that Eloy, I'm going to have a look at textlib and test it on PHP4.

Show
Francois Marier added a comment - Thanks for that Eloy, I'm going to have a look at textlib and test it on PHP4.
Hide
Francois Marier added a comment -

New patch based on Eloy's suggestion.

Show
Francois Marier added a comment - New patch based on Eloy's suggestion.
Hide
Francois Marier added a comment -

Fixed in 1.8 and 1.9 (HEAD was not affected by this problem).

I have updated the unit tests to match the output of this new library. They all pass now.

Show
Francois Marier added a comment - Fixed in 1.8 and 1.9 (HEAD was not affected by this problem). I have updated the unit tests to match the output of this new library. They all pass now.
Hide
Eloy Lafuente (stronk7) added a comment -

I've backported tests to 18_STABLE and they are passing ok under all branches.

So closing as reviewed. Thanks, Francois B)

Ciao

Show
Eloy Lafuente (stronk7) added a comment - I've backported tests to 18_STABLE and they are passing ok under all branches. So closing as reviewed. Thanks, Francois B) Ciao

Dates

  • Created:
    Updated:
    Resolved: