Moodle

Non-breaking space in a lang string after $a will not work due to PHP wiredness

Details

  • Type: Bug Bug
  • Status: Closed Closed
  • Priority: Major Major
  • Resolution: Not a bug
  • Affects Version/s: 1.9.3
  • Fix Version/s: None
  • Component/s: Language, Libraries
  • Labels:
    None
  • Affected Branches:
    MOODLE_19_STABLE

Description

The sprintf() function used in get_string_from_file() (moodlelib.php) doesn't parse correctly all UTF-8 chars, notably multibyte ones.

As a consequence, lang strings cannot use UTF-8 chars like the non-breaking space.

  1. test.php
    26/Jan/09 4:02 AM
    0.7 kB
    Eloy Lafuente (stronk7)
  2. test2.php
    26/Jan/09 6:33 AM
    2 kB
    Nicolas Martignoni
  3. test3.php
    26/Jan/09 7:54 AM
    2 kB
    Eloy Lafuente (stronk7)
  4. test4.php
    10/Feb/09 8:01 PM
    1 kB
    Eloy Lafuente (stronk7)
  5. test5.php
    10/Apr/09 2:21 AM
    3 kB
    Nicolas Martignoni

Issue Links

Activity

Hide
Helen Foster added a comment -

VIP watchers added.

Show
Helen Foster added a comment - VIP watchers added.
Hide
Nicolas Martignoni added a comment -

See http://www.php.net/manual/en/function.sprintf.php#55837 for a possible solution. But what about performance?

Show
Nicolas Martignoni added a comment - See http://www.php.net/manual/en/function.sprintf.php#55837 for a possible solution. But what about performance?
Hide
Nicolas Martignoni added a comment -
Show
Nicolas Martignoni added a comment - Further reading (sorry for the spam): http://www.phpwact.org/php/i18n/utf-8 http://www.scribd.com/doc/940142/utf8-survival
Hide
Eloy Lafuente (stronk7) added a comment -

Hi Nicolas... can you execute the attached test in your server to see what it returns? It's a simple string, containing one utf-8 no-break space (C2A0) and here, being processed by sprintf() doesn't cause apparently any problem.

Ciao

Show
Eloy Lafuente (stronk7) added a comment - Hi Nicolas... can you execute the attached test in your server to see what it returns? It's a simple string, containing one utf-8 no-break space (C2A0) and here, being processed by sprintf() doesn't cause apparently any problem. Ciao
Hide
David Mudrak added a comment -

Hi Eloy. Your test.php passes at my web server as well. However there are some more examples at the pages referenced by Nicolas where this issue should appear. If I understood it correctly (haven't spent much time on it), the problem is with the percentage sign contained in multibyte strings code.

Show
David Mudrak added a comment - Hi Eloy. Your test.php passes at my web server as well. However there are some more examples at the pages referenced by Nicolas where this issue should appear. If I understood it correctly (haven't spent much time on it), the problem is with the percentage sign contained in multibyte strings code.
Hide
Nicolas Martignoni added a comment -

Hi Eloy and thanks for looking at this. Just like David, your test file runs correctly on my servers.

However the problem occurs with embedded vars ($a) in strings. Please have a look at the test2.php file I attach here now. It describes the problem I experience. Thanks in advance for your feedback.

Show
Nicolas Martignoni added a comment - Hi Eloy and thanks for looking at this. Just like David, your test file runs correctly on my servers. However the problem occurs with embedded vars ($a) in strings. Please have a look at the test2.php file I attach here now. It describes the problem I experience. Thanks in advance for your feedback.
Hide
Eloy Lafuente (stronk7) added a comment -

Bah!

Forgot my last comment!!! It's sprintf(). Bloody sprintf(). Deleting previous comment!

Show
Eloy Lafuente (stronk7) added a comment - Bah! Forgot my last comment!!! It's sprintf(). Bloody sprintf(). Deleting previous comment!
Hide
Eloy Lafuente (stronk7) added a comment -

aha!

I think I've found a workaround:

This is current string to evaluate, that is broken:

$resultstring= sprintf("Connecté sous le nom « $a »");

But this seems to be properly evaluated:

$resultstring= sprintf("Connecté sous le nom « ". $a. " »");

Silly trick but seems that we could, simply, replace all the $a (or $a->whatever) by " . $a . ", and that way, forcing the concatenation... results seem to be properly processed by sprintf/eval. Stupid PHP after all.

My only concern is about the cost of applying that replace (regular expression) to all the strings all the time. Could be too much.

Show
Eloy Lafuente (stronk7) added a comment - aha! I think I've found a workaround: This is current string to evaluate, that is broken: $resultstring= sprintf("Connecté sous le nom « $a »"); But this seems to be properly evaluated: $resultstring= sprintf("Connecté sous le nom « ". $a. " »"); Silly trick but seems that we could, simply, replace all the $a (or $a->whatever) by " . $a . ", and that way, forcing the concatenation... results seem to be properly processed by sprintf/eval. Stupid PHP after all. My only concern is about the cost of applying that replace (regular expression) to all the strings all the time. Could be too much.
Hide
Eloy Lafuente (stronk7) added a comment -

Hehe, another workaround that is properly processed too:

$resultstring= sprintf("Connecté sous le nom « {$a} »");

Idiot PHP. :-D

Show
Eloy Lafuente (stronk7) added a comment - Hehe, another workaround that is properly processed too: $resultstring= sprintf("Connecté sous le nom « {$a} »"); Idiot PHP. :-D
Hide
Eloy Lafuente (stronk7) added a comment -

Attached the test3.php example that uses the {$a} trick to allow sprintf()/eval() to work properly in the example given.

Ciao

Show
Eloy Lafuente (stronk7) added a comment - Attached the test3.php example that uses the {$a} trick to allow sprintf()/eval() to work properly in the example given. Ciao
Hide
Eloy Lafuente (stronk7) added a comment -

Just for reference, some execution times, to compare:

  • 1000 strings processed, without the regular expression: 0.032 seconds
  • 1000 strings processed, with the regular expression: 0.039 seconds

Difference: 0.007 seconds (21% slower)

  • 100000 strings processed, without the regular expression: 3.365 seconds
  • 100000 strings processed, with the regular expression: 4.114 seconds

Difference: 0.749 seconds (22% slower)

Note that this is a comparison between the execution of test2.php and test3.php above, processing the same "in-memory" string 1000 and 100000 times.

I guess differences in real world (where strings are "in-file", hence slower access) will be really smaller. More yet, 0.007 seconds for 1000 strings seems to be a viable solution if finally implemented.

Ciao

Show
Eloy Lafuente (stronk7) added a comment - Just for reference, some execution times, to compare:
  • 1000 strings processed, without the regular expression: 0.032 seconds
  • 1000 strings processed, with the regular expression: 0.039 seconds
Difference: 0.007 seconds (21% slower)
  • 100000 strings processed, without the regular expression: 3.365 seconds
  • 100000 strings processed, with the regular expression: 4.114 seconds
Difference: 0.749 seconds (22% slower) Note that this is a comparison between the execution of test2.php and test3.php above, processing the same "in-memory" string 1000 and 100000 times. I guess differences in real world (where strings are "in-file", hence slower access) will be really smaller. More yet, 0.007 seconds for 1000 strings seems to be a viable solution if finally implemented. Ciao
Hide
David Mudrak added a comment -

Or, we can make sure that $a parameters are properly quoted by the curly brackets at the time of writing the string via the translation tool. Let us check this condition into the syntax checker, recently published by Petr (was it he?). No need to to search and replace on the fly.
Translators not-using the translation UI should run the checker before they commit anyway.

Show
David Mudrak added a comment - Or, we can make sure that $a parameters are properly quoted by the curly brackets at the time of writing the string via the translation tool. Let us check this condition into the syntax checker, recently published by Petr (was it he?). No need to to search and replace on the fly. Translators not-using the translation UI should run the checker before they commit anyway.
Hide
Nicolas Martignoni added a comment -

After running test3.php, I confirm that the solution with {$a} works correctly for me too Bloody PHP indeed

The grep solution would work out of the box for a small performance cost. With the checker solcution, there's a risk for the translator to forget the check before committing (I know myself ).

In any case, both solution will be better than actual situation.

Show
Nicolas Martignoni added a comment - After running test3.php, I confirm that the solution with {$a} works correctly for me too Bloody PHP indeed The grep solution would work out of the box for a small performance cost. With the checker solcution, there's a risk for the translator to forget the check before committing (I know myself ). In any case, both solution will be better than actual situation.
Hide
Nicolas Martignoni added a comment -

For the grep solution, I'll suggest to take potential {} in the lang strings:

return preg_replace('/{?(\$a(->\w+)?)}?/', '{\\1}', $temp);

or even

return preg_replace('/{(\$a(->\w+)?)}/', '{\1}}', $temp);

Show
Nicolas Martignoni added a comment - For the grep solution, I'll suggest to take potential {} in the lang strings: return preg_replace('/{?(\$a(->\w+)?)}?/', '{\\1}', $temp); or even return preg_replace('/{(\$a(->\w+)?)}/', '{\1}}', $temp);
Hide
Koen Roggemans added a comment -

Petr's script is in MDL-14152

Please include it in moodle/admin with all the other nice testscripts (might be the time to create a folder for them )

Show
Koen Roggemans added a comment - Petr's script is in MDL-14152 Please include it in moodle/admin with all the other nice testscripts (might be the time to create a folder for them )
Hide
Eloy Lafuente (stronk7) added a comment -

Hi,

I think that we shouldn't make translators life harder here, by forcing them to know about those {} internals at all. IMO, if the times are aceptable, we should grep in code and leave them free from complexities.

Only if times aren't aceptable I'd convert/produce that in the editor.

Just my opinion.

Show
Eloy Lafuente (stronk7) added a comment - Hi, I think that we shouldn't make translators life harder here, by forcing them to know about those {} internals at all. IMO, if the times are aceptable, we should grep in code and leave them free from complexities. Only if times aren't aceptable I'd convert/produce that in the editor. Just my opinion.
Hide
David Mudrak added a comment -

What about a pre-commit CVS hook script to search and replace all language files before they are actually saved into the repository. All I am leading towards is to do this thing once instead of doing it everytime we load the strings.

Show
David Mudrak added a comment - What about a pre-commit CVS hook script to search and replace all language files before they are actually saved into the repository. All I am leading towards is to do this thing once instead of doing it everytime we load the strings.
Hide
Eloy Lafuente (stronk7) added a comment -

But that doesn't work for local changes and so on, plus the idea of modifying files in the commit process, wow. -10

IMO it should be handled from core. The only thing is to measure carefully the cost before doing that. Hopefully, some day, PHP won't bug there, so the workaround will be erased safely.

Ciao

Show
Eloy Lafuente (stronk7) added a comment - But that doesn't work for local changes and so on, plus the idea of modifying files in the commit process, wow. -10 IMO it should be handled from core. The only thing is to measure carefully the cost before doing that. Hopefully, some day, PHP won't bug there, so the workaround will be erased safely. Ciao
Hide
David Mudrak added a comment -

Yeah, make sense... So what about sanity this by cron before generating the lang pack zip archives? Local modifications should be OK because master pack will already contain {$a} form so admin will preserve such format. Just a final idea...

Show
David Mudrak added a comment - Yeah, make sense... So what about sanity this by cron before generating the lang pack zip archives? Local modifications should be OK because master pack will already contain {$a} form so admin will preserve such format. Just a final idea...
Hide
Nicolas Martignoni added a comment -

Yes Eloy, I agree with you. +1 from me (if it counts) to handle this in core, if the cost is not too high.

And re PHP bug: it seems that php6 will be UTF-8 compatible, so there's some hope here.

Ciao

Show
Nicolas Martignoni added a comment - Yes Eloy, I agree with you. +1 from me (if it counts) to handle this in core, if the cost is not too high. And re PHP bug: it seems that php6 will be UTF-8 compatible, so there's some hope here. Ciao
Hide
Nicolas Martignoni added a comment -

Eloy, do you have an estimate date for pushing your fix into CVS, and if it's a candidate for backporting in 1.8.x and 1.9.x?

Ciao!

Show
Nicolas Martignoni added a comment - Eloy, do you have an estimate date for pushing your fix into CVS, and if it's a candidate for backporting in 1.8.x and 1.9.x? Ciao!
Hide
Eloy Lafuente (stronk7) added a comment -

Well, the commit is a matter of a few seconds... but it seems that there are two alternatives:

1)- apply the patch and leave moodle to process it, leaving languages unmodified.
2)- don't apply the patch, updating all languages and editor to use {} plus sanity checks, leaving moodle unmodified.

My +1 is about to implement 1), apply the patch and forget modifications in langs.

But I'd like to hear some more votes here before doing that.

Ciao

Show
Eloy Lafuente (stronk7) added a comment - Well, the commit is a matter of a few seconds... but it seems that there are two alternatives: 1)- apply the patch and leave moodle to process it, leaving languages unmodified. 2)- don't apply the patch, updating all languages and editor to use {} plus sanity checks, leaving moodle unmodified. My +1 is about to implement 1), apply the patch and forget modifications in langs. But I'd like to hear some more votes here before doing that. Ciao
Hide
Eloy Lafuente (stronk7) added a comment -

Adding Martin and Tim here.

Martin, after some HQ chat about this, Tim detected that that sprintf() has been there since the beginning. See:

http://cvs.moodle.org/moodle/lib/moodlelib.php?view=diff&pathrev=MOODLE_19_STABLE&r1=1.25&r2=1.26

Do you remember any case where it is necessary? It seems that directly eval() the string is enough to perform all the $a(whatever) parsing.

Also, Tim comments about to use one manual solution for translators having the problem as far as it doesn't affect too many strings, instead of preg_replace always the evaled sentence.

Show
Eloy Lafuente (stronk7) added a comment - Adding Martin and Tim here. Martin, after some HQ chat about this, Tim detected that that sprintf() has been there since the beginning. See: http://cvs.moodle.org/moodle/lib/moodlelib.php?view=diff&pathrev=MOODLE_19_STABLE&r1=1.25&r2=1.26 Do you remember any case where it is necessary? It seems that directly eval() the string is enough to perform all the $a(whatever) parsing. Also, Tim comments about to use one manual solution for translators having the problem as far as it doesn't affect too many strings, instead of preg_replace always the evaled sentence.
Hide
Eloy Lafuente (stronk7) added a comment -

Added test 4 that:

  • Shows no differences between using sprintf() and not using it.
  • Shows that the original preg_replace() isn't valid for strings like «$a» (without spaces).
  • Shows that adding the /u (unicode) modifier to the preg_replace() it processes the strings properly.

I any case, things to decide:

1) Erase sprintf() use if we don't need it. HEAD only for now + unit tests. Backported later.
2) Automated vs. manual {} parsing.

Ciao

Show
Eloy Lafuente (stronk7) added a comment - Added test 4 that:
  • Shows no differences between using sprintf() and not using it.
  • Shows that the original preg_replace() isn't valid for strings like «$a» (without spaces).
  • Shows that adding the /u (unicode) modifier to the preg_replace() it processes the strings properly.
I any case, things to decide: 1) Erase sprintf() use if we don't need it. HEAD only for now + unit tests. Backported later. 2) Automated vs. manual {} parsing. Ciao
Hide
Tim Hunt added a comment -

FYI, I have removed the sprintf in HEAD. Would be good if you could re-test there.

Show
Tim Hunt added a comment - FYI, I have removed the sprintf in HEAD. Would be good if you could re-test there.
Hide
Nicolas Martignoni added a comment -

New file test5.php shows that the problem is not sprintf() function, but eval(). Silly PHP

{$a} workaround still applies.

Is there any alternative to eval() for detecting parse errors?

Show
Nicolas Martignoni added a comment - New file test5.php shows that the problem is not sprintf() function, but eval(). Silly PHP {$a} workaround still applies. Is there any alternative to eval() for detecting parse errors?
Hide
Tim Hunt added a comment -

I'll look into this, since I have been messing with get_string.

If eval is really not Unicode-safe, then we may have to resort to writing our own parser to replace $a and $a->xxx in lang strings. Actually, that should be an easy preg_replace. The only thing to be careful about is to know whether we are relying on eval to do anything else for us. I guess it also replaces \" and \$ with " and $ respectively.

Show
Tim Hunt added a comment - I'll look into this, since I have been messing with get_string. If eval is really not Unicode-safe, then we may have to resort to writing our own parser to replace $a and $a->xxx in lang strings. Actually, that should be an easy preg_replace. The only thing to be careful about is to know whether we are relying on eval to do anything else for us. I guess it also replaces \" and \$ with " and $ respectively.
Hide
Tim Hunt added a comment -

OK, you have to RTFM (http://au2.php.net/manual/en/language.variables.basics.php) and then bang your head against the desk at the stupidity of PHP.

Non-breaking space (character \xa0) is a legal character in a variable name, so, if you are insane you could write code like

$user name = "Nicolas Martignoni";
echo 'This is the output of Moodle:<br />';
echo get_string('loggedinas', 'moodle', $user name);

Anyway, that explains why '$a »' does not work as expected. It is looking for a variable called '$a ' (a followed by non-breaking-space) which obviously does not exist.

The correct work-around is to define

$string['loggedinas'] = 'Connecté sous le nom « {$a} »';

That is, use curly braces to separate the variable from the following character, just like you would say

$string['distancerun'] = 'You have run {$a}km.';

to avoid referring to the variable $akm.

So, this is resolve not-a-bug, but translators need to be aware. I'll post in the Languages forum.

Show
Tim Hunt added a comment - OK, you have to RTFM (http://au2.php.net/manual/en/language.variables.basics.php) and then bang your head against the desk at the stupidity of PHP. Non-breaking space (character \xa0) is a legal character in a variable name, so, if you are insane you could write code like $user name = "Nicolas Martignoni"; echo 'This is the output of Moodle:<br />'; echo get_string('loggedinas', 'moodle', $user name); Anyway, that explains why '$a »' does not work as expected. It is looking for a variable called '$a ' (a followed by non-breaking-space) which obviously does not exist. The correct work-around is to define $string['loggedinas'] = 'Connecté sous le nom « {$a} »'; That is, use curly braces to separate the variable from the following character, just like you would say $string['distancerun'] = 'You have run {$a}km.'; to avoid referring to the variable $akm. So, this is resolve not-a-bug, but translators need to be aware. I'll post in the Languages forum.
Hide
Nicolas Martignoni added a comment -

Thanks Tim, I was getting mad, as I couldn't understand what was going here. What a stupidity...

Show
Nicolas Martignoni added a comment - Thanks Tim, I was getting mad, as I couldn't understand what was going here. What a stupidity...
Hide
Nicolas Martignoni added a comment -

Indeed not a bug, just a silly PHP convention.

Show
Nicolas Martignoni added a comment - Indeed not a bug, just a silly PHP convention.
Hide
Dan Poltawski added a comment -

Wow, that is ridiculous - is it worth greping through all the language packs and ensuring that we use braces everywhere in the lang packs?

Show
Dan Poltawski added a comment - Wow, that is ridiculous - is it worth greping through all the language packs and ensuring that we use braces everywhere in the lang packs?
Hide
Koen Roggemans added a comment -

Is it possible to add to the language pack editor to add curly brackets arround $a if they are not already there? That way, opening and saving a language file would fix all $a strings in one go. New errors are not possible anymore, old ones, if they exist, are easy to fix and will be fixed over time automatically.

Show
Koen Roggemans added a comment - Is it possible to add to the language pack editor to add curly brackets arround $a if they are not already there? That way, opening and saving a language file would fix all $a strings in one go. New errors are not possible anymore, old ones, if they exist, are easy to fix and will be fixed over time automatically.
Hide
Tim Hunt added a comment -

Sounds like a good idea, but I think that changing the editing tool is separate from the underlying issue discussed here, so I created MDL-18841.

Show
Tim Hunt added a comment - Sounds like a good idea, but I think that changing the editing tool is separate from the underlying issue discussed here, so I created MDL-18841.

Dates

  • Created:
    Updated:
    Resolved: