Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.9.6
    • Fix Version/s: 1.9.8, 2.0
    • Component/s: Filters
    • Labels:
      None
    • Difficulty:
      Easy
    • Affected Branches:
      MOODLE_19_STABLE
    • Fixed Branches:
      MOODLE_19_STABLE, MOODLE_20_STABLE
    • Rank:
      31873

      Description

      The convert_urls_into_links function in lib/weblib.php has bad regular expressions.
      The regex used sucks up anything after the domain until it hits a space.

      Something like
      <p>text www.example.com</p> text
      gets completely trashed, and produces
      <p>text <a href="http://www.apple.com</p" target="_blank">www.example.com</p</a>> text
      Which in effect makes the rest of the content vanish.

      I changed the regex to fix it.
      I just rewrote the regexes, but it's much better than what was there... I only mess with http(s):// or www urls with at least one period in the host string.
      Also no longer using the deprecated ereg function. See MDL-20821

      function convert_urls_into_links(&$text) {
      /// Make lone URLs into links. eg http://moodle.com/
      $text = preg_replace("!([[:space:]]|^|(|[)(https?)://([[:alnum:]]\.[[:alnum:].;_%#?/&=][[:alnum:]])!i",
      "
      1<a href=\"\\2://\\3\" target=\"_blank\">\\2://
      3</a>", $text);

      /// eg www.moodle.com
      $text = preg_replace("!([[:space:]]|^|(|[)www\.([[:alnum:]]\.[[:alnum:].;_%#?/&=][[:alnum:]])!i",
      "
      1<a href=\"http://www.\\2\" target=\"_blank\">www.
      2</a>", $text);
      }

      1. bug20826.php
        2 kB
        Alan Trick
      2. crazy_url_parser.patch.txt
        3 kB
        Eloy Lafuente (stronk7)
      3. testweblib.patch
        8 kB
        Andrew Davis
      4. testweblib2.patch
        9 kB
        Andrew Davis
      5. testweblib3.patch
        10 kB
        Andrew Davis
      6. weblib.patch
        4 kB
        Andrew Davis
      7. weblib2.patch
        4 kB
        Andrew Davis
      8. weblib3.patch
        4 kB
        Andrew Davis

        Issue Links

          Activity

          Hide
          Alan Trick added a comment -

          Here's another case that's really bugging us here:

          http://google.com)<br />

          becomes

          <a href="http://google.com)<br" target="_blank">http://google.com)<br</a> />
          Show
          Alan Trick added a comment - Here's another case that's really bugging us here: http://google.com)<br /> becomes <a href="http://google.com)<br" target="_blank">http://google.com)<br</a> />
          Hide
          Alan Trick added a comment -

          Here's an alternate implementation of convert_urls_into_links. I hacked it together rather quickly, so it can probably be cleaned up a bit. It has the benefit of working correctly though

          It also changes some other behaviour. For example, it won't include punctuation at the end of a link (so "www.abc.com/." doesn't include the final "."). I think this behaviour is what is wanted 99% of the time and so I think it's an improvement. If someone really needs that punctuation, they can just manually specify the url.

          N.b. There's still one potential bug, but the old version had it too. It doesn't actually check for nested links, so

          <a href='http://google.com'> http://google.com</a>

          still trips it up.

          Show
          Alan Trick added a comment - Here's an alternate implementation of convert_urls_into_links . I hacked it together rather quickly, so it can probably be cleaned up a bit. It has the benefit of working correctly though It also changes some other behaviour. For example, it won't include punctuation at the end of a link (so "www.abc.com/." doesn't include the final "."). I think this behaviour is what is wanted 99% of the time and so I think it's an improvement. If someone really needs that punctuation, they can just manually specify the url. N.b. There's still one potential bug, but the old version had it too. It doesn't actually check for nested links, so <a href='http://google.com'> http://google.com</a> still trips it up.
          Hide
          Eloy Lafuente (stronk7) added a comment -

          Hi,

          Extracting URLs with just one regexp is really a complex problem, hehe. Here we need to find something balanced, able to handle 99% of them, leaving apart really side cases.

          In any case, I've added some tests both to 1.9 and HEAD in order to be able to test and compare different alternatives. I've added everything I've been able to imagine, including your 2 detected problems above.

          Right now, current implementation is failing on 4 tests (your 2 problems and the urls within <a> tags having one space. I think it would be great to know how your alternatives are performing under those tests. (run tests restricted to "lib/simpletest/testweblib.php").

          Knowing that... we'll be nearer to go in the better solution (once more, knowing that it's impossible to achieve 100%).

          Ciao

          Show
          Eloy Lafuente (stronk7) added a comment - Hi, Extracting URLs with just one regexp is really a complex problem, hehe. Here we need to find something balanced, able to handle 99% of them, leaving apart really side cases. In any case, I've added some tests both to 1.9 and HEAD in order to be able to test and compare different alternatives. I've added everything I've been able to imagine, including your 2 detected problems above. Right now, current implementation is failing on 4 tests (your 2 problems and the urls within <a> tags having one space. I think it would be great to know how your alternatives are performing under those tests. (run tests restricted to "lib/simpletest/testweblib.php"). Knowing that... we'll be nearer to go in the better solution (once more, knowing that it's impossible to achieve 100%). Ciao
          Hide
          Alan Trick added a comment -

          Actually, I think for the task of this function, it's possible to get something that at least produces valid code 100% of the time. The bug I mentioned is fixable, it just requires more code. Plus it was late Friday night and I had gotten 3 hours of sleep the night before struggling with trying to finish a lab. I'm certain I can do better, given time.

          However, in the end, I think a correct solution will probably be slower (though I haven't done tests). Is this acceptable? I personally think it is because performance issues can be offloaded by caching, but I'm not in charge of Moodle's design

          Show
          Alan Trick added a comment - Actually, I think for the task of this function, it's possible to get something that at least produces valid code 100% of the time. The bug I mentioned is fixable, it just requires more code. Plus it was late Friday night and I had gotten 3 hours of sleep the night before struggling with trying to finish a lab. I'm certain I can do better, given time. However, in the end, I think a correct solution will probably be slower (though I haven't done tests). Is this acceptable? I personally think it is because performance issues can be offloaded by caching, but I'm not in charge of Moodle's design
          Hide
          Andrew Davis added a comment -

          Thanks for those tests Eloy. They're very helpful.

          For the record the existing regexs pass 20 of the unit tests and fail these 4.
          URL: <a href="http://moodle.org"> http://moodle.org</a>
          URL: <a href="http://moodle.org"> www.moodle.org</a>
          URL: http://moodle.org)<br />
          URL: <p>text www.moodle.org</p> text

          Alan's suggested code also passes 20 and fails 4. It has 3 fails in common with the existing code.
          URL: http://moodle.org:8080/s/i=1
          URL: <a href="http://moodle.org"> http://moodle.org</a>
          URL: <a href="http://moodle.org"> www.moodle.org</a>
          URL: <p>text www.moodle.org</p> text

          I'm working on my own version but I'm not sure I'm going to do much better. I'm looking at making several passes rather than trying to do everything in one go. These large regular expressions are going to be hard to modify if they ever need maintenance in the future.

          Show
          Andrew Davis added a comment - Thanks for those tests Eloy. They're very helpful. For the record the existing regexs pass 20 of the unit tests and fail these 4. URL: <a href="http://moodle.org"> http://moodle.org </a> URL: <a href="http://moodle.org"> www.moodle.org</a> URL: http://moodle.org )<br /> URL: <p>text www.moodle.org</p> text Alan's suggested code also passes 20 and fails 4. It has 3 fails in common with the existing code. URL: http://moodle.org:8080/s/i=1 URL: <a href="http://moodle.org"> http://moodle.org </a> URL: <a href="http://moodle.org"> www.moodle.org</a> URL: <p>text www.moodle.org</p> text I'm working on my own version but I'm not sure I'm going to do much better. I'm looking at making several passes rather than trying to do everything in one go. These large regular expressions are going to be hard to modify if they ever need maintenance in the future.
          Hide
          Eloy Lafuente (stronk7) added a comment -

          Assigning to you Andrew...

          Note: One possible strategy could be the one used by filters (see filterlib) where:

          • All the <a>...</a> tags are extracted from text and replaced by placeholders.
          • The main regexp is executed
          • The placeholders are replaced by original <a>...</a> tags.

          That approach has used to work pretty well there since ages ago.

          Also note that convert_urls_into_links() is mainly executed by format_text() (like filters), so its results are cached for n minutes in DB (if cache enabled), so I don't thing switching from a raw regexp to the approach above will hurt too much. In fact, if that helps to reduce the regexp complexity it can end being faster.

          I think it would be interesting to compare execution times in any case. For any solution proposed, with caches disabled, process the same text string (big) n times and report about differences with current one. Just to be sure we don't spend suddenly too much process time here.

          Finally, feel free to add more and more tests, containing utf-8 texts, multiple links... whatever. Any final solution must be 100% tested before adding it upstream as far as this is one core-core function.

          Ciao

          Show
          Eloy Lafuente (stronk7) added a comment - Assigning to you Andrew... Note: One possible strategy could be the one used by filters (see filterlib) where: All the <a>...</a> tags are extracted from text and replaced by placeholders. The main regexp is executed The placeholders are replaced by original <a>...</a> tags. That approach has used to work pretty well there since ages ago. Also note that convert_urls_into_links() is mainly executed by format_text() (like filters), so its results are cached for n minutes in DB (if cache enabled), so I don't thing switching from a raw regexp to the approach above will hurt too much. In fact, if that helps to reduce the regexp complexity it can end being faster. I think it would be interesting to compare execution times in any case. For any solution proposed, with caches disabled, process the same text string (big) n times and report about differences with current one. Just to be sure we don't spend suddenly too much process time here. Finally, feel free to add more and more tests, containing utf-8 texts, multiple links... whatever. Any final solution must be 100% tested before adding it upstream as far as this is one core-core function. Ciao
          Hide
          Andrew Davis added a comment -

          Ok, I've added a couple of tests including one timing tests. I moved a copy of the original function into the test code so the performance of the new code can be measured against the original code. If the new code is more than 20% slower than the old code the unit test fails. It means the unit tests take 2 or 3 seconds more to run but anyone tinkering in that area will get a unit test failure if their changes result in a slowdown. Let me know if there is a better more "moodly" way to do this.

          I'll attach patches containing the new tests plus the updated function for discussion. I've brought in code similar to that used within the filters as suggested by Eloy. This means <a> tags now make it through undamaged I've also largely given up on my own code as that was largely attempting to achieve what the filter <a> protecting code achieves.

          Right now, there are 2 problems that I am, as yet, unable to solve.

          1) Chopping of a trailing < Bearing in mind that &'s are valid in URLs. There will be a way to do this but thus far my attempts to modify the regexs to do this have not been successful.

          2) Brackets are allowed mid url but should be cut off the end. Simply allowing brackets means this happens. first line original. second line is after conversion.
          http://moodle.org)<br />
          <a href="http://moodle.org)" target="_blank">http://moodle.org)</a><br />
          Note closing bracket included in link.

          denying brackets means this happens:
          http://cc.org/url_(withpar)_go/?i=2
          <a href="http://cc.org/url_" target="_blank">http://cc.org/url_</a>(withpar)_go/?i=2
          Note link ends when a bracket is encountered cutting off part of the URL.

          I haven't come up with a good way to get the correct output in both situations.

          Show
          Andrew Davis added a comment - Ok, I've added a couple of tests including one timing tests. I moved a copy of the original function into the test code so the performance of the new code can be measured against the original code. If the new code is more than 20% slower than the old code the unit test fails. It means the unit tests take 2 or 3 seconds more to run but anyone tinkering in that area will get a unit test failure if their changes result in a slowdown. Let me know if there is a better more "moodly" way to do this. I'll attach patches containing the new tests plus the updated function for discussion. I've brought in code similar to that used within the filters as suggested by Eloy. This means <a> tags now make it through undamaged I've also largely given up on my own code as that was largely attempting to achieve what the filter <a> protecting code achieves. Right now, there are 2 problems that I am, as yet, unable to solve. 1) Chopping of a trailing < Bearing in mind that &'s are valid in URLs. There will be a way to do this but thus far my attempts to modify the regexs to do this have not been successful. 2) Brackets are allowed mid url but should be cut off the end. Simply allowing brackets means this happens. first line original. second line is after conversion. http://moodle.org )<br /> <a href="http://moodle.org)" target="_blank"> http://moodle.org )</a><br /> Note closing bracket included in link. denying brackets means this happens: http://cc.org/url_(withpar)_go/?i=2 <a href="http://cc.org/url_" target="_blank"> http://cc.org/url_ </a>(withpar)_go/?i=2 Note link ends when a bracket is encountered cutting off part of the URL. I haven't come up with a good way to get the correct output in both situations.
          Hide
          Andrew Davis added a comment -

          patches attached. with the regexs Ive gone full circle through building up very complex ones and back to simpler ones. Largely thanks to the filter code.

          Show
          Andrew Davis added a comment - patches attached. with the regexs Ive gone full circle through building up very complex ones and back to simpler ones. Largely thanks to the filter code.
          Hide
          Eloy Lafuente (stronk7) added a comment - - edited

          Hi Andrew.

          some comments:

          1a) about tests... do we have any test involving utf-8 chars? I'd add some having those chars both in the server part, in the path part and in the parameters part.
          1b) Some more tests about having parameters with contents like param=string%2f&with%slashes (i.e. having encoded chars like slashes or spaces %20).
          2) as we are moving from POSIX to REGEX regular expressions.. can we change those eregi_replaces?
          3) +1 to commit straight the testweblib.php, both to 1.9 and HEAD to have it there, warning us since now (and until we decide the final impl. of the function).

          Nice work, thanks!

          About those brackets (parenthesis) it seems that only wikipedia is used to them, grrr, but it's important enough to support them. Based on what I've seen there in various regexp url extractors it's generally accepted that they are only accepted if balanced, i.e. if there is one open parenthesis. Else, they are handled as "punctuation" mark (out from the URL).

          About the < problem, sincerely I have not really idea of the best way to handle it. Perhaps if the URL has some "?", then we can infere it's going to have parameters, so then we handle it as part of the URL, but if the URL is missing the "?" then the < should be left out. Just one idea.... but if it implies too much "cost" to the function I think I would leave it out. Not sure anyway.

          Ciao

          Edited: I'm looking for one simple regexp expression that was handling near everything ok... and that I was testing some days ago... grrr. I don't find it now. Will post here later...

          Show
          Eloy Lafuente (stronk7) added a comment - - edited Hi Andrew. some comments: 1a) about tests... do we have any test involving utf-8 chars? I'd add some having those chars both in the server part, in the path part and in the parameters part. 1b) Some more tests about having parameters with contents like param=string%2f&with%slashes (i.e. having encoded chars like slashes or spaces %20). 2) as we are moving from POSIX to REGEX regular expressions.. can we change those eregi_replaces? 3) +1 to commit straight the testweblib.php, both to 1.9 and HEAD to have it there, warning us since now (and until we decide the final impl. of the function). Nice work, thanks! About those brackets (parenthesis) it seems that only wikipedia is used to them, grrr, but it's important enough to support them. Based on what I've seen there in various regexp url extractors it's generally accepted that they are only accepted if balanced, i.e. if there is one open parenthesis. Else, they are handled as "punctuation" mark (out from the URL). About the < problem, sincerely I have not really idea of the best way to handle it. Perhaps if the URL has some "?", then we can infere it's going to have parameters, so then we handle it as part of the URL, but if the URL is missing the "?" then the < should be left out. Just one idea.... but if it implies too much "cost" to the function I think I would leave it out. Not sure anyway. Ciao Edited: I'm looking for one simple regexp expression that was handling near everything ok... and that I was testing some days ago... grrr. I don't find it now. Will post here later...
          Hide
          Eloy Lafuente (stronk7) added a comment -

          LoL, simple, LoL!!

          I haven't been able to resist. Attaching one patch for Moodle 1.9 (crazy_url_parser.patch.txt) I've ended after reading the RFC a bit/a lot and using some examples found here and there, mixing all them with some similar stuff already present in backuplib.php and done.

          It should support the RFC more or less Ok, with domain and IPv4 URLs, ports and all short of encoded params, parenthesis when allowed, unicode chars (if the php server support them) and so on.

          Also it assumes that these chars at the end of the expression: dot, comma, semicolon, while legal in URLs, are the end of the URL. It's done via negative assertion that can be used to add other "end" characters/combinations.

          It doesn't support IPv6 URLs (though could be added, I just don't know enough about them)

          It's passing all current test (but the <a>...</a> ones, so the trick of extracting all the <a> tags continue being necessary. And in any case it needs more tests:

          • utf8 chars in all the possible places (domain, path, param names and param values).
          • Some more (real) wikipedia parenthesis examples.
          • encoded param values (%xx)
          • space specially encoded with char.
          • Compare speed.
          • Everything else you can imagine.

          Ciao

          Show
          Eloy Lafuente (stronk7) added a comment - LoL, simple, LoL!! I haven't been able to resist. Attaching one patch for Moodle 1.9 (crazy_url_parser.patch.txt) I've ended after reading the RFC a bit/a lot and using some examples found here and there, mixing all them with some similar stuff already present in backuplib.php and done. It should support the RFC more or less Ok, with domain and IPv4 URLs, ports and all short of encoded params, parenthesis when allowed, unicode chars (if the php server support them) and so on. Also it assumes that these chars at the end of the expression: dot, comma, semicolon, while legal in URLs, are the end of the URL. It's done via negative assertion that can be used to add other "end" characters/combinations. It doesn't support IPv6 URLs (though could be added, I just don't know enough about them) It's passing all current test (but the <a>...</a> ones, so the trick of extracting all the <a> tags continue being necessary. And in any case it needs more tests: utf8 chars in all the possible places (domain, path, param names and param values). Some more (real) wikipedia parenthesis examples. encoded param values (%xx) space specially encoded with char. Compare speed. Everything else you can imagine. Ciao
          Hide
          Andrew Davis added a comment - - edited

          I've attached updated patches for /lib/simpletest/testweblib.php and /lib/weblib.php

          testweblib2.patch
          weblib2.patch

          The regexs from Eloy, combined with the code to extract existing <a> tags means it was passing all of the tests I have however added some new ones and one of them fails. Its related to non-ascii characters. I haven't figured out the handling of utf8 characters just yet. I did some experimenting with utf8_encode/decode without success. Ill come back to this.

          The new version appears to run in about 20% of the time of the original. I'm guessing the switch from ereg to preg is responsible.

          Haven't yet committed any of this yet.

          Show
          Andrew Davis added a comment - - edited I've attached updated patches for /lib/simpletest/testweblib.php and /lib/weblib.php testweblib2.patch weblib2.patch The regexs from Eloy, combined with the code to extract existing <a> tags means it was passing all of the tests I have however added some new ones and one of them fails. Its related to non-ascii characters. I haven't figured out the handling of utf8 characters just yet. I did some experimenting with utf8_encode/decode without success. Ill come back to this. The new version appears to run in about 20% of the time of the original. I'm guessing the switch from ereg to preg is responsible. Haven't yet committed any of this yet.
          Hide
          Andrew Davis added a comment -

          Attached new patches. weblib3 and testweblib3

          All unit tests passing. There actually wasn't any problem with utf8 strings. After hours of fiddling around and reading about php's handling of character sets it turns out my test was failing because the test itself was broken. sigh.

          I'll wait for this to be reviewed before committing.

          Show
          Andrew Davis added a comment - Attached new patches. weblib3 and testweblib3 All unit tests passing. There actually wasn't any problem with utf8 strings. After hours of fiddling around and reading about php's handling of character sets it turns out my test was failing because the test itself was broken. sigh. I'll wait for this to be reviewed before committing.
          Hide
          Eloy Lafuente (stronk7) added a comment -

          Great Andrew!

          I think we are near the end. Just added Petr and Martin here as watchers to take a look to tests and so on.

          In any case, if somebody finds anything else to be tested, now it's the time, ypu are welcome!

          I think we could commit that stuff really soon if nobody is against that as seems to fix the bug and to improve the url detection a lot.

          Ciao

          Show
          Eloy Lafuente (stronk7) added a comment - Great Andrew! I think we are near the end. Just added Petr and Martin here as watchers to take a look to tests and so on. In any case, if somebody finds anything else to be tested, now it's the time, ypu are welcome! I think we could commit that stuff really soon if nobody is against that as seems to fix the bug and to improve the url detection a lot. Ciao
          Hide
          Martin Dougiamas added a comment -

          +1 for 1.9 and head! Great work guys!

          Show
          Martin Dougiamas added a comment - +1 for 1.9 and head! Great work guys!
          Hide
          Andrew Davis added a comment -

          Committed to both 1.9 and trunk.

          Show
          Andrew Davis added a comment - Committed to both 1.9 and trunk.
          Hide
          Eloy Lafuente (stronk7) added a comment -

          Great Andrew,

          just one minor problem...

          The original function in HEAD didn't use to include the target="_blank" part at all (breaks XHTML), and you have re-introduced it. And exactly the same for tests... so if you can fix that... great!

          Note: perhaps it's a good idea in HEAD to change that target="blank" to something like class="_blanktarget". For now it won't have any (practical) effect but later it may be possible to introduce some JS transforming links having the "_blanktarget" class to links opening in a "_blank" target without breaking XHTML. Just one idea, plz comment with MD.

          Ciao

          Show
          Eloy Lafuente (stronk7) added a comment - Great Andrew, just one minor problem... The original function in HEAD didn't use to include the target="_blank" part at all (breaks XHTML), and you have re-introduced it. And exactly the same for tests... so if you can fix that... great! Note: perhaps it's a good idea in HEAD to change that target="blank" to something like class="_blanktarget". For now it won't have any (practical) effect but later it may be possible to introduce some JS transforming links having the "_blanktarget" class to links opening in a "_blank" target without breaking XHTML. Just one idea, plz comment with MD. Ciao
          Hide
          Andrew Davis added a comment -

          Good catch Eloy I've got uncommitted changes on my machine replacing target="_blank" with class="_blanktarget". Here is an example of some JS that does something like what you are talking about for future reference.

          http://articles.sitepoint.com/article/standards-compliant-world/3

          I'll bring this to Martin's attention and get his opinion on committing the code including the class on the links.

          Show
          Andrew Davis added a comment - Good catch Eloy I've got uncommitted changes on my machine replacing target="_blank" with class="_blanktarget". Here is an example of some JS that does something like what you are talking about for future reference. http://articles.sitepoint.com/article/standards-compliant-world/3 I'll bring this to Martin's attention and get his opinion on committing the code including the class on the links.
          Hide
          Martin Dougiamas added a comment -

          Yep, that's the way to do it! +1

          Apart from these links, are there other places in Moodle that need fixing? I'm thinking of things like help buttons and messaging.

          Show
          Martin Dougiamas added a comment - Yep, that's the way to do it! +1 Apart from these links, are there other places in Moodle that need fixing? I'm thinking of things like help buttons and messaging.
          Hide
          Andrew Davis added a comment -

          committed the fix. Also opened http://tracker.moodle.org/browse/MDL-21152 to look at the link target issue.

          Show
          Andrew Davis added a comment - committed the fix. Also opened http://tracker.moodle.org/browse/MDL-21152 to look at the link target issue.

            People

            • Votes:
              2 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: