Moodle
  1. Moodle
  2. MDL-35364

Shorten text can cause invalid HTML

    Details

    • Rank:
      44039

      Description

      Add this unit test to lib/tests/moodlelib_test.php.

          function test_shorten_text_multilang() {
              $text = '<span lang="en" class="multilang">A</span>' .
                      '<span lang="fr" class="multilang">B</span>';
              $this->assertEquals('<span lang="en" class="multilang">A</span>...', shorten_text($text, 1));
          }
      

      The actual result, right now, is '<span lang="en" ...</span>', so the test fails. Clearly, shorten_text should never return invalid HTML, so this test should pass.

        Activity

        Hide
        Tim Hunt added a comment - - edited

        If the use of mutilang confuses you, shorten_text('<span id="x" class="y">A</span> B', 1) also demonstrates the problem.

        Show
        Tim Hunt added a comment - - edited If the use of mutilang confuses you, shorten_text('<span id="x" class="y">A</span> B', 1) also demonstrates the problem.
        Hide
        David Mudrak added a comment -

        So, if a student wraps all their 2MB long forum post into a pair of <div> tags, the function would be unable to do anything, right? I am not sure that's what we want to achieve here ...

        Show
        David Mudrak added a comment - So, if a student wraps all their 2MB long forum post into a pair of <div> tags, the function would be unable to do anything, right? I am not sure that's what we want to achieve here ...
        Hide
        David Mudrak added a comment -

        Please note I agree with Tim's point that the function should not invalidate the input. For me, the conclusion is that this function can't cope with HTML conceptually.

        Show
        David Mudrak added a comment - Please note I agree with Tim's point that the function should not invalidate the input. For me, the conclusion is that this function can't cope with HTML conceptually.
        Hide
        Tim Hunt added a comment -

        <p>A very long text use <b>your imagination</b>, I can't be bothered to type that much.</p>

        Should be shortened to something like

        <p>A very long text use <b>your ...</b></p>

        and that works correctly.

        Show
        Tim Hunt added a comment - <p>A very long text use <b>your imagination</b>, I can't be bothered to type that much.</p> Should be shortened to something like <p>A very long text use <b>your ...</b></p> and that works correctly.
        Hide
        David Mudrak added a comment -

        That works correctly if you use examples like that. Please define the what the output should be for an input text like the following, given it is defined to shorten it to 100 characters.

        <h1>Lorem ipsum</h1>
        <p>Dolor sit amet, consectetur adipiscing elit. Proin pharetra placerat tempor. Sed ac risus sed felis tempor posuere condimentum ut sem. Maecenas mi velit, blandit vitae vulputate eu, ornare eget nulla. Lorem ipsum dolor sit amet, consectetur adipiscing elit. In eget erat quam, eu blandit dolor. Sed ac velit ut sem mattis blandit. Integer sed sem non est hendrerit lacinia a sed libero. Sed adipiscing luctus leo eget porttitor. Morbi leo dui, congue ut sodales ac, posuere sit amet magna. Donec viverra lacus vel ligula mattis tincidunt. Aenean eget magna est. Aliquam erat volutpat. Curabitur id dolor erat. Vivamus porta mi sit amet lorem varius malesuada. Mauris sit amet arcu vitae nunc hendrerit mattis at at sem. Nullam vulputate arcu a nibh scelerisque aliquam. Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Curabitur ut nunc a mauris pretium viverra quis vel libero. Aenean scelerisque scelerisque posuere. Proin euismod bibendum congue. Aenean egestas, erat at mollis semper, sapien odio scelerisque mauris, at facilisis mauris ipsum quis sapien. Aliquam semper, diam sed pellentesque sollicitudin, sem augue scelerisque mi, eget faucibus nunc urna quis felis. Pellentesque vel feugiat massa. Sed porttitor vestibulum rhoncus. Vivamus nulla dolor, scelerisque et vestibulum ut, facilisis sed velit. Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Fusce quis neque enim, ut eleifend nisi. Nunc non eros nulla. Proin sed sapien ipsum. Maecenas nec diam felis. Duis tristique lacinia dignissim. Curabitur velit purus, sodales eu feugiat id, pharetra non velit. Ut fermentum convallis nisl nec egestas. Vestibulum et facilisis tortor. Nunc egestas felis et mauris posuere imperdiet. Nunc id condimentum metus. Sed quis dolor sed lorem vehicula sollicitudin ac ut turpis. Nulla facilisi. Suspendisse potenti. Sed suscipit porta nisl at egestas. Praesent purus massa, laoreet sed euismod a, viverra in est. Aenean vitae sapien vitae leo condimentum posuere sit amet eu neque. Sed lacus dui, viverra sit amet pellentesque eget, iaculis porttitor ante. Integer nisl eros, vulputate at rutrum at, molestie lacinia nisi. Duis orci urna, rutrum eu imperdiet non, consequat sed metus. Suspendisse commodo vestibulum mauris, vitae tincidunt nibh egestas interdum. In hendrerit fringilla adipiscing. Integer <strong>vel iaculis augue</strong>.
        </p>

        Show
        David Mudrak added a comment - That works correctly if you use examples like that. Please define the what the output should be for an input text like the following, given it is defined to shorten it to 100 characters. <h1>Lorem ipsum</h1> <p>Dolor sit amet, consectetur adipiscing elit. Proin pharetra placerat tempor. Sed ac risus sed felis tempor posuere condimentum ut sem. Maecenas mi velit, blandit vitae vulputate eu, ornare eget nulla. Lorem ipsum dolor sit amet, consectetur adipiscing elit. In eget erat quam, eu blandit dolor. Sed ac velit ut sem mattis blandit. Integer sed sem non est hendrerit lacinia a sed libero. Sed adipiscing luctus leo eget porttitor. Morbi leo dui, congue ut sodales ac, posuere sit amet magna. Donec viverra lacus vel ligula mattis tincidunt. Aenean eget magna est. Aliquam erat volutpat. Curabitur id dolor erat. Vivamus porta mi sit amet lorem varius malesuada. Mauris sit amet arcu vitae nunc hendrerit mattis at at sem. Nullam vulputate arcu a nibh scelerisque aliquam. Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Curabitur ut nunc a mauris pretium viverra quis vel libero. Aenean scelerisque scelerisque posuere. Proin euismod bibendum congue. Aenean egestas, erat at mollis semper, sapien odio scelerisque mauris, at facilisis mauris ipsum quis sapien. Aliquam semper, diam sed pellentesque sollicitudin, sem augue scelerisque mi, eget faucibus nunc urna quis felis. Pellentesque vel feugiat massa. Sed porttitor vestibulum rhoncus. Vivamus nulla dolor, scelerisque et vestibulum ut, facilisis sed velit. Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Fusce quis neque enim, ut eleifend nisi. Nunc non eros nulla. Proin sed sapien ipsum. Maecenas nec diam felis. Duis tristique lacinia dignissim. Curabitur velit purus, sodales eu feugiat id, pharetra non velit. Ut fermentum convallis nisl nec egestas. Vestibulum et facilisis tortor. Nunc egestas felis et mauris posuere imperdiet. Nunc id condimentum metus. Sed quis dolor sed lorem vehicula sollicitudin ac ut turpis. Nulla facilisi. Suspendisse potenti. Sed suscipit porta nisl at egestas. Praesent purus massa, laoreet sed euismod a, viverra in est. Aenean vitae sapien vitae leo condimentum posuere sit amet eu neque. Sed lacus dui, viverra sit amet pellentesque eget, iaculis porttitor ante. Integer nisl eros, vulputate at rutrum at, molestie lacinia nisi. Duis orci urna, rutrum eu imperdiet non, consequat sed metus. Suspendisse commodo vestibulum mauris, vitae tincidunt nibh egestas interdum. In hendrerit fringilla adipiscing. Integer <strong>vel iaculis augue</strong>. </p>
        Hide
        David Mudrak added a comment -

        Ah, right. We would take just the first X chars from the <p> and then close the all open tags. So we need some stack of open tags as we parse the text...

        Show
        David Mudrak added a comment - Ah, right. We would take just the first X chars from the <p> and then close the all open tags. So we need some stack of open tags as we parse the text...
        Hide
        Tim Hunt added a comment -

        David, have you looked at the existing implementation of shorten_text? That is how it does work!

        Show
        Tim Hunt added a comment - David, have you looked at the existing implementation of shorten_text? That is how it does work!
        Hide
        Chris Follin added a comment -

        We're having a problem that I think fits with this ticket.

        When text is shortened, open tags get closed but open quotes may not be closed. This is from a forum description that has been shortened:

        <p><span style="font-size: ...</span></p></div></td>
        <td class="cell c3 lastcol" style="text-align:center;"><a href="view.php?f=5" >0</a></td>
        </tr>
        

        It breaks in the middle of the span tag and the span tag is closed as expected, but the break happened in the middle of the style attribute and the open quote wasn't closed. So the browser sees everything from style=" to the next quote in class="cell to still be inside the quoted style attribute. The end result in our case is that content shows up in the wrong column. See attached screenshot for a fairly simple example on the mod/forum/index.php page, which is where the code above was generated.

        Show
        Chris Follin added a comment - We're having a problem that I think fits with this ticket. When text is shortened, open tags get closed but open quotes may not be closed. This is from a forum description that has been shortened: <p><span style="font-size: ...</span></p></div></td> <td class= "cell c3 lastcol" style= "text-align:center;" ><a href= "view.php?f=5" >0</a></td> </tr> It breaks in the middle of the span tag and the span tag is closed as expected, but the break happened in the middle of the style attribute and the open quote wasn't closed. So the browser sees everything from style=" to the next quote in class="cell to still be inside the quoted style attribute. The end result in our case is that content shows up in the wrong column. See attached screenshot for a fairly simple example on the mod/forum/index.php page, which is where the code above was generated.
        Hide
        Petr Škoda added a comment -

        HTML Purifier is designed to "fix" any html fragment, the only problem might be that is strips JS, Flash, etc. But I suppose that is not a big deal when shortening random html, right?

        YOu can test it at http://htmlpurifier.org/demo.php

        Show
        Petr Škoda added a comment - HTML Purifier is designed to "fix" any html fragment, the only problem might be that is strips JS, Flash, etc. But I suppose that is not a big deal when shortening random html, right? YOu can test it at http://htmlpurifier.org/demo.php
        Hide
        Tim Hunt added a comment -

        I'm taking this. The bug is in the if (!$exact) { bit towards the end.

        Show
        Tim Hunt added a comment - I'm taking this. The bug is in the if (!$exact) { bit towards the end.
        Hide
        Tim Hunt added a comment -

        Right, I think this does it.

        I am afraid that I fixed codechecker issues as part of the same commit, which makes it harder to peer review. The main change is moving the if (!$exact) { bit inside the if ($total_length + $content_length > $ideal) { case.

        I broke the unit tests into more smaller tests, because that makes it easier to see what is going on when some tests are failing.

        In some tests, I changed the $ideal value by +/- 1. I added comments in the tests to make it easier to count characters in the input strings. I think with careful counting, my new values for $idea are better.

        Comments please.

        Show
        Tim Hunt added a comment - Right, I think this does it. I am afraid that I fixed codechecker issues as part of the same commit, which makes it harder to peer review. The main change is moving the if (!$exact) { bit inside the if ($total_length + $content_length > $ideal) { case. I broke the unit tests into more smaller tests, because that makes it easier to see what is going on when some tests are failing. In some tests, I changed the $ideal value by +/- 1. I added comments in the tests to make it easier to count characters in the input strings. I think with careful counting, my new values for $idea are better. Comments please.
        Hide
        Tim Hunt added a comment -

        We had a bug at the OU where somoene typed the wrong thing into an input box, and the whole course page vanished because shorten text produced invalid HTML. (Oops!) hence our sudden desire to get this fixed and integrated.

        Show
        Tim Hunt added a comment - We had a bug at the OU where somoene typed the wrong thing into an input box, and the whole course page vanished because shorten text produced invalid HTML. (Oops!) hence our sudden desire to get this fixed and integrated.
        Hide
        Jason Fowler added a comment -

        Hit the wrong button

        Show
        Jason Fowler added a comment - Hit the wrong button
        Hide
        Dan Poltawski added a comment -

        Hi Tim,

        I think this is issue is a great example of why coding style issues should be separated in their own commit because it really makes it hard to work out what is going on in your patch. You are changing code which you are not doing any other change with and I think this is really bad for the code history of Moodle - its not like you are changing all the associated code anyway.

        Otherwise I agree with your unit test changes and while i've looked over the code a few times, the unit tests are really the things which make me confident to say it makes sense for integration.

        I assume that test_shorten_text_multilang() was what was breaking for you in the OU and now is fixed by this change.

        Show
        Dan Poltawski added a comment - Hi Tim, I think this is issue is a great example of why coding style issues should be separated in their own commit because it really makes it hard to work out what is going on in your patch. You are changing code which you are not doing any other change with and I think this is really bad for the code history of Moodle - its not like you are changing all the associated code anyway. Otherwise I agree with your unit test changes and while i've looked over the code a few times, the unit tests are really the things which make me confident to say it makes sense for integration. I assume that test_shorten_text_multilang() was what was breaking for you in the OU and now is fixed by this change.
        Hide
        Jean-Michel Vedrine added a comment -

        Hello Dan,
        I understand the problem about fixing coding issues on unchanged code and git history.
        But what I find really difficult when you don't do that is that it is really hard to distinguish in the CodeChecker output between "old" errors you don't want to fix and "new" errors you added in your commit.
        I had a hard time doing that on several issues, the only trick I found was to copy codechecker output with and without the fix and searching diffs between both, not really friendly.
        Do you see a better way or an improvement to CodeChecker that would simplify that task ?

        Show
        Jean-Michel Vedrine added a comment - Hello Dan, I understand the problem about fixing coding issues on unchanged code and git history. But what I find really difficult when you don't do that is that it is really hard to distinguish in the CodeChecker output between "old" errors you don't want to fix and "new" errors you added in your commit. I had a hard time doing that on several issues, the only trick I found was to copy codechecker output with and without the fix and searching diffs between both, not really friendly. Do you see a better way or an improvement to CodeChecker that would simplify that task ?
        Hide
        Tim Hunt added a comment -

        Submitting for integration. Thanks Dan.

        Show
        Tim Hunt added a comment - Submitting for integration. Thanks Dan.
        Hide
        Jean-Michel Vedrine added a comment -

        Oops, sorry, reading your comment again I see that you don't say to not fix coding style issues, just to put them in a separate commit ! So my comment is irrelevant, please just ignore it

        Show
        Jean-Michel Vedrine added a comment - Oops, sorry, reading your comment again I see that you don't say to not fix coding style issues, just to put them in a separate commit ! So my comment is irrelevant, please just ignore it
        Hide
        Tim Hunt added a comment -

        Jean-Michel, there is a feature request for that: CONTRIB-4163

        The simple approach, which I should know to do, is to first make some commit that fixes the codechecker issues, but does not change any functionality, then make a second commit with the actual change.

        Show
        Tim Hunt added a comment - Jean-Michel, there is a feature request for that: CONTRIB-4163 The simple approach, which I should know to do, is to first make some commit that fixes the codechecker issues, but does not change any functionality, then make a second commit with the actual change.
        Hide
        Jean-Michel Vedrine added a comment -

        Thanks Tim, this is my fault, I read Dan's comment to quickly.
        That said, CONTRIB-44163 would be a nice feature to have (like when recently I worked on the lesson code if you see what I mean!)

        Show
        Jean-Michel Vedrine added a comment - Thanks Tim, this is my fault, I read Dan's comment to quickly. That said, CONTRIB-44163 would be a nice feature to have (like when recently I worked on the lesson code if you see what I mean!)
        Hide
        Dan Poltawski added a comment -

        The other problem is that only fixing some coding style issues makes it inconsistent sometimes. I would advocate making the code match the surrounding code when doing basic bugfixes (so long as its not completely crap) in those cases - especially if targeting the stable branches.

        Show
        Dan Poltawski added a comment - The other problem is that only fixing some coding style issues makes it inconsistent sometimes. I would advocate making the code match the surrounding code when doing basic bugfixes (so long as its not completely crap) in those cases - especially if targeting the stable branches.
        Hide
        Tim Hunt added a comment -

        Right. That is why I ended up fixing all the issues in shorten_text, which is just one function alone in the hodge-podge that is moodlelib.

        Show
        Tim Hunt added a comment - Right. That is why I ended up fixing all the issues in shorten_text, which is just one function alone in the hodge-podge that is moodlelib.
        Hide
        Eloy Lafuente (stronk7) added a comment -

        Integrated (23, 24 & master), thanks!

        Show
        Eloy Lafuente (stronk7) added a comment - Integrated (23, 24 & master), thanks!
        Hide
        Eloy Lafuente (stronk7) added a comment -

        Passed.

        Show
        Eloy Lafuente (stronk7) added a comment - Passed.
        Hide
        Eloy Lafuente (stronk7) added a comment -

        I feel myself really alone tonight! So was time to push your fixes upstream!

        "Lest we forget. We will remember them."

        Thanks and ciao!

        Show
        Eloy Lafuente (stronk7) added a comment - I feel myself really alone tonight! So was time to push your fixes upstream! "Lest we forget. We will remember them." Thanks and ciao!

          People

          • Votes:
            3 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: