|
[
Permalink
| « Hide
]
Petr Skoda added a comment - 07/Dec/08 05:47 AM
thanks, going to review it later this week
Assigning to Petr
Here's a patch which replaces the html2text library with the attached GPL one.
The output is slightly different and we could help with more testing with different HTML inputs. However, I will need to commit this to the Debian package today as it is considered a release-critical issue for Debian. I'm cool with the patch going into Lenny.
It can also go into HEAD for further testing (but not 1.9.x yet). Passing to Dongsheng to run some performance comparisons (that's the main thing I'm worried about).
The new one is faster to process <p></p>
New: 0.0113739967346 Old: 0.183360099792 (500 p tags) but slower to process <img and <table> > 500 a tags
There is a setting to disable the processing of links (create a list of URLs at the end I think). Perhaps we should disable that to make it faster? Hi, Francois
Another test on links only: New: 0.126650094986 Old: 0.449314832687 Actually, the new lib faster in everything except <img, the funnest thing is the new lib directly ignore img tags, should we make it support process img tags? I modified this file, now, the new lib wins on all tests
Hi Dongsheng,
Here's a new version of the file. There are a few code changes, but the most important one is a security fix for possible code execution (see bottom of http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=508909 Can you please run your script against this one as well? Cheers, both OOP html2text versions are fater than the proceduring html2text,
new version is a bit slower the the old version, I guess it use more regex in code. Old OOP: 0.0414950847626 New OOP: 0.055447101593 proceduring: 0.39603304863 the testing text includes 500 links, images, tables and p tags each. New patch against 1.9 with the fixed version of html2text.php
New patch against 1.8 with the new version of html2text.php
New patch against HEAD with the fixed version of html2text.php
Francois - Since Martin's concern was with performance and the new function was shown to be faster and because of the licensing issues, should we not also apply the changes to 1.9 and 1.8? I came across this file when looking at another issue (
I am re-opening in the hopes that we can apply this to 1.8 and 1.9 and thus resolve
One last thought/question, if the licensing is an issue for Debian will they not want this fixed in the 1.8 Moodle package?
I don't mind applying it to 1.8 and 1.9. There are a few differences in the conversions from HTML to text but as far as I remember, it was a matter of the output being "different" (as opposed to being "wrong").
The Debian packages all use the new code already. Francois Committed the fix to 1.8 and 1.9 stable.
looks like we have a serious regression, please review the linked bug report, thanks
I have committed an updated version of this file to CVS, along with a readme describing its origin.
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||