Uploaded image for project: 'Moodle'
  1. Moodle
  2. MDL-70038

Implement Poppler pdftoppm compatibility for faster assignment submission PDF to PNG conversion

XMLWordPrintable

    • MOODLE_311_STABLE, MOODLE_38_STABLE, MOODLE_400_STABLE
    • MOODLE_311_STABLE
    • MDL-70038-master
    • Hide

      Prerequisites

      This test requires Ghostscript and Poppler to be installed on the server.

      A PDF file is wanted, the larger it is the greater the difference in time that will be appreciated.

      Test

      1. Go to Site administration > Server > System paths page(admin/settings.php?section=systempaths).
      2. Verify that Path to ghostscript setting points to your local gs path (be sure is correctly set).
      3. Verify that Path to pdftoppm setting is empty.
      4. Verify the queue of conversions is empty (table assignfeedback_editpdf_queue is empty).
      5. Create a course.
      6. Add an assignment with default options:
        • Feedback types: Annotate PDF.
      7. As a student, submit a PDF to the assignment.
      8. Verify the database table assignfeedback_editpdf_queue has 1 record for your submission.
      9. Run the scheduled task:

        php admin/cli/scheduled_task.php --execute='\assignfeedback_editpdf\task\convert_submissions'


        This execution will use gs to generate PNG files from your PDF file. Record the total execution time.

      10. As admin or teacher, go to grade the submission.
      11. Confirm you see the submission content.
      12. Set up the path for the pdftoppm tool:
        • Go to Site administration > Server > System paths page(admin/settings.php?section=systempaths).
        • Add your local pdftoppm path to the_Path to pdftoppm_ setting (be sure is correctly set).
      13. Repeat steps 7-11, using the same file (remove and add it again) and the same assignment.
      14. Confirm the new time (pdftoppm) is lower than the first one (gs).
      Show
      Prerequisites This test requires Ghostscript and Poppler to be installed on the server. A PDF file is wanted, the larger it is the greater the difference in time that will be appreciated. Test Go to Site administration > Server > System paths page(admin/settings.php?section=systempaths). Verify that Path to ghostscript setting points to your local gs path (be sure is correctly set). Verify that Path to pdftoppm setting is empty. Verify the queue of conversions is empty (table assignfeedback_editpdf_queue is empty). Create a course. Add an assignment with default options: Feedback types: Annotate PDF. As a student, submit a PDF to the assignment. Verify the database table assignfeedback_editpdf_queue has 1 record for your submission. Run the scheduled task: php admin/cli/scheduled_task.php --execute='\assignfeedback_editpdf\task\convert_submissions' This execution will use gs to generate PNG files from your PDF file. Record the total execution time. As admin or teacher, go to grade the submission. Confirm you see the submission content. Set up the path for the pdftoppm tool: Go to Site administration > Server > System paths page(admin/settings.php?section=systempaths). Add your local pdftoppm path to the_Path to pdftoppm_ setting (be sure is correctly set). Repeat steps 7-11, using the same file (remove and add it again) and the same assignment. Confirm the new time ( pdftoppm ) is lower than the first one ( gs ).

      This issue is relate somehow to MDL-57202.

      All what I comment here is related to the file mod/assign/feedback/editpdf/classes/pdf.php and the gs command built to extract a single page as PNG image.

      Context:

      Linux installs have a package called poppler-utils or poppler package, depending on the Linux distribution, that has a tool named pdftoppm. This tool is able to convert single pages (or all pages at once) several orders of magnitude quicker than ghostscript.

      Why is this process slow using ghostscript (gs)? I read that this is because gs converts the whole document to PDF (again) and then extracts the pages requested. The reason behind that is that a PDF may be so complex that content on diferent pages may affect to the final result viewed in an specific page of the PDF.

      Why pdftoppm has more conversion speed? Really I don't know. I tried to search why is that the reason, but without success. However, pdftoppm and the rest of tools inside the poppler project are open source too, from the [Poppler project|https://poppler.freedesktop.org/.]

      Proposal:

      So here it is my proposal and I add a patch for it:

      1. Add a setting for the pdftoppm path, in the same way we do for /usr/bin/gs.
      2. Use pdftoppm if defined to convert PDF to PNG. If the setting is empty, Moodle will convert using gs as it does today.

      Performance analysis:

      In our Moodle we have Architecture studies, where the students' final works are really big, with high quality images, resulting on PDFs document of > 250 MB. This makes the queue for converting submissions to PNG images grow easily to > 2K in a couple of days in our site in normal days, without exams nor deadlines.

      This is an example and a comparison of the performance for both tools in my local computer, for just extracting a single page, the page number 46:

       

      jordi@jpax360:~/pdf_conversion $
      time pdftoppm -q -f 46 -l 46 -png -singlefile 20200623_combined.pdf pag46
      real 0m1,400s
      user 0m1,343s
      sys 0m0,024s

      jordi@jpax360:~/SREd/pdf_conversion $
      time gs -q -sDEVICE=png16m -dSAFER -dBATCH -dNOPAUSE -r100 -dFirstPage=46 -dLastPage=46 -dDOINTERPOLATE -dGraphicsAlphaBits=4 -dTextAlphaBits=4 -sOutputFile=pag46_gs.png 20200623_combined.pdf
      real 3m59,952s
      user 3m57,231s
      sys 0m2,287s
      jordi@jpax360:~/pdf_conversion
      $

       This is 171 times quicker pdftoppm than gs for this single page, as an example.

      Another result from our patch:

      Running the scheduled task by hand "php admin/tool/task/cli/schedule_task.php --execute='\assignfeedback_editpdf\task\convert_submissions'", using pdftoppm, it converts a document of 207MB with high quality images and details, with 43 pages, in  28m17,105s. That is, all 43 pages as PNG images in 28 minutes aprox.

      Instead, the same scheduled task using the gs command, it took 19m1,642s to convert just the first page of the document. Just the first. To do so, I've just left empty the setting for the pdftoppm path in my testing Moodle.

      The result of each PNG file is the same in both cases, either using pdftoppm than using gs.

      We have it in our production site already.

      Hoping this helps to this part of the Moodle. For us it's a critical part, and a headache at the same time, having to check every week alerts for long lasting cron.php processes.

       

            jpahullo Jordi Pujol-Ahulló
            jpahullo Jordi Pujol-Ahulló
            Noel De Martin Noel De Martin
            Victor Déniz Falcón Victor Déniz Falcón
            Bas Brands Bas Brands
            Votes:
            17 Vote for this issue
            Watchers:
            28 Start watching this issue

              Created:
              Updated:
              Resolved:

                Estimated:
                Original Estimate - 0 minutes
                0m
                Remaining:
                Remaining Estimate - 0 minutes
                0m
                Logged:
                Time Spent - 2 hours, 40 minutes
                2h 40m

                  Error rendering 'clockify-timesheets-time-tracking-reports:timer-sidebar'. Please contact your Jira administrators.