Uploaded image for project: 'Moodle'
  1. Moodle
  2. MDL-68166

Write backup files to localcachedir first, then move in place

    XMLWordPrintable

    Details

    • Testing Instructions:
      Hide

      h2 .Setup

      1. Create a medium or large sized test course (Site administration => Development => Create test course)
      2. View the couse
      3. Open its question bank
      4. Import the attached questions.xml
        1. Ignore any warnings - these are unrelated and I'll raise these as a separate issue when I get a chance
      5. Take a note of the courseid

      Baseline

      1. Checkout the latest weekly:

        git log version.php
        # Find the most recent weekly and grab the commit hash
        git checkout [hash]
        

      2. Open the console
      3. Run the following:

        time php admin/cli/backup.php --courseid=[courseid] --destination=/tmp
        

      4. Run it 5 times and take a note of the times

      Patch and rerun

      1. Checkout the main branch again, e.g. for master:

        git checkout master
        

      2. Run another 5 backups

      Comparson

      1. Take an average of the before runs
      2. Average the after runs
      3. Compare the two
        1. Hopefully the numbers are about the same.

      Notes

      1. The speed of this is difficult to compare as it depends highly on any other operations going on at the same time
      2. Technically there is an additional operation so there is a potentail for it to be slower, but this is unlikely for most
      3. Whilst this test looks at time to complete, this isn't the whole story.
        On a networked file system this has the potential to reduce load on the storage system as replication is only required once, and not for each write
      Show
      h2 .Setup Create a medium or large sized test course (Site administration => Development => Create test course) View the couse Open its question bank Import the attached questions.xml Ignore any warnings - these are unrelated and I'll raise these as a separate issue when I get a chance Take a note of the courseid Baseline Checkout the latest weekly: git log version.php # Find the most recent weekly and grab the commit hash git checkout [hash] Open the console Run the following: time php admin/cli/backup.php --courseid=[courseid] --destination=/tmp Run it 5 times and take a note of the times Patch and rerun Checkout the main branch again, e.g. for master: git checkout master Run another 5 backups Comparson Take an average of the before runs Average the after runs Compare the two Hopefully the numbers are about the same. Notes The speed of this is difficult to compare as it depends highly on any other operations going on at the same time Technically there is an additional operation so there is a potentail for it to be slower, but this is unlikely for most Whilst this test looks at time to complete, this isn't the whole story. On a networked file system this has the potential to reduce load on the storage system as replication is only required once, and not for each write
    • Affected Branches:
      MOODLE_39_STABLE
    • Pull Master Branch:
      MDL-68166-master-2

      Description

      Whilst looking at some performance stats I notice on some of our systems that moodle backups are the single biggest hit to IOPs.

      The reason behind this is that when we write a larger file, we write it in chunks.
      We have a buffer size (default 4096), and we write the file tag, by tag.
      Once the buffer size sills up with enough data, we push the current buffer to the file, and then continue filling the buffer.

      The same also happens with the mbz file itself. The initial file is opened, and then each part of it is appended to it.

      What this means is that, for large files in backup, the backups are really write heavy.

      For clustered environments this matters a lot. The files are written to the Moodle tempdir, which must be shared. Typically that sharing is over a system such as NFS. NFS does not handle those kinds of operations well.
      Where that remote system is a clustered file system such as GlusterFS, Ceph, etc. then a replication step is also required.

      The ideal solution would be to stop writing backups to the shared file system, but that is not something that we can currently achieve.

      However, we are able to make use of our separation between local cache and shared cache very easily.

      Rather than writing the XML files in small (4k) chunks straight to the tempdir, we can easily write them in exactly the same way to a per-request directory, and when complete move them to the final destination.
      Likewise we can perform the same type of change with the mbz backup itself. Writing it to localcache then moving it into place once complete.

      On non-clustered systems where localcachedir and tempdir are on the same filesystem this will simply be an atomic move and incur no penalty.

      On clustered systems, or those where a different filesystem is in use, this is an additional step; however the IOPs to move a single file is vastly more efficient than writing it in small chunks.

      Right now we cannot just move backups out of tempdir - it's simply too big a change.

        Attachments

        1. 68166-post-1.png
          68166-post-1.png
          2.23 MB
        2. 68166-post-2.png
          68166-post-2.png
          2.40 MB
        3. 68166-pre.png
          68166-pre.png
          2.14 MB
        4. questions.xml
          4.04 MB

          Issue Links

            Activity

              People

              Assignee:
              dobedobedoh Andrew Nicols
              Reporter:
              dobedobedoh Andrew Nicols
              Peer reviewer:
              Simey Lameze
              Participants:
              Component watchers:
              Adrian Greeve, Jake Dallimore, Mathew May, Mihail Geshoski, Peter Dias
              Votes:
              3 Vote for this issue
              Watchers:
              10 Start watching this issue

                Dates

                Created:
                Updated:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0 minutes
                  0m
                  Logged:
                  Time Spent - 1 day, 15 minutes
                  1d 15m