Saturday, 20 February 2010

Linux How-to: Incremental Backups using Tar (re-post)

Original Article: 27/03/2008
I won't go through the catalogue of computer disasters I've had to recover, but lets just say there's no substitute for a sound backup strategy. Let's assume you've created a full system backup as your baseline, using a tool such as Partimage to create a snapshot of your Ubuntu installation. You need to go incremental...
Wait -

You have created a baseline backup, haven't you? Good. Just checking.

You verified it as okay, didn't you? You wouldn't want to try a system restore and find your backup corrupted. Yes? Excellent.

The trouble with full backups is the time they take to run and the storage space they use up. What do you do next week and the week after, when all that's changed are your household accounts, the pictures folder and that really useful sticky notes program you installed after reading a review  in Full Circle? You don't need another full backup, that's a waste of time and storage.  You need to go incremental.

Backup Types
Unlike a full backup which enables you to restore your entire system to a point in time, an incremental backup only includes a subset of files changed since a chosen point in time; incremental backups typically follow the pattern 'backup all files changed since .' You can also create a type of  differential backup which compares the current state against that listed in a stored snapshot file, only backing up any files changed from that. 

Not only do incremental backups minimize the time and storage taken, a series of incremental backups  can be used to rebuild your data through a number of steps in time.

That's enough of the theory, let's look at the practical side.

Practical Magic
The tar program (short for Tape ARchive, but you can write to almost any media these days) is an archiving application designed to store and extract files from a compressed file format known as a tarfile. From the outside it looks like any other file; on the inside the tar program records the folder structures and properties of the files being archived. Its' real usefulness lies in the compression  applied, squashing the contents to a smaller size for storage. When you extract a tarfile, like one of those self-inflating rubber dinghies, the archived files are extracted back to their original size and structure. If you know Stuffit on the Mac or PKZip on Windows, tar does the same job, only better. There's a graphical front-end called File Roller (also known as Archive Manager in the Ubuntu menus) which uses the tar program as its' engine. This only supports the basics of tar, but we're going to use more advanced options from the command line in a terminal. The command tar --help will display the tar manual page.

Low Tar
We'd better start with the basics. I could use tar to backup the whole filesystem from the root down, using:
sudo tar cvpzf /dev/sdb2/myPC_ubuntu710_bu010408.tar.gz  /

Here's how that breaks down;
sudo     where I need to run tar with root permissions to access everything in the filesystem. I wouldn't need it just to backup my own home folder under  plain old $robin@myPC;

tar    the backup command itself
cvpf    the tar options - we're using the short forms to cut down the typing:
    c    create a new archive   
  v    verbose mode - list the files on-screen as they are written into the archive.
  p    preserve-permissions - maintains the original file attributes in the archive.
  z    compress the archive file through the gzip algorithm.
  f    use the archive file specified next;
  /dev/sdb2/ the device name onto which to write the archive - like my external USB drive. This could be a DVD writer, tape drive, network drive.
  myPC_ubuntu710_bu010408.tar.gz    the file name to use for the archive.

Tip: do this often and you get a long list of archives, so I've added a bit of intelligence in my naming convention by including some identifiers for the machine, operating system, version and extras like 'bu' for backup and 010408 representing the date of my backup (just don't use periods or slashes, more below).
  /    the source of the files to back up, namely the root of my filesystem. Tar will include everything under it.

High Tar
However, I don't want to backup the whole thing, I want an increment, so I'm going to use the option.

The  --newer 'date', or -N 'date' option specifies “only work on files whose modification or status change date are newer than the date given.” You can also use –after-date 'date'.  A file is considered to have changed if its contents, owner or permissions have been changed after the date given. This is the modification-date, not create-date. The entire 'date' parameter must be in quotes if it contains any spaces. So in a terminal I type:
sudo tar cvpzf myPC_ubuntu710_bu010408.tar.gz --newer '1 Apr 2008' /

Tar is good at parsing the parameter for dates, so I can also specify:
--newer-mtime '2 days ago'
--after-date='10 days ago'

In verbose-mode, tar will respond telling you how it's interpreting your dates:
tar: Option --after-date: Treating date `10 days ago' as 2008-05-11 13:19:37.232434

In or Out?
There's a whole stack of folders and files I don't want to backup because they are volatile and temporary – temp files, cache files, package files and the like. I explicitly don't want to tar the  mount points for other devices, especially not the storage media holding my backups – I could end up with a tar file that tries to include itself and the entire network! Tar has an --exclude option so I can specify what to leave out. I usually string together a stack of exclusions:
--exclude="/proc/*"     : a folder full of pointers to other files
--exclude="/lost+found/*" : recovered file segments and trash
--exclude="/sys/*" : the Ubuntu kernel view of your hardware
--exclude="cache" : cache folder for browsers and the like
--exclude="/tmp/*" : temporary files
--exclude "/var/cache/apt/*" : cache for installable package files

The next three all look like folders in the file system, so you need to tell tar not to include them:
--exclude="/dev/*"  : mount point for plug-in devices
--exclude="/mnt/*" : mount points for your drives
--exclude="/media/*" : mount points for storage media

Remember the  /* at the end to exclude everything beneath the folder specified.

 After you have made a backup, you should check it , or verify, using the --compare -d  option, specifying verbose and the filename as in;
tar dvf myPC_ubuntu710_bu010408.tar.gz

Restoring files with tar
The --extract (-x) option for tar extracts files from an archive:
tar xpvf myPC_ubuntu710_bu010408.tar.gz

You also specify files or complete folders (with sub-folders) to extract:
tar xpvf my_PC_ubuntu710_bu010408.tar.gz

/home/Documents/*        extracts the whole documents folder
/home/Pictures/*.png  extracts only png image files in the Pictures folder

If you create several generations of incremental backups, to restore the exact contents as at the time the last generation was created, you will need start with your full baseline then restore each increment in sequence.

Use the --list (-t) option, if you just want to see what files are in a tar file:
tar tf myPC_ubuntu710_bu010408.tar.gz

Do the Differential Shuffle
Hang on to your hats: remember the differential update described earlier? The option --listed-incremental instructs tar to operate on an incremental archive with additional reference data (or metadata) stored in a standalone file, called a snapshot file. This helps determine which files have been changed, added or deleted since the last backup, so that the next incremental backup will contain only modified files. The name of the snapshot file is given as a parameter:
tar cpfvz myPC_ubuntu710_bu010408.tar.gz
--listed-incremental=myPC_ubuntu710_snapshot_010408 /

This creates an incremental backup of the '/' filesystem in the tarfile 'myPC_ubuntu710_bu010408.tar.gz', using the metadata list in the snapshot  'my710_snapshot_010408'. Tar looks in the snapshot to compare which files have been modified since the snapshot was taken.  If this file does not exist, it will be created as a new snapshot listing what is backed up in this increment. Which is everything unless I add more parameters to be specific.

Note that the snapshot file will be updated with the list of modified files backed up this time around. The next time you run this, specifying the same snapshot file, it updates again so you may want to keep  working copies of your snapshot files corresponding to your backup increments.

The best way to get the feel of this is create a scratch folder containing a handful of text files and create some incremental tar's from this so you can see how it operates. Modify a couple of  text files between increments and inspect your snapshot file after each one in a text editor, you'll see how it changes.

To extract from this type of archive, you need to use the extract option along with --listed-incremental. Tar doesn't need the snapshot file to extract listed-incremental, only to create.

Important note: when extracting from these listed-incremental archives, tar attempts to restore the exact state of the file system according to the snapshot, so it will attempt to delete files and folders that did not exist when the archive was created.

There's more to the --listed incremental and --incremental options if you want to research further.

Danger, Will Robinson...
This wouldn't be a proper How-to without some words of warning:
  • Perhaps stating the obvious, but extracting a tar file will overwrite any folders/files of the same name on the target drive. If you overwrite a whole folder structure you are effectively doing a 'roll-back' in time. You may want to be selective rather than extract /home/* and wipe your latest baby photos!
  • Tar doesn't do file-compares when extracting, say on modification dates or permission bits that may have changed. You may need other tools to find and compare the existing filesystem state with lists of files that have been previously backed up. Scripts and programs for doing this can be found on Linux ftp sites.
  • Be aware of file size limits when writing your tar files to your destination drive: you may run into problems at 2 gigs on some old kernels and filesystems. Tar used to have a limit of 8 gigs or so, assuming the underlying kernel/filesystem would allow it. A DOS/Windows fat32 filesystem, for instance, will only allow files up to 4Gb.
  • Extracting a tarfile restores from the backup exactly what you tell it, including those compromising images you deleted from public folders three months ago...
  • Punctuation characters in file names: some programs like tar assume / and are delimiters between command options and throw back strange, even dangerous results. Underscores are safe to use as separators.
  • Compression failure: previously, there were reports of problems with older versions of tar where just a single corrupted bit in a compressed tar file could render it unusable. This is more a symptom of bad backup media – duff disks or DVD writers for instance. I've not had this problem in any current version.
What's in an Increment?
Bear in mind the relative benefits of full versus incremental backups. Too long a long series of incremental backups will start to eat up storage space and the time taken to restore the whole series, so remember to create a new baseline periodically.  RC

No comments:

Post a Comment

At least try to be nice, it won't kill you...