On making backups
Nov. 25th, 2007 10:48 pmSo, awhile back I wrote about external hard drives, partly because I was interested in performing backups. I also noticed some speed increases when I did backups to my external hard drive, so I decided to look into performing regular backups onto it in additional to the semi-regular backups I do onto DVDs.
Why back up to an external hard drive?
I don't have to waste a DVD every time perform a backup, especially if I am making backups on a daily basis. Plus, the same amount of data can be backed up to my external hard drive in less time.
Note that I still back up to DVDs, since those are more durable and can easily be taking offsite. But those backups are done every few weeks at best, so backups to my external hard drive are done more frequently -- usually every few days.
What is backed up?
Various documents, my photography (I take a lot of pictures), source code for projects I am working on, my Moneydance financial data, and tarred/gzipped backups of websites that I manage.
What is not backed up?
Any movies and music that I have--since those files are static (i.e., they never change), I just burn them to DVD when I have enough content to actually fill a DVD. There's simply no need for me to keep backing them up over and over. Sure, it's nice to have multiple copies of this stuff, but I simply do not have that much of a need for my MP3s. (And it's not like I can't rerip them from my CD collection) Any pictures that are older than a year old are also burned to DVD and removed from my Pictures/ directory, since I am no longer working with them on a day-to-day basis.
Full directory structures, or tarballs?
For external hard drives, I discovered that copying full directory structures (i.e., cp -r wildcard dest) doesn't work out so well. First, there is overhead for writing each file and directory at the destination, and second, external hard drives don't work out so well with large numbers of files. After I made a few file-based backups to the external hard drive, I discovered that when I connected it, mdimport would run for a full minute while the entire directory structure was not traversed. Not fun.
I found creating tarballs to be a superior choice here, since only a single file is opened for writing, and it is just a bunch of data that keeps getting appended to it. Very simple and efficient. While getting individual files from the tarball might be a bit complicated, I do not anticipate having to do this often. And I always have the option of just extracting the entire tarball to a tmp directory on my machine and picking through the files.
The hardware and software
Just to clarify what kind hardware and software I'm using for this little experiment:
- iMac G5 20" 2.0 Ghz
- 1 GB of RAM
- 250 Gig (really 232.89 GB) internal IDE drive
- OS/X 10.4.10
- External drive: Firewire Interface: OWC Mercury Elite, 76.69 GB
The results!
I wrote a shell script to perform backups. It processed command-line options (such as whether to use compression, and the destination directory), created flags for the tar command, then ran the tar command under the UNIX time command, which measures wall-clock seconds, as well as program seconds (time that the program spent executing) and system seconds (time spent in system calls).
For my tests, my variables included both the destination directory (my home directory versus a directory on the removable drive) and whether compression was used or not. Here are the results:
| Compression with Gzip | No compression | |
|---|---|---|
| Home directory | Time: 13m40.982s (real) 7m21.610s (user) 1m3.942s (sys) Size: 3.09 GB |
Time: 7m2.057s (real) 0m3.260s (user) 0m52.204s (sys) Size: 3.51 GB |
| External hard drive | Time: 11m41.163s (real) 7m16.400s (user) 1m1.679s (sys) Size: 3.09 GB |
Time: 3m53.975s (real) 0m2.793s (user) 0m41.010s (sys) Size: 3.51 GB |
Conclusions
The better performance on the external hard drive can be explained by understanding what is happening on each hard drive. When I am backing up to a tarball in my home directory, the tarball is being written to the same drive that files are being read from. Since you cannot read and write on the same hard drive at the same time, it was impossible to write the tarball unless no reading was being done at the same time. While modern OSes have gotten very good at using caching and scheduling disk activity in idle periods, the OS can only do so much. Contract with backing up to an external hard drive, which resulted in a scenario where only reads were done on the internal drive and only writes were done on the external drive.
For the type of data I was backing up, trying to compress it ended up being a big time waster. This is obvious by looking at the difference in user time. For both the local and external hard drives, using compression resulted in over 7 minutes of execution. And the space savings was a mere 0.42 GB, or just over 10% of what the uncompressed tarball was.
Back when I first got fed up with Retrospect and tried making tarballs of my data, I originally used the method of compressed tarballs in my home directory, with occasional backups to DVD. But based on this, it looks like uncompressed backups to my external hard drive is going to be the way to go from now on.
The shell script
Finally, if you made it this far, I might as well share the shell script that I used for running these tests.
#!/bin/sh
#
# Perform a backup of our stickies and our system
#
set -e
#
# What directory will the file go into?
#
DIR=$HOME
DATE=`date +%Y%m%d%-%H%m%S`
if test ! "$HOSTNAME"
then
HOSTNAME="dmuth.local"
fi
#
# Parameters that can be specified on the comamnd line
#
P_VERBOSE=""
P_COMPRESS=""
P_TARGET=""
#
# Print out the program's syntax
#
function print_syntax() {
echo "Syntax: $0 [--verbose] [--compress] [target directory]"
} # End of print_syntax()
#
# Parse our arguments and populate config variables
#
function parse_args() {
while test "$1"
do
CURRENT=$1
shift
if test "$CURRENT" == "--verbose"
then
P_VERBOSE=1
elif test "$CURRENT" == "--compress"
then
P_COMPRESS=1
elif test "$CURRENT" == "--help"
then
print_syntax
exit
elif test "$CURRENT" == "-h"
then
print_syntax
exit
else
P_TARGET=$CURRENT
#
# Check our target for sanity
#
if test ! -d "$P_TARGET"
then
echo "$0: Target '$P_TARGET' is not a directory!"
exit 1
fi
if test ! -w "$P_TARGET"
then
echo "$0: Target '$P_TARGET is not writable!"
exit 1
fi
fi
done
#
# If not specified, assume that the home directory is writable
#
if test ! "$P_TARGET"
then
P_TARGET=$HOME
fi
} # End of parse_args()
#
# Get the flags for our tar command.
# They are printed out, so this funciton should be called via the backtick
# operators so that the output can be captured.
#
function get_tar_flags() {
if test "$P_VERBOSE"
then
echo -n "v"
fi
if test "$P_COMPRESS"
then
echo -n "z"
fi
} # End of get_tar_flags()
#
# Main program
#
parse_args "$@"
#echo "TEST: Verbose: $P_VERBOSE, Compress: $P_COMPRESS, Target: $P_TARGET"
#
# Backup our stickies, since they don't work nicely with symlinks.
#
cp $HOME/Library/StickiesDatabase $HOME/Data/Stage1/Library
#
# Our target file
#
TARGET=${P_TARGET}/${HOSTNAME}-${DATE}.tar.gz
#
# Make the tarball
#
cd $HOME
#
# Get our tar flags
#
FLAGS=`get_tar_flags`
#
# Our source folders to back up
#
SOURCES="Data local"
#
# Run the tar command inside of time so we know how long things took.
#
time {
#
# We're not creating the tar command ahead of time because of issues I
# had with quotes and spaces in the target name.
#
tar cf${FLAGS} "${TARGET}" ${SOURCES} || true
}
(no subject)
Date: 2007-11-26 04:57 am (UTC)I have a Linux box, a macbook pro, and a Mac Pro. Backups are as follows:
Linux box:
- Cron job runs incremental dumps (using dump) to a disk in the Mac Pro. I take a level 0 every now and then when the level 1s start getting too big.
- Occasional rsync to a hard drive taken offsite.
Mac Pro:
- Time machine backups to a second internal drive.
- Occasional Carbon Copy Cloner backup to an external drive kept offsite.
Macbook Pro:
- Time machine backups over the network to the Mac Pro.
A bit of a kludge, but it works for now. You might consider using Time Machine when you move to leopard; for day to day backups it works great. Just remember to keep an offsite backup; fires and equipment theft can happen any time!
(no subject)
Date: 2007-11-26 05:08 am (UTC)Time machine is just a fancy wrapper for rsync, right? If so, that still involves creating lots of files and directories. See my complaint about mdimport...
(no subject)
Date: 2007-11-26 05:15 am (UTC)It does this through a rather clever system of hard links, which means that a typical hourly backup when not much has changed executes in under five seconds, yet creates a directory tree every hour that is effectively a snapshot of your system at that exact moment.
To avoid mdimport woes, I dragged the directory where Time Machine keeps its backups to the Spotlight exclusion list so that mdimport ignores it. I think it already ignores it by default but I wanted to be sure. ;)
(no subject)
Date: 2007-11-27 04:38 am (UTC)(no subject)
Date: 2007-11-26 05:56 pm (UTC)(no subject)
Date: 2007-11-26 06:02 pm (UTC)Time machine seems to do well enough for incrementals, though. So I'm sticking with it for now.
For those who are less programming-inclined, and more Windows-inclined....
Date: 2007-11-26 05:00 am (UTC)Re: For those who are less programming-inclined, and more Windows-inclined....
Date: 2007-11-26 05:07 am (UTC)Does it write stuff in proprietary formats?
Re: For those who are less programming-inclined, and more Windows-inclined....
Date: 2007-11-26 05:18 am (UTC)(no subject)
Date: 2007-11-26 05:06 am (UTC)I find that his method actually works quite well because it gives you a bootable disk that you can just slap back in to get rolling again, assuming that you chose earlier to leave the RAID array in a machine not at your desk.
In addition, you know that your backups are bad before it's too late, so there's less bitrot because of neglected media.
(no subject)
Date: 2007-11-26 05:27 am (UTC)(no subject)
Date: 2007-11-26 08:55 am (UTC)(no subject)
Date: 2007-11-26 04:58 pm (UTC)I'm also on a Raid1, so the monthly backups are sufficient for my needs. Adn the fact that I can just grab the external drive and run in an emergency is a plus as well.
(no subject)
Date: 2007-11-26 06:37 pm (UTC)(no subject)
Date: 2007-11-27 06:40 pm (UTC)Using an rsync filter I can selectively exclude/include things without having to copy items to a staging location first.
(no subject)
Date: 2007-11-27 06:41 pm (UTC)Please reread the part of my post about mdimport.
(no subject)
Date: 2007-11-27 06:48 pm (UTC)(no subject)
Date: 2007-11-27 06:50 pm (UTC)From the research I did, I think Spotlight is trying to be clever.
That being said, I never was a big fan of writing out thousands of files to disk and dealing with the overhead. For the frequency of my backups and the data involved, I find it more efficient to write out a single file.
(no subject)
Date: 2007-11-27 07:12 pm (UTC)(no subject)
Date: 2007-11-27 07:15 pm (UTC)Websites that are file-based are so 2001. Database-driven CMS, baby!
Also, see my previous comments about mdimport issues.
(no subject)
Date: 2007-11-28 11:35 am (UTC)As for the mdimport issue, I thought it only mattered on the first full backup (as opposed to each incremental backup), but apparently not. Gosh, that's bothersome. You'd think they'd have come up with a way to have the metadata subsystem ignore whole directories or even drives...
(no subject)
Date: 2007-11-28 02:01 pm (UTC)There probably is a way to make mdimport ignore arbitrary paths. I just haven't figured it out yet.
And unless I'm backing up large amounts of data on an hourly/daily basis, I really have no desire to have thousands of files and directories. A single tarball will do quite nicely.
(no subject)
Date: 2007-12-06 03:08 am (UTC)(no subject)
Date: 2007-12-06 03:12 am (UTC)Last time I used it, it wrote backups in a proprietary format that nothing else could read.
Please correct me if this is no longer the case.
(no subject)
Date: 2007-12-06 12:24 pm (UTC)