Return to Digital Photography Articles
Backup Comparison Criteria
Most simple backup schemes are based on a two-part process: one initial full backup, and then a series of incremental backups. The incremental backups on contain files that have changed since the last backup. While this sounds simple enough, how can you tell that a file has changed?
Files are modified in many different ways: you might add/remove/change text in a document, edit a digital photo, overwrite a file with a newer version, change a file's metadata, etc. A file is actually represented by two parts: the main content and the operating system attributes or metadata. Changes may affect one or both of these parts.
All backup programs that support incremental backups must rely on some means to determine whether each a file in the backup set has changed since the last backup. The following snapshot shows the typical list of comparison criteria offered by a backup program (Backup4All in this example). All of these are based on the file system attributes with the exception of the CRC32 option, which is dependent upon the main content.
With each file comparison method, there are pros and cons associated. In each case, one should always be mindful of situations where changes are not recognized. In such cases, you may not be backing up a file that should be.
Comparison: Size Changes
One of the simplest comparison methods is to see if the file size has changed. While this may catch most file updates, it is generally a poor method on its own. It is very easy to make smaller modifications to a file that don't affect the overall length. This can happen if you swap out some content with an equal-length replacement (for example, correcting a spelling mistake in text document). It can also happen in instances where the changes are different lengths but the file format has defined a fixed-length field. For these reasons, identification of size changes is often combined with other methods.
Comparison: Date / Time Changes
There is an expectation that every time a file is modified, that the Last Modified date is updated to reflect the time of the most recent change. While this is often done by the majority of software applications and Windows explorer, there are cases where the Last Modified timestamp is not affected, even though the main content is rewritten. In general, this happens in photo-editing / organization applications that provide an option to explicitly avoid updating the timestamp.
You may want to consider turning off such an option if one exists. As this change identification method is dependent upon all software applications to "behave normally", there is a chance that files can be missed with this method.
Comparison: Archive Bit
A few years back, nearly all backup programs relied on the Archive Bit. The Archive Bit is yet another file system flag that marks a file as changed. Again, any program that changes a file was supposed to set the archive bit (meaning: should be archived). For a backup program to do an incremental backup, it scans your disk for any files with the archive bit set. All files with the bit set will be backed up, and then the bit will be cleared. Performing a full backup simply meant either ignoring the archive bit or setting it for all files.
Unfortunately, the bit is simply a binary flag -- on or off. If multiple backup processes / programs are relying on this flag, problems will occur as there will be contention for the single bit. For the reason that not all programs use the archive bit properly and the fact that you will encounter problems with multiple backup jobs, I don't recommend using the archive bit method (on its own).
Comparison: Content (Byte-by-Byte)
As shown above, relying on file system attributes or properties is not always guaranteed to facilitate detection of file content changes. The only guaranteed method is to performa byte-by-byte comparison between the original and latest file versions. Unfortunately, this is not at all practical as it means that a copy of the original file must be accessed in its entirety every time an incremental backup is made. This would be extremely slow and very inefficient, requiring all of the backup sets to be redownloaded each time a backup run is initiated. For this reason, byte-by-byte comparisons are not used for incremental backup jobs.
Comparison: Content (CRC32)
Clearly a byte-by-byte comparison is not practical, and so another method is almost always used instead, offering many of the benefits while reducing some of the performance penalty. A 32-bit value (checksum) is calculated that represents a "hash" of a file's entire data content. In a gross simplification, a CRC is the sum of all bytes within the file, but represented by a single 32-bit value. It is an extremely simple matter to store this 32-bit value in the database catalog for comparison purposes. However, it still takes a complete pass through the data file (byte-by-byte) once to generate the CRC32 value itself. So, CRC32 doesn't require the original files to be downloaded, but it does require a full read of the entire current file / data set to identify the differences.
IPTC Metadata Update
What is IPTC Metadata?
As photographers expend considerable effort in cataloging their photos (adding keywords & tags), there is a strong desire to store this metadata within the image files themselves, rather than simply in the photo catalog program's internal database. IPTC provides a industry-standard set of fields that can be embedded within JPEG images, allowing keywords and other information to be stored directly within each image file. Doing so provides not only an extra level of redundancy (should your catalog software database become corrupted), but also the ability to share these extra details with others (who may not have access to your image database).
Example of problem with directory-based comparison
The following example is from a JPEG image file that was originally backed up by Backup4All. Later, in a photo catalog program, I updated one of the IPTC Metadata fields, being careful not to change the length of the field (simply changed one character). When I saved the changes, the photo catalog program wrote back the embedded IPTC metadata to the original JPEG file (this behavior is standard practice).
But, when I view the general file characteristics from within Windows XP Explorer, I see that the file size, and last modified date didn't change. In fact, nothing appears to have changed in the file at all, from the perspective of the file properties view.
Then, I compared the before and after versions in a hex editor (I used Beyond Compare) and confirm that only 1 byte in the file has changed -- nothing else.
Turn off the preservation of date/time
To work around the performance problems associated with relying on CRC32, I strongly recommend that you consider turning off the option to preserve the date and timestamps when updating metadata within your photo catalog software. Doing so will allow you to use the much faster file property comparisons (e.g. compare based on size & last modified date).
|Turning off the preservation of date/time in IMatch|
If you configure your backup program to use directory-based comparisons for the incremental backup, the scanning process is extremely quick. Only the file directory tables need to be read in and traversed. The amount of data being read is very small.
On the other hand, if you add in a file-based comparison (such as CRC32), then the program must read in each file in its entirety. The time taken will be proportional to the size of your entire backup set. Many hard drives (if not fragmented) will read at approximately 20 MB/sec. If most of your files are very small, then the read rate will be even lower as the overhead associated with reading each individual file will become more significant.
Backup set: 84,381 files @ 89.0 Gigabytes
|Comparison Criteria||Time||Files Identified||Rate|
|File Info||01 min 23 sec||11||1017 files / sec|
|CRC32||70 min 00 sec||352||21.2 MB / sec|
In the above table, notice the huge difference in performance. Also recognize that the CRC32 method identified hundreds of additional files that had changed, but that were not identified with the File Info method!