I have been using rdiff-backup and cron to back up everything important to me for a long time. I love how the latest full backup is available as plain files on the disk. This makes for easy recovery. I also love how every backup after the first one is an incremental backup, which works well for my large photo collection (at 300GB, it's too large to ever push a full backup over the Internet to my offsite backup). Also, history can be kept forever. It uses the rsync algorithm to only store the changes to files, which means it can store VM images and database dumps very efficiently, even if they are dumped every hour.
However, there are a few downsides, some of which have caused me significant pain:
- It doesn't resume failed backups, causing large files to be transferred again.
- It doesn't support manually culling a file that you didn't mean to back up.
- It doesn't have a mechanism for using the sneakernet to transfer a large changeset.
- It creates a ton of files in its rdiff-backup-data directory. These directories are unwieldy.
- It doesn't detect renamed files, which is catastrophic if someone decides to reorganize the samba share at work.
- It doesn't deduplicate files.
- Restores take a long time — especially for old files, or when the job has been run too frequently.
I have 34 rdiff-backup jobs, and managing them is a pain. There is certainly some data that I'm not backing up that I should be. Since hard drive space is cheap, I'm looking for a simpler solution. Something where I can just back up everything and get reasonable results.
At my current employer, the videographers will sometimes add a few hundred gigabytes of video files in a single day. They also like to move and rearrange the file share. Another programmer wrote a small backup system that is simple to understand. My employer was kind enough to let us release the code as open source.
While it needs a lot of polish, I feel the core concept is solid and has some interesting properties that other backup systems don't have.
The target directory is scanned and the MD5 checksum is calculated for new files and any files where the size or modification time has changed. (I am aware that MD5 is broken, but it doesn't matter for my employer's use case, where hash speed was more important than preventing a collision attack.) The files are then uploaded via SFTP and stored with the checksum of the file as the filename, unless they already exist on the remote end.
Once all the files are uploaded, a YAML file containing the filenames is also uploaded with the current timestamp.
This backup system is better than rdiff-backup for our use case, since the raw video files aren't ever changed, even if they are renamed, copied, or moved. All files are deduplicated and renames are handled on the server side. It is easy to sneakernet the latest changes to our offsite backup server when the videographers bring in a large set of new assets.
It is easy to create a read-only view of any revision of the backup data by simply creating hard links to the backed-up files. (Creating hardlinks is a very fast operation on Linux systems.)
Resuming is easy to do. It is easy to check if files are corrupted by running a checksum on the remote server.
There are a couple of obvious downsides. If a large file is only partially changed, the whole file has to be uploaded again, and stored in its entirety. Also, cleaning up old backups requires a garbage collection pass, which, while straightforward, isn't as simple as other repository operations.
One fantastic property of reverse is that the backups it creates can themselves be backed up with reverse without significant bloat. Let me explain. Let's say you have a computer at home and a computer at work. You want to back up both, and you want to have an offsite backup of both. You also have data that you're backing up from miscellaneous devices around your house, such as your cell phone.
You can make sure that everything is backed up, and that you have an offsite copy of your backups, by doing something like the following:
# on cell phone each day reverse /sdcard home-computer:/backup # on home computer each day reverse /home /backup reverse /backup work-computer:/backup # and on the work computer each day reverse /home /backup reverse /backup home-computer:/backup
It would be easy to enhance the system in a number of other ways:
- Backup to S3, Glacier, and other systems
- Automatic verification
- Shared backup repository. Better support for having all jobs write to one repository.
- Disk space usage analysis. The historical index of files not only is a fast way to see what is currently taking up space, but it can also show velocity of changes over time. When trying to free up space on the samba server at work, the change in directory usage over time has been more valuable than just the total space consumed by each directory.
Code so far is on github.