Spring cleaning, dealing with duplicate files, and vacuumpack

I was doing some spring cleaning of my computer, part of which was removing duplicate files. I'm a packrat, and I subscribe to the copy-first-and-ask-questions-later school of thought, so I have a few duplicates floating around, just taking up disk space. (I think I have somewhere a filesystem copy of my previous computer, which, in turn, contains a filesystem copy of my computer before that.)

There are plenty of tools that will help you remove duplicate files (this page lists no fewer than 16), but, disappointingly, none of them that I could find seem to give you a high-level understanding of where and how duplicate files appear in your filesystem. So I wrote vacuumpack, a short Python tool to help me really see what was going on in my filesystem. Nothing revolutionary, but it helped me through my spring cleaning.

(Aside 1: I noticed the parallel with coding theory: you want to remove undesirable redundancy, by removing duplicate files on the same disk, and add desirable redundancy, by using backups or RAID.)

(Aside 2: This is simple enough that undoubtedly someone will send a pointer to a tool written circa 1988 that does what I am describing. C'est la vie; unfortunately it is usually easier, and more fun, to write code than to find it.)

vacuumpack is probably best explained by example. First, vacuumpack can scan a directory and identify duplicated files:

$ ./vacuumpack.py --target=/home/phil --cache=/home/phil/.contentindex

The output shows clusters of identical files. For example, oh hey, it's the GPLv3 (the cluster is prefaced by the sha256 of the shared content):

[...]
8ceb4b9ee5adedde47b31e975c1d90c73ad27b6b165a1dcd80c7c545eb65b903: 140588 bytes wasted
  /home/phil/projects/raytracer/COPYING (35147 bytes)
  /home/phil/projects/rdiff-snapshot-fs/source/COPYING (35147 bytes)
  /home/phil/projects/syzygy/COPYING (35147 bytes)
  /home/phil/projects/vacuumpack/COPYING (35147 bytes)
  /home/phil/source/emacs/COPYING (35147 bytes)
[...]

The clusters are sorted in decreasing order of the amount of space wasted, so you can fry the big fish first. Having multiple copies of the GPL floating around isn't really a cause for concern; let's have a look at another cluster:

6940e64dd91f1ac77c43f1f326f2416f4b54728ff47a529b190b7dadde78ea23: 714727 bytes wasted
  /home/phil/photos/20060508/may-031.jpg (714727 bytes)
  /home/phil/album1/c.jpg (714727 bytes)

Thus far duff and many other tools do essentially the same thing. But most of the tools out there are focused on semi-mechanically helping you delete files, even though cleaning up often requires moving or copying them as well.

Duplicated photos, like the ones above, are probably something I want to resolve. Since I recall that /home/phil/photos is the canonical location for most of my photos, the other directory looks somewhat suspect. So I'll ask vacuumpack to tell me more about it:

$ ./vacuumpack.py --target=/home/phil --cache=/home/phil/.contentindex /home/phil/album1

Now, for each file in that directory, vacuumpack tells me whether it knows about any duplicates:

/home/phil/album1/a.jpg         == /home/phil/photos/20060508/may-010.jpg
/home/phil/album1/b.jpg         == /home/phil/photos/20060508/may-019.jpg
/home/phil/album1/c.jpg         == /home/phil/photos/20060508/may-031.jpg
/home/phil/album1/d.jpg         == /home/phil/photos/20060508/may-033.jpg
/home/phil/album1/e.jpg         == /home/phil/photos/20060508/may-048.jpg
/home/phil/album1/f.jpg         -- NO DUPLICATES
/home/phil/album1/g.jpg         == /home/phil/photos/20060508/may-077.jpg
/home/phil/album1/h.jpg         == /home/phil/photos/20060508/may-096.jpg

It looks like the files in /home/phil/album1 are a subset of the photos in /home/phil/photos... except that /home/phil/photos is missing a file! I need to copy that file back; once I do, the directory album1 is safe to delete.

In this mode vacuumpack is behaving like a directory diff tool, except that it uses content rather than filenames to match up files.

A majority of the code in vacuumpack is actually devoted to identifying duplicates efficiently. vacuumpack stores the file hashes and metadata in a cache (specified by --cache=...) and automatically rereads files when (and only when) they have been modified. So after the initial run, vacuumpack runs very quickly (e.g. in just seconds on my entire homedir) while always producing up-to-date reports. It's fast enough that you can run it semi-interactively, using it to check your work continuously while you're reorganizing and cleaning your files. You can drill down and ask lots of questions about different directories without having to wait twenty minutes for each answer.

I've posted a git repo containing the vacuumpack source:

git clone http://web.psung.name/git/vacuumpack.git

This has been tested on Ubuntu 11.04 under Python 2.7.1. You may use the code under the terms of the GNU GPL v3 or (at your option) any later version.