I was doing some spring cleaning of my computer, part of which was removing
duplicate files. I'm a packrat, and I subscribe to the
copy-first-and-ask-questions-later school of thought, so I have a few
duplicates floating around, just taking up disk space. (I think I have
somewhere a filesystem copy of my previous computer, which, in turn, contains
a filesystem copy of my computer before that.)
There are plenty of tools that will help you remove duplicate files
(this page lists no fewer than
16), but, disappointingly, none of them that I could find seem to give
you a high-level understanding of where and how duplicate files appear in
your filesystem. So I wrote vacuumpack, a short Python tool
to help me really see what was going on in my filesystem. Nothing
revolutionary, but it helped me through my spring cleaning.
(Aside 1: I noticed the parallel with coding theory: you want
to remove
undesirable redundancy, by removing duplicate files on the same disk,
and add
desirable redundancy, by using backups or RAID.)
(Aside 2: This is simple enough that undoubtedly someone will send a pointer to a tool written circa 1988 that does what I am describing. C'est la vie; unfortunately it is usually easier, and more fun, to write code than to find it.)
vacuumpack is probably best explained by example.
First, vacuumpack can scan a directory and identify duplicated
files:
$ ./vacuumpack.py --target=/home/phil --cache=/home/phil/.contentindex
The output shows clusters of identical files. For example, oh hey, it's
the GPLv3 (the cluster is prefaced by the sha256 of the
shared content):
[...]
8ceb4b9ee5adedde47b31e975c1d90c73ad27b6b165a1dcd80c7c545eb65b903: 140588 bytes wasted
/home/phil/projects/raytracer/COPYING (35147 bytes)
/home/phil/projects/rdiff-snapshot-fs/source/COPYING (35147 bytes)
/home/phil/projects/syzygy/COPYING (35147 bytes)
/home/phil/projects/vacuumpack/COPYING (35147 bytes)
/home/phil/source/emacs/COPYING (35147 bytes)
[...]
The clusters are sorted in decreasing order of the amount of space wasted,
so you can fry the big fish first. Having multiple copies of the GPL floating
around isn't really a cause for concern; let's have a look at another
cluster:
6940e64dd91f1ac77c43f1f326f2416f4b54728ff47a529b190b7dadde78ea23: 714727 bytes wasted
/home/phil/photos/20060508/may-031.jpg (714727 bytes)
/home/phil/album1/c.jpg (714727 bytes)
Thus far duff and many
other tools do essentially the same thing. But most of the tools out there
are focused on semi-mechanically helping you delete files, even
though cleaning up often requires moving or copying them as
well.
Duplicated photos, like the ones above, are probably something I want to
resolve. Since I recall that /home/phil/photos is the canonical
location for most of my photos, the other directory looks somewhat suspect.
So I'll ask vacuumpack to tell me more about it:
$ ./vacuumpack.py --target=/home/phil --cache=/home/phil/.contentindex /home/phil/album1
Now, for each file in that directory, vacuumpack tells me whether
it knows about any duplicates:
/home/phil/album1/a.jpg == /home/phil/photos/20060508/may-010.jpg
/home/phil/album1/b.jpg == /home/phil/photos/20060508/may-019.jpg
/home/phil/album1/c.jpg == /home/phil/photos/20060508/may-031.jpg
/home/phil/album1/d.jpg == /home/phil/photos/20060508/may-033.jpg
/home/phil/album1/e.jpg == /home/phil/photos/20060508/may-048.jpg
/home/phil/album1/f.jpg -- NO DUPLICATES
/home/phil/album1/g.jpg == /home/phil/photos/20060508/may-077.jpg
/home/phil/album1/h.jpg == /home/phil/photos/20060508/may-096.jpg
It looks like the files in /home/phil/album1 are a subset of the
photos in /home/phil/photos... except
that /home/phil/photos is missing a file! I need to copy that file
back; once I do, the directory album1 is safe to delete.
In this mode vacuumpack is behaving like a directory diff tool,
except that it uses content rather than filenames to match up files.
A majority of the code in vacuumpack is actually devoted to
identifying duplicates efficiently. vacuumpack stores the file
hashes and metadata in a cache (specified by --cache=...) and
automatically rereads files when (and only when) they have been modified. So
after the initial run, vacuumpack runs very quickly (e.g. in just
seconds on my entire homedir) while always producing up-to-date reports. It's
fast enough that you can run it semi-interactively, using it to check your
work continuously while you're reorganizing and cleaning your files. You can drill down and ask lots of questions about different directories without having to wait twenty minutes for each answer.
I've posted a git repo containing the vacuumpack source:
git clone http://web.psung.name/git/vacuumpack.git
This has been tested on Ubuntu 11.04 under Python 2.7.1. You may use the
code under the terms of the GNU GPL v3 or (at your option) any later
version.