Spring cleaning, dealing with duplicate files, and vacuumpack

I was doing some spring cleaning of my computer, part of which was removing duplicate files. I'm a packrat, and I subscribe to the copy-first-and-ask-questions-later school of thought, so I have a few duplicates floating around, just taking up disk space. (I think I have somewhere a filesystem copy of my previous computer, which, in turn, contains a filesystem copy of my computer before that.)

There are plenty of tools that will help you remove duplicate files (this page lists no fewer than 16), but, disappointingly, none of them that I could find seem to give you a high-level understanding of where and how duplicate files appear in your filesystem. So I wrote vacuumpack, a short Python tool to help me really see what was going on in my filesystem. Nothing revolutionary, but it helped me through my spring cleaning.

(Aside 1: I noticed the parallel with coding theory: you want to remove undesirable redundancy, by removing duplicate files on the same disk, and add desirable redundancy, by using backups or RAID.)

(Aside 2: This is simple enough that undoubtedly someone will send a pointer to a tool written circa 1988 that does what I am describing. C'est la vie; unfortunately it is usually easier, and more fun, to write code than to find it.)

vacuumpack is probably best explained by example. First, vacuumpack can scan a directory and identify duplicated files:

$ ./vacuumpack.py --target=/home/phil --cache=/home/phil/.contentindex

The output shows clusters of identical files. For example, oh hey, it's the GPLv3 (the cluster is prefaced by the sha256 of the shared content):

8ceb4b9ee5adedde47b31e975c1d90c73ad27b6b165a1dcd80c7c545eb65b903: 140588 bytes wasted
  /home/phil/projects/raytracer/COPYING (35147 bytes)
  /home/phil/projects/rdiff-snapshot-fs/source/COPYING (35147 bytes)
  /home/phil/projects/syzygy/COPYING (35147 bytes)
  /home/phil/projects/vacuumpack/COPYING (35147 bytes)
  /home/phil/source/emacs/COPYING (35147 bytes)

The clusters are sorted in decreasing order of the amount of space wasted, so you can fry the big fish first. Having multiple copies of the GPL floating around isn't really a cause for concern; let's have a look at another cluster:

6940e64dd91f1ac77c43f1f326f2416f4b54728ff47a529b190b7dadde78ea23: 714727 bytes wasted
  /home/phil/photos/20060508/may-031.jpg (714727 bytes)
  /home/phil/album1/c.jpg (714727 bytes)

Thus far duff and many other tools do essentially the same thing. But most of the tools out there are focused on semi-mechanically helping you delete files, even though cleaning up often requires moving or copying them as well.

Duplicated photos, like the ones above, are probably something I want to resolve. Since I recall that /home/phil/photos is the canonical location for most of my photos, the other directory looks somewhat suspect. So I'll ask vacuumpack to tell me more about it:

$ ./vacuumpack.py --target=/home/phil --cache=/home/phil/.contentindex /home/phil/album1

Now, for each file in that directory, vacuumpack tells me whether it knows about any duplicates:

/home/phil/album1/a.jpg         == /home/phil/photos/20060508/may-010.jpg
/home/phil/album1/b.jpg         == /home/phil/photos/20060508/may-019.jpg
/home/phil/album1/c.jpg         == /home/phil/photos/20060508/may-031.jpg
/home/phil/album1/d.jpg         == /home/phil/photos/20060508/may-033.jpg
/home/phil/album1/e.jpg         == /home/phil/photos/20060508/may-048.jpg
/home/phil/album1/f.jpg         -- NO DUPLICATES
/home/phil/album1/g.jpg         == /home/phil/photos/20060508/may-077.jpg
/home/phil/album1/h.jpg         == /home/phil/photos/20060508/may-096.jpg

It looks like the files in /home/phil/album1 are a subset of the photos in /home/phil/photos... except that /home/phil/photos is missing a file! I need to copy that file back; once I do, the directory album1 is safe to delete.

In this mode vacuumpack is behaving like a directory diff tool, except that it uses content rather than filenames to match up files.

A majority of the code in vacuumpack is actually devoted to identifying duplicates efficiently. vacuumpack stores the file hashes and metadata in a cache (specified by --cache=...) and automatically rereads files when (and only when) they have been modified. So after the initial run, vacuumpack runs very quickly (e.g. in just seconds on my entire homedir) while always producing up-to-date reports. It's fast enough that you can run it semi-interactively, using it to check your work continuously while you're reorganizing and cleaning your files. You can drill down and ask lots of questions about different directories without having to wait twenty minutes for each answer.

I've posted a git repo containing the vacuumpack source:

git clone http://web.psung.name/git/vacuumpack.git

This has been tested on Ubuntu 11.04 under Python 2.7.1. You may use the code under the terms of the GNU GPL v3 or (at your option) any later version.

Assorted notes

  • Public service announcement: earlier this year Google announced optional 2-factor authentication for Google accounts. Please use it: it's one of the least painful ways to make your data safer (most people are toast if their email gets compromised). And the implementation seems fairly well thought out:
    • You download an app to your smartphone (or smartphone-like device) that generates one-time passwords (OTPs), to be used in conjunction with your regular password when needed. A single OTP can authenticate one computer for up to 30 days. Yes, the app is open source. It runs on any Android, Blackberry, or iOS device.
    • The app works offline, without a data connection, because the method for generating OTPs is specified by RFC 4226 (yes, it's standardized and everything) and is either sequence-based or time-based.
    • Failing that, if you don't have a smartphone, or it's busted, you can also receive an OTP via SMS to a designated number (though, obviously, then you need phone reception).
    • Failing that, if you don't have a cell phone, or it's busted, you can also receive an OTP via a voice call to a designated landline.
    • Failing that... if you know you'll be somewhere where you have no phone at all, you can print a list of OTPs to carry with you that will enable you to log in.
    • Apps that authenticate via just a password (e.g. the phone itself, or most desktop apps, like Picasa) get a dedicated automatically generated password. You don't get the benefit of 2-factor auth here, but these passwords are less likely to be phished because you're not typing them in all the time, and you can revoke them individually.
  • Good lord, Ubuntu 11.04 (Natty) is fast. My laptop (Thinkpad X201 with Intel SSD) boots from disk unlock screen (LUKS full-disk encryption) to a working Openbox desktop in about four seconds.
  • I've been playing with Blender (the Free 3D modeling tool) for a personal project to be 3D printed, and it's a lot of fun, and quite rewarding. I'm still a noob at this stuff, but already I get some of these "in the zone" moments that are so rarely attained in software (Emacs being the other exception) where I feel like I'm manipulating a thing directly rather than using a software program. The Blender UI looks like an airplane cockpit, but there is a method to its madness! The other neat thing is that most of the time when you do creative work on the computer you are not rewarded with anything nearly so tangible as a 3D printed piece.
  • A clever thing I noticed on Android the other week: when you use voice dictation in a text entry field, and you move the cursor back to previous words, above the keyboard it shows not the nearest alternatives based on the keyboard layout (as it would if you were typing), but the nearest alternatives based on sound— e.g. "wreck" ... "a nice beach" as suggested replacements for "recognize speech".