I was doing some spring cleaning of my computer, part of which was removing
  duplicate files. I'm a packrat, and I subscribe to the
  copy-first-and-ask-questions-later school of thought, so I have a few
  duplicates floating around, just taking up disk space. (I think I have
  somewhere a filesystem copy of my previous computer, which, in turn, contains
  a filesystem copy of my computer before that.)
There are plenty of tools that will help you remove duplicate files
  (this page lists no fewer than
  16), but, disappointingly, none of them that I could find seem to give
  you a high-level understanding of where and how duplicate files appear in
  your filesystem. So I wrote vacuumpack, a short Python tool
  to help me really see what was going on in my filesystem. Nothing
  revolutionary, but it helped me through my spring cleaning.
(Aside 1: I noticed the parallel with coding theory: you want
    to remove
    undesirable redundancy, by removing duplicate files on the same disk,
    and add
    desirable redundancy, by using backups or RAID.)
(Aside 2: This is simple enough that undoubtedly someone will send a pointer to a tool written circa 1988 that does what I am describing. C'est la vie; unfortunately it is usually easier, and more fun, to write code than to find it.)
vacuumpack is probably best explained by example.
  First, vacuumpack can scan a directory and identify duplicated
  files:
$ ./vacuumpack.py --target=/home/phil --cache=/home/phil/.contentindex
The output shows clusters of identical files. For example, oh hey, it's
    the GPLv3 (the cluster is prefaced by the sha256 of the
    shared content):
[...]
8ceb4b9ee5adedde47b31e975c1d90c73ad27b6b165a1dcd80c7c545eb65b903: 140588 bytes wasted
  /home/phil/projects/raytracer/COPYING (35147 bytes)
  /home/phil/projects/rdiff-snapshot-fs/source/COPYING (35147 bytes)
  /home/phil/projects/syzygy/COPYING (35147 bytes)
  /home/phil/projects/vacuumpack/COPYING (35147 bytes)
  /home/phil/source/emacs/COPYING (35147 bytes)
[...]
The clusters are sorted in decreasing order of the amount of space wasted,
  so you can fry the big fish first. Having multiple copies of the GPL floating
  around isn't really a cause for concern; let's have a look at another
  cluster:
6940e64dd91f1ac77c43f1f326f2416f4b54728ff47a529b190b7dadde78ea23: 714727 bytes wasted
  /home/phil/photos/20060508/may-031.jpg (714727 bytes)
  /home/phil/album1/c.jpg (714727 bytes)
Thus far duff and many
  other tools do essentially the same thing. But most of the tools out there
  are focused on semi-mechanically helping you delete files, even
  though cleaning up often requires moving or copying them as
  well.
Duplicated photos, like the ones above, are probably something I want to
  resolve. Since I recall that /home/phil/photos is the canonical
  location for most of my photos, the other directory looks somewhat suspect.
  So I'll ask vacuumpack to tell me more about it:
$ ./vacuumpack.py --target=/home/phil --cache=/home/phil/.contentindex /home/phil/album1
Now, for each file in that directory, vacuumpack tells me whether
  it knows about any duplicates:
/home/phil/album1/a.jpg         == /home/phil/photos/20060508/may-010.jpg
/home/phil/album1/b.jpg         == /home/phil/photos/20060508/may-019.jpg
/home/phil/album1/c.jpg         == /home/phil/photos/20060508/may-031.jpg
/home/phil/album1/d.jpg         == /home/phil/photos/20060508/may-033.jpg
/home/phil/album1/e.jpg         == /home/phil/photos/20060508/may-048.jpg
/home/phil/album1/f.jpg         -- NO DUPLICATES
/home/phil/album1/g.jpg         == /home/phil/photos/20060508/may-077.jpg
/home/phil/album1/h.jpg         == /home/phil/photos/20060508/may-096.jpg
It looks like the files in /home/phil/album1 are a subset of the
  photos in /home/phil/photos... except
  that /home/phil/photos is missing a file! I need to copy that file
  back; once I do, the directory album1 is safe to delete.
In this mode vacuumpack is behaving like a directory diff tool,
  except that it uses content rather than filenames to match up files.
A majority of the code in vacuumpack is actually devoted to
  identifying duplicates efficiently. vacuumpack stores the file
  hashes and metadata in a cache (specified by --cache=...) and
  automatically rereads files when (and only when) they have been modified. So
  after the initial run, vacuumpack runs very quickly (e.g. in just
  seconds on my entire homedir) while always producing up-to-date reports. It's
  fast enough that you can run it semi-interactively, using it to check your
  work continuously while you're reorganizing and cleaning your files. You can drill down and ask lots of questions about different directories without having to wait twenty minutes for each answer.
I've posted a git repo containing the vacuumpack source:
git clone http://web.psung.name/git/vacuumpack.git
This has been tested on Ubuntu 11.04 under Python 2.7.1. You may use the
  code under the terms of the GNU GPL v3 or (at your option) any later
  version.