GNU Parallel

I figured this was worth sharing because I myself had written two (fairly lame) clones of this program before I discovered it.

Sometimes I find myself composing and running huge shell scripts, like the following:

$ cat process-files.sh
sox input/foo.ogg output/foo.ogg channels 1
sox input/bar.ogg output/bar.ogg channels 1
sox input/baz.ogg output/baz.ogg channels 1
sox input/quux.ogg output/quux.ogg channels 1
# more of the same, for perhaps hundreds of lines...

(Aside: why not xargs? For complicated tasks, it can be error-prone or just plain insufficient. Moreover, there's a lot of value in being able to just look at the script and see exactly what is going to be executed on your behalf, especially for one-off tasks. If you know emacs macros, scripts like this are not onerous at all to generate anyway.)

If you have a sequence of tasks like this that can run independently (and they are CPU-bound), then it pays to distribute the tasks over all your CPU cores. Here's where GNU Parallel comes in handy. Just pipe into it the commands you want to execute:

$ parallel -j4 < process-files.sh

Now parallel runs up to 4 tasks concurrently, starting up a new one when each one finishes (just as if you had a queue and a pool of 4 workers). What an elegant interface.

GNU Parallel has a bunch of more advanced features that are worth checking out, for example, preserving the proper ordering of standard output across tasks (to maintain the illusion of sequential-ness), or showing an ETA.

GNU Parallel is not in the official Debian/Ubuntu repos (as far as I can tell) but it is a snap to build from source, and it's the sort of thing I'd want floating around in my ~/bin everywhere I work.

8 comments:

  1. Consider:

    cd input
    ls | parallel -j+0 sox {} ../output/{} channels 1

    This would save you making the huge process-files script.

    ReplyDelete
  2. Thanks! I sometimes use xargs and the xargs-style syntax for parallel, but in the general case I sometimes fall back to a Huge Shell Script. For example, if you want to do something like

    convert foo.jpg [...] foo_small.jpg

    then I don't know whether it's simple (or possible) to automate that using the xargs-style syntax.

    ReplyDelete
  3. xargs cannot, but GNU Parallel can:

    parallel convert {} {.}_small.jpg ::: *.jpg

    ReplyDelete
  4. Wow, that's really neat. Thanks!

    ReplyDelete
  5. You may also have a look at http://sf.net/projects/paexec

    This tool does similar tasks (distributes your tasks over CPUs or hosts) but in a different way. It also provides features absent in GNU parallel.

    ReplyDelete
  6. Thanks for this post.

    Just a note for those who might accidentally install the 'moreutils' package on ubuntu, this package provides a *different* 'parallel' binary that is not GNU Parallel. It has different syntax and generally won't work as described here.

    ReplyDelete
  7. It is 2013 and GNU Parallel is as easy as 'apt-get install parallel -y' in Debian Squeeze.

    ReplyDelete
  8. I have prepared a shell script to check the url status. but when i excuted the script to check 1000 urls. It take at least 30 mins to job done.

    Could please help me out, is there is way,that i can minimize the output time.

    ReplyDelete