I Still Know What You Learned Last Summer

Showing posts with label unix. Show all posts

GNU Parallel

I figured this was worth sharing because I myself had written two (fairly lame) clones of this program before I discovered it.

Sometimes I find myself composing and running huge shell scripts, like the following:

$ cat process-files.sh
sox input/foo.ogg output/foo.ogg channels 1
sox input/bar.ogg output/bar.ogg channels 1
sox input/baz.ogg output/baz.ogg channels 1
sox input/quux.ogg output/quux.ogg channels 1
# more of the same, for perhaps hundreds of lines...

(Aside: why not xargs? For complicated tasks, it can be error-prone or just plain insufficient. Moreover, there's a lot of value in being able to just look at the script and see exactly what is going to be executed on your behalf, especially for one-off tasks. If you know emacs macros, scripts like this are not onerous at all to generate anyway.)

If you have a sequence of tasks like this that can run independently (and they are CPU-bound), then it pays to distribute the tasks over all your CPU cores. Here's where GNU Parallel comes in handy. Just pipe into it the commands you want to execute:

$ parallel -j4 < process-files.sh

Now parallel runs up to 4 tasks concurrently, starting up a new one when each one finishes (just as if you had a queue and a pool of 4 workers). What an elegant interface.

GNU Parallel has a bunch of more advanced features that are worth checking out, for example, preserving the proper ordering of standard output across tasks (to maintain the illusion of sequential-ness), or showing an ETA.

GNU Parallel is not in the official Debian/Ubuntu repos (as far as I can tell) but it is a snap to build from source, and it's the sort of thing I'd want floating around in my ~/bin everywhere I work.

Network transparency makes distance irrelevant

The recent version of Ubuntu and the upcoming version of Fedora both ship with PulseAudio, a sound server which supports, among other things, network-transparent operation: you can take any program that generates sound and redirect that sound to be played on any other machine with a PulseAudio server.

PulseAudio is part of a long and venerable history of using computers remotely. People have been using SSH and X11 forwarding for ages now. CUPS and SANE allow you to access a printer or a scanner from halfway around the world. x2x lets you move the cursor on one computer with the mouse on another. And SSH's port-forwarding feature builds bridges to enable pretty much any service to be used from anywhere, even if the server is behind a firewall or is only serving to localhost. Features like these aren't unique to the modern Unix-like operating systems, but it is only there that they are widely enough used that people actually rely on them to get things done. Perhaps more critically, it is only there that they are easy enough to configure that people can use them on an ad-hoc basis.

In contrast, Windows or Mac OS, users are very much tied to individual machines. It does not really occur to people that they could use more than one computer, or that they would even want to.

I think this reflects one of the philosophical differences between Unix-like operating systems and others. Under Unix, a computer provides a number of resources (such as software, data, computation capacity, or specialized hardware) from which you can pick and choose. Because of network transparency, physical proximity is unimportant. Channels of communication can be layered and rerouted arbitrarily. You can make computers work together as ensembles. Individual computers are fungible.

Just a few weeks ago, for my research work, I wrote a script to distribute the processing of hundreds of files among a bunch of free machines. SSH is set up with public-key authentication there, so all the distribution of work happens without human intervention. The whole thing, including parsing the input, distributing those jobs, and load balancing, is about 50 lines of Python, using software which is standard on most any actual GNU/Linux system.

I do not think it is accidental that this kind of flexibility started in OSes built on free software. People who are trying to sell you software licenses do not have as much incentive to allow individual computers to be used in whatever way you please.

Setting up reverse SSH

With SSH, you can use local forwarding to access a service that's available from a well-known host. For example:

localhost$ ssh -L 8080:remotehost:80 corporate

asks localhost to take requests to http://localhost:8080/ and relay them through corporate to http://remotehost:80/. This is handy if the host remotehost is accessible from corporate but not from localhost. Note that the client, localhost, can be anywhere (e.g. behind a NAT or a firewall) because it is the one initiating the SSH connection.

You can also do the reverse (also known as reverse SSH), forwarding in the other direction to allow a service behind a NAT/firewall to be accessible from a well-known host. For example:

localhost$ ssh -R 8080:localserver:80 corporate

asks corporate to take requests to http://corporate:8080/ and relay them through localhost to http://localserver:80/. This is handy if localserver is a host which is accessible from localhost but not from corporate. Note that the proxy and server, localhost and localserver, can be anywhere (e.g. behind a NAT or a firewall) because localhost is the one initiating the SSH connection. This setup can be used by trusted users inside a firewall to circumvent the firewall.

I frequently connect from my laptop to a desktop machine to do work. I've configured reverse SSH to allow SSHing from my desktop to my laptop so that I can, for example, save files to my laptop using a program on my desktop. Again, this works whenever I connect, even if my laptop connects from behind a NAT/firewall, or if I'm just on some network where I haven't bothered to look up my IP address.

To do this manually, add a remote port-forward and use that port to connect from the remote end:

laptop$ ssh -R 8889:localhost:22 desktop desktop$ ssh -p 8889 localhost laptop$

Now, this is sort of unwieldy, so I've configured my .ssh/config to fill in all the boring parts automatically. On laptop:

Host desktop RemoteForward 8889 localhost:22

Then, on desktop:

Host laptop HostName localhost HostKeyAlias laptop Port 8889

Now the login sequence is much more streamlined:

laptop$ ssh desktop desktop$ ssh laptop laptop$

Notice the HostKeyAlias field. Old versions of OpenSSH would barf if, on desktop, you SSH'd to both localhost:22 and to localhost:8889, because it would see two different host keys associated with the same host and think that there was a man-in-the-middle attack going on. This is a common problem whenever port-forwarding is used. The HostKeyAlias is your way of telling SSH, "no, this connection I've set up has its own host key which you should remember, but that host key is different from that of localhost". (I think newer versions of OpenSSH are smart enough to remember the host key on a per host/port basis, so they aren't subject to this problem.)

How to rescue a dying hard drive

If you suspect your hard drive is dying (unusually large amounts of disk thrashing, long access times, frequent file corruption, refuses to boot), and you have a spare disk, it's not hard to clone the disk. It will be easier to recover data from the clone, since continued use of the bad disk can make things worse.

If the disk is your boot disk, move it to another computer. Otherwise, be sure to unmount it first.
Determine what device and partition the disk appears on, perhaps using mount or fdisk.
Use ddrescue to make an image of the disk:
```
$ ddrescue /dev/sdb1 IMAGEFILE LOGFILE
```
where sdb1 is the node for your source disk. ddrescue is like dd (which writes a copy of all the disk's raw data to a file), but works better for damaged disks: it fills in zeros for parts of the disk it can't read; you can run ddrescue as many times as you want, and if you provide the same LOGFILE it will attempt to fill in the gaps in the image that it didn't get before.
Write the image to a new disk:
```
$ dd if=IMAGEFILE of=/dev/sdc1
```
where sdc1 is the node for your target disk.
Mount the filesystem and run fsck or an equivalent. For NTFS volumes, moving the drive to a Windows computer and running chkdsk /F works wonders.

Makefiles

You may know make as a tool for building software, but it's actually a general utility for performing tasks based on a dependency graph, even if those tasks have nothing to do with compiling software. In many situations, files are processed in some way to create other files or perform certain tasks. For example:

A .c file is compiled to make a .o file, which is then linked to other object files to make an executable.
A .tex file is processed by LaTeX to make a PDF file.
A file containing data is processed by a script to generate graphs.
A web page needs to be uploaded to a remote server after it is modified.
A piece of software needs to be installed after it is compiled.

In each case, the modification of a source file demands that some task be performed to ensure that the corresponding resource is up-to-date (i.e., it reflects the changes made to the source file).

make automates the process of keeping resources up-to-date: you tell it about what resources depend on what source files, and what needs to be done to those source files to make the resources. It then runs only those tasks that actually need to be run. No more obsessive-compulsive internal dialogues that go like this: "Did I remember to compile that program again after I was done changing it?" "Well, I don't remember, so I might as well do it again." Running a properly-configured make will compile your program if it's necessary, and do nothing if your program is up-to-date. This can save you seconds, minutes, or hours, depending on what you're doing.

make determines what to do by comparing the modification times of the resources and the source files. For example, if thesis.pdf (a resource) is created from thesis.tex (a source file), then pdflatex needs to be run if thesis.pdf doesn't exist or if it's older than thesis.tex, but nothing needs to be done if thesis.pdf is newer than thesis.tex.

To use make to manage some files in a directory, create a file called Makefile in that directory. The syntax for the Makefile is refreshingly simple. For each resource to generate (these are called targets), type:

TARGET-FILE: DEPENDENCY-FILE-1 DEPENDENCY-FILE-2 ...
       COMMAND-TO-RUN
       ANOTHER-COMMAND

Each command to be run should be indented with a single tab. Here's an example Makefile, for a system where two files are to be compiled, then linked to make an executable:

helloworld: hello1.o hello2.o
       gcc -o helloworld hello1.o hello2.o
hello1.o: hello1.c
       gcc -c hello1.c
hello2.o: hello2.c
       gcc -c hello2.c

Running $ make helloworld tells make to resolve the dependencies necessary to generate the file helloworld: make first compiles the two source files, then links them. If I were to subsequently change hello1.c and run $ make helloworld again, make would recompile that file and relink the program, but it would not recompile hello2.c.

By default, $ make (without a target name) does whatever is necessary to make the first target that appears in your Makefile (in this case, helloworld).

In fact, your make targets do not have to refer to files. You can use make to define groups of commands to be run together under an (arbitrary) easy-to-remember target name. For example, I could define a publish target which would upload my web pages:

.PHONY: publish
publish: index.html stuff.html
       scp index.html stuff.html phil@remoteserver:~/www/

(The .PHONY directive tells make to not worry that a file named publish is not being generated.) Now publishing my web page is as easy as running $ make publish.

Here are some common applications of phony targets:

After you compile a piece of software, you can frequently install it with $ make install.
$ make clean usually has the job of removing compiled files from a directory (so you are left with the "clean" source files). It can often be defined similarly to this:
```
.PHONY: clean
clean:
       rm -f helloworld *.o
```
If a source directory contains multiple independent targets, an all target is usually created to force building the individual parts, and placed first so that $ make invokes $ make all:
```
.PHONY: all
all: part1 part2 part3
```

As you can see, you can use make to automate any well-defined process you're going to want to do repeatedly.

Getting more screen real estate with x2x

x2x is a program to transfer keyboard and mouse input from one X display to another. But in X, displays can be controlled by remote mice and keyboards (using X-forwarding), so what x2x really does is let you control a desktop remotely.

One particularly neat feature of x2x is the directional mode, which can essentially make two monitors behave like a single large screen: when you move your mouse off one edge of the first display, it appears on the second, even if the displays are running on two separate computers. This is really handy if you have two computers which are near each other and you need to use both of them at the same time, or if you just want some more screen area for certain tasks (for example, writing on one display while reading on another). It's actually easy enough that you can set up this kind of sharing ad-hoc, whenever you need it.

If you can SSH from one computer to the other, then x2x is very easy to configure. Suppose you have two computers named LAPTOP and DESKTOP. Then on LAPTOP do the following:

laptop$ ssh -X desktop
desktop$ x2x -east -from $DISPLAY -to :0

Now LAPTOP's mouse and keyboard can control DESKTOP when you move the mouse off the right-hand edge of the screen. To make DESKTOP's mouse and keyboard control LAPTOP instead, do the following on LAPTOP:

laptop$ ssh -X desktop
desktop$ x2x -west -from :0 -to $DISPLAY

If neither of the computers you're using can SSH to the other, you'll need a third computer that you have access to. Suppose you want to connect the displays of LAPTOP1 and LAPTOP2. In this case, the key is to connect both to SERVER and determine which remote displays they are connected on, then invoke x2x on one of them. On LAPTOP1, do this:

laptop1$ ssh -X server
server$ echo $DISPLAY
:10

Then on LAPTOP2, running

laptop2$ ssh -X server
server$ x2x -east -from :10 -to $DISPLAY

allows LAPTOP1 to control LAPTOP2 (provided you have replaced :10 with what LAPTOP1's $DISPLAY variable actually is). To make LAPTOP2 control LAPTOP1:

laptop2$ ssh -X server
server$ x2x -west -from $DISPLAY -to :10

Of course, you can substitute the direction arguments with -north, -south, or -east, depending on how the monitors are actually arranged relative to each other.

Everything is a text file

One of the things which makes UNIX systems so powerful is the ease with which one can move data around. What makes this possible is the fact that, with few exceptions, everything is a text stream:

Configuration files are plain text.
Data are usually stored as flat text files.
Even executables, which most Windows users consider to be synonymous with "binaries", are frequently text files: shell scripts, Perl scripts, Python scripts, and the like.
Most of the UNIX core programs produce output as text streams or text files.

(On Windows, all of the above, except configurations, are usually represented as binary data, and configuration data, as stored in the registry, is not really amenable to editing by hand.)

What this means is that any text tool you learn, from less to emacs, can be put to use in almost any situation; you don't have to learn specialized tools for every new task you want to perform. Moreover, it means that applications which understand text automatically get an interface they can use to talk to each other. Text is the universal language of computing.

This model is so useful that Linux even creates text interfaces for many system internals which are not not naturally represented as text files or streams. The /proc filesystem, inspired by Plan 9, is one such "virtual filesystem" which exposes certain system vital signs. For example, /proc/cpuinfo provides information about the CPU(s):

prompt$ head /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 9
model name      : Intel(R) Pentium(R) M processor 1400MHz
stepping        : 5
cpu MHz         : 1400.000
cache size      : 1024 KB
fdiv_bug        : no
hlt_bug         : no

The /proc filesystem contains a wealth of system information (current processes, memory usage, hardware and software configuration) all in the guise of text files that you can read. You can write to some of these files to change configurations as well. For example, when executed as root,

echo www.example.com > /proc/sys/kernel/hostname

changes the host name, and

echo 1 > /proc/sys/vm/drop_caches

drops the system page cache.

So what? It means that when you want to write a program to read or manipulate some aspect of the system, you don't have to rely on bindings which are fragile, or require special headers, or are unavailable in your language of choice. All your program needs to do is read from or write to a file, which is (usually) a piece of cake.

Sometimes flat text isn't enough. What if you need structured or hierarchical data? As it turns out, a filesystem provides a fine hierarchical storage mechanism in the form of directories. For example, the /proc filesystem stores information about the process with PID N inside files in /proc/N/. But when structure is stored in directories instead of, say, nested XML elements, or in the keys of a Windows registry file, you can bring to bear all the tools you already know that operate on files. If you're deploying an application, it's trivial to copy or extract a configuration directory to ~/.appname/. It's not quite as easy to unzip an XML configuration fragment into a larger XML configuration file.

The idea of universal interfaces has found traction even as flat text files have lost ground. In the OpenDocument format, each document (extensions .odt, .odp, etc.) is really a ZIP file containing an XML description of the file and any extra media. (On the other hand, prior to Microsoft Office 2007, essentially all knowledge of the Office file format was obtained through reverse-engineering.)

The other week, I found myself in a situation where I had to save all the (numerous) images in a Word document to individual files. The fastest way was in fact to use OpenOffice to convert to an OpenDocument; when I unzipped that, one of the directories inside contained all the pictures which were in the original document. Common interfaces and tools help you to break free from the limitations of specialized tools when necessary.

Further reading: Unix philosophy, proc filesystem. The /sys filesystem is worth a look too.

The Unix Pipe

With any installation of a Unix you get a bunch of programs like grep, tar, and scores of others.

Here are some examples of using these tools in combination by using the Unix pipe. These combinations really amount to one-line programs. (The theme is IM log analysis.)

(Gaim saves IM logs in files named with the date and time.)

cd ~/.gaim/logs/aim/my_sn/; find . | egrep '2006-08-26' | xargs head

"Show me the beginning of every conversation I had yesterday."

(Within each account, Gaim stores the logs for each of your buddies in a directory named using the buddy's screen name.)

cd ~/.gaim/logs/aim/my_sn/; du -sk * | sort -gr | cut -f2 | head

"Print the screen names of the people I talk to the most, in descending order of total log size."

cd ~/.gaim/logs/aim/my_sn/; egrep -i -r "my_sn:.*linux" * | wc -l

"How many times have I mentioned 'linux' in conversation?" (Answer: 83.)

Here is some stuff I actually use on a regular basis:

wget -O - http://www.example.com/file.tar.gz | tar -xz

Download and decompress a file in one step. This way, I don't have to make a temporary directory in which to download the file, I don't have to remember where I put the file, and I don't have to delete it when I'm done.

ssh phil@remotehost 'cat ~/filelist | xargs tar -C ~/ -cz' > ./backup`date +%Y%m%d`.tar.gz

Make backups of selected files over the network from a remote host. This command reads a file I keep (named filelist) which contains a list of all the files/directories I want to back up, one per line.

For your edification, or if you're insomniac, here are full explanations for what the above commands are doing:

find recursively lists every file in the log directory; egrep does filtering to pass on only those lines (filenames) which contain that particular date; xargs constructs and executes the command "head file1 file2 ..." where file1, file2, etc. are the lines it gets from stdin; head, in turn, prints the beginning (first 10 lines) of each file argument.
du prints each directory named, preceded by its disk usage; sort sorts all the rows by the first column (the disk usage) in reverse (decreasing) order; cut trims off the size so that only the names remain (the second column, hence -f2) and head limits the output to the first 10 lines.
egrep -i -r searches recursively over all lines in all files contained in the directory; wc -l takes in all the lines and prints as output only the number of lines.
wget -O - downloads the file and outputs to stdout instead of to a file; tar -xz extracts a .tar.gz from stdin to the current directory (the absence of -f FILE means use stdin instead of reading from a file).
The quoted command is run on the remote host: "cat ~/filelist | xargs tar -cz" constructs and executes the command "tar -cz file1 file2 ...", supplying to tar all the files I've listed in filelist. This compresses all the files I named and writes the archive to stdout. The archive is then written to a file named something like backup20060827.tar.gz. (The date command is executed and outputs something like "20060827"; this string is then pasted in to the command.)

Bonus: In the last example, the stdout of the last command on the remote host can be immediately redirected to a file on the local host. ssh is in general capable of connecting pipes between programs on different hosts. It automatically streams that information over the network (encrypted, of course) so that the connection is transparent to everyone involved!

Further reading: on a GNU system, typing info coreutils will bring up information about the base GNU tools, like cat, head, and more.