January 2008

A wonderful abuse of regular expressions

As it turns out, you can use regular expressions to test primality.

You'll need a regular expression engine that supports backreferences to do this. (Intuitively, a more restricted matcher that can be implemented as a finite state machine cannot count up to arbitrarily large numbers.)

I tried this in GNU egrep. I made this input file, which contains a number on each line followed by that number of stars:

$ cat numbers.txt
 2 **
 3 ***
 4 ****
 5 *****
 6 ******
 7 *******
 8 ********
 9 *********
10 **********
11 ***********
12 ************
13 *************
14 **************
15 ***************
16 ****************
17 *****************
18 ******************
19 *******************
20 ********************

Then, to find only those lines corresponding to prime numbers:

$ egrep -v " (\*{2,})\1+$" numbers.txt
 2 **
 3 ***
 5 *****
 7 *******
11 ***********
13 *************
17 *****************
19 *******************

Intuitively, the regular expression identifies composite numbers by seeing whether they can be written as a group of two or more stars, followed by one or more repetitions of that group. Writing the line in such a way corresponds of finding a factorization of the corresponding number into two other integers, both of which are greater than 1. The implementation basically carries out trial division by letting the first group be equal to **, then to ***, ****, etc. The -v option to egrep only prints those lines which do not match the regular expression, that is, the prime numbers.

Here's the catch:

Back-references are very slow, and may require exponential time.

(man page for GNU egrep)

Thank you, HP and Intel

I recently set up an HP all-in-one printer/scanner/fax (an Officejet 5610) on my house network.

Thanks, HP, for providing free drivers for the printer and the scanner. I set up the device directly through my OS and I don't have to use a crappy vendor-specific frontend to scan and print. I've had a bad experience with a Brother scanner (which didn't provide an open source driver, only a really flaky binary), so this is quite refreshing.

In Ubuntu, the printer drivers (PPDs) are in the package hpijs-ppds and the scanner drivers (SANE backends) are in the package hplip, I think. Our Officejet now happily scans and prints over the network with SANE and CUPS. Standards, I love you.

Which reminds me, I'm also thanking my lucky stars for the free drivers for components on my Thinkpad X61s. The Intel integrated graphics (965GM) and the Intel wireless chipset (4965AGN) just work out of the box on Ubuntu. Thanks, Intel!

Setting up reverse SSH

With SSH, you can use local forwarding to access a service that's available from a well-known host. For example:

localhost$ ssh -L 8080:remotehost:80 corporate

asks localhost to take requests to http://localhost:8080/ and relay them through corporate to http://remotehost:80/. This is handy if the host remotehost is accessible from corporate but not from localhost. Note that the client, localhost, can be anywhere (e.g. behind a NAT or a firewall) because it is the one initiating the SSH connection.

You can also do the reverse (also known as reverse SSH), forwarding in the other direction to allow a service behind a NAT/firewall to be accessible from a well-known host. For example:

localhost$ ssh -R 8080:localserver:80 corporate

asks corporate to take requests to http://corporate:8080/ and relay them through localhost to http://localserver:80/. This is handy if localserver is a host which is accessible from localhost but not from corporate. Note that the proxy and server, localhost and localserver, can be anywhere (e.g. behind a NAT or a firewall) because localhost is the one initiating the SSH connection. This setup can be used by trusted users inside a firewall to circumvent the firewall.

I frequently connect from my laptop to a desktop machine to do work. I've configured reverse SSH to allow SSHing from my desktop to my laptop so that I can, for example, save files to my laptop using a program on my desktop. Again, this works whenever I connect, even if my laptop connects from behind a NAT/firewall, or if I'm just on some network where I haven't bothered to look up my IP address.

To do this manually, add a remote port-forward and use that port to connect from the remote end:

laptop$ ssh -R 8889:localhost:22 desktop desktop$ ssh -p 8889 localhost laptop$

Now, this is sort of unwieldy, so I've configured my .ssh/config to fill in all the boring parts automatically. On laptop:

Host desktop RemoteForward 8889 localhost:22

Then, on desktop:

Host laptop HostName localhost HostKeyAlias laptop Port 8889

Now the login sequence is much more streamlined:

laptop$ ssh desktop desktop$ ssh laptop laptop$

Notice the HostKeyAlias field. Old versions of OpenSSH would barf if, on desktop, you SSH'd to both localhost:22 and to localhost:8889, because it would see two different host keys associated with the same host and think that there was a man-in-the-middle attack going on. This is a common problem whenever port-forwarding is used. The HostKeyAlias is your way of telling SSH, "no, this connection I've set up has its own host key which you should remember, but that host key is different from that of localhost". (I think newer versions of OpenSSH are smart enough to remember the host key on a per host/port basis, so they aren't subject to this problem.)

2007 in Review

What did I do and learn last year?

Emacs

Emacs is a Lisp-based operating system which I occasionally use for editing text files. When you type in a text box in a normal application, you are just typing in a little box. When you work in Emacs, the thing you are editing an extension of your mind. After becoming acclimated to this, using a regular editor is incredibly frustrating in that you can feel the lag between what you are thinking and what you are actually capable of doing with your hands.

I taught a class on Emacs last IAP and have been ramping up my Emacs usage since then. Using Emacs leads to a virtuous cycle of productivity: you do your work in Emacs, discover patterns in the things you do, then automate those patterns using macros and code. In Emacs, it's trivial to find out how the keys you press translate into function calls, so unlike most automation languages, you never have to look up some crazy API in order to code an equivalent for what you were already doing.

For coding, some people swear by Eclipse (for Java and C++ work, at least). What I think they don't realize is that time spent refactoring and remembering symbol names is a drop in the bucket compared to the time spent typing and moving text around, and no IDE comes close to Emacs as an editor.

I'd characterize Emacs as having a text UI for input combined with a graphical UI for output, taking advantage of the high-bandwidth link in both directions. Unlike typical GUI programs, Emacs isn't used to whitewash the underlying OS and programs; rather, it augments them. Emacs gives you the full power of command-line tools without the limitations of shell interaction. The biggest of these limitations is what I call "linearity". Once you run a command, you can look at its output, but the output is dead: you can't edit it (unless it occurred to you to save it to a file first), and you can't act on it directly except in very restricted ways. In Emacs, every significant piece of text you see appears in a buffer, where you can bring to bear all your editing skills and execute context-specific commands in the buffer. Read the Emacs guided tour for more of my thoughts on this (with examples).

Git

Git is a versatile content tracker which I've started using mostly as a local VCS and as file synchronization software. It is astonishingly easy to set up and use locally (at least after trying to use CVS and SVN), so I've started using it for managing almost all of my projects.

Git has changed the way I write software. When version control is fast, no change is too small to commit by itself. Smaller and logically coherent changes make it easier to isolate bugs. In addition, always having a nearby baseline version makes it easier to think about what I am trying to do at any particular time. When branching is fast enough to use regularly, taking a detour to try something new and ambitious doesn't interfere with the rest of your work.

Having really easy replication is quite reassuring. I actually trust Git to synchronize my files correctly, regardless of when I modified my files on which hosts since the last synchronization, something which is not true of, say, rsync.

Free software

I came for the power, but I stayed for the freedom.

I think at some point in 2006 I switched over completely to using GNU/Linux. In 2007 I stopped using nonfree software entirely on my primary work computer (now a Thinkpad X61s I got this fall, which works great, by the way).

I love the ethos associated with free software. Astonishingly, some people have software that doesn't allow them to (1) use it in any way they wish (2) share a good thing with their friends (3) learn from it by taking it apart (4) tweak it to suit their needs (or pay someone else to do so). How backwards is that?

What bothers me the most about proprietary software is its use in education. Many proprietary software makers "subsidize" students by offering their software cheaply. In reality, by teaching proprietary software in schools, the education system is subsidizing the software makers: if all I've used in school is Microsoft Word, then I am beholden to Microsoft when I graduate. It goes against the very idea that education is supposed to make people self-reliant, but for public money to be spent this way is particularly horrifying.

Computers are tools of empowerment. That's why free software is important: as we've seen in recent years, if you are running proprietary software on your computer, it is in a sense no longer "your computer". That's why the OLPC project excites me: it's an attempt to empower children rather than to give them a cheap handout (not to mention, it created a market segment where none existed two years ago, but that's another story).

I realize that some people do not respond to arguments like the above in favor of free software, which is why I think it is also important for those among us with the knowhow to make free software as good as it can be. It's not just hackers who benefit from free software, but hackers will have to lead the way.

Comparison of regexp syntax

To help myself learn Emacs regular expressions, I've put together this cheat sheet comparing regular expression syntax in Emacs, Python, and egrep.

Note that in Emacs, the syntax described below is for regular expressions that are entered directly (e.g. in isearch-forward-regexp and occur). When you provide a regexp in a string (e.g. in Lisp code or in re-builder) you need to double all backslashes.

	Emacs	Python	egrep
Any character	`.`	`.`	`.`
Beginning of line	`^`	`^`	`^`
End of line	`$`	`$`	`$`
0 or more repetitions	`*`	`*`	`*`
1 or more repetitions	`+`	`+`	`+`
3-5 repetitions	`\{3,5\}`	`{3,5}`	`{3,5}`
Optional	`?`	`?`	`?`
Character set	`[...]`	`[...]`	`[...]`
Alternatives	`\\|`	`\|`	`\|`
Group	`$...$`	`(...)`	`(...)`
Named group		`(?P<name>...)`
Non-capturing group¹	`$?:...$`	`(?:...)`
Word boundary	`\b`	`\b`	`\b`
Digit	`[[:digit:]]`	`\d`	`[[:digit:]]`
Whitespace char	`\s-`	`\s`
Alphanumeric char	`\w`	`\w`	`\w`
Back reference	`\1`		`\1`
Named back reference		`(?P=name)`

¹ Also referred to as "shy groups".

Using Git: review and analysis

A version control system (VCS) performs two major functions:

It saves snapshots of your project for comparison and debugging purposes.
It publishes your project for use by others.

Early VCS like RCS performed did only (1). As sharing code over a network became more common, systems like CVS and Subversion were developed, which performed both (1) and (2).

Tragically, CVS and Subversion use the same command ('commit') to perform both operations. And that means a user who is unable to perform (2) for whatever reason (say, he's on an airplane, or has no commit privileges) loses out on all the advantages of (1) as well. And a user can't do (1) unless he is also willing to do (2).

This is where distributed version control systems (DVCS) like Git come in.

In Git, (1) and (2) are decoupled. While you're working, you can snapshot your project as often as you want. But you do this without publishing your work. If and when you do decide to publish, the complete change history is transplanted to a public repository. Others can see the individual changes you've made and understand your development process. If you don't have permission to commit to the original repository, someone who does can commit on your behalf after reviewing your work. But if your work didn't pan out, you can blow it away and no one else is the wiser.

For a project, the chief advantage of using a DVCS is that it allows many contributors to work asynchronously, so that everyone who wants to can get all the usual version control tools, without the blessing of the managers and without any centralized coordination needed. Use of a DVCS dramatically lowers the barrier for contributors.

Now, it's not like I spend most of my time working on the Linux kernel. But what I've realized is that a DVCS changes the game even for single-user projects. Git's fast branching operations encourage users to proceed down experimental avenues. Whenever you work on two things at once, branches can help you to keep them separated. Rule of thumb: anytime you implement anything even moderately complex, do it on a new branch. This has the following advantages:

You can delete the branch if you can't get it to work.
You can pause work and continue working on the original branch if something urgent or unrelated comes up.

If you use branches for features in development, and only merge them back into your master (mainline) branch when you're finished, then you know that master is never in a half-working state. If work continues on master, you can transplant ('merge') those changes into your experimental branch. Your experimental branch can have the latest updates from master, but the master branch itself is never tainted by experimental code. Branching liberally removes a lot of the uncertainly associated with changing things.

Because Git provides a superset of the features of CVS, you can use Git in a CVS-like way, if you want to. But because it's so lightweight (easy to configure; no need to set up a server), low-overhead, and fast (especially in handling branches), I've found myself using Git to manage content that I would never have bothered to configure CVS for. Say goodbye to files named thesis-backup, thesis-backup2, etc.

Nowadays, I use Git even for personal projects that don't contain source code— anywhere I want to keep content synchronized between computers. Here I'm merely using Git as a file synchronization and backup tool. Git does conflict resolution when it's necessary (I don't need it frequently, but it happens). Every clone is itself a bona fide respository from which I can make another clone. And if I clone B from A, and then in turn clone C from B, that C and A have enough information to synchronize with each other directly. In these respects a DVCS is much more robust than ordinary file synchronization software. To top it all off, every working copy knows the full history of the project, and acquiring updates is as simple as git pull. There is overhead associated with storing all past history, but modern DVCS are good at keeping it small, and hard disk space being as cheap as it is, it's a small price to pay for easy and reliable backups.

Git is billed not as a VCS per se but as a "content tracker". Depending on how you use it, it's a local VCS, a VCS for sharing, a file synchronizer, or a time-travel backup system. Not only is it convenient that Git can fill all those needs, it's reassuring to know that as my projects change and grow, it is very unlikely that they will outgrow Git.

Further reading: Git homepage, Git tutorial

Version control with Git: publishing your work

To publish your work in Git, you need to define a remote repository and ask Git to push your changes there.

To add a new remote:

git-remote add mywebsite ssh://psung.name/~/path/to/repo

This registers a remote under the name mywebsite.

When you do git push mywebsite, Git will take all branches that are present on both ends and push changes from your repo to the remote. If other people have pushed to the repository since you last read from it, you will need to merge their changes locally with git pull mywebsite before pushing.

If you just created a bare repository at the remote and are now looking to do your first push, you can name a single branch to push:

git push mywebsite mybranch

Or you can push all branches:

git push --all mywebsite

To see what state the remote is in before or after you push, it's helpful to use gitk --all (which will show markers for the positions of the local and remote branch heads) or git-show-branch --all (which shows some of that information in a terminal).

When pushing to a location that's accessible by HTTP, you need to tell Git to update some cache information on each push. (HTTP is what Git calls a dumb transport mode.) To do this, you just need to enable one of the hooks which Git has provided for this purpose:

chmod +x hooks/post-update

(That assumes you are running that from the root of a bare repo.)

Then, you are all set for others to clone that repo from an HTTP path.

Version control with Git: tracking remote branches

When you clone a repository, Git only creates a branch corresponding to the remote's master. For each other branch that exists at the remote that you wish to work on locally, you need to create a local branch to track the remote branch. You can do this with:

git-checkout --track -b mybranch origin/mybranch

Subsequently, when you work on branch mybranch, git pull will know to acquire and merge changes from the specified remote branch.

Coming soon to Emacs...

I've recently read about these new features that are making their way into Emacs CVS. (You can check out many of these by compiling from CVS or using a recent snapshot.) It may be a while before Emacs 23 is released, but never let it be said that Emacs is stagnant:

Emacs is now a viewer for PDF, PS and DVI files (screencast). The new doc-view mode is automatically activated when you visit any of those file types.
Emacs is getting D-Bus support. D-Bus provides a channel for Emacs to talk to other desktop and system applications that support it.
Tramp (remote file access) is now faster than ever because it caches all the diagnostics it performs on each host.
As I've previously mentioned, VC-mode is being reworked to better deal with changeset-oriented and distributed VCS.
With the merging of the Unicode branch, Emacs will get, among other things, gorgeous anti-aliased font rendering.
The multi-TTY branch allows a currently running Emacs to open frames on other displays and TTYs. This means emacsclient is now a fast and proper editor for terminal programs that invoke an editor.

Update: Emacs is also getting support for status area icons and notifications. Hopefully this will lead to better integration with the rest of the desktop for Emacs-based applications.

Manipulating changeset-based VCS from within Emacs

VC mode in Emacs is currently getting a facelist to better support changeset-oriented version control systems, including distributed VCS. I told myself I would attempt to start using these new features rather than continuing to use Git from the command line. Here's what I've learned so far (just enough to get started):

The new VC mode was committed to Emacs CVS after the 22.1 release; you can get it by building Emacs (from CVS or from the Git mirror) or getting a precompiled snapshot (Ubuntu, Debian).

Previously, VC operations in Emacs operated on a single file (whatever file was in the current buffer). Now, you can select multiple files to operate on, as well as get an overview of your project's status, by using VC-Dired mode:

Open a directory with C-x v d. It looks a lot like a regular Dired listing, but has space to show some VC-specific information.
By default, VC-Dired only shows changed files. Type v t to toggle between displaying only changed files and displaying all files.
In VC-Dired, v is the VC prefix key, which does what C-x v does in file buffers. For example, you can diff the file at point with v = or annotate the file at point with v g.
Hack a bit. When you're ready to commit, mark one or more files you want to commit in VC-Dired with m.
Commit the files with v v. As with single-file commits, VC prompts you for a log message (type C-c C-c to finish the commit.