Showing posts with label git. Show all posts
Showing posts with label git. Show all posts

git grab bag

Time to flush my buffers again before the new year. Happy new year, everyone!

In no particular order here are some useful git things I've run across recently.

git stash --keep-index

git's index (or staging area) is useful, but it comes with a liability: you run into the danger of committing a state that doesn't work (compile, pass tests, etc.) because you never actually "saw", on your disk, the stuff that was going to be committed, in isolation.

That's what git stash --keep-index is for: it stashes the changes you haven't staged for commit, so you are left with only what will get committed. Do your builds/testing/verification, then git stash pop to continue working.

git clone --reference

If you're trying to clone a large remote repo, it's a waste of time to re-download objects that are already present in another repository on your local disk or local network. You can use the --reference option to obtain objects from this local repository when possible (any objects not found are pulled as usual from the real source):

git clone --reference LOCAL.git REMOTE.git my_new_clone

(This is sort of like cloning LOCAL.git, changing its remote to point to REMOTE.git, and fetching again, but much easier, and doesn't pollute your clone with branches and commits that are only present in LOCAL.git but not in REMOTE.git. From a content perspective it is exactly a clone of REMOTE.git.)

Note: if LOCAL.git is on the same filesystem, git sets up the alternates file so that object IDs can be resolved just using the object files in LOCAL.git. This may be undesirable because you won't actually have the objects in your new repo(!) and references in your new repo alone will not(!) be sufficient to keep the objects from being GC'd if LOCAL.git changes. To break this dependency, forcing the required objects to actually be copied into your new clone, do the following:

git repack -a; rm .git/objects/info/alternates

gfind

To make git print out all the files it is tracking:

git ls-tree -r --name-only HEAD

This is a useful base for searching your codebase, since it automatically lets you search all of your actual code and avoid looking at compiled code, generated code, downloaded libraries and documentation, etc.

I have aliased this to gfind and do things like this to search within a repo:

gfind | xargs grep FOO

git-filter-branch

git-filter-branch is your general purpose tool for rewriting repositories wholesale, for example, to extract a single subdirectory retaining all its history, or to excise a subdirectory, file, secret key, password, etc. that you've put in the repository.

Matthew McCullough has put together a good starting point with examples of how to use this tool. (Unfortunately, some of the links are broken; I will post alternate links once I find track down copies.)

git diff --color-words

Word based diffs in git! Great for LaTeX, Markdown, etc. More info here.

Reducing merge headaches: git meets diff3

git has an option to display merge conflicts in diff3 format (by default it only displays the two files to be merged). You can enable it like so:

$ git config --global merge.conflictstyle diff3

Now, when you have to resolve merge conflicts, git shows your side, the side being merged, and (here's what's new) the common ancestor in between them. Here's an example of the diff3-formatted output:

cauliflower
<<<<<<< HEAD
peas
potatoes
||||||| merged common ancestors
peas
=======
>>>>>>> topic
tomatoes

Having the merge ancestor readily available helps you to quickly determine what the correct merge is, since you can infer from it the changes that were made on both sides. Here you can see that the original state was peas. On your branch potatoes was added (compare the middle section to the top) and on the other branch peas was removed (compare the middle section to the bottom). Therefore the correct change is to both add potatoes and remove peas, leaving you with just potatoes in the conflicted section.

There's really no reason you shouldn't enable the diff3 style, because you frequently need the ancestor to determine what the correct merge is.

To see that this is true, even in the simple example above, look at what the conflict looks like under the standard style:

cauliflower
<<<<<<< HEAD
peas
potatoes
=======
>>>>>>> topic
tomatoes

There's an asymmetry between peas and potatoes: one was added and one was deleted, but this merge conflict doesn't tell you anything at all about which was which! You can't determine the correct merge unless you remember the sequence of changes that led up to this point. And why should you have to rack your brain to do that? That's exactly the sort of thing that your computer can, and should, help you with.

Bonus tip: rerere (reuse recorded resolution)

If your workflow finds you redoing the same merges over and over again you might also find git's rerere (reuse recorded resolution) feature to be useful.

One of the things that is wonderful about rerere is that it provides hardly any UI surface at all. Just set it...

$ git config --global rerere.enabled 1

...and forget it. Although there is a git rerere command, you can get a lot done without using it at all.

After enabling rerere, whenever you resolve a merge conflict, git automatically squirrels away the resolution in its database. You'll see a message like this one:

$ git commit
Recorded resolution for 'soup'
[...]

And the next time you encounter the same conflict, where you would have expected git to spit out a file with conflict markers, you will instead find that it has automatically resolved the merge for you, and printed the following message:

$ git merge topic
Auto-merging soup
CONFLICT (content): Merge conflict in soup
Resolved 'soup' using previous resolution.

Just double-check to make sure nothing has gone awry, add, and commit. Save your blood, sweat, and tears for other, more interesting problems than redoing merges.

Further reading: Pro Git on rerere

Git for researchers

In my previous job—as a grad student, doing computational/biomedical research—I used Git to manage my projects.

For small projects, people usually treat CVS/SVN as checkpointing tools—tools to get you back to a known good state when you've screwed up. Git, however, provides a whole new vocabulary you can use to talk about creating, altering, composing, combining, splitting, undoing, and otherwise manipulating changes to code (commits). It helps you get stuff done faster every day, not just when you mess up.

Here are a couple of reflections and "lessons learned" on really using VCS to your advantage in a research environment, where some of the rules of thumb are a bit different from those in industry.

(They seem so stunningly obvious now that I've committed them to writing, but they seemed much less so when I first articulated them to myself.)

Retaining history, all of it. I have found git merge -s ours to be very handy. It produces a merge commit and merge topology, tying in the history of the other branch, but without applying any of the changes produced in that branch.

Typically, if a feature doesn't pan out, you delete the corresponding branch and destroy all evidence that you tried. But in exploratory or research contexts, the details of your failed experiments can be quite important. You might need to revisit some past state in order to perform further investigation. Or maybe you want to obtain some numbers for a paper or presentation.

Graphically: imagine you have a "successful" branch feature1 and a "failed" branch feature2 (left). You don't want to git branch -D feature2, since that could cause its history to be lost. If you instead git merge -s ours feature2, you get a topology where the states from both branches appear in your git log (right), but the state at the tip is the same as that at feature1.

* ddddddd (refs/heads/feature1)
* ccccccc
* bbbbbbb
| * 2222222 (refs/heads/feature2)
| * 1111111
|/
* aaaaaaa
* eeeeeee "Merge branch 'feature2'."
|\
* | ddddddd (refs/heads/feature1)
* | ccccccc
* | bbbbbbb
| * 2222222 (refs/heads/feature2)
| * 1111111
|/
* aaaaaaa

This kind of setup makes tracking your progress super easy. My git log basically becomes the scaffolding for my research notebook. I have bare-bones notes like the following:

Commit 2222222: this change did not improve quality at all. Furthermore it runs much slower, probably because blah blah blah blah. See full output in /home/phil/logs/2222222.

The great thing is that now every result (whether a success or a failure) has, attached to it, a commit name: a pointer to the exact code that generated that result. If I hadn't had complete change history so easily available, I would have spent half of my time second-guessing results I'd already obtained.

This application also demonstrates the strengths of DVCS versus CVCS. Research and software development do not happen in a clean linear way. There is lots of backtracking, and sometimes you cannot expect to work effectively with a VCS whose basic model is "one damn commit after another."

Summary: 90% of everything ends in failure. Keeping your failure history (as well as your success history) around is something that is underemphasized.

Long-lived branches vs. refactoring. If you know what you're going to do in advance, then it's not called research. In my work, what I ended up writing on a day-to-day basis depended more on experimentation and testing than on planning and specs. Here's some sample code for illustrative purposes:

# (1)
def my_function(a, b):
   foo = random_sample() # Random heuristic
   something(foo)
   ...

I want to find out how the following code stacks up against (1). Does it perform better? Is it faster?

# (2)
def my_function(a, b):
   foo = shortest_path(a, b) # A better(?) heuristic
   something(foo)
   ...

In reality we might be evaluating alternative heuristics (as here), different numeric parameters, alternative algorithms, or an alternative data source (e.g., training vs. testing data).

Sometimes, when there are a number of alternatives, the right thing to do is to refactor to parameterize the code, for example,

# (3)
def my_function(a, b, heuristic = 'shortest_path'):
   if heuristic == 'random':
       foo = random_sample()
   elif heuristic == 'shortest_path':
       foo = shortest_path(a, b)
   else:
       foo = ... # Additional logic...
   something(foo)
   ...

But every parameterization increases complexity. The new argument is something you have to think about every time you or someone else tries to read your code. Your function is longer, leaks more implementation details, and provides less abstraction. So you don't want to go down this route unless it's necessary. If one choice is a clear winner, and every invocation is going to pass the same argument, then the extra generality you introduced is a liability, not an asset. To do that refactoring can be a lot of work without much reward.

So you want to run and evaluate the alternatives before refactoring. People who find themselves in this situation often write code like this:

# (4)
def my_function(a, b):
   foo = random_sample()
   ## Uncomment the next line if blah blah blah
   # foo = shortest_path(a, b)
   something(foo)
   ...

which is convenient to write, but setting all the switches by hand whenever you want to run it is rather error-prone, especially if the difference is more complicated than one line.

Branching saves the day by letting your tools manage what you were doing by hand in (4). You can compare alternatives like (1) and (2) above against each other if you keep them in parallel branches (granted, you can't select between the alternatives at runtime, but that may be OK). Maintenance is a breeze: with git merge it's easy to maintain multiple parallel source trees, differing by just that one line, for as long as you please. And because you're committing every merge commit, your results are 100% reproducible (if you were messing with your files by hand, in order to reproduce a code state you would have to not only specify a commit name, but also what lines you had commented and uncommented).

After branching, you can mull it over and obtain data on all the alternatives. When you've made your decision, you either drop one implementation and end up with (1) or (2), or, if you need the generality, then you refactor so you can choose between them at runtime (3).

Summary: lightweight branches allow you to defer the work of refactoring rather than having to pay for it up front. They greatly improve the hackability of code, by letting you try out many different alternatives reliably and without much hassle.

A command-line substitute for gitk

gitk is indispensable for viewing repo histories and understanding the relationships between different branches. However, using a GUI is a bit heavyweight if you are working remotely or only need to see the last few commits. Under these circumstances, git log --graph, introduced in git 1.5.6, is a pretty good fake.

My preferred invocation is

git log --graph --abbrev-commit --pretty=oneline --decorate

which I've aliased to gl in my shell.

The subsequent options, respectively: only show short commit names, for compactness; only display one line per commit, for compactness; and show where your branches are. Here's some sample output:

* f95f34e... (refs/remotes/origin/master, refs/heads/master) Acknowledge Alexey.
* c691bc7... Allow unmarking the marked commit.
* 7640638... Fixed visualization of marked commits
* c3830ed... Make it work better on Windows.  Thanks to Jeff Dik.
*   3433556... Merge commit 'fdr/sign-off'
|\  
| * 68344a2... Add signoff customization option
* |   3e29059... Merge commit 'cymacs/master'
|\ \  
| * | 45fb865... Fix incorrect diff hightlighting of lines beginning with "+" or "-".
| * | b7fe745... Disable undo in all magit-mode buffers.
| |/  
* | 10fe99a... Ambiguity in call to git log fixed
* | 3d34a7c... Make buffer saving behavior customizable.
* | 64b8265... Removed unused threshold machinery.
* | b430add... Make sure that point never ends up in an invisible region.
|/  
*   b30faeb... Merge commit 'voins/voins'
|\  
| * 7386af1... Use "medium" git log format when visiting commit
* | 5fb7327... Mention autogen.sh
* | f055b18... Typo.
|/  

Magit

Magit is a spectacular Emacs add-on for interacting with git. Magit was designed with git in mind (unlike VC mode, which is a more generic utility), so git commands map quite straightforwardly onto Magit commands. M-x magit-status tells you about the current state of your repo and gives you one-key access to many common git commands. However, what really sold me on Magit was its patch editor, which completely obsoletes my use of git add, git add --interactive, and git add --patch. If Magit had this patch editor and nothing else, I would still use it. That's how great this is.

M-x magit-status (which I've bound to C-c i) tells you about your working tree and the index, kind of like a combination of git diff, git diff --cached, and git status. It shows some number of sections (e.g. Staged changes, Unstaged changes, etc.); within each section you can see what files have been modified; within each file you can view the individual hunks. Within any of these containers you can press TAB to expand or collapse the heading. Moving your cursor into a file header or a diff hunk header selects the changes in that file or hunk, respectively. You can then press s to stage those changes, as shown in these before-and-after pictures:

Once you're satisfied with your staged changes, you can press c to commit, which prompts you for a log message. After you've typed a message, C-c C-c performs the actual commit.

This is already much faster than using git add --interactive or git add --patch to stage parts of a file. You just find the hunk you want rather than having git ask you yes/no about every hunk.

However, Magit also allows staging changes at an even finer granularity. If you highlight some lines in a hunk and then press s, Magit only stages the selected lines, as shown in these before-and-after pictures:

When in doubt, it's a good idea to make small commits rather than large commits. It's easy to revert (cherry-pick, explain, etc.) more than one commit, but hard to revert half a commit. Kudos to Magit for making small commits easier to create.

Finally, Magit comes with a fine manual, which you can read online.

Installing Magit

It doesn't get too much easier than this for external Emacs packages.

Check out Magit:

git clone git://gitorious.org/magit/mainline.git

Make sure that magit.el from that checkout, or a copy, is on your load path. For example:

(add-to-list 'load-path (expand-file-name "~/.emacs.d/lisp"))

Autoload Magit and bind magit-status:

(autoload 'magit-status "magit" nil t)
(global-set-key "\C-ci" 'magit-status)

Linus Torvalds on Git

I finally got around to watching the video of the tech talk that Linus gave at Google discussing the design of Git.

In this video, Linus explains a lot of the advantages of using a distributed system. But it is also enlightening because it's a window into Linus's motivations: he discusses the ways in which his own needs— as a system maintainer— drove the design of the system, in particular in the areas of workflow, speed, and data integrity.

One interesting idea is that in DVCS, the preferred development workflow (you pull from a small group of people you trust, who in turn pull from people they trust...) mirrors the way humans are wired to think about social situations. You cannot directly trust a huge group of people, but you can transitively trust many people via a web of trust— a familiar concept from security. A centralized system cannot scale because there are n2 pairs of conflicts waiting to happen, and they will happen, because groups of people are distributed (not everyone is in the same room at the same time on the same LAN). But a DVCS workflow can scale, because it is fundamentally based on interactions between people and not on the artificial technical requirement that there has to be a single canonical place for everything.

Warning: Linus has strong opinions. I think he refers to at least three different groups of people as "ugly and stupid" in the course of his 70-minute talk.

Managing dotfiles with git, continued

Previously, I commented on my setup for keeping my dotfiles (.emacs, .bashrc, etc.) synchronized using git. Here is one refinement I've made to this process in the meantime.

Allowing local customizations

Occasionally there are changes I'd like to keep local to one machine. These may be on a permanent basis (for example, if there are certain things I'd like to happen, or not happen, on my laptop but not my desktop) or on a temporary basis (if I want to test out some change locally for some time before pushing it to my canonical repo). In version control a setup like this is best represented using branches. Here's how I've done this:

The master branch contains customizations that are supposed to be common to all machines and are appropriate to apply everywhere. Most changes are of this form. But on each machine I maintain a local branch named, for example, laptop-custom. This branch contains all the changes in master, plus usually no more than a couple of changes specific to that machine.

To initially set this up, after making a clone, I create a new branch and switch to it. Most of the time I stay on this branch.

git checkout -b laptop-custom

When I make changes, they initially go in to laptop-custom as local changes:

emacs # make some changes...
git add ...
git commit

If I decide a change is appropriate to apply everywhere, I put it on the master branch by using git-cherry-pick. I then rebase the local branch so the local patches always are at the "end". When you cherry-pick a patch to master and then rebase the other branch, git recognizes that the patch has already been applied on master and does not attempt to apply it again. So as you move changes over, the number of patches which are exclusive to the local branch decreases.

git checkout master
git cherry-pick ccddeef
git checkout laptop-custom
git rebase master

Pushing and pulling the master branch (containing the common customizations) is done in the same way as before, except that I always rebase the local branch afterwards.

git checkout master
git pull
git push
git checkout laptop-custom
git rebase master

For the benefit of posterity

index-pack died of signal 25

may occur when you try to pull from a repo created with git 1.5.x using git 1.4.x. Just wanted to put that out there for Google.

Content-addressable storage

In Git, every file is stored under a hash of its content. One consequence is that a particular file is only ever stored once in the repository, regardless of how many versions it appears in and under how many names. Each of those instances is a pointer to the same blob of data.

I'm using Git to manage a large collection of binary files (scanned images). I'm using version control so I can add new data, replace bad scans, and rename or reorder files reversibly. I've noticed two great things about this:

First, since I frequently add new files and rename files, but only occasionally delete or modify them, the repository (containing the entire revision history) is only slightly larger than the actual current state of the repository. This is true for many kinds of binary data (e.g. photos, music), so it makes using Git very attractive: why would you keep one backup copy when, for about the same amount of space, you could have a complete version history?

Second, pushing changes is very fast. Even after I rename hundreds of files, Git doesn't need to push huge amounts of binary data (just a few kilobytes) because every file is already in the repository, except possibly under a different name. This is far more economical than rsync, which would attempt to re-transfer every file that had been changed.

Using git-svn to interact with SVN repositories

Imagine you are a Git user and you have to work with an SVN repository. Tragically, you are now incapable of working without cheap and easy branching, disconnected operation, and all the other things that make Git great.

git-svn is a gateway that lets you interact with SVN repositories while using Git for all your local commits. It has the advantage of needing no extra configuration on the SVN server end.

To create a repository:

$ mkdir myproject; cd project
$ git-svn clone svn+ssh://me@remotehost/path/to/svn/repo

This may take a while, but you now have a bona fide Git archive with the full project history (which is not something that an SVN checkout contains).

Note: this is for a repository with no trunk and branches directories.

Your typical workflow will now looks like this.

Get remote changes:

$ git-svn rebase

Do all your Git business locally: edit, diff, commit, branch, and merge.

To push your changes back to the SVN repo:

$ git-svn dcommit

Individual commits in Git are pushed in order as separate commits to the SVN repo.

Further reading: git-svn documentation

Perhaps the only person who makes more extravagant claims about Git than I do

"Git is the next Unix":

Git was originally not a version control system; it was designed to be the infrastructure so that someone else could build one on top. And they did; nowadays there are more than 100 git-* commands installed along with git. It's scary and confusing and weird, but what that means is git is a platform. It's a new set of nouns and verbs that we never had before. Having new nouns and verbs means we can invent entirely new things that we previously couldn't do.

With git, we've invented a new world where revision history, checksums, and branches don't make your filesystem slower: they make it faster. They don't make your data bigger: they make it smaller. They don't risk your data integrity; they guarantee integrity. They don't centralize your data in a big database; they distribute it peer to peer.

Much like Unix itself, git's actual software doesn't matter; it's the file format, the concepts, that change everything.

The author describes a whole bunch of projects he worked on or heard about where Git would basically have satisfied all the needs of the project while being faster, smaller, and more secure.

Essentially, Git could serve as the base for many more version control system-type tools than just the ones that are called Git today. Myself, I'm looking forward to version control that's suitable for entire disks (perhaps with automatic history pruning) and better ways to be able to easily deal with large projects that contain smaller modules. But there are undoubtedly better ways of doing things than our untrained minds are capable of even imagining right now.

Using Git: review and analysis

A version control system (VCS) performs two major functions:

  1. It saves snapshots of your project for comparison and debugging purposes.
  2. It publishes your project for use by others.

Early VCS like RCS performed did only (1). As sharing code over a network became more common, systems like CVS and Subversion were developed, which performed both (1) and (2).

Tragically, CVS and Subversion use the same command ('commit') to perform both operations. And that means a user who is unable to perform (2) for whatever reason (say, he's on an airplane, or has no commit privileges) loses out on all the advantages of (1) as well. And a user can't do (1) unless he is also willing to do (2).

This is where distributed version control systems (DVCS) like Git come in.

In Git, (1) and (2) are decoupled. While you're working, you can snapshot your project as often as you want. But you do this without publishing your work. If and when you do decide to publish, the complete change history is transplanted to a public repository. Others can see the individual changes you've made and understand your development process. If you don't have permission to commit to the original repository, someone who does can commit on your behalf after reviewing your work. But if your work didn't pan out, you can blow it away and no one else is the wiser.

For a project, the chief advantage of using a DVCS is that it allows many contributors to work asynchronously, so that everyone who wants to can get all the usual version control tools, without the blessing of the managers and without any centralized coordination needed. Use of a DVCS dramatically lowers the barrier for contributors.

Now, it's not like I spend most of my time working on the Linux kernel. But what I've realized is that a DVCS changes the game even for single-user projects. Git's fast branching operations encourage users to proceed down experimental avenues. Whenever you work on two things at once, branches can help you to keep them separated. Rule of thumb: anytime you implement anything even moderately complex, do it on a new branch. This has the following advantages:

  • You can delete the branch if you can't get it to work.
  • You can pause work and continue working on the original branch if something urgent or unrelated comes up.

If you use branches for features in development, and only merge them back into your master (mainline) branch when you're finished, then you know that master is never in a half-working state. If work continues on master, you can transplant ('merge') those changes into your experimental branch. Your experimental branch can have the latest updates from master, but the master branch itself is never tainted by experimental code. Branching liberally removes a lot of the uncertainly associated with changing things.

Because Git provides a superset of the features of CVS, you can use Git in a CVS-like way, if you want to. But because it's so lightweight (easy to configure; no need to set up a server), low-overhead, and fast (especially in handling branches), I've found myself using Git to manage content that I would never have bothered to configure CVS for. Say goodbye to files named thesis-backup, thesis-backup2, etc.

Nowadays, I use Git even for personal projects that don't contain source code— anywhere I want to keep content synchronized between computers. Here I'm merely using Git as a file synchronization and backup tool. Git does conflict resolution when it's necessary (I don't need it frequently, but it happens). Every clone is itself a bona fide respository from which I can make another clone. And if I clone B from A, and then in turn clone C from B, that C and A have enough information to synchronize with each other directly. In these respects a DVCS is much more robust than ordinary file synchronization software. To top it all off, every working copy knows the full history of the project, and acquiring updates is as simple as git pull. There is overhead associated with storing all past history, but modern DVCS are good at keeping it small, and hard disk space being as cheap as it is, it's a small price to pay for easy and reliable backups.

Git is billed not as a VCS per se but as a "content tracker". Depending on how you use it, it's a local VCS, a VCS for sharing, a file synchronizer, or a time-travel backup system. Not only is it convenient that Git can fill all those needs, it's reassuring to know that as my projects change and grow, it is very unlikely that they will outgrow Git.

Further reading: Git homepage, Git tutorial

Version control with Git: publishing your work

To publish your work in Git, you need to define a remote repository and ask Git to push your changes there.

To add a new remote:

git-remote add mywebsite ssh://psung.name/~/path/to/repo

This registers a remote under the name mywebsite.

When you do git push mywebsite, Git will take all branches that are present on both ends and push changes from your repo to the remote. If other people have pushed to the repository since you last read from it, you will need to merge their changes locally with git pull mywebsite before pushing.

If you just created a bare repository at the remote and are now looking to do your first push, you can name a single branch to push:

git push mywebsite mybranch

Or you can push all branches:

git push --all mywebsite

To see what state the remote is in before or after you push, it's helpful to use gitk --all (which will show markers for the positions of the local and remote branch heads) or git-show-branch --all (which shows some of that information in a terminal).

When pushing to a location that's accessible by HTTP, you need to tell Git to update some cache information on each push. (HTTP is what Git calls a dumb transport mode.) To do this, you just need to enable one of the hooks which Git has provided for this purpose:

chmod +x hooks/post-update

(That assumes you are running that from the root of a bare repo.)

Then, you are all set for others to clone that repo from an HTTP path.

Version control with Git: tracking remote branches

When you clone a repository, Git only creates a branch corresponding to the remote's master. For each other branch that exists at the remote that you wish to work on locally, you need to create a local branch to track the remote branch. You can do this with:

git-checkout --track -b mybranch origin/mybranch

Subsequently, when you work on branch mybranch, git pull will know to acquire and merge changes from the specified remote branch.

Manipulating changeset-based VCS from within Emacs

VC mode in Emacs is currently getting a facelist to better support changeset-oriented version control systems, including distributed VCS. I told myself I would attempt to start using these new features rather than continuing to use Git from the command line. Here's what I've learned so far (just enough to get started):

The new VC mode was committed to Emacs CVS after the 22.1 release; you can get it by building Emacs (from CVS or from the Git mirror) or getting a precompiled snapshot (Ubuntu, Debian).

Previously, VC operations in Emacs operated on a single file (whatever file was in the current buffer). Now, you can select multiple files to operate on, as well as get an overview of your project's status, by using VC-Dired mode:

  • Open a directory with C-x v d. It looks a lot like a regular Dired listing, but has space to show some VC-specific information.
  • By default, VC-Dired only shows changed files. Type v t to toggle between displaying only changed files and displaying all files.
  • In VC-Dired, v is the VC prefix key, which does what C-x v does in file buffers. For example, you can diff the file at point with v = or annotate the file at point with v g.
  • Hack a bit. When you're ready to commit, mark one or more files you want to commit in VC-Dired with m.
  • Commit the files with v v. As with single-file commits, VC prompts you for a log message (type C-c C-c to finish the commit.

Version control with Git: git-rebase --interactive

Git 1.5.3 introduces git-rebase --interactive, which lets you alter the commit history in various ways, including splitting, squashing (combining), inserting, and removing patches. In each case git-rebase rewrites the subsequent commit history so no one else is the wiser.

Start by doing:

git-rebase -i f00bab

where f00bab is the commit before the first commit you want to change.

Git opens an editor describing the commits since that commit, in chronological order, in the following format:

pick 1e4dfd7 Foo bar commit.
pick 6b78037 Baz quuz commit.
:

You can edit this list to tell Git to do certain things:

  • Remove a line to delete the corresponding commit.
  • Move lines around to reorder commits.
  • Change pick to squash on a line to combine that commit with the previous commit.
  • Change pick to edit to modify or split that commit (see below).
  • Git will attempt to reapply all other commits (lines which are still labeled pick).

When you squash a commit, Git prompts you for a new message for the combined commit.

When you choose edit, Git applies that commit but pauses the rebasing process so you can edit your tree. There are a couple of ways to proceed:

  • To edit the commit message only, just do git commit --amend.
  • To edit the commit itself, make some changes, git add them, and do git commit --amend.
  • To split the commit into multiple commits, do git reset HEAD^ to rewind the branch without touching the working copy. Stage and commit (git add ...; git commit) your first change. Repeat that as many times as you like until the branch catches up to your working copy. git-add --interactive can be useful if you want to pick and choose parts of files to stage. Notice that if you only use git add and git add --interactive then your working tree never changes. If you really want to work with the intermediate states— for example, to run unit tests or whatnot— use git-stash to put away your working copy temporarily.
  • You can also make additional commits here, which will appear right after the commit being edited.

In any case, when you're finished at that point in history, use git-rebase --continue to move on.

Whenever you change a commit, Git applies the patches from subsequent commits (at least to the best of its ability; if a change you make causes a subsequent patch to not apply cleanly, then Git will stop to ask you to resolve the conflict). However, these new commits will have different identifiers.

Further reading: git-rebase documentation

Version control with Git: git-add --interactive

Git 1.5.0 introduces git-add --interactive, which lets you stage changes at a finer granularity than the file level. This is useful if you've made a number of changes in a file before you realize that they logically should go in as separate commits. It works by allowing you to pick and choose hunks of the diff to stage.

To start:

$ git add --interactive
           staged     unstaged path
  1:    unchanged       +26/-6 path/to/file1
  2:    unchanged        +1/-0 some/other/file

*** Commands ***
  1: status   2: update   3: revert   4: add untracked
  5: patch    6: diff     7: quit     8: help
What now>

git shows you all the files which have changes from HEAD. To pick and choose diffs, select 5 (patch) at the menu. Git prompts you to select one of the listed files by number.

Git then shows you hunks of the diff between HEAD and your working copy. For each hunk, you can choose whether or not you want to stage the hunk. For large hunks, Git will also offer to split the hunk into smaller hunks which you can stage independently. (There are other options, such as options to stage or unstage all remaining hunks.)

Once you're done, you can select 7 to quit, do git diff --cached to verify the staged changes, and do git commit as usual to commit.

Update: you can also use git-add --patch FILENAME, which skips the menu and jumps directly to the hunk selection part. I usually use this instead of git-add --interactive now and I've aliased it to gap in my shell. If you use Emacs, magit is a great extension that also supports staging and unstaging individual hunks.

Further reading: git-add documentation

Version control with Git: branches

In git, branches are a lightweight way to manage multiple lines of development. You might use a separate branch for work on an experimental feature, or to maintain parallel lines of code with just a few differences. In either case, it's often essential to copy changes ("merge") from one branch to another to keep them synchronized, and git handles this very well.

To create a new branch named mynewfeature and switch to it:

git branch mynewfeature
git checkout mynewfeature

Now, commits you make will appear on the mynewfeature branch, but will not affect the default branch (which is named master). Running gitk --all conveniently shows you commits and branches, and lets you see graphically when your branches diverged or were synchronized.

To resume working on the master branch:

git checkout master

Git will put away all the file versions associated with your branch and restore the file versions associated with the master.

Typically, after you branch, new changes will be committed both on the branch and on the master (main line). Merging changes on the master into your branch regularly will ensure that your branch contains the latest updates from the main line (but does not apply to the master any changes you've made on the branch). Assuming you have checked out a branch, you can merge changes from the master with:

git merge master

Git applies all the changes made on the master since your branch diverged from it (or since the last time you merged). It then (usually) creates a new commit representing the merged state. If there is a merge conflict, Git will not make the commit: you will be asked to fix up the merge and then git commit -a the result.

Once you are done with your new feature, it's time to merge it back into the master! Checkout the master and then merge your branch like so:

git merge mynewfeature

You can then delete the branch:

git branch -d mynewfeature

Version control with Git: remote repositories

Previously, I wrote about Git usage for single-user single-location projects. However, where Git really shines is in managing a project when changes are made on multiple machines (whether by one person or by multiple people).

Unlike centralized version control systems and file synchronization software, Git and other distributed version control systems actually have good support for disconnected operation:

  • You can perform your commits locally and without talking to a central server. You can push all your changes to another location whenever you get a chance. (CVS and SVN don't support making commits in isolation, so people who work offline end up submitting huge patches.)
  • When you (or you and other people) perform independent changes in parallel on different machines, Git knows how to gracefully merge those changes the next time you synchronize.

However, I've found Git to be a lot easier to get started with than other VCS's with the same features. All you need is a git init to start managing a simple project in Git: like RCS (but unlike CVS and SVN), there's no need to create a separate repository and make a checkout of it. And yet, when your project matures, Git will happily (and with just a couple of additional commands) move that project to multiple machines or share it over the internet so you can take advantage of its distributed features.

One of my projects is used solely to manage my dotfiles. (I'm constantly tweaking my .emacs, .bashrc, etc.) I use Git to keep my dotfiles synchronized on all the computers I use. I'll discuss the basics of distributed operation for a generic project before talking about some of the wrinkles associated with using Git to manage your dotfiles.

Suppose my Git repository is in ~/testproj on the machine bigphil. To start working on that project on another machine (the equivalent of "svn checkout"), do:

git clone ssh://bigphil/~/testproj

You can also clone a repository from elsewhere on a local disk, or over HTTP:

git clone ~/testproj
git clone http://example.com/git/testproj

You now have the complete version history of the project, and Git can work completely independently of the original repository, if you'd like. You can add files, make your own commits, etc. on your new repository locally.

However, typically you will want to continue incorporating commits that are made to the source repository. You can do this with:

git pull

(When you cloned the repository, Git remembers the original location; git pull will retrieve updates from the same location.) This is the equivalent of "svn update".

Merging

When you pull from a remote repository and there have been changes on both your local copy and the remote copy since you last synchronized, Git needs to merge those changes. Usually, it will do this automatically and make a new commit which incorporates the changes from both the local and the remote repositories.

After a merge has occurred, if you look at the project history with gitk --all, you'll see a place where two history lines diverged (representing development in the local and remote repositories) and then were merged back together. (However, the output of git log flattens these nonlinearities.)

If the local and remote changes made modifications to the same pieces of code, Git may have trouble performing a merge. In this case, it will do its best, but it will leave conflict markers in the code and not commit the final result. You should fix up the conflicts and then commit the merge with git commit -a.

Pushing, and bare repositories

To take your local modifications and push them back to the original repository from which you made your clone, do this:

git push

To push to a different repository:

git push path/to/other/repo

However, if you push your local modifications to a regular repository, a person who is using that repository to do work may get confused because the state of the repository is changing right under him. So typically it's better to push to a bare repository, which is a repository without a working copy (essentially, what's in the .git subdirectory of a regular repository).

To initialize a bare repository:

git --bare init

Then use git push PATH to push to it. Bare repositories look different at the file level, but cloning from and pushing to them is otherwise the same.

Managing dotfiles with Git

To manage my dotfiles, I've made my home directory the root of a Git repo. I only add the files I'm interested in managing (.emacs, .bashrc, etc.), and Git ignores the rest of them. I push changes from that repo into a bare repository on one of my machines, and pull from that repository to get the latest versions.

The only complication is when I wish to bring my dotfiles to a new computer. Git does not allow you to clone a repo into an existing directory (as I would wish to do to clone my dotfiles into my home directory). However, things will work if you clone to a new directory, and then copy the contents of that directory (the .git subdirectory, and all the files of interest) back to your home directory:

git clone ssh://bigphil/~/projects/dotfiles
mv dotfiles/.[a-zA-Z]* ~

(Note that dotfiles/* doesn't work because * doesn't usually select dotfiles.)

Version control with Git

I've recently started using Git for version control of all my personal projects. It works so smoothly that I don't have any reservations about using version control. That means I commit small changes very often; as a result, I'm never afraid of leaving my projects in a wedged state, even if I'm making big changes.

This post only discusses Git basic usage. I'll write about its distributed features in future posts.

To start managing a directory with Git:

  1. Do the following to initialize the directory:

    git init

  2. If you have existing files you want to start managing, do

    git add .

    to add all the files in the directory, or

    git add FILE ...

    to add only particular files.

  3. Then do

    git commit

    Git prompts you for a log message and records your first commit. It will print a message of the following form:

    Created commit 1e169f6: Add new file.
     1 files changed, 15 insertions(+), 0 deletions(-)

    Congratulations!

Like SVN, Git store per-tree versions rather than per-file versions. However, instead of assigning version numbers like SVN does, Git assigns a unique hexadecimal identifier for each commit. Although these identifiers are long (40 characters), when you wish to refer to a particular commit, you only need to type as many characters are needed to make a unique prefix.

My typical workflow looks like this:

  1. Do some editing:

    emacs FILE ...

  2. Add files to Git's staging area:

    git add FILE ...

    Do this whether the files are new files you've added, or existing files you've modified.

  3. Commit the files in the staging area:

    git commit

    Again, Git prompts you for a log message and records a commit.

As a shortcut, git commit -a is equivalent to running git add with any modified files before running git commit. (It does not, however, pick up newly created files.)

The following commands are used to explore the project's history and current state:

  • git log shows recent commits.
  • git status shows which files are in the staging area, which files have been modified, and which newly created files are not managed by git.
  • git diff shows changes in your working copy.
  • git diff --cached shows changes in the staging area, that is, what will be committed when you do git commit.
  • git diff cd12..ab34 diffs the revisions cd12 and ab34.
  • git reset --hard restores your working copy state to that of the last commit.

The CVS version of GNU Emacs supports Git in VC, so it knows when your file is being managed by Git. When it is, you can use the following VC commands:

  • C-x v v commits the current file.
  • C-x v = shows a diff between your working copy and the last committed version (or between any two committed versions).
  • C-x v g displays an annotated version of the file showing, for each line, when that line was last modified, and a heat-map displaying older and newer code in different colors.

To install git on Ubuntu: type sudo apt-get install git-core.

Further reading: Everyday GIT with 20 commands or so, A tour of git.