Version-Control Systems

Code evolves. As a project moves from first-cut prototype to deliverable, it goes through multiple cycles in which you explore new ground, debug, and then stabilize what you've accomplished. And this evolution doesn't stop when you first deliver for production. Most projects will need to be maintained and enhanced past the 1.0 stage, and will be released multiple times. Tracking all that detail is just the sort of thing computers are good at and humans are not.

Code evolution raises several practical problems that can be major sources of friction and drudgery — thus a serious drain on productivity. Every moment spent on these problems is a moment not spent on getting the design and function of your project right.

Perhaps the most important problem is reversion. If you make a change, and discover it's not viable, how can you revert to a code version that is known good? If reversion is difficult or unreliable, it's hard to risk making changes at all (you could trash the whole project, or make many hours of painful work for yourself).

Almost as important is change tracking. You know your code has changed; do you know why? It's easy to forget the reasons for changes and step on them later. If you have collaborators on a project, how do you know what they have changed while you weren't looking, and who was responsible for each change?


Amazingly often, it is useful to ask what you have changed since the last known-good version, even if you have no collaborators. This often uncovers unwanted changes, such as forgotten debugging code. I now do this routinely before checking in a set of changes.

-- Henry Spencer  

Another issue is bug tracking. It's quite common to get new bug reports for a particular version after the code has mutated away from it considerably. Sometimes you can recognize immediately that the bug has already been stomped, but often you can't. Suppose it doesn't reproduce under the new version. How do you get back the state of the code for the old version in order to reproduce and understand it?

To address these problems, you need procedures for keeping a history of your project, and annotating it with comments that explain the history. If your project has more than one developer, you also need mechanisms for making sure developers don't overwrite each others' versions.

The most primitive (but still very common) method is all hand-hacking. You snapshot the project periodically by manually copying everything in it to a backup. You include history comments in source files. You make verbal or email arrangements with other developers to keep their hands off certain files while you hack them.

The hidden costs of this hand-hacking method are high, especially when (as frequently happens) it breaks down. The procedures take time and concentration; they're prone to error, and tend to get slipped under pressure or when the project is in trouble — that is, exactly when they are most needed.

As with most hand-hacking, this method does not scale well. It restricts the granularity of change tracking, and tends to lose metadata details such as the order of changes, who did them, and why. Reverting just a part of a large change can be tedious and time consuming, and often developers are forced to back up farther than they'd like after trying something that doesn't work.

To avoid these problems, you can use a version-control system (VCS), a suite of programs that automates away most of the drudgery involved in keeping an annotated history of your project and avoiding modification conflicts.

Most VCSs share the same basic logic. To use one, you start by registering a collection of source files — that is, telling your VCS to start archive files describing their change histories. Thereafter, when you want to edit one of these files, you have to check out the file — assert an exclusive lock on it. When you're done, you check in the file, adding your changes to the archive, releasing the lock, and entering a change comment explaining what you did.

The history of the project is not necessarily linear. All VCSs in common use actually allow you to maintain a tree of variant versions (for ports to different machines, say) with tools for merging branches back into the main “trunk” version. This feature becomes important as the size and dispersion of the development group increases. It needs to be used with care, however; multiple active variants of the code base can be very confusing (just associated bug reports to the right version are not necessarily easy), and automated merging of branches does not guaranteed that the combined code works.

Most of the rest of what a VCS does is convenience: labeling, and reporting features surrounding these basic operations, and tools which allow you to view differences between versions, or to group a given set of versions of files as a named release that can be examined or reverted to at any time without losing later changes.

VCSs have their problems. The biggest one is that using a VCS involves extra steps every time you want to edit a file, steps that developers in a hurry tend to want to skip if they have to be done by hand. Near the end of this chapter we'll discuss a way to solve this problem.

Another problem is that some kinds of natural operations tend to confuse VCSs. Renaming files is a notorious trouble spot; it's not easy to automatically ensure that a file's version history will be carried along with it when it is renamed. Renaming problems are particularly difficult to resolve when the VCS supports branching.

Despite these difficulties, VCSs are a huge boon to productivity and code quality in many ways, even for small single-developer projects. They automate away many procedures that are just tedious work. They help a lot in recovering from mistakes. Perhaps most importantly, they free programmers to experiment by guaranteeing that reversion to a known-good state will always be easy.

(VCSs, by the way, are not merely good for program code; the manuscript of this book was maintained as a collection of files under RCS while it was being written.)

Historically, three VCSs have been of major significance in the Unix world, and we'll survey them here. For an extended introduction and tutorial, consult Applying RCS and SCCS [Bolinger-Bronson].

CVS (Concurrent Version System) began life as a front end to RCS developed in the early 1990s, but the model of version control it uses was different enough that it immediately qualified as a new design. Modern implementations don't rely on RCS.

Unlike RCS and SCCS, CVS doesn't exclusively lock files when they're checked out. Instead, it tries to reconcile nonconflicting changes mechanically when they're checked back in, and requests human help on conflicts. The design works because patch conflicts are much less common than one might intuitively think.

The interface of CVS is significantly more complex than that of RCS, and it needs a lot more disk space. These properties make it a poor choice for small projects. On the other hand, CVS is well suited to large multideveloper efforts distributed across several development sites connected by the Internet. CVS tools on a client machine can easily be told to direct their operations to a repository located on a different host.

The open-source community makes heavy use of CVS for projects such as GNOME and Mozilla. Typically, such CVS repositories allow anyone to check out sources remotely. Anyone can, therefore, make a local copy of a project, modify it, and mail change patches to the project maintainers. Actual write access to the repository is more limited and has to be explicitly granted by the project maintainers. A developer who has such access can perform a commit option from his modified local copy, which will cause the local changes to get made directly to the remote repository.

You can see an example of a well-run CVS repository, accessible over the Internet, at the GNOME CVS site. This site illustrates the use of CVS-aware browsing tools such as Bonsai, which are useful in helping a large and decentralized group of developers coordinate their work.

The social machinery and philosophy accompanying the use of CVS is as important as the details of the tools. The assumption is that projects will be open and decentralized, with code subject to peer review and inspection even by developers who are not officially members of the project group.

Just as importantly, CVS's nonlocking philosophy means that projects can't be blocked by a lock if a programmer disappears in the middle of making some changes. CVS thus allows developers to avoid the “single person point of failure” problem; in turn, this means that project boundaries can be fluid, casual contributions are relatively easy, and projects are not required to have an elaborate hierarchy of control.

The CVS sources are maintained and distributed by the FSF.

CVS has significant problems. Some are merely implementation bugs, but one basic problem is that your project's file namespace is not versioned in the same way changes to files themselves are. Thus, CVS is easily confused by file renamings, deletions, and additions. Also, CVS records changes on a per-file basis, rather than as sets of changes made to files. This makes it harder to back out to specific versions, and harder to handle partial check-ins. Fortunately, none of these problems are intrinsic to the nonlocking style, and they have been successfully addressed by newer version-control systems.

CVS's design problems are sufficient to have created demand for a better open-source VCS. Several such efforts are under way as of 2003. The most notable of these are Aegis and Subversion.

Aegis has the longest history of any of these alternatives, has hosted its own development since 1991, and is a mature production system. It features a heavy emphasis on regression-testing and validation.

Subversion is positioned as “CVS done right”, with the known design problems fully addressed, and in 2003 probably has the best near-term prospect of replacing CVS.

The BitKeeper project explores some interesting design ideas related to change-sets and multiple distributed code repositories. Linus Torvalds uses Bitkeeper for the Linux kernel sources. Its non-open-source license is, however, controversial, and has significantly retarded the acceptance of the product.