Sunday, September 24, 2006

As long as I've been writing web pages, I've experimented with ways to manage them. Long ago I used the C preprocessor to give me server-side includes and simple macros. It was a mess, since the C preprocessor parses single quotes and double slashes (for C character constants and C++ comments), and both of those occur in other contexts in web pages. Later on I built something that could automatically build navigation trees for each page. I also experimented with but never fully adopted a system that would let me write HTML in a simpler syntax. I built tools that would take multiple text files and assemble them into a larger page. I also tried using third party tools, like LaTeX2HTML. I eventually abandoned all of these systems. They were too complex and introduced dependencies between input files, and I then had to manage those dependencies.

The last time I had the web page management itch, I wrote down what I wanted out of any new system I set up:

  • Portable. I want to depend on as few external tools as possible, so that I could run this on a variety of systems, including Windows, Linux, Mac, and the restricted environment where my web pages are hosted.
  • Straightforward. I had abandoned several of my previous systems because they imposed restrictions on what I could put into my document. I'm comfortable with HTML, and I'd like to just write HTML as much as possible. The more abstractions I put in between my system and the final output, the more restrictive and complex it will be.
  • Fast. Several of my previous systems had to analyze all of my documents in order to rebuild any of them. In particular, navigation trees require analyzing other nodes in order to create the links. If I only edit one document, I want my system to regenerate only one HTML file. Therefore this rule requires that I do not put in navigation trees.
  • Simple. I'm lazy. I don't want to write a complicated system to manage my pages. I just want to get the low hanging fruit and not worry about solving all the problems.
  • Static. I have to produce static HTML; I don't have a web host where I can run web apps or CGI scripts.

I ended up learning and using XSLT. I do not particularly like XSLT, but it's a reasonable tool to start with. When HTML and XML documents are viewed as trees, XSLT is used to transform (rearrange, erase, and add) tree nodes.

The first step was to extract the content out of my existing web pages. Each page has a mix of template and content. The extrator separated them. I had written these pages in different styles over a period of ten years, so the headers, navigation, HTML style, etc. are not consistent. Some of the pages were generated by other tools I had used. While looking at the old HTML, I decided that I would not be able to treat them uniformly. I added to my requirements:

  • Optional. Some pages will use the new system and some pages will not. I will not migrate everything at once.

I wrote XSLT to extract content out of some of my web pages. I grouped the pages by their implementation and style. I handled the extraction in four ways:

  1. If the page already fit my extractor, I left it alone.
  2. If the page with some minor changes would fit my extractor, I made those changes to the page.
  3. If the page would fit the extractor with minor changes, I made those changes to the extractor.
  4. If the page would require major changes to either it or the extractor, I excluded it.

I thus had a set of pages (some modified), a set of pages to exclude, and an extractor. The extracted content was in the form of XML files (which were mostly HTML, with some XML meta information.) I tested them by inspection; it was hard to tell whether I got things right until the second step: injection. In the second step, I combined the content of each page and a new template I had written, producing HTML. I then compared the new pages with the old pages, side by side, in several browsers, until I was reasonably happy with the results. I rapidly iterated, each time fixing the extactor, the injector, or the old HTML (so that it'd extract better). Any pages I couldn't fix at this step I wrote down for later fixing.

To summarize, during this stage of development I had old pages, an extractor, a set of content pages (the results of the extractor), an injector, and new pages (the results of the injector). Some pages I had excluded from the entire process, and others I had marked as needing repair. I also had a set of things I'd like to do that XSLT couldn't handle.

After testing the output extensively, I was finally ready to make the switch. Tonight I replaced the old pages with the new ones. I no longer run the extractor. This means the content pages are no longer being overwritten, so I can now edit those pages instead of the old web pages. I went through the list of pages that needed repairs and fixed them. I tested every page on the new site, fixed up a few minor leftover problems, and pushed the pages to the live site.

I'm much happier with the new system. It's not a series of hacks and it's not custom code. It's using XML and a very simple shell script. It runs on Windows, Mac, and Linux. There is still more to do though. Not all of my pages use the new system or my new style sheet. Some parts of my site, including the blog, will continue to use an external content management system, so I will apply my new stylesheet and template without using the XML injector. There are a few more minor features that I want to implement, and there are more pages to clean up. I'm not in a rush to do any of this; it'll probably take a month or two. I've also been reluctant to edit my pages until now, because any changes I made had to be duplicated on the development pages (which I had modified to make the extractor work). Now that the new pages are up, I can resume working on the content of my pages.

I do recommend that people look into XSLT, but I think it's not sufficient for most needs. It does handle a large set of simple cases though; I'll fill in the gaps with some Python or Ruby scripts. If you find things on my site that don't work properly, let me know; I'm sure there are bugs remaining.



Ryan wrote at Tuesday, October 31, 2006 at 1:05:00 PM PST

ah, migrating. it's a fact of life, and a pain in the ass. you clearly picked interoperability winners in XML and XSLT, which will be nice for both routine maintennance and [shudder] the next migration.

i've written just enough XSLT to know that i like it in spite of its godawful syntax. the unapologetic functional programming style is refreshing, and importing libraries over HTTP is just plain cool.

as you noted, though, HTML/XML/etc. are all pretty painful for authoring. i had to do a similar migration recently, from SnipSnap to PyBlosxom, and i decided to switch to Markdown.

so far, i'm loving it. it's a breeze for simple blog posts, and even lengthy articles with complex layouts seem to Just Work. i spend more time thinking about the content, instead of fighting the markup and publishing framework.

just like you, porting all of my old content took some work, even with html2text's help. it was worth it, though.

Amit wrote at Wednesday, November 1, 2006 at 10:30:00 AM PST

Oh, yes, the NEXT migration. I had completely forgotten about that! :-)

Amit wrote at Tuesday, August 19, 2014 at 10:29:00 PM PDT

It's been a long time since I posted this, and I'm still using the XSLT system. Dealing with XSLT's rules has been a mild annoyance because I keep forgetting them, but overall, it's been rather flexible and I've been able to adapt it as my needs have changed. Most recently, I've started inlining my CSS into the HTML output instead of using an external CSS file, and XLST was able to handle that too, including properly HTML-escaping the CSS.