Amit's Thoughts: Text to HTML

Saturday, August 28, 2004

I’ve been looking through my old notes and have realized that like many other people, I follow certain conventions when writing text files. I’d like to publish some of these notes to a web site. The simplest thing to do would be to upload the notes as text. The second simplest thing to do would be to upload the notes as HTML, with a <pre> around the entire note. However, it would be cooler if I could take advantage of the conventions I've used to write something richer than text.

There are lots of systems that allow you to write text and produce HTML. I wrote one a few years ago to turn this text into this set of HTML pages. ReStructuredText turns text into HTML; see the sample page. Wikis turn text into HTML as well. Of these, the goals of structured text were the closest to my own: intuitive, simple, readable. However, none of these actually worked with my own text formatting conventions. My A* pages use { ... } for formatting. Wikis use lots of ` backquotes `. For example, see this set of rules, including “Use five single-quotes, or triples within doubles, for some other kind of emphasis.” ReStructuredText conventions are the closest to what I've used in the past.

My goal was to have a system that could read my existing notes for the past 15 years. I didn’t want to go back and edit all those notes to fit some formatting convention. In particular, these are some differences between my conventions and ReStructuredText:

I often use indentation for preformatted text. ReStructuredText uses indentation for indented paragraphs (using <blockquote>—ew!) and :: for preformatted text.
I use *bold*, /italics/; ReStructuredText uses **bold**, *italics*. I don’t think this would really matter though.
I don’t have a good convention for sections. In the last 5 years I’ve been using [header] on a line by itself, but I don’t have any way to have subsections. I have been writing short notes, not documents, so I never really needed subsections. ReStructuredText uses text with a line underneath or on top to mark section headings. It seems like a nice convention, but my existing notes don’t use it.
I don’t have a way to include images, because I am just writing simple notes. ReStructuredText has this:
```
.. image:: images/biohazard.png
   :height: 100
   :width: 200
   :scale: 50
   :alt: alternate text
```
At this point, I started wondering why I would want to write ReStructuredText instead of HTML: <img src="images/biohazard.png" height=100 width=200 alt="alternate text">.
I have no way to include links with anchor text. I do include URLs and email addresses, often but not always surrounded by <...>. ReStructuredText uses ` and _ characters, with several nice ways to avoid writing URLs inline in the document body.

Looking over my notes, I’ve realized that this task is harder than it initially seemed. For example, my notes mix plain text with code/output. I’d like plain text to be rendered in proportional font and code/output to be rendered in monospace font. I don’t have a good way to solve this for my existing notes. For future notes, I could use ReStructuredText’s convention of ``text``.

I have two problems to solve: how do I read my existing notes, and what do I do for new notes? I’m not trying to write documents (I am comfortable writing HTML); I’m looking for something simpler for quick notes. The ReStructuredText format for images made me realize that it’s really a new markup language. Simple things like paragraphs and lists are close to what I’d write naturally, but most of the formatting conventions are not. Then it hit me: if I need a markup language anyway, why not use HTML?

The problem with HTML (especially for quick note taking) is that it’s cumbersome for the simple things: text, paragraphs, lists, code snippets. For less common things (images, tables, anchor text), it isn’t as much of a win to invent a new language. So my plan is to parse text with the most common text formatting conventions (paragraphs, lists, etc.), but allow any HTML tag to “escape” into HTML mode. That way if I want to do anything complicated, I can drop into HTML, but if I want to stay simple, I can write plain text.

Why doesn’t ReStructuredText do the same thing? StructuredTextNG includes a rule to let people use SGML, but it didn’t seem to get much attention. I suspect it’s because they do not want to assume that HTML is the output format (it might be PDF or XML) and the target audience is less comfortable with HTML. TWiki also has both structured text and HTML. The original Wiki does not, and there is a discussion about pros and cons.

It will take some time to play with this system before I know whether I like it. I might end up finding that I do want to avoid HTML altogether, or that I want to write entire documents in text. I am likely to find that some of my old notes are ambiguous and cannot be parsed easily. It may be easier to go through them and fix them up instead of writing a more complicated parser.

Update: [2011] It seems that Markdown has won, not only for me, but for lots of other people. It also allows embedding HTML for when Markdown isn't enough. As of 2011, Markdown is being used by an increasing number of tools and apps.

– Amit – Saturday, August 28, 2004

1 comment:

Amit wrote at Sunday, September 11, 2005 at 2:07:00 PM PDT: Update: It looks like Markdown might fit my needs. It allows embedding HTML inside the text and most of the conventions are similar to the ones I wanted.