Amit's Thoughts: Text to HTML: syntax

Saturday, August 28, 2004

In any discussion about a new markup language (like ReStructuredText), there will be some debate about syntax. Here’s what I’ve been reading:

Setext archives.
A plan for structured text.
Problems with structured text.
ReStructuredText reference.
Four years of doc-sig debates.
TWiki formatting rules (much closer to structured text than to Wiki formatting).
Wikipedia formatting rules, which include using inline HTML tags (which is different from what I had been thinking of doing—treating everything between tag pairs as raw HTML).
Alternative syntax for links, with the main variants being inline links, such as [http://www.ibm.com/ IBM Corporation] and out of line links, such as `IBM Corporation`_ and later .. _IBM Corporation: http://www.ibm.com/. This page also has some comments about character substitutions, such as -- turning into — and `` ... '' turning into “ ... ”. I’d like to have such features, but I don’t yet know how easy this will be.

There are lots of good ideas out there, and there are a lot of issues I hadn’t thought about. My own notes tend to be formatted so that I can easily edit them with XEmacs and filladapt-mode. Where I don’t have particular conventions already established (like named links), I would like to use what’s already out there.

For allowing HTML in my notes, I was originally thinking that I’d require XHTML, but this note about XHTML made me realize that all I wanted was a way to find closing tags, and not all of XHTML. I plan to require all tags to be closed (<X> ... </X>) or self-closing (<X />), but I do not plan to require other aspects of XHTML.

I envision three levels of parsing:

At the top level, a notes file is composed of several notes separated by some marker. The marker may contain some information. For my notes files, the marker is [bracketed text] on a line by itself, and the text inside brackets is the name of the note.
At the block level, there are four types of blocks:
1. Lists begin with a bullet (*, 1., etc.).
2. HTML blocks begin with an HTML block-level tag (<p>, <div>, etc.).
3. Preformatted text begins with spaces for indentation. All subsequent lines indented at least that number of spaces will be part of the preformatted text block.
4. Any other text (including HTML inline tags) begins a paragraph.
At the inline level, there are three types of spans:
1. HTML spans begin with an HTML inline-level tag (<img>, <a>, <abbr>, etc.) and end at the corresponding closing tag.
2. Markup spans begin with some markup character (*, `, \) and typically end with the same markup character. However, the rules are tricky because markup characters sometimes occur in contexts where they are not being used for markup. The \ character is used for escaping in contexts where the following character would otherwise be treated as a markup character. (This seems complicated.)
3. Text spans begin with any text.

Indentation plays a key role in parsing preformatted and list blocks Implementing this may be trickier than I initially expected, especially because my notes were not written with these rules in mind.

ReStructuredText and Twiki have had a lot of good work put into them, and it’s silly to reinvent it, unless I’m trying to solve a different problem. I have to remind myself that my main goal is to parse my existing notes, and my secondary goal is to be able to write new notes easily. If the syntax of old and new documents is going to be different, then perhaps my approach should be to write a parser for my old notes that converts everything to my new note format, or perhaps I should manually convert everything.

– Amit – Saturday, August 28, 2004

2 comments:

Anonymous wrote at Wednesday, January 7, 2009 at 10:26:00 AM PST: Did you or anyone else write a translator? I am looking for TWiki to StructuredText or TWiki to reStructuredText.

Thanks,
Larry Dickson
ldickson@cuttedge.com
Amit wrote at Saturday, January 10, 2009 at 2:23:00 PM PST: Larry, I've recently started using Markdown.py, which is a Python version of Markdown. There's also Markdown2.py, but I haven't tried it.