In any discussion about a new markup language (like ReStructuredText), there will be some debate about syntax. Here’s what I’ve been reading:

There are lots of good ideas out there, and there are a lot of issues I hadn’t thought about. My own notes tend to be formatted so that I can easily edit them with XEmacs and filladapt-mode. Where I don’t have particular conventions already established (like named links), I would like to use what’s already out there.

For allowing HTML in my notes, I was originally thinking that I’d require XHTML, but this note about XHTML made me realize that all I wanted was a way to find closing tags, and not all of XHTML. I plan to require all tags to be closed (<X> ... </X>) or self-closing (<X />), but I do not plan to require other aspects of XHTML.

I envision three levels of parsing:

  • At the top level, a notes file is composed of several notes separated by some marker. The marker may contain some information. For my notes files, the marker is [bracketed text] on a line by itself, and the text inside brackets is the name of the note.
  • At the block level, there are four types of blocks:
    1. Lists begin with a bullet (*, 1., etc.).
    2. HTML blocks begin with an HTML block-level tag (<p>, <div>, etc.).
    3. Preformatted text begins with spaces for indentation. All subsequent lines indented at least that number of spaces will be part of the preformatted text block.
    4. Any other text (including HTML inline tags) begins a paragraph.
  • At the inline level, there are three types of spans:
    1. HTML spans begin with an HTML inline-level tag (<img>, <a>, <abbr>, etc.) and end at the corresponding closing tag.
    2. Markup spans begin with some markup character (*, `, \) and typically end with the same markup character. However, the rules are tricky because markup characters sometimes occur in contexts where they are not being used for markup. The \ character is used for escaping in contexts where the following character would otherwise be treated as a markup character. (This seems complicated.)
    3. Text spans begin with any text.

Indentation plays a key role in parsing preformatted and list blocks Implementing this may be trickier than I initially expected, especially because my notes were not written with these rules in mind.

ReStructuredText and Twiki have had a lot of good work put into them, and it’s silly to reinvent it, unless I’m trying to solve a different problem. I have to remind myself that my main goal is to parse my existing notes, and my secondary goal is to be able to write new notes easily. If the syntax of old and new documents is going to be different, then perhaps my approach should be to write a parser for my old notes that converts everything to my new note format, or perhaps I should manually convert everything.

2 comments:

Anonymous wrote at Wednesday, January 7, 2009 at 10:26:00 AM PST

Did you or anyone else write a translator? I am looking for TWiki to StructuredText or TWiki to reStructuredText.

Thanks,
Larry Dickson
ldickson@cuttedge.com

Amit wrote at Saturday, January 10, 2009 at 2:23:00 PM PST

Larry, I've recently started using Markdown.py, which is a Python version of Markdown. There's also Markdown2.py, but I haven't tried it.