[Update: I abandoned this project for several reasons, the main one being that I didn't have an immediate use for a parser system at the time.]

I've been working on a major upgrade to Yapps (Yet Another Python Parser System). Yapps was designed with two things in mind: it should be easy to use and it should be easy to understand output. I think I got those two pretty well. However, as I try to push the system to do more, I'm finding that easy to use and easy to understand output are coming into conflict:

  • LL(1) recursive descent parsers are easy to understand. But they make it harder to write grammars, because you have to manually left-factor.
  • Better error messages would make Yapps easier to use. But tracking the information needed for better error messages clutters up the parser, making it harder to understand.

So far, I've been leaning towards more ease of use and less readable parsers. Some more things I want to implement towards that goal:

  • Abstract rules: a way to reuse common patterns used in writing rules.

    If you want to parse a comma-separated list of numbers in Yapps 2, you'd write a rule like this:

    rule number_list:
          {{ numbers = [] }}
          (
             number {{ numbers.append(number) }}
             ( ',' number {{ numbers.append(number) }}
             )*
          )?
          {{ return numbers }}
    

    If you then want to parse a colon-separated list of strings, you have to write a very similar rule with a different delimiter and pattern. When you need to parse a semicolon-separated list of statements, you may be wondering how you can avoid writing this pattern repeatedly.

    Abstract rules are rules that have named placeholders. For the above rule, you'd write

    rule delimiter_separated_list[delimiter][element]:
          {{ results = [] }}
          (
             element {{ results.append(element) }}
             ( delimiter element {{ results.append(element) }}
             )*
          )?
          {{ return results }}
    

    You'd then be able to invoke that rule with delimiter_separated_list[','][number], delimiter_separated_list[':'][string], and delimiter_separated_list[';'][statement]. I'm running out of delimiters, though. I've already used [ A B C ] for optional matches. I'll probably switch to the standard EBNF form for optional matches, (A B C)?, to free up brackets for something else.

  • Inheritance of Parsers: the ability to reuse an existing parser's rules when defining a new parser.

    In Yapps 2, I find that I often copy and paste tokens and rules from one parser to another. The most common ones are strings, numbers, identifiers, and delimiter-separated lists. I'd like to be able to define these in one place and then inherit them in the parsers I'm writing.

    To implement this, Yapps needs to be able to decipher an existing parser's First and Follow sets. I can either do this via reflection at the Python class level or I can implement inheritance directly at the grammar level. Either way, parsers would be more useful if you could reuse them in larger parsers.

I've also been thinking about some more radical changes to Yapps:

  • Backtracking: abandoning the LL(1) restriction.

    The Toy Parser Generator (don't be fooled by the name -- it's quite powerful!) supports full backtracking. You don't have to worry about whether your grammar is LL(1) or not; you can write it in a more natural style. However, I'm not sure how backtracking interacts with imperative actions. I've thought about using something like Haskell monads, but without decent support for blocks, it's hard to do this properly in Python. I might be able to delay all the actions by constructing a string at parse time and then using exec when the parser is complete. However, that has the disadvantage that the actions can't influence the parser. (On the other hand, without backtracking, the actions can't meaningfully influence the parser anyway.)

  • Run Time Parser Generation: abandoning the model of generating a Python source file.

    One annoying feature of most parser generators is that they generate source code, which then has to be compiled/run in a separate process. It'd be much nicer to build a parser within a running program. With Python, this doesn't seem hard to do. Even in the most static of languages -- C++ -- you can generate parsers without creating a separate source file, using the Spirit parser library.

I'm not yet sure whether I'll implement the more radical changes.

Labels:

0 comments: