How would you go about writing a parser for a new syntax in Python?
7 votes by dis.k — 7 votes, 13 comments

Say I want to come up with a new lightweight/super-high-level syntax for [ [ [hierarchically] [structured] ] data], that I want to turn automagically into JSON or YAML etc for further manipulation later.

What's the smart way to go about it? Join() and split() strings is the first thing that comes to mind. But what would be the path you'd recommend I look into? (I want to avoid reinventing something that's already well established).

For a very simple parser you could get away with simple string manipulation functions and regular expressions. But if you want to parse a more complex syntax you should look into Flex and Bison.

Hm.. If that's it, it doesn't look too hard.

I've never used PyBison. When I did it, I did it in C. It's not something hardcore to do, but Flex/Bison do take time to learn. I think a Python binding can help make things easier. However it all depends on what kind of grammar you want to implement. Parsing a +-*/ calculator is easy (even with Flex/Bison the grammar is short); parsing a C++ source file is a lot more difficult.

Unless you're doing this as a hobby project or you think there is something really wrong with existing formats, my personal suggestion is that you look into (and possibly fork/modify) one of the existing formats. For example TOML seems to have a format similar to your example although I don't know if there is any parser available yet.

The most difficult thing probably is to parse your input into tokens (for instance an array). Once you have the tokens Python already has got JSON and YAML serialization libraries.

Is your grammar context free? (It looks like it) Then you don't need tokens or string splitting. Just parse each character and use a state machine generate JSON.

My input will be human languages (with the structural info provided by a syntax I will devise), so afaik it is not as clear-cut. Human languages are context-sensitive...ish?

But maybe the notation used to describe human languages is context-free? My idea will be an extremely lightweight alternative to the bracket notation linguistics typically use.

Is this what bracket notation looks like? How would you simplify that notation?

Edit: I've found another example of the same notation. Looking at example (3) it seems already easy to parse as is, unless you want to replace the square brackets with something else to make it easier on the eye :P

My design principles will be

a. no need for matching brackets

b. optional and flexible labelling

c. left-to-right expansion of the structure (since sentences start with one object (eg a word) and merge it with another object, which results into a new level of structure each time you merge).

The problem I face is that bracket notation is not convenient for taking notes during a lecture, because it requires planning and it goes contrary to the direction of writing (the first object is at the end of the line, and you add new objects to the right of the old ones.


So, for each point

a1. [visit [the [ [white] [house] ] ] ] 
a2. house white | the | visit (objects separated by whitespace merge, | closes of their projection and moves up a level)

b1. [DP [D the] [AP [A white] [N house] ...
b2. house white |<AP the |<DP   (not everything has to be labelled, if the parser doesn't find a label, it assumes continues building up)

c1/2 already demonstrated above

Is this the same problem of serializing a tree data structure into a string (file)?

I was not familiar with the terminology, but it looks related if that's it.

These look like "phrase markers". Writing a parser for a different version of this shouldn't be difficult, but the question/problem is if when removing the brackets or any other symbol you are losing information or you are introducing an "invariant". For example A | B | C could mean [ [A, B], C ] or [ A, [B, C] ]. If you remove the brackets from the syntax and the original meaning changes, then you cannot remove them. Either you lose information or you need to introduce a new symbol to convey the meaning that was lost by removing the brackets.

If you can remove the brackets and at the same time not lose meaning, then yes you can give a try to flex&bison.

I'm not familiar with the syntax (never used before) but wouldn't flex/bison be overkill here? Words are simply split at every white space. It looks to me that brackets here only provide order, so it's only a matter of defining a separator (for example |) and then agree (beforehand) on the order how to rebuild the tree from the single labels.

I think I can solve it with some more assumptions which will work for a specific circumstance (a specific linguistic framework), like strict binary branching (so that's one ambiguity gone), and some assumptions for what element heads the level of projection it is in (eg optional labeling of the phrase, and in absence of that, assume that the element immediately before the seperator is the head).