Easy Lossless Trees with Nom and Rowan

In the past few months I've been idly working on a few utilities in Rust to facilitate writing tooling for the Julia programming language. The main reason being, I don't think it makes sense to have things like code formatters require the Julia runtime. I want quick little CLI tools for those tasks, not giant binaries that have warmup time.

To make any progress, I needed to write a Julia parser in Rust. Julia's syntax is mostly standard recursive descent, with a few strange gotchyas along the way. It seems that the best approach to start is to borrow from Rust Analyzer's wisdom and build linked trees between the concrete realization (Green Tree) and the syntax (Red Tree). That way we can easily work with both - CST for syntax highlighting, formatting, etc and the AST for linting, other analysis.

The Rust Analyzer team has a library rowan that is built for this purpose. It has a few rough edges, but seems to be ubiquitous. Every example and blog post I have found seems to indicate that one should write a parser by hand when working with rowan. So, I tried this along with the logos lexer by following the fanstastic blog posts by Luna. This worked, but there was so much boilerplate and minutia that I was loosing my mind. Not only that, but logos wasn't quite expressive enough for some of the tricky edge cases in the language.

So instead, I chose to use the parser combinator library nom. Before the hand-written parser attempt, though, I tried to use a few of the parser generators (LALRPOP and pest), but they were either not expressive enough or too slow.

The only hiccup here is that nom seems geared towards producing typed ASTs. For example, the function separated_list0 returns a vector of matches that are separated by the result of another parser. This would be perfect for something like the contents of a vector, but this function throws out the commas, which we need if we are building a lossless tree. The functions in nom are generic in their return type, so we're fine to emit GreenNodes, we just have to be careful not to throw anything out.

Additionally, the examples for rowan use a builder abstraction which makes sense if we are building the tree top down. Because the nom use case is structured as the combination of smaller parsers, we want to build bottom up.

After a bit of tinkering, I've managed to find a nice workflow that allows us to build trees in Rowan with combinators from nom. The rest of this post is how I did that, by building a very simple S-expression parser.

Working with Rowan

The Green tree in Rowan is untyped. That is, the nodes of its tree are all the same struct with a type tag and a vector of children. Rowan has it's own internal type for this tag, rowan::SyntaxKind, which we need to

Define all our node types
Connect to the internal representation

Rowan settled on using struct SyntaxKind(u16) for it's internal type, which means we need a way to create unique IDs for each node. This is a perfect fit for enums, except there isn't a safe, vanilla rust way to convert back from a u16 to an enum variant. In fact, the official rowan example does

unsafe { std::mem::transmute::<u16, SyntaxKind>(raw.0) }

Following Luna's advice from the Make a Language series, we'll make use of the num-derive crate to avoid these.

So, we'll start by importing those macros and traits and setting up our node types

use num_derive::{FromPrimitive, ToPrimitive};
use num_traits::{FromPrimitive, ToPrimitive};

#[derive(Debug, Clone, Copy, PartialEq, Eq, PartialOrd, Ord, Hash, FromPrimitive, ToPrimitive)]
#[allow(non_camel_case_types)]
#[repr(u16)]
enum SyntaxKind {
    L_PAREN = 0, // '('
    R_PAREN,     // ')'
    ATOM,        // '+', '15'
    WHITESPACE,  // whitespaces is explicit

    // composite nodes
    LIST, // `(+ 2 3)`
    ROOT, // top-level node
}

// Bring the variants into our scope
use SyntaxKind::*;

We'll want to use .into() in certain cases, which is easy thanks to num-derive

impl From<SyntaxKind> for rowan::SyntaxKind {
    fn from(kind: SyntaxKind) -> Self {
        rowan::SyntaxKind(kind.to_u16().unwrap())
    }
}

Rowan's Language trait is how we connect the two trees together. Here, we'll not worry about anything but mapping the SyntaxKind variants back and forth. Again, this is made easy because of num-derive.

#[derive(Debug, Copy, Clone, Ord, PartialOrd, Eq, PartialEq, Hash)]
enum Lang {}

impl rowan::Language for Lang {
    type Kind = SyntaxKind;

    fn kind_from_raw(raw: rowan::SyntaxKind) -> Self::Kind {
        Self::Kind::from_u16(raw.0).unwrap()
    }

    fn kind_to_raw(kind: Self::Kind) -> rowan::SyntaxKind {
        rowan::SyntaxKind(kind.to_u16().unwrap())
    }
}

// Node type for the red tree
type SyntaxNode = rowan::SyntaxNode<Lang>;

Connecting to Nom

This is where we diverge from the examples, we want to now write the parsers in a way that builds a rowan::GreeNode.

We'll start with importing the things we need

use rowan::{GreenNode, GreenToken, NodeOrToken};

In GreenNode::new, the children can be either a GreenToken (terminal) or another GreenNode (non-terminal). They have this exposed as NodeOrToken, which we'll make a type alias for.

type Child = NodeOrToken<GreenNode, GreenToken>;

Nom has a ton of imports, here are all the ones we'll use

use nom::{
    branch::alt,
    bytes::complete::tag,
    character::complete::{multispace1, none_of},
    combinator::{all_consuming, map, opt, recognize},
    multi::{many0, many1},
    sequence::{pair, tuple},
};

Every parser will return a Child, so we can type alias that as well, to make our function signatures a little cleaner.

type IResult<'a> = nom::IResult<&'a str, Child>;

Last little bit of boilerplate, here is a function that takes a SyntaxKind and some &str and builds the NodeOrToken::Token out of it.

fn leaf(kind: SyntaxKind, input: &str) -> Child {
    NodeOrToken::Token(GreenToken::new(kind.into(), input))
}

Lexing

The line between lexing and parsing is blurred in nom, which makes things a little strange, but all of our terminals are basically lexemes. So, for each we write a simple function that does the match and creates the leaf.

fn lparen(input: &str) -> IResult {
    map(tag("("), |s| leaf(L_PAREN, s))(input)
}

fn rparen(input: &str) -> IResult {
    map(tag(")"), |s| leaf(R_PAREN, s))(input)
}

fn whitespace(input: &str) -> IResult {
    map(multispace1, |s| leaf(WHITESPACE, s))(input)
}

The atomics (in this example) are anything that aren't spaces or parens. This is of course not quite true, as there are string literals and other more complicated things. But we'll keep it simple here

fn atom(input: &str) -> IResult {
    map(recognize(many1(none_of("() \t"))), |s| leaf(ATOM, s))(input)
}

Parsing

Now the fun part, lists! S-expressions have a very simple grammar. We need to match an open paren, then there are zero or more atoms or other lists, separated by spaces, and then a closing paren. As discussed before, we can't use the separated_list0 function as we need to keep track of whitespace. We use alt for the two cases of the atom or the recursive list and many0 of pairs for the inner body. The sticky bit here is that there is optional whitespace before the first element and after the last element. The opt(whitespace) at the end takes care of the latter, but we need to be explicit about the first optional whitespace and list element. Nom is greedy, so this all works as you'd expect.

To then build the tree, we can simply build a children vector from the matches of many0 and build the GreenNode directly.

fn list(input: &str) -> IResult {
    map(
        tuple((
            lparen,
            opt(whitespace),
            opt(alt((list, atom))),
            many0(pair(whitespace, alt((list, atom)))),
            opt(whitespace),
            rparen,
        )),
        |(l, ws1, li1, body, ws2, r)| {
            let mut children: Vec<Child> = vec![l];
            if let Some(s) = ws1 {
                children.push(s)
            }
            if let Some(n) = li1 {
                children.push(n)
            }
            for (ws, elem) in body {
                children.push(ws);
                children.push(elem);
            }
            if let Some(s) = ws2 {
                children.push(s)
            }
            children.push(r);
            NodeOrToken::Node(GreenNode::new(LIST.into(), children))
        },
    )(input)
}

Wrapping Up

Finally, we can create a top-level parse function that panics if we don't consume all the input and builds the associated SyntaxNode.

fn parse(input: &str) -> SyntaxNode {
    let body = all_consuming(list)(input).unwrap().1;
    let cst = GreenNode::new(ROOT.into(), [body]);
    SyntaxNode::new_root(cst)
}

We can test this by using the SyntaxNode debug pretty print

println!("{:#?}",parse("defn foo (x) (+ x x))"))

et viola, a beautiful lossless tree!

ROOT@0..22
  LIST@0..22
    L_PAREN@0..1 "("
    ATOM@1..5 "defn"
    WHITESPACE@5..6 " "
    ATOM@6..9 "foo"
    WHITESPACE@9..10 " "
    LIST@10..13
      L_PAREN@10..11 "("
      ATOM@11..12 "x"
      R_PAREN@12..13 ")"
    WHITESPACE@13..14 " "
    LIST@14..21
      L_PAREN@14..15 "("
      ATOM@15..16 "+"
      WHITESPACE@16..17 " "
      ATOM@17..18 "x"
      WHITESPACE@18..19 " "
      ATOM@19..20 "x"
      R_PAREN@20..21 ")"
    R_PAREN@21..22 ")"

From here, we would add more syntax into our grammar to get more semantic meaning as well as create some more specific atoms.

The code for all of this is in a gist here. If you'd like to follow along with my Julia parser, that's happening here.

Thanks for reading!