top | item 36008139

Topiary: A code formatting engine leveraging Tree-sitter

120 points| Xophmeister | 2 years ago |tweag.io

33 comments

order
[+] hardwaregeek|2 years ago|reply
I'd be curious to know how effective this is at getting formatting close to a Prettier or rustfmt level of quality. I know that for Prettier there was a heavy layer of custom logic around the core printer just to get output that looked decent. Of course even if it only gets 80-90% of the way there, that's still a massive achievement.

It's fascinating seeing these tools that facilitate building better programming language experiences. I've called them "Tooling for Tooling", basically tools that make it easier to create tools like formatters, linters, etc.

[+] ErinvanderVeen|2 years ago|reply
Hi!

We share the same curiosity for the effectiveness of our approach! Right now, we want to make Topiary great for languages with less complex formatting rules, and "good enough" for languages that are a bit more complex. Where, on a one-off basis, you don't feel the need to get the dedicated formatter.

We don't yet have any ambitions to compete directly with Prettier and rustfmt among others.

Having said that, we are quite proud of how the OCaml rules turned out, and even had some great results with the Rust rules.

As we explore more, and expand the complexity of our tree-sitter scopes, who knows what kind of things we might be able to format!

It's all very exciting!

[+] afiori|2 years ago|reply
It might be that JS/TS creates very hard to handle code, but I find prettier choices quite disappointing.
[+] aidos|2 years ago|reply
Nice work!

I keep wondering, there a reason everything isn’t just based off treesitter these days? If I were tasked with writing tsserver, my instinct would be to layer it over treesitter. Does anyone know if there are practical reasons that doesn’t happen, or is it just legacy?

The neovim world is slowly converging on using it for syntax. This project for formatting. I personally use it in neovim for things like highlighting and formatting sql within strings in python.

[+] hardwaregeek|2 years ago|reply
It's really annoying to produce an AST from tree-sitter. I tried writing my programming language parser with tree-sitter and it was a huge pain[1]. Anything with error recovery or good error messages is hard to customize too. If you want the ability to work with partial pieces of code in a homogeneous syntax tree and not an AST, then tree-sitter is great. Otherwise it's definitely rough around the edges.

[1]: https://uptointerpretation.com/posts/vicuna-update/

[+] junon|2 years ago|reply
Tree sitter isn't exactly the most ergonomic API or structure to use. In my opinion Neovim stuff moving entirely to TS made it worse that the existing tools that were out there. But for something like this, I think TS fits pretty nicely into the use case.
[+] IshKebab|2 years ago|reply
TreeSitter is easy to use, pretty language neutral and it is tolerant to errors. Plus at this point people have already implemented support for a ton of languages.

My only issue with it is that it really only does half the job. You get a CST of sorts, but if you want to do anything with it you pretty much have to hand write another parser for that node tree.

In contrast parser combinator libraries like Nom and Chumsky give you "the final output".

[+] yewenjie|2 years ago|reply
VS Code or Monaco still doesn't support tree-sitter.
[+] solarkraft|2 years ago|reply
The tree sitter ecosystem is very cool. I'm happy it exists.

My research into programming language parsing started with a very specific problem: I like folding code, and I like disabling ("commenting out") code to test behavior. Well, but (with rare exceptions: Xcode and nowadays some languages in VSCode) "commenting out" code breaks folding. I never got around to really solving it, but the learning involved (including about tree sitter) was very cool.

[+] junon|2 years ago|reply
Neat. I tried to do this exact thing a while back, leveraging TS as well, and struggled to find a generalized rule engine for it. I'll give this a try later, been hoping for something like this.
[+] xupybd|2 years ago|reply
Can someone explain why semantic whitespace wouldn't work with a tool like this?
[+] Xophmeister|2 years ago|reply
Topiary contributor, here: In theory, I think a simple semantic white space language could work, provided the Tree-Sitter grammar for that language is adequate. Python, for example, might be possible; as long as we ignore things like line-continuations.
[+] mhh__|2 years ago|reply
You don't want an AST or even a full parser for formatting most languages.

Tree-sitter deals with errors better than most parser generators but if you just lex and separate into chunks then you can much more flexibly format broken code.

[+] ErinvanderVeen|2 years ago|reply
Hej!

I agree with you in that there are many languages where skipping parsing altogether could still result in a good formatter, and I would love to see a Topiary-like project attempt it.

I don't feel confident in saying that that holds for most languages however, worrying that it can lead to a lot of ambiguity in languages with more complex formatting conventions.

Regardless, the eventual goal of Topiary is to be able to format the widest possible spectrum of languages, and so limiting ourselves to just lexing didn't seem like the right choice at the time.

Like you mention, this does mean we give up being able to format broken code. In fact, we currently even ensure that TS is able to parse the entire input before formatting. This is a shame, but ultimately what we decided was the best approach for Topiary to achieve its goal.

[+] xigoi|2 years ago|reply
How do you produce something like this with just lexing?

    aaa(
      bbb,
      ccc(
        ddd,
        eee,
      ),
      fff,
    )
[+] jensenbox|2 years ago|reply
Can this implement/emulate something like Python Black?
[+] ErinvanderVeen|2 years ago|reply
Hi!

We are not sure right now because Topiary is still very much an experiment.

Having said that, we are constantly surprised what we can do with Topiary. So with a dedicated Python developer willing to draft a set of rules, it might just be possible!