top | item 37494595

Let's write a treesitter major mode for Emacs

209 points| nanna | 2 years ago |masteringemacs.org | reply

84 comments

[+] G3rn0ti|2 years ago|reply

BTW:

While Emacs 29.1 comes with "treesitter" built-in, you still need to manually build and install any treesitter language plugin implementing the actual language specific parser. This can be fiddly and frustrating doing it yourself.

I had a quick success with using this convenience script: https://github.com/casouri/tree-sitter-module/. It provides fully-automated builds for the most popular languages (including typescript, c and c++).

This is how it works for "typescript":

1. Clone the repository: https://github.com/casouri/tree-sitter-module/

2. Install "build-essentials" (providing a c/c++ compiler if you're on Linux).

3. run "./build typescript" from within the repo

4. Copy the resulting shared library from "dist/libtree-sitter-typescript.so" into your "~/.emacs.d/tree-sitter/".

5. Open a random typescript file and try "M-x typescript-ts-mode" which should not give you any error but instead nice syntax highlighting.

You might find there is a treesitter plugin for your language available and it is even supported by "tree-sitter-module" but there is still no major mode, yet. Happened to me for Perl 5.

[+] nanna|2 years ago|reply

Technically in Emacs 29.1 tree-sitter is still only an optional build option, which a given package maintainer may have 'built in' to your package. It isn't actually a default. If you build it from source you need to pass the --with-tree-sitter flag to ./configure. See:

https://www.masteringemacs.org/article/how-to-get-started-tr...

What I read from this is that tree-sitter isn't considered quite ready by the Emacs maintainers, perhaps because of the restricted number of actual treesitter modes, or maybe because the treesitter support itself is not quite considered there yet?

[+] treeblah|2 years ago|reply

I found this snippet in one of Mickey's earlier tree-sitter posts that works great. It does require searching through the tree-sitter repo to make sure your paths are correct:

  (setq treesit-language-source-alist
      '((typescript "https://github.com/tree-sitter/tree-sitter-typescript" "master" "typescript/src")
        (tsx "https://github.com/tree-sitter/tree-sitter-typescript" "master" "tsx/src")))

  (mapc #'treesit-install-language-grammar (mapcar #'car treesit-language-source-alist))

[+] yougane|2 years ago|reply

Or you can just "M-x treesit-install-language-grammar" then follow the prompts.

[+] shanusmagnus|2 years ago|reply

Is there anything that returns a parse tree of an org document? A while ago I wrote some super hacky elisp to navigate around the structure of a giant org mode doc, but it was rickety and terrible and constantly breaking.

Part of this is surely that I don't know wtf I'm doing, but it seemed like there was not an underlying data structure held in memory that you could conveniently query / manipulate, but rather, most of the existing org functionality built some kind of structure each time you did an operation.

Would appreciate any pointers, code examples, tutorials that show how to effectively navigate / manipulate an org structure and have it reflected in the buffer, if there is such a thing.

[+] elviejo|2 years ago|reply

Organice is org-mode but as react apt. they have pretty complete parser. https://github.com/200ok-ch/organice#background-information

[+] morelisp|2 years ago|reply

https://orgmode.org/worg/dev/org-element-api.html

But even with this I found it pretty awful.

[+] Brentward|2 years ago|reply

In org-alert we use `org-map-entries` and a simple `org-alert--parse-entry` function for stripping out the details we're looking for. Depending on what you want, it's not exactly a data structure, but maybe it will help you get started!

https://github.com/spegoraro/org-alert/blob/master/org-alert...

[+] imaltont|2 years ago|reply

While it doesn't properly understand the structure, you can move around pretty well with Imenu or (configured) org-goto. I assume it's also possible to make something for it so that it take nesting into consideration like it does for some programming languages. My org files are only a couple 1000 lines though, so don't know how they perform when it gets larger than that.

[+] mark_l_watson|2 years ago|reply

This is from the author of the excellent book Mastering Emacs.

I am very far from being knowledgeable about programming on the Emacs platform, but I am trying to learn. I grabbed the name M-x-AI.com a while back with the goal of integrating other people’s Emacs packages with some of my own hacks into a better AI dev work environment and writing a short book on it. I have been using Emacs since, I think, 1982. There are so many good new packages for integrating CoPilot, GPT-4, etc., as well as major Emacs platform improvements that are too many to list.

[+] confounded|2 years ago|reply

Out of interest, do you use Emacs as an alternative to Jupyter for interactive work (examining plots etc.)?

If so, which modes and packages do you use?

[+] raincole|2 years ago|reply

I'm not trying to bash Emacs or treesitter or anyone. But I find it mildly amusing that after so many decades, parsing and syntax highlighting aren't a perfectly solved problem, considering programming languages are the most used tools for developers.

[+] vidarh|2 years ago|reply

Parsing of a correct program is a pretty "solved" problem.

But fast enough re-parsing of fragments and recovery from errors is a much more complex problem, that often doesn't have a single correct answer, and it's also a much newer problem in as much as syntax-highlighting is much newer feature, being preceded largely by "offline" pretty-printers with very different constraints.

The extent to which modern compilers try to parse past errors still varies greatly, with a whole lot not even trying to.

But just any recovering parser also does not mean the problem is solved. E.g. you've typed "foo". Now you type "(". It'd be very annoying if your editor now re-colors everything as an error, so you typically want some error recovery. But how soon? Do you assume the tokens immediate afterwards are par of what was a valid expression until you typed "foo", or are they a valid part of an argument list? And where do they end? Do you just delay re-parsing until the user has typed more? Or left the line? Sometimes that can help, sometimes it will just make things worse.

Parsing methods that work fine if you assume you can "reset" the parse at many different points which tend to constrain the area considered an error and so reducing the size of a typical re-parse will fail badly if you want stricter re-parsing that frequently may trigger reparsing most of a file, for example.

A lot of this is subjective, and picking the "right" way of handling it largely comes down to unpacking humans unstated preferences, and trying to reconcile competing and possibly contradictory preferences.

[+] mickeyp|2 years ago|reply

I clarified exactly why it is the way it is here:

https://www.masteringemacs.org/article/tree-sitter-complicat...

And also why using LSP to furnish your editor with highlight markers is an inelegant solution for many languages.

[+] troupo|2 years ago|reply

The problem is that a good editor-compatible tool for parsing and syntax highlighting is at cross-purposes with what you want from a compiler.

A good overview here: https://matklad.github.io/2022/04/25/why-lsp.html

[+] petepete|2 years ago|reply

Until TreeSitter came along the effort to add support for new languages to your editor would be gargantuan.

Now it's much, much easier providing there's a TreeSitter parser for your language.

I don't know of anything else that bridges the gap like this.

[+] olau|2 years ago|reply

It's a tough problem. Steve Yegge blogged about the complexities involved when he wrote js2-mode:

https://steve-yegge.blogspot.com/2008/03/js2-mode-new-javasc...

I guess comp. sci. people studying languages have been more interested in syntactically valid programs than the opposite.

[+] kristopolous|2 years ago|reply

The languages extension strategies for CSS pretifiers are usually pretty reasonable.

In editors this always seems extremely esoteric comparatively: I've tried doing it in a few.

I'm sure brilliant people find it easy, but I'm merely average on a good day.

I haven't tried extending any of these modern electron based editors, can anyone speak to that?

[+] PurpleRamen|2 years ago|reply

There will never be "perfectly" solved problems at that complexity-level. There are always changing requirements and space for improvement. Make it faster, add new features, use new hardware-abilities, follow the flavors of this decade, this is an eternally going game of catching up.

[+] mbork_pl|2 years ago|reply

Well, there are more problems like that. You'd think diffing is a solved problem, and yet we still struggle with syntax-aware diffs (I use difftastic, which is great, but doesn't always work well, and is under constant development).

[+] hardwaregeek|2 years ago|reply

It's not really a solved problem in general. Most editors appear to use TextMate grammars which nobody likes. Otherwise you have to implement it using whatever custom setup your specific editor uses. It just happens that most languages have some poor soul who set this up already. Emacs is actually on the better side because tree-sitter is a much better setup for writing grammars.

[+] nerdponx|2 years ago|reply

This is that. Tree Sitter has become one of the foundational advances that is allowing us to make progress on solving that problem.

[+] thih9|2 years ago|reply

What would you consider a perfectly solved problem in this case? I.e. how is current development experience bad and how it could be better?

[+] Difwif|2 years ago|reply

Is anyone using treesitter with lsp-mode?

I see some people say it's possible and use both together but I thought for the most part language servers offer the same set of features, and probably better? My current mental model for how to use them together is that the majority of the languages I quickly read I set up treesitter for speed. For languages I read extensively or write I set up a language server.

[+] BaculumMeumEst|2 years ago|reply

they are mostly used for different things.

lsp (and lsp-mode) are mostly concerned with IDE functionality- go-to definition, show references, displaying project errors in real time without explicitly building, etc.

tree-sitter builds a syntax tree of your source code; its applications are things like syntax highlighting and structural navigation of your code.

there is some overlap in functionality, lsp has somewhat supported mechanisms for syntax highlighting iirc, but they are fairly orthogonal overall

so yes, it makes sense to use them together

[+] nequo|2 years ago|reply

I use both. In my experience, syntax highlighting with language servers is slower than with tree-sitter.

It stands to reason: a language server often does way more than just incremental parsing of the source code into a concrete syntax tree. By limiting itself to syntax, tree-sitter can be much faster.

[+] wiz21c|2 years ago|reply

I've been using it a bit but it still not on par with, well, vscode. It tends to be a bit slow on big files (say 10000+ lines) when you open type an fstring in python such as 'print(f"p={' once the open accolade is typed in, it can get noticeably slow.

But well, I still love emacs :-)

[+] jdblair|2 years ago|reply

It's so hard to give up your custom environment and keyboard muscle memory! (25 year emacs user here)