top | item 42094842

(no title)

Looking at the architecture, they will probably run into some issues. We are doing something similar with SemanticDiff [1] and also started out using tree-sitter grammars for parsing and GumTree for matching. Both choices turned out to be problematic.

Tree sitter grammars are primarily written to support syntax highlighting and often use a best effort approach to parsing. This is perfectly fine for syntax highlighting, since the worst that can happen is that a few characters are highlighted incorrectly. However, when diffing or modifying code you really want the code to be parsed according to the upstream grammar, not something that mostly resembles it. We are currently in the process of moving away from tree-sitter and instead using the parsers provided by the languages themselves where possible.

GumTree is good at returning a result quickly, but there are quite a few cases where it always returned bad matches for us, no matter how many follow-up papers with improvements we tried to implement. In the end we switched over to a dijkstra based approach that tries to minimize the cost of the mapping, which is more computationally expensive but gives much better results. Difftastic uses a similar approach as well.

[1]: https://semanticdiff.com/

discuss

wetneb|1 year ago

Thanks for the insightful comments! You surely have a lot more experience than me there, but my impression was that producing visual diffs and merging files are tasks that put different requirements on the tree matching algorithms, and Dijkstra-style approaches felt more fitting for diffs than for merging, so that's why I went for GumTree as it seemed to be the state of the art for merging. Does SemanticDiff offer a merge driver? I could only find documentation about diffing on the website.

As to mismatches: yes, they are bound to happen in some cases. Even for line-based diffing, Git uses rather convoluted heuristics to avoid them (with the "histogram" diff algorithm), but they can't be completely ruled out there either. I hope that with enough safeguards (helper to review merges, downstream consistency checks with local fall-back to line-based diffing) they can be lived with. I'm happy to try other matching algorithms if they are more promising though (there isn't much coupling with the rest of the pipeline).

Concerning tree-sitter, I have noticed some small issues, but nothing that was a show-stopper so far. I actually like it that it's designed for syntax highlighting, because it's really helpful that the representations it gives stay faithful to the original source, to avoid introducing reformatting noise in the merging process. Parsers written for a specific language can sometimes be too zealous (stripping comments out, doing some normalizations behind your back). That's a problem in Spork (which uses Spoon, a pretty advanced Java parser). And the uniform API tree-sitter offers over all those parsers is just too good to give up, in my opinion.

DarkPlayer|1 year ago

I don't think that different algorithms are better for merging or diffing. In both cases, the first step is to match identical nodes, and the quality of the final result depends heavily on this step. The main problem with GumTree is that it is a greedy algorithm. One incorrectly matched node can completely screw up the rest of the matches. A typical example we encountered was adding a decorator to a function in Python. When other functions with the same decorator followed, the algorithm would often map the newly added decorator to an existing decorator, causing all other decorator mappings to be "off-by-one". GumTree has a tendency to come up with more changes than there actually are.

We try to really get the diff quality nailed down before going after merges. We don't have merge functionallity in SemanticDiff yet.

The main issue we have with tree-sitter is that the grammars are often written from scratch and not based on the upstream grammar definition. Sometimes they only cover the most likely cases which can lead to parsing errors or incorrectly parsed code. When you encounter parsing errors it can be difficult to fix them, because the upstream grammar is structured completely different. To give you an example, try to compare the tree-sitter Go grammar for types [1] with the upstream grammar [2]. It is similar but the way the rules are structured is somewhat inverted.

We use separate executables for the parsers (this also helps to secure them using seccomp on Linux), and they all use the same JSON schema for their output. This allows us to write the parser executable in the most appropriate language for the target language. Building all them statically and cross-platform for our VS Code extension isn't easy though ;)

[1]: https://github.com/tree-sitter/tree-sitter-go/blob/master/gr... [2]: https://go.dev/ref/spec#Types

abathur|1 year ago

> We are currently in the process of moving away from tree-sitter and instead using the parsers provided by the languages themselves where possible.

I imagine this means you're trying to abstract over those parsers somehow? How well is that going, and have you written about your approach?

(I wrote `resholve` to identify and rewrite references to external dependencies in bash/posixy Shell scripts to absolute paths. This is helpful in the Nix ecosystem to confirm the dependencies are known, specified, present, don't shift when run from a service with a different PATH, etc.

It builds on the mostly-bash-compatible OSH parser from the oilshell/oils-for-unix project for the same reasons you're citing.

It would be ~nice to eventually generalize out something that can handle scripts for other shell languages like fish, zsh, nushell, elvish, the ysh part of the oils-for-unix project, etc., but I suspect that'll be a diminishing-return sort of slog and haven't had any lightbulb-moments to make it feel tractable yet.

We also have some ~related needs here around identifying hardcoded or user-controlled exec...)

DarkPlayer|1 year ago

Our parsers simply return the concrete syntax trees in a JSON format. We do not unify all the different syntax constructs into a common AST if that is what you are looking for. The languages and file formats we support are too diverse for that.

The language specific logic does not end with the parsers though. The core of SemanticDiff also contains language specific rules that are picked up by the matching and visualization steps. For example, the HTML module might add a rule that the order of attributes within a tag is irrelevant. So it all comes down to writing a generic rule system that makes it easy to add new languages.

Sesse__|1 year ago

An important point here is that for certain languages, using the original grammar is pretty much impossible. In particular, for C, you want to do diffing and merging on the un-preprocessed source, but the language's grammar very much assumes the source has gone through the preprocessor.

Of course, the existence of the preprocessor means there are situations where it's completely impossible to know what the correct parse is; it will necessarily be heuristic in some cases.

yencabulator|1 year ago

You can ask Clang for an AST that contains macro definitions and invocations as nodes.

https://docs.rs/clang/latest/clang/struct.Parser.html#method...

https://docs.rs/clang/latest/clang/enum.EntityKind.html#vari...

herrington_d|1 year ago

Hi! ast-grep[1] author here. It is a tree-sitter based syntax tool to search tool.

I wonder how you transition from tree-sitter to other builtin parsers? Tree-sitter gave a unified interface to all languages. Using language native parsers will require significant work for various FFI if I am not wrong.

[1]: https://ast-grep.github.io/

alexpovel|1 year ago

Not the OP, but you raise good points. Performance might also be a concern, thinking of languages like Python and its ast package (not sure that’s accessible without going through the interpreter).

For a tool I’m writing, the tree-sitter query language is a core piece of the puzzle as well. Once you only have JSON of the concrete syntax trees, you’re back to implementing a query language yourself. Not that OP needs it, but ast-grep might?

OJFord|1 year ago

> best effort approach to parsing. This is perfectly fine for syntax highlighting, since the worst that can happen is that a few characters are highlighted incorrectly. However, when diffing or modifying code you really want the code to be parsed according to the upstream grammar, not something that mostly resembles it.

But surely you need to support code that doesn't parse correctly by the actual language's grammar anyway? 'Merge branch fix-syntax-error'

wetneb|1 year ago

In Mergiraf, as soon as there is a parsing error in any of the revisions, it falls back on line-based merging, even though tree-sitter is generally good at isolating the error. It felt like the safest thing to do (maybe we detected the language wrong), but I'm definitely open to reconsidering…

Gibbon1|1 year ago

Small brained primate comment.

I've wondered if you could add annotation keywords to languages to convert them into something that could be parsed reliably with a tree sitter grammar.

I say this as someone that feels like you really want diffs that say, changed 'struct x name from y to z' instead of here's a huge list of files with ten changes each.

drawnwren|1 year ago

This may or may not be on your radar, but crypto is desperate for a product like this. Smart contracts are often forks or rewrites (obfuscated or otherwise) of others and an easy interface for end users to be able to see changes between two forks would probably provide a lot of value.

gritzko|1 year ago

Interesting. My students used Language Server Protocol data to make syntax-aware diffs. Very promising project. Unfortunately, everyone moved on. I am currently looking for ways to revitalize it.

https://github.com/shishyando/tokenized-myers-diff

Sparkyte|1 year ago

Good example of when adding abstraction is more problematic than the processes themselves which is like an extra minute or two in flow.