top | item 47022679

(no title)

gfody | 14 days ago

I've had this idea too, and think about it everytime I'm on a PR with lots of whitespace/non-functional noise how nice it would be if source code wern't just text and I could be looking at a cleaner higher level diff instead.. I think you have to go higher than AST though, it should at least be language-aware

discuss

gritzko|14 days ago

(Author) In my current codebase, I preserve the whitespace nodes. Whitespace changes would not affect the other nodes though. My first attempt to recover whitespace algorithmically not exactly failed, but more like I was unable to verify it is OK enough. We clang-format or go fmt the entire thing anyway, and whitespace changes are mostly noise, but I did not find 100% sure approach yet.

gfody|14 days ago

I think about eg the "using" section at the top of a .cs file where order doesn't matter and it's common for folks to use the "Remove and Sort Usings" feature in VS.. if that were modeled as a set then diffs would consist only of added/removed items and a re-ordering wouldn't even be representable. And then every other manner of refactor that noises up a PR: renaming stuff, moving code around, etc. in my fantasies some perfect high-level model would separate everything that matters from everything that doesn't and when viewing PRs or change history we could tick "ignore superficial changes" to cut thru all the noise when looking for something specific

..to my mind such a thing could only be language-specific and the model for C# is probably something similar to Roslyn's interior (it keeps "trivia" nodes separate but still models the using section as a list for some reason) and having it all in a queryable database would be glorious for change analysis

zelphirkalt|14 days ago

Some languages are unfortunately whitespace sensitive, so a generic VCS cannot discard whitespace at all. But maybe the diffing tools themselves could be made language aware and hide not meaningful changes.

em-bee|14 days ago

hiding not meaningful changes is not enough. when a block in python changes the indentation, i want to see that the block is otherwise unchanged. so indentation changes simply need to be marked differently. if a tool can to that then it will also work with code where indentation is optional, allowing me to cleanly indent code without messing up the diff.

i saw a diff tool that marked only the characters that changed. that would work here.

procaryote|14 days ago

You can build a mergetool (https://git-scm.com/docs/git-mergetool)

WorldMaker|13 days ago

In my investigations ages ago [1] I felt the trick was to go lower that an AST. ASTs by nature generally have to be language-aware and vary so much from language to language that trying to generally diff them is rough. I didn't solve the language-aware part, but I did have some really good luck with using tokenizers intended for syntax highlighting. Because they are intended for syntax highlighting they are fast, efficient, and generally work well with "malformed"/in-progress works (which is what you want for source control where saving in progress steps can be important/useful/necessary).

It still needs to be language-aware to know which token grammar to use, but syntax highlighting as a field has a relatively well defined shared vocabulary of output token types, which lends to some flexibility in changing the language on the fly with somewhat minimal shifts (particularly things like JS to TS where the base grammars share a lot of tokens).

I didn't do much more with it than generate simple character-based diffs that seemed like improvements of comparative line-based diffs, but I got interesting results in my experiments and beat some simple benchmarks in comparing to other character-based diff tools of the time.

(That experiment was done in the context of darcs exploring character-based diffs as a way to improve its CRDT-like source control. I still don't think darcs has the proposed character-based patch type. In theory, I could update the experiment and attempt to use it as a git mergetool, but I don't know if it provides as many benefits as a git mergetool than it might in a patch theory VCS like darcs or pijul.)

[1] https://github.com/WorldMaker/tokdiff