top | item 28670372

What if Git worked with programming languages?

170 points| LukeEF | 4 years ago |github.com

191 comments

order
[+] mumblemumble|4 years ago|reply
I would maybe be interested in Git allowing you to plug in your own diff generators for different file types.

But I would not want Git itself trying to understand the contents of files. That seems to me to be an idea that lives on a misconception of the "things programmers believe about names" variety. Not every file in source control is source code. Not every programming language's grammar maps to an abstract syntax tree. In some files, such as makefiles, the difference between tabs and spaces is semantically significant. Some languages (such as Fortran and Racket) have variable syntax. And so on and so forth.

So I think that we really don't want the source control system itself trying to get too smart about the contents of files. That will inevitably make the source control system less compatible with the various kinds of things you might want to put into source control. And it will also make the source control system a lot more complicated than it would otherwise be, in return for a largely theoretical payoff.

But if we want to delegate the work of generating diffs off to other people, so that Git can allow for syntax or semantics-aware diffing without having to personally wade into that quagmire (and perhaps also allowing language communities to support multiple source control systems, a bit like how it works with LSP), that might be an interesting thing to experiment with.

[+] saurik|4 years ago|reply
> I would maybe be interested in Git allowing you to plug in your own diff generators for different file types.

This is already supported.

[+] ffwacom|4 years ago|reply
> Not every programming language's grammar maps to an abstract syntax tree

Are there some examples of this?

[+] madmax96|4 years ago|reply
I disagree. Many engineers want to refactor across a sequence of small PRs, for example. Small PRs are a good thing, because they’re easier to understand. But today, Git makes this painful. Also, understanding how the meaning of code changes over time can help reduce bugs.

The solution will have to be pluggable. But I think it is possible, and there are sane things to do (e.g. fall back to vanilla git) when there are missing plugs.

[+] ironmagma|4 years ago|reply
Not only that, but imagine you realize there is a bug in the parsing tool. Now you have to go back and re-parse the code, or otherwise just deal with a bad history forever. Suddenly you’re storing text again.
[+] afavour|4 years ago|reply
I do kind of love the idea of Git using ASTs instead of source code. It makes a ton of sense.

Even just in the immediate term I wish I could make Git(hub) tabs/2 spaces/4 spaces/whatever agnostic. Seems crazy to me that in 2021 we still have to make opinionated choices across orgs about what to use... why can't we pull the code down, view it in whatever setup we want, then commit a normalized version?

[whispers] this is actually something tabs allow you to do natively by setting custom tab widths in text editors but I've given up trying to sell people on tabs at this point and just want to be able to do my own thing

[+] williamdclt|4 years ago|reply
It's not that you're going too far, it's that you're not going far enough!

It's not a Git question, it's a programming language question. There's no reason source code need to be stored as plain text[1]! Editors show it as text, we edit it as text, but why wouldn't it be _stored_ as an AST? Not only does formatting becomes an editor concern, but code could even be edited as a tree, as a graph, as whatever you want

[1] - well, actually there's plenty of reasons: chiefly because plaintext is very interoperable

[+] enriquto|4 years ago|reply
[whispers] don't give up! There's quite a bunch of us. Our day will come! Long live glorious tabs!
[+] fstrthnscnd|4 years ago|reply
Tabs do work as long as they aren't fixed width (I don't know what you mean by "custom").

For instance, in many languages, one will sometimes have to split a function call to many lines, and in most languages function names aren't of fixed length, thus in order to get a correct alignment for parameters, the tab width at that point will have to match the function name length.

    #include<stdio.h>

    int main(int argc, char* argv[]) {
        printf("%s %s %s %s\n",
               __FILE__,
               __LINE__,
               __DATE__,
               __TIME__);

        return 0;
    }
I agree with your idea of storing a normalized version of the code in the repo: it wouldn't then matter whether that version contains characters to align the code properly, it would just be inserted by the editor/linter as needed. The difficulty is that sometimes linting isn't enough, and some manual formatting is needed. Or perhaps the formatting rules are under specified?

Another issue with AST diffing is when languages allow some form of syntactic sugar as preprocessing: the compiler might just see the simplified tree, not the one with the "sugary" forms. A tool capable of parsing such languages should also be able to handle these extensions.

[+] Anon_troll|4 years ago|reply
The whitespace and formatting are not significant to the compiler, but they can provide a lot of information to the reader of the code.

You can often see where the writer put the most effort and thought by just seeing how they wrote it. This can help analyzing a codebase considerably.

If everything is normalized, you lose those valuable cues.

[+] thrwyoilarticle|4 years ago|reply
You can also write git hooks to turn their spaces into your tabs & vice versa.
[+] pbiggar|4 years ago|reply
fwiw, this is what we do in Dark [1]. We store (serialized) ASTs, then then we pretty print them in the editor. This converts the AST into tokens that you see on your screen, complete with configurable* indentation, line-length, etc. Code would be displayed according to your config* and the same code displayed differently to a different developer looking at the same code.

[1] https://darklang.com

* I haven't actually enabled users to configure this, but it's just some variables called 'indent' and `lineLength` in the code

[+] geofft|4 years ago|reply
One of the practical issues here is, if your code fails to compile in CI with an error like

    /home/ci/src/foo.c:123:45: error: use of undeclared identifier 'a'
or

    /home/ci/src/bar.py:50: syntax error in type comment
or crashes in production with an error like

    java.lang.NullPointerException
        at com.example.Baz.doThings(Baz.java:1337)
you really want to be able to find line 123 column 45, line 50, or line 1337 in your editor, and have that be the same line as what your CI compiled and deployed.

On its own, tabs vs. spaces only affects columns, and you can probably figure things out without columns (although it's a shame to lose it). But different tab sizes affect how long your lines are, and line wrapping is a thing that people care about at least as much as tabs vs. spaces (people with different size monitor or fonts will easily see too-long or too-short lines on their display; if your spaces are equivalent to the tab stop, the distinction is literally invisible). And once you start rewrapping lines, everyone's line numbers are different.

I think it's possible to solve this by using some sort of AST-based index into the file and teaching IDEs to let you seek based on that, but it's suddenly a more complex problem.

[+] BiteCode_dev|4 years ago|reply
Yes, but only if it falls back to text diff as soon as there is the smallest doubt it can't provide a good AST diff.
[+] thefreeman|4 years ago|reply
If you append `?w=1` to the diff view URL on a pull request it makes it whitespace agnostic just FYI
[+] convolvatron|4 years ago|reply
having presentation by flexible and different than the underlying model is a great idea for code

but admit it, tabs are fragile and a pretty weak implementation

[+] mabbo|4 years ago|reply
Reading this article, I feel as though the author doesn't deeply understand git.

git works on blobs of data, not files, and not lines of text. It doesn't just happen to also work on binary files- that's all it works on.

Now, if the author is suggesting that git-diff ought to have a language specific mode that parses changed files as ASTs to compare, now I'm interested. Let's do that. I'll help!

But git does not need to change how it works for that to happen. Git does not even need git-diff to exist to serve it's main purpose.

[+] tux3|4 years ago|reply
Note that git does work with diffs a lot.

Rebases and cherry-picks work by applying diffs, not by copying blobs. Auto-merging also needs to look at file content as text, you can't auto-merge a binary file with git.

It's an often repeated fact that if you look inside Git, it doesn't work with diffs, it works with blobs. But if you look closer, it's often diffs again!

[+] munk-a|4 years ago|reply
There's also a historical angle here that's important to inspect - Git was designed to specifically be content agnostic. There are some predecessors in the SCM space (like VSS) that are specifically language aware and allow the checking out of line ranges (pinning them so that no one else will make a conflicting change specifically) and even entire functions - these systems can cause a lot of grief while failing to protect the logic they're specifically trying to protect. As the warts on SVN got more and more visible I think the general assumption was that the replacement SCM would come out of this code aware space - but it didn't and in retrospect we all dodged a huge bullet when that happened.

I absolutely adore tooling around git that makes diffs more visible - one thing I absolutely gush over is anything that can detect and highlight function reordering... however, the core process of merging and rebasing and all that jazz - I don't think we're going to find anything automated that I'll ever trust when I'm not working on a ridiculously clean codebase - minor changes can have echo effects and when two people are coding in the same general area they need to be aware of what the other person is trying to do.

[+] hardwaregeek|4 years ago|reply
I dunno I feel like you're focusing on a detail that's not particularly relevant. The author's main thrust is precisely what you described about parsing changed files as ASTs.
[+] nerdponx|4 years ago|reply
Storing AST instead of source code is one of the goals of the very interesting Unison programming language: https://www.unisonweb.org/

Part of what's nice about Git (and plain text in general) is that it's the lowest common denominator for a lot of things. This is why traditional Unix tools are built oriented around streams of bytes. Text is a low level carrier protocol; you can encode almost anything in it, but you need to agree on some kind of format.

The good part is that you can use very very generic tools on almost arbitrary pieces of data. The bad part is that you might have to do a lot of parsing and re-parsing of the same data, and you have to contend with the dangers of underspecified formats.

Git follows the Unix tradition in this regard. As a result, it is nearly universal in what it can store. You can use it to store pretty much anything, but you are now at the lowest common denominator of support for any particular data format.

Git-for-ASTs will no longer have this universality property, but will gain a lot more power in the covered domain. This is a design tradeoff.

One thing that's nice about Git is that you can specify arbitrary diff drivers with the "attributes" system. So even if the Git database is storing plain text, your diff driver can parse your source code into ASTs and present AST diffs to you when you run `git diff`. Perhaps more impressive, you can configure custom merge drivers, so you can (theoretically) implement semantic merging of ASTs right inside Git.

There are probably some fundamental limitations of this system, because the underlying data is still stored as blobs of bytes. But you can get pretty far as long as you don't mind parsing and re-parsing the same text over and over.

[+] ssivark|4 years ago|reply
Has this approach been tried? (Unison or otherwise…)
[+] Jensson|4 years ago|reply
I don't see how this could ever work on evolving languages, different GIT versions would produce different commits and read commits differently based on the latest C++ standard. This would potentially lead to version control bugs where different GIT versions creates different results from the same commit, that is horrible, version control needs to be 100% bug free in that regard.

The only reasonable application would be to use a language AST parser to better identify relevant text diffs, but the commits still needs to be stored as text.

[+] dboreham|4 years ago|reply
This doesn't really make sense, because in order to have those code changes compile correctly, there must be a corresponding commit to CI config that changes the complier version or compiler switches for the new language version. The "semantic-diff-er" can also be driven by that commit such that it uses the correct language version.
[+] shepherdjerred|4 years ago|reply
Commits could be stored as is, the difference would be that diffs are clearer when presented to a human.
[+] pkghost|4 years ago|reply
How is this different from any other problem that is already solved by version pegging?
[+] Karellen|4 years ago|reply
`git` generally doesn't work with lines of text. Mostly it works with opaque file blobs and directory trees.

`git diff` and `git merge` work with lines of text by default - but they don't have to. You can supply your own `diff` and `merge` tools with the `difftool.*` and `mergetool.*` config options, try them out with `git-difftool` and `git-mergetool` commands, and set the default with the `git.diff` and `git.merge` config options.

If someone wanted to create AST-based diff and merge tools for a given language, they could be plugged right into the existing `git` infrastructure and it would work with them absolutely fine.

[+] bspammer|4 years ago|reply
This feature is useful in so many different places. I use it to diff small encrypted files in my repo - just add `gpg -d` as a diff configuration and now I can use git log, diff etc in a meaningful way with binary files.

I've heard of people using it with pdfs as well - a pdf to html converter lets you get a good idea of what changed in the document.

[+] dTal|4 years ago|reply
What if generating a diff is nontrivial? Say you rename an identifier. That might be a single command in an IDE. A sufficiently high-level "diff" format could easily capture that intent. But working backwards from hundreds of touched lines across many files to deduce that single semantic edit is not trivial. Git assumes that arbitrary diffs can be deduced from "before" and "after" files, but this isn't the case - it may be that you'd rather generate the new file from the diff!
[+] colonwqbang|4 years ago|reply
Yes, I think this article is coming at it from the wrong end. Git is hardly the problem here, nor is it going to provide the solution.

The problem seems to be that we are lacking the format and the toolchain to manipulate it, and that is not the fault of git.

What is the state of the art in this area? Does somebody know of a viable format and toolchain, or any interesting projects looking to build them?

[+] kapep|4 years ago|reply
> If someone wanted to create AST-based diff and merge tools for a given language, they could be plugged right into the existing `git` infrastructure and it would work with them absolutely fine.

There's a lot tooling in the Eclipse modelling ecosystem which could be easily used for this. Storing XML-based models in git is no problem and there's tooling for diffing and merging models via a GUI or programmatically. Combined with the fact that xtext DSLs use EMF models to represent ASTs, it wouldn't be too hard to glue together an AST-based a diff/merge tool for an xtext DSL.

[+] kmeisthax|4 years ago|reply
Indeed. The Composer merge driver is critical for being able to work with modern PHP frameworks without tearing your hear out on every merge.

Merge drivers are Git's most powerful and least known feature, and I really wish they were more common.

[+] rileymat2|4 years ago|reply
> `git` generally doesn't work with lines of text. Mostly it works with opaque file blobs and directory trees.

I am not sure this is true.

In the past it gave me problems with line ending normalization between windows/mac/linux, in and out. In those cases it definitely had a lines of text view of things.

[+] zomglings|4 years ago|reply
I maintain a free/open source project that does exactly what the author asks for: https://github.com/bugout-dev/locust.

Our tool uses git as the foundation of its functionality. It superimposes git diffs on top of ASTs.

It is insanely powerful.

For example, we use it to power semantic code search and current support Python, Javascript, and Java. We generate a JSON object describing the AST differences between initial and terminal commits on GitHub PRs. A full text search on the JSON objects performs surprisingly well when we want to answer questions like, "When did we add dateutils as a dependency?" or "When did we last change the /journals handler on the API?"

The Python integration currently sees the most use but if you are interested in other languages, we would be happy to support it.

Do drop me a DM if you want help getting started with Locust.

[+] tombert|4 years ago|reply
I would definitely support a Lisp-centric Git.

Whenever I do Clojure, something that can get difficult when working with multiple people is how the parentheses/brackets/braces stack up, especially when everyone seems to have different opinions on how that works. As a result, if you're not careful, when there's a merge conflict you can have a ton of extra parentheses, which can be irritating to debug.

Obviously this is at some level an issue inherent to Lisps (and to be clear, I love Lisps, and these small headaches are worth it), but I think problems like that could be reduced if our source controls were aware of the ASTs.

[+] ClassAndBurn|4 years ago|reply
Git is designed to require human oversight. This is usually a feature, but in recent years has become a bug with things like GitOps.

It's important to remember that Git is a terrible database because of its lack of semantic structure. All conflicts require a human who does have to context. This is why almost no one builds a system that uses Git as a two way interface. And when they do, its via Github Pull Requests (which go to humans) and not Git itself.

In all, this makes it a wonderful general purpose shared filesystem. And that's about it.

[+] tomxor|4 years ago|reply
> The fact that git works on lines of text [...] we could be looking at the alterations to the abstract syntax tree.

Fundamentally git does not operate on text, it operates on files (content addressed SCM not a ledger of text diffs); diffs are generated upon request between arbitrary merkel trees. So there is no need to implicate git in such a tool, it can be independent:

       GIT_EXTERNAL_DIFF
           When the environment variable GIT_EXTERNAL_DIFF is set, the program
           named by it is called to generate diffs, and Git does not use its
           builtin diff machinery. For a path that is added, removed, or
           modified, GIT_EXTERNAL_DIFF is called with 7 parameters:

               path old-file old-hex old-mode new-file new-hex new-mode
[+] maweki|4 years ago|reply
Working on the AST is quite an interesting idea, until your comments aren't in the AST and you want to commit a syntax error of work in progress.

Not to mention changing ASTs (while maintaining concrete syntax) in different versions of the language.

[+] shoo|4 years ago|reply
There's a good blog post about auto-merging JSON/XML structured data files (for game content) on the bitsquid blog from 2010:

> having content conflicts is no fun either. A level designer wants to work in the level editor, not manage strange content conflicts in barely understandable XML-files. The level designer should never have to mess with WinMerging the engine's file formats.

> And conflicts shouldn't be necessary. Most content conflicts are not actual conflicts. It is not that often that two people have moved the exact same object or changed the exact same settings parameter. Rather, the conflicts occur because a line-based merge tool tries to merge hierarchical data (XML or JSON) and messes up the structure.

> In those rare cases when there is an actual conflict, the content people don't want to resolve it in WinMerge. If two level designers have moved the same object, we don't really help them address the issue by bringing up a dialog box with a ton of XML mumbo-jumbo. Instead, it is much better to just pick one of the two locations and go ahead with merging the file. Then, the level designers can fix any problems that might have occurred in the level editor -- the right tool for the job.

-- http://bitsquid.blogspot.com/2010/06/avoiding-content-locks-...

[+] CodeIsTheEnd|4 years ago|reply
I don't understand why GitHub hasn't solved the issue of diffs starting with a '}' (or ')' or 'end'). Just slide the diff over while it starts with a closing token! I suppose it's an artifact of the diffing algorithm, but aren't there better diffing algorithms, even built-in within git?

This is by far the most obvious example of "git doesn't understand programming languages", but it also seems like the most straightforward to fix.

[+] aardvark179|4 years ago|reply
I’ve done quite a lot of work on version management on structured data (in my case this was for a version managed GIS database) and it’s not an easy problem, and is likely even harder with something like an AST that is generated from a text file and so does not preserve the identity of nodes. I’m not saying that it’s impossible, but it is more work and requires more tooling around it than people think, and it keeps coming up here and other places as a, “really good idea.”
[+] ufo|4 years ago|reply
I'm trying to remember the citation, but I remember seeing a presentation once from someone who studied this and they said that the thing that worked best was a hybrid approach: use structured diff at the top level of the program (modules / methods) but use line-based for statements and expressions. According to them, the structured diff can give unintuitive results if applied at the lowest syntactic levels.
[+] alkonaut|4 years ago|reply
I’d give anything just to get a few basic merge modes. For example “this file can treat two one line additions as unordered”.

So any shared append-only file (a change log, an enumeration,…) doesn’t automatically conflict.

Syntax aware diffing would be great too, but I’d take something much simpler. For syntax aware stuff I’d love something that could tell semantic changes from noise.