top | item 27419237

Abstract Syntax Tree for Patching Code and Assessing Code Quality

163 points| sidcool | 4 years ago |engineering.soroco.com | reply

58 comments

order
[+] neilv|4 years ago|reply
This AST-based programmatic source code transformations is a very useful approach to have in your software engineering toolbox.

The authors correctly identify in section "Limitations" that you want to `unparse` only minimal changes, and avoid losing comments and formatting.

I wrote a library to support doing similar things, with this in mind:

https://www.neilvandyke.org/racket/progedit/

Not clear from the documentation, but, although this particular library runs in Racket, it uses Racket's "syntax objects", so it can process not only Racket source code, but pretty much any language with a parser implemented in Racket. Also not clear from the documentation is that (although the example doesn't use it), syntax objects can also represent comments, and exact positional information, and can be annotated with things like type information and identifier resolution, plus there's always pretty printers... so you can in theory do very nice programmatic source code manipulations.

I originally wrote `progedit` in 2012, for updating a package metadata file that's updated both manually and automatically. (And then made a point to factor out this functionality into a separate generalized open source library, and release it, as good reuse practice.) I later got a nice email from someone who said they'd used `progedit` to refactor millions of lines of code, which they thought would've been nonviable using regexps.

[+] heinrichhartman|4 years ago|reply
For me, this is the critical bit here:

> One of the problems of patching code using ast package of Python is that it loses all the formatting and comments of the original source code. This can be solved by making the patch script a little smarter. Instead of having it unparse the entire patched AST and write that to disk, we can make it unparse only the nodes it modified and insert the modified code at the corresponding line number in the file. The ast nodes have lineno attribute that can be used to retrieve the line number of the file to be injected the patched code with.

Parsing & Patching the AST is relatively easy, with the provided tooling. I always struggled to write the modifications out in a minimal way. Looks like line-number information does get you over this hump.

EDIT: Looks like the python community has already a package for this: https://github.com/PyCQA/redbaron

[+] shpx|4 years ago|reply
Autoformatting the code with Black before and after unparsing is how I avoided this problem.
[+] megous|4 years ago|reply
I wrote a tool for refactoring JS a few years ago, and the way I solved this was to not modify the AST and regenerate the code from it, but only to use position information from it, and patch source code directly based on that positional information.
[+] oefrha|4 years ago|reply
> One of the problems of patching code using ast package of Python is that it loses all the formatting and comments of the original source code.

What you actually want here is a concrete syntax tree preserving all the comments and whitespace. There’s a sort of built-in one as part of lib2to3 used by the likes of black and yapf. There’s also a good third party options now in the form of https://github.com/Instagram/LibCST.

There’s been talks about adding a CST module to PSL, but I don’t think the proposal went anywhere.

https://bugs.python.org/issue33337

[+] 1337shadow|4 years ago|reply
Python's ast module is fantastic, we use it both to compile Python to Michelson[0], an assembly language for the Tezos blockchain, and Python to JavaScript[1]. However, for refactoring one might want to consider redbaron[2].

[0] https://yourlabs.io/pyratzlabs/pymich [1] https://yourlabs.io/oss/ryzom/#javascript [2] https://redbaron.readthedocs.io/en/latest/

[+] erezsh|4 years ago|reply
> Python to JavaScript

That's cool. I'm working on something very similar, with the goal of generating idiomatic Javascript, which seems to be a goal for your code as well.

I did make a few different choices.

- Instead of worrying about formatting and indent, I just output whatever I feel like, and run Prettier afterwards. That way I don't have to worry about whitespace, indents, or superfluous parenthesis. - I integrated it with type analysis, so that operators like `+=` will produce different code for each type (ints vs lists, for example). - I also include the comments from the python source in the target javascript

But these things aside, I can see a lot of similarities!

[+] stevekemp|4 years ago|reply
Agreed. I've not used it much, but I've done some fun things with it - mostly relating to adding linting checks for pipelines.

For example we use troposphere to generate cloudformation templates, so I wrote a linter which would ensure that every time we tried to create an elastic container registry on AWS we explicitly setup "security scanning" to be "on".

That kind of thing is trivial to write when you have the AST to manipulate:

* Find all calls to function "foo(...)".

* Error out if there is is no parameter "scan_on_push = True".

[+] anitil|4 years ago|reply
I love it as well. I'm working on a Python AST viewer tool at the moment [0] so I've been spending a bit of time playing with it lately.

I'm thinking of calling it Pasta (Python AST view....A? I don't have a word that starts with A).

[0] I only really have a gif - https://imgur.com/a/4YYdfmZ

[+] eska|4 years ago|reply
Trivial AST editing is what I severly miss when using languages that aren't based on s-expressions. Diffing ASTs, searching for patterns in ASTs, patching ASTs, .. can literally be implemented from scratch in few lines of code.

Meanwhile I tried the same with C++, but it was a pain every time. For example I remember using the LLVM for Python API and it was just unstable (functions were removed etc), and mostly undocumented.

[+] dasyatidprime|4 years ago|reply
Tangential curiosity: S-expression languages with lexical binding seem like they still require context-sensitive codewalkers in order to reliably search for patterns that involve any literal symbol that isn't constrained out of that mechanism¹, don't they? What's your strategy for this, just ignore it and hope for the best, or punt to interactive handling for any cases that look slightly tricky? (Or is that still small enough to count as "few lines"?)

¹ For instance, I believe CL forbids lexically binding symbols from the CL package in ways that might conflict with the original definitions. But if you want to find uses of a particular library call from some other non-special package in a sound way, then you have to assume someone might flet that name, due to the way locals work. The underlying codewalker issue is roughly dual to the functioning of non-hygienic macros.

[+] Aissen|4 years ago|reply
I did this recently with both lua (luaparse) and javascript (babel), and this is a fundamental tool to have in your toolbox once working with large codebases. Javascript has great tooling for this (see AST explorer and all the different libs it supports).

Before the 1.0 promise, Go used "go fix" to automatically update old code, and the standard library even has tools to manipulate an AST.

The linux kernel uses coccinelle for C code, which goes one step further, and lets you write "semantic patches", which is a DSL to do this in a compact way and without having to write code.

[+] ievans|4 years ago|reply
Sounds like you know the language-specific AST tooling for your stack well, but if you are looking for AST patching that uses a generic approach across many languages, I'm a maintainer for a project named Semgrep (https://en.wikipedia.org/wiki/Semgrep#History) which was created by one of the original coccinelle authors.

It supports many more languages (~17 at various stages of development) and being able to do AST patching as in the original is one of the capabilities we're experimenting with: https://semgrep.dev/docs/experiments/overview/#autofix

Would love your feedback!

[+] chgibb|4 years ago|reply
What was your solution to writing out your parsed Lua AST?
[+] sausage_dog|4 years ago|reply
If you like this idea, take a look at Nim. It has a syntax familiar to Python users with very capable AST metaprogramming built in.
[+] efitz|4 years ago|reply
In the security space we frequently have to mitigate potential injection attacks and parser vulnerabilities. Often this takes the form of checking for unexpected keywords or patterns in input, a la web application firewalls. For the past few years I’ve been advising teams that I work with that using allow and deny lists against ASTs prior to execution is a more effective technique than scanning input for patterns, because you sidestep encoding and fragmentation issues, which are common TTPs in bypassing input filters.
[+] grdvnl|4 years ago|reply
I have done something similar using https://pybowler.io/. This library wraps the LibCST library and abstracts away some of the refactoring operations into nice wrapper functions.
[+] flakiness|4 years ago|reply
This kind of programmatic transformation is believed to be easier with statically typed languages. It's interesting to see how this folks identified patterns to effectively figure out the types of the specific bit of the code.

I guess "good code" in the dynamic language tends to have this kind of uniformity / unsurprising nature which helps navigate the code without the type's help. On the other hand, code in statically typed languages tends to gets smarter on types to make it shorter, sometimes ending up hard-to-read (or at least glance) lines.

I don't say which is worse, but this is good counterexample against still-widely-believed idea that dynamically typed languages lead shorter code while statically typed languages have longer (verbose) code.

[+] radicalbyte|4 years ago|reply
I've taken to writing statically typed code with the IDE in mind - I almost always use one and they've become amazing and are no slower than text editors nowadays (thanks electron making text editors slow).
[+] maxdw101|4 years ago|reply
Absolutely agree! Doing this with dynamic languages are much harder. Unless, of course, one works on ASTs at run-time.
[+] the_mitsuhiko|4 years ago|reply
If someone wants to do something similar: it’s better to use the 2to3 machinery than ast as it can retain comments.
[+] ianlevesque|4 years ago|reply
They address that with a bit of a hack.

> One of the problems of patching code using ast package of Python is that it loses all the formatting and comments of the original source code. This can be solved by making the patch script a little smarter. Instead of having it unparse the entire patched AST and write that to disk, we can make it unparse only the nodes it modified and insert the modified code at the corresponding line number in the file. The ast nodes have lineno attribute that can be used to retrieve the line number of the file to be injected the patched code with.

[+] archi42|4 years ago|reply
Neat! The comments are an issue, though. But it can be solved: Just parse the comments as AST nodes, too.

I did just that for a converter/transpiler which "updated" code written in an old DSL into a much newer variant of that DSL: A lexer, a parser, a transformer and a printer. The result was not as perfect as doing a mindful, manual transformations for each individual case - but the result is still well maintainable. And, most important, the whole task took us merely months instead of years.

Also, Perl/regexes are awesome for this: With a one-shot transformation it does not matter if transforming 100000 lines takes 5m or 10s, so I could save a lot of time by letting lexer & parser try regexes (essentially guessing and some backtracking = stupid long lookahead) instead of writing a minimal LR parser ;-)

[+] ay|4 years ago|reply
I had a problem of applying patches to the files that got indented differently meantime.

So I wrote a tool that takes a context diff, rebuilds the before and after text, then parses both as vectors of (leading whitespaces, token), then creates a diff of those vectors ignoring the whitespaces, parses the target file as a similar vector, and applies this diff to such a vector (again with ignoring/copying whitespaces) - after which it reconstructs the text.

The results are surprisingly robust, despite the relatively naive tokenization heuristic.

https://github.com/ayourtch/tbpatch

[+] rikatee|4 years ago|reply
We use python's AST to power https://pypi.org/project/django-doctor/ - it's a Django linter that suggests the fix and makes the change it suggests if you approve the change.

Eventually we will replace AST with a CST (Concrete Syntax Tree) because the tricks we used maintain new lines and comments indicate CST was actually needed.

[+] dwheeler|4 years ago|reply
This is a useful technique. But it'd be nice if library developers would make it much less necessary. Removing interfaces makes it hard to update, even with ast based techniques, and encourages developers to never upgrade. Then when a vulnerability is found it's hard to upgrade.
[+] 1nj3ct0r|4 years ago|reply
I hope it's better to use the 2to3 machinery as everyone said in comments Here 2to3 machinery doc -> https://docs.python.org/3.9/library/2to3.html
[+] punnerud|4 years ago|reply
Isn't 2to3 only for converting Python2 to Python3 code? Could possibly be used for code replacement in Python3, but does not seem like that is the purpose. From the bottom of the page on your link: "The lib2to3 module may be removed from the standard library in a future Python version."
[+] maxdw101|4 years ago|reply
How scalable is this approach of using ASTs? Can you scale this to a single codebase with 500K+ lines? The article talks about multiple different code bases that together added up to 100K.

Second, does one need to worry about getting this right in dynamically typed languages?

[+] shados|4 years ago|reply
For both your question, a lot of large companies do thousands of pull requests a month/year using this methodology, so millions++ of lines of code, in any and all programming languages (including JavaScript, so your typed language there).

The tooling (generally built internally on top of a couple of open source package) matters a lot here. You generally don't want to create a single pull request with hundreds of thousands of changes.

The way I've done it and seen it done, is using tools that stagger/schedule pull requests, generally across packages (libraries, apps).

You build a database of every package in your system. If they're across multiple repositories (as opposed to a monorepo) you keep track of that.

When you need to make a changeset, you first narrow down your target (By language, using Github Search, OpenGrok, SourceGraph, whatever). Then for each match, you run your AST transform. If the transform returns a non-null change, you create a pull request for each one (usually using some kind of cron job system or something clever on top of k8s)

Then teams that own the package can then review and merge them, or the system can auto-merge after tests pass or something.

You really want to go 80/20 with your AST transform at large scale (building a script that will get it 100% right in every case would take too long), then do dry runs to see where it gets it wrong. Either tweak things or handle those cases manually.

For dynamic languages, you can rely on rich AST parsers to give you more info (eg: for JavaScript you can use TypeScript to infer some info, even if the original code is not in TypeScript). You also rely heavily on tests to know if you're breaking something or not.

[+] Agingcoder|4 years ago|reply
We do the same thing, but with a statically typed language.

We had started by distributing the changes manually across the team, which was slow and introduced bugs. We then switched to the fully automated approach, which was significantly fastet, and bug free.