Author of HumanifyJS here! I've created specifically a LLM based tool for this, which uses LLMs on AST level to guarantee that the code keeps working after the unminification step:
Would it be difficult to add a 'rename from scratch' feature? I mean a feature that takes normal code (as opposed to minified code) and (1) scrubs all the user's meaningful names, (2) chooses names based on the algorithm and remaining names (ie: the built-in names).
Sometimes when I refactor, I do this manually with an LLM. It is useful in at least two ways: it can reveal better (more canonical) terminology for names (eg: 'antiparallel_line' instead of 'parallel_line_opposite_direction'), and it can also reveal names that could be generalized (eg: 'find_instance_in_list' instead of 'find_animal_instance_in_animals').
What kind of question does it ask the LLM? Giving it a whole function and asking "What should we rename <variable 1>?" repeatedly until everything has been renamed?
Asking it to do it on the whole thing, then parsing the output and checking that the AST still matches?
Does it work with huge files? I'm talking about something like 50k lines.
Edit: I'm currently trying it with a mere 1.2k JS file (openai mode) it's only 70% done after 20 minutes. Even if it works therodically with 50k LOC file, I don't think you should try.
It does work with any sized file, although it is quite slow if you're using the OpenAI API. HumanifyJS works so it processes each variable name separately, and keeps the context size manageable for an LLM.
I'm currently working on parallelizing the rename process, which should give orders of magnitude faster processing times for large files.
> Large files may take some time to process and use a lot of tokens if you use ChatGPT. For a rough estimate, the tool takes about 2 tokens per character to process a file:
> echo "$((2 * $(wc -c < yourscript.min.js)))"
> So for refrence: a minified bootstrap.min.js would take about $0.5 to un-minify using ChatGPT.
> Using humanify local is of course free, but may take more time, be less accurate and not possible with your existing hardware.
It uses smart feedback to fix the code when LLMs occasionally do hiccups with the code. You could also have a "supervisor LLM" that asserts that the resulting code matches the specification, and gives feedback if it doesn't.
It's a shame this loses one of the most useful aspects of LLM un-minifying - making sure it's actually how a person would write it. E.g. GPT-4o directly gives the exact same code (+contextual comments) with the exception of writing the for loop in the example in a natural way:
for (var index = 0; index < inputLength; index += chunkSize) {
Comparing the ASTs is useful though. Perhaps there's a way to combine the approaches - have the LLM convert, compare the ASTs, have the LLM explain the practical differences (if any) in context of the actual implementation and give it a chance to make any changes "more correct". Still not guaranteed to be perfect but significantly more "natural" resulting code.
As someone who has spent countless hours and days deobfuscating malicious Javascript by hand (manually and with some scripts I wrote), your tool is really, really impressive. Running it locally on a high end system with a RTX 4090 and it's great. Good work :)
how do you make an LLM work on the AST level? do you just feed a normal LLM a text representation of the AST, or do you make an LLM where the basic data structure is an AST node rather than a character string (human-language word)?
The frontier models can all work with both source code and ASTs as a result of their standard training.
Knowing this raises the question, which is better to feed an LLM source code of ASTs?
The answer is really it depends on the use case, there are tradeoffs. For example keeping comments intact possibly gives the model hints to reason better. On the other side, it can be argued that a pure AST has less noise for the model to be confused by.
There are other tradeoffs as well. For example, any analysis relating to coding styles would require the full source code.
On structural level it's exactly 1-1: HumanifyJS only does renames, no refactoring. It may come up with better names for variables than the original code though.
Came here to say Humanify is awesome both as a specific tool and in my opinion a really great way to think about how to get the most from inherently high-temperature activities like modern decoder nucleus sampling.
thomassmith65|1 year ago
Sometimes when I refactor, I do this manually with an LLM. It is useful in at least two ways: it can reveal better (more canonical) terminology for names (eg: 'antiparallel_line' instead of 'parallel_line_opposite_direction'), and it can also reveal names that could be generalized (eg: 'find_instance_in_list' instead of 'find_animal_instance_in_animals').
jehna1|1 year ago
1. I ask LLM to describe what the meaning of the variable in the surrounding code
2. Given just the description, I ask the LLM to come up with the best possible variable name
You can check the source code for the actual prompts:
https://github.com/jehna/humanify/blob/eeff3f8b4f76d40adb116...
unknown|1 year ago
[deleted]
firtoz|1 year ago
I'm still waiting for the AST level version control tbh
jansvoboda11|1 year ago
rightonbrother|1 year ago
unknown|1 year ago
[deleted]
sebstefan|1 year ago
Asking it to do it on the whole thing, then parsing the output and checking that the AST still matches?
jehna1|1 year ago
1. It asks the LLM to write a description of what the variable does
2. It asks for a good variable name based on the description from 1.
3. It uses a custom Babel plugin to do a scope-aware rename
This way the LLM only decides the name, but the actual renaming is done with traditional and reliable tools.
thrdbndndn|1 year ago
Edit: I'm currently trying it with a mere 1.2k JS file (openai mode) it's only 70% done after 20 minutes. Even if it works therodically with 50k LOC file, I don't think you should try.
jehna1|1 year ago
I'm currently working on parallelizing the rename process, which should give orders of magnitude faster processing times for large files.
kingsloi|1 year ago
> Large files may take some time to process and use a lot of tokens if you use ChatGPT. For a rough estimate, the tool takes about 2 tokens per character to process a file:
> echo "$((2 * $(wc -c < yourscript.min.js)))" > So for refrence: a minified bootstrap.min.js would take about $0.5 to un-minify using ChatGPT.
> Using humanify local is of course free, but may take more time, be less accurate and not possible with your existing hardware.
punkpeye|1 year ago
jehna1|1 year ago
cryptoz|1 year ago
jehna1|1 year ago
https://arxiv.org/pdf/2405.15793
It uses smart feedback to fix the code when LLMs occasionally do hiccups with the code. You could also have a "supervisor LLM" that asserts that the resulting code matches the specification, and gives feedback if it doesn't.
zamadatix|1 year ago
ouraf|1 year ago
Making the code, fully commenting it and also giving an example after that might cost three times as much
strictnein|1 year ago
boltzmann-brain|1 year ago
WhitneyLand|1 year ago
Knowing this raises the question, which is better to feed an LLM source code of ASTs?
The answer is really it depends on the use case, there are tradeoffs. For example keeping comments intact possibly gives the model hints to reason better. On the other side, it can be argued that a pure AST has less noise for the model to be confused by.
There are other tradeoffs as well. For example, any analysis relating to coding styles would require the full source code.
dunham|1 year ago
jehna1|1 year ago
Babel first parses the code to AST, and for each variable the tool:
1. Gets the variable name and surrounding scope as code
2. Asks the LLM to come up with a good name for the given variable name, by looking at the scope where the variable is
3. Uses Babel to make the context-aware rename to AST based on the LLM's response
bgirard|1 year ago
jehna1|1 year ago
unknown|1 year ago
[deleted]
fny|1 year ago
jehna1|1 year ago
API access of ChatGPT mode is needed as there are many round trips and it uses advanced API-only tricks to force the LLM output.
KolmogorovComp|1 year ago
jehna1|1 year ago
For small scripts I've found the output to be very similar between small local models and GPT-4o (judging by a human eye).
anticensor|1 year ago
jehna1|1 year ago
unknown|1 year ago
[deleted]
benreesman|1 year ago
+1
neoOpus|1 year ago
[deleted]