top | item 41390033

(no title)

jehna1 | 1 year ago

Author of HumanifyJS here! I've created specifically a LLM based tool for this, which uses LLMs on AST level to guarantee that the code keeps working after the unminification step:

https://github.com/jehna/humanify

discuss

order

thomassmith65|1 year ago

Would it be difficult to add a 'rename from scratch' feature? I mean a feature that takes normal code (as opposed to minified code) and (1) scrubs all the user's meaningful names, (2) chooses names based on the algorithm and remaining names (ie: the built-in names).

Sometimes when I refactor, I do this manually with an LLM. It is useful in at least two ways: it can reveal better (more canonical) terminology for names (eg: 'antiparallel_line' instead of 'parallel_line_opposite_direction'), and it can also reveal names that could be generalized (eg: 'find_instance_in_list' instead of 'find_animal_instance_in_animals').

jehna1|1 year ago

Yes, I think you could use HumanifyJS for that. The way it works is that:

1. I ask LLM to describe what the meaning of the variable in the surrounding code

2. Given just the description, I ask the LLM to come up with the best possible variable name

You can check the source code for the actual prompts:

https://github.com/jehna/humanify/blob/eeff3f8b4f76d40adb116...

firtoz|1 year ago

More tools should be built on ASTs, great work!

I'm still waiting for the AST level version control tbh

sebstefan|1 year ago

What kind of question does it ask the LLM? Giving it a whole function and asking "What should we rename <variable 1>?" repeatedly until everything has been renamed?

Asking it to do it on the whole thing, then parsing the output and checking that the AST still matches?

jehna1|1 year ago

For each variable:

1. It asks the LLM to write a description of what the variable does

2. It asks for a good variable name based on the description from 1.

3. It uses a custom Babel plugin to do a scope-aware rename

This way the LLM only decides the name, but the actual renaming is done with traditional and reliable tools.

thrdbndndn|1 year ago

Does it work with huge files? I'm talking about something like 50k lines.

Edit: I'm currently trying it with a mere 1.2k JS file (openai mode) it's only 70% done after 20 minutes. Even if it works therodically with 50k LOC file, I don't think you should try.

jehna1|1 year ago

It does work with any sized file, although it is quite slow if you're using the OpenAI API. HumanifyJS works so it processes each variable name separately, and keeps the context size manageable for an LLM.

I'm currently working on parallelizing the rename process, which should give orders of magnitude faster processing times for large files.

kingsloi|1 year ago

It has this in the README

> Large files may take some time to process and use a lot of tokens if you use ChatGPT. For a rough estimate, the tool takes about 2 tokens per character to process a file:

> echo "$((2 * $(wc -c < yourscript.min.js)))" > So for refrence: a minified bootstrap.min.js would take about $0.5 to un-minify using ChatGPT.

> Using humanify local is of course free, but may take more time, be less accurate and not possible with your existing hardware.

punkpeye|1 year ago

Looks useful! I will update the article to link to this tool. Thanks for sharing!

jehna1|1 year ago

Super, thank you for adding the link! It really helps to get people to find the tool

cryptoz|1 year ago

Finally someone else using ASTs while working with LLMs and modifying code! This is such an under-utilized area. I am also doing this with good results: https://codeplusequalsai.com/static/blog/prompting_llms_to_m...

jehna1|1 year ago

Super interesting! Since you're generating code with LLMs, you should check out this paper:

https://arxiv.org/pdf/2405.15793

It uses smart feedback to fix the code when LLMs occasionally do hiccups with the code. You could also have a "supervisor LLM" that asserts that the resulting code matches the specification, and gives feedback if it doesn't.

zamadatix|1 year ago

It's a shame this loses one of the most useful aspects of LLM un-minifying - making sure it's actually how a person would write it. E.g. GPT-4o directly gives the exact same code (+contextual comments) with the exception of writing the for loop in the example in a natural way:

    for (var index = 0; index < inputLength; index += chunkSize) {
Comparing the ASTs is useful though. Perhaps there's a way to combine the approaches - have the LLM convert, compare the ASTs, have the LLM explain the practical differences (if any) in context of the actual implementation and give it a chance to make any changes "more correct". Still not guaranteed to be perfect but significantly more "natural" resulting code.

ouraf|1 year ago

Depends on how many tokens you want to spend.

Making the code, fully commenting it and also giving an example after that might cost three times as much

strictnein|1 year ago

As someone who has spent countless hours and days deobfuscating malicious Javascript by hand (manually and with some scripts I wrote), your tool is really, really impressive. Running it locally on a high end system with a RTX 4090 and it's great. Good work :)

boltzmann-brain|1 year ago

how do you make an LLM work on the AST level? do you just feed a normal LLM a text representation of the AST, or do you make an LLM where the basic data structure is an AST node rather than a character string (human-language word)?

WhitneyLand|1 year ago

The frontier models can all work with both source code and ASTs as a result of their standard training.

Knowing this raises the question, which is better to feed an LLM source code of ASTs?

The answer is really it depends on the use case, there are tradeoffs. For example keeping comments intact possibly gives the model hints to reason better. On the other side, it can be argued that a pure AST has less noise for the model to be confused by.

There are other tradeoffs as well. For example, any analysis relating to coding styles would require the full source code.

dunham|1 year ago

It looks like they're running `webcrack` to deobfuscate/unminify and then asking the LLM for better variable names.

jehna1|1 year ago

I'm using both a custom Babel plugin and LLMs to achieve this.

Babel first parses the code to AST, and for each variable the tool:

1. Gets the variable name and surrounding scope as code

2. Asks the LLM to come up with a good name for the given variable name, by looking at the scope where the variable is

3. Uses Babel to make the context-aware rename to AST based on the LLM's response

bgirard|1 year ago

How well does it compare to the original un-minified code if you compare it against minify + humanify. Would be neat if it can improve mediocre code.

jehna1|1 year ago

On structural level it's exactly 1-1: HumanifyJS only does renames, no refactoring. It may come up with better names for variables than the original code though.

fny|1 year ago

Is it possible to add a mode that doesn't depend on API access (e.g. copy and paste this prompt to get your answer)? Or do you make roundtrips?

jehna1|1 year ago

There is a fully local mode that does not use ChatGPT at all – everything happens on your local machine.

API access of ChatGPT mode is needed as there are many round trips and it uses advanced API-only tricks to force the LLM output.

KolmogorovComp|1 year ago

Thanks for your tool. Have you been able to quantify the gap between your local model and chatgpt in terms of ‘unminification performance’?

jehna1|1 year ago

At the moment I haven't found good ways of measuring the quality between different models. Please share if you have any ideas!

For small scripts I've found the output to be very similar between small local models and GPT-4o (judging by a human eye).

anticensor|1 year ago

Thanks for creating this megafier, can you add support for local LLMs?

jehna1|1 year ago

Better yet, it already does have support for local LLMs! You can use them via `humanify local`

benreesman|1 year ago

Came here to say Humanify is awesome both as a specific tool and in my opinion a really great way to think about how to get the most from inherently high-temperature activities like modern decoder nucleus sampling.

+1