Quickly looking at the source code, mostly treeBuilder and tokenizer, I do see several possible improvements:
- Use Typescript instead of JavaScript
- Use perfect hashes instead of ["a', "b", "c"].includes() idioms, string equalities, Seys, etc.
- Use a single perfect hash to match all tags/attribute names and then use enums in the rest of the codebase
- Use a single if (token.kind === Tag.START instead of repeating that for 10 consecutive conditionals
- Don't return the "reprocess" constant, but use an enum or perhaps nothing if "reprocess" is the only option
- Try tail recursion instead of a switch over the state in the tokenizer
- Use switches (best after a perfect hash lookup) instead of multiple ifs on characters in the tokenizer
- "treeBuilder.openElements = treeBuilder.open_elements;" can't possibly be good codePerhaps the agent can find these themselves if told to make the code perfect and not just pass tests
simonw|2 months ago
I didn't include the TypeScript bit though - it didn't use TypeScript because I don't like adding a build step to my JavaScript projects if I can possible avoid it. The agent would happily have used TypeScript if I had let it.
I don't like that openElements = open_elements pattern either - it did that because I asked it for a port of a Python library and it decided to support the naming conventions for both Python and JavaScript at once. I told it to remove all of those.
I had it run a micro benchmark too against the before and after - here's the code it used for that: https://github.com/simonw/justjshtml/blob/a9dbe2d7c79522a76f...
After applying your suggestions: It pushed back against the tail recursion suggestion:> The current implementation uses a switch statement in step(). JavaScript doesn’t have proper tail call optimization (only Safari implements it), so true tail recursion would cause stack overflow on large documents.