I know uutils/core-utils uses the old tests, which makes sense, that way you cover most of the intentional behavior. A more comprehensive method could be to generate a comprehensive set of random scripts with a capable LLM like GPT4 in identical vm's with the 2 different binaries and then log/diff each scripts behavior.
No comments yet.