I've been thinking about a benchmark designed this way for a while. It doesn't even need to be code, particularly, it could be basic reasoning problems. The key is that you define a new, random language that has never before been seen (maybe it has statistical similarity to existing languages, maybe not), create a translation key, then ask a question in that language.
CuriouslyC|1 year ago
ape4|1 year ago
CuriouslyC|1 year ago