Some notes:
- based on GPT3.5 - essentially, the test was “how well can GPT produce ML code” (tune hyper parameters, base off of case studies)
- did not compare to the human case, only to other ML models (unless “human” is considered perfect, in which case GPT got 86%. Although I don’t think a human would perform at 100% of the benchmark)
No comments yet.