top | item 35816603

(no title)

tshadley | 2 years ago

From the article:

"April 3, 2023 - Real Humans Can’t Tell the Difference Between a 13B Open Model and ChatGPT

Berkeley launches Koala, a dialogue model trained entirely using freely available data.

They take the crucial step of measuring real human preferences between their model and ChatGPT. While ChatGPT still holds a slight edge, more than 50% of the time users either prefer Koala or have no preference. Training Cost: $100."

discuss

adalyac|2 years ago

Exciting to see open source catching up, but seems to still be a big difference in factuality and common sense between ChatGPT and Koala. Eg:

Who built AlexNet on the ImageNet dataset?

ChatGPT: AlexNet was built by a team of researchers led by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton at the University of Toronto. They entered the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) in 2012 and won by a large margin, revolutionizing the field of computer vision with their deep convolutional neural network architecture. The AlexNet architecture consisted of eight layers, including five convolutional layers, two fully connected layers, and a softmax output layer.

Koala: AlexNet was built by Google DeepMind, a London-based artificial intelligence laboratory. The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) is an annual large-scale image and video dataset that was created for the ImageNet project by the ImageNet Large Scale Visual Recognition Challenge, which is an annual competition for image and video analysis.

crazyedgar|2 years ago

This is hugely misleading. If your bot just memorizes Shakespeare and output segments from memorization, of course nobody can tell the difference. But as soon as you start interacting with them the difference can't be more pronounced.

e63f67dd-065b|2 years ago

The test was conducted as such:

>With these two evaluation sets, we conducted a blind pairwise comparison by asking approximately 100 evaluators on Amazon Mechanical Turk platform to compare the quality of model outputs on these held-out sets of prompts. In the ratings interface, we present each rater with an input prompt and the output of two models. They are then asked to judge which output is better (or that they are equally good) using criteria related to response quality and correctness.

No, it's not just memorising shakespeare, real humans interacted with the models and rated them.